File Systems: The state of the art

Generally overlooked, file systems are assuming a prominent place at the heart of new technologies that address some of the most vexing storage problems, such as scaling and performance. New file-system approaches provide the underpinnings for technologies such as clustering, global namespaces and wide-area computing.

New file systems are solving some of the biggest storage system problems, such as scaling and performance.

For most of their relatively brief history, file systems have traveled below the radar, performing yeoman tasks. Nowadays, instead of being the unsung workhorses of the infrastructure, some file systems have achieved a higher profile and are home to some of the most cutting-edge innovation in data storage.

The change comes from a number of startups and established companies using advances in file-system technologies to solve some large, lingering problems such as quick I/O reads and writes, file locking and synchronization, support for geographically dispersed work groups and NAS consolidation. New file systems are the underpinnings of the following storage technologies:

  • High-performance clusters. The challenge of high-performance computing (HPC) has been to assemble hundreds or thousands of small compute nodes into an efficient, parallelized computing resource. At the core of this massive effort are file-system technologies that federate the independent actions of computing and storage nodes into clusters.
  • NAS clustering. The never-ending quest to create fast, modular and scalable NAS has received a significant boost in the past two years from many new vendors that are leveraging file systems to create high speed, reliable NAS computing environments. Because so many of the historical problems with NAS result from independent file systems being resized and migrated, the benefits of a clustered NAS solution are significant. By uniting file systems across multiple NAS devices, users can create one large NAS computing resource.
  • SAN clustering. The SAN world has rapidly moved to embrace new file-system technologies to solve a range of performance, consolidation and management issues. SANs and storage virtualization enable storage resources to be consolidated and shared across devices, but servers and their file systems still have access only to assigned storage resources--there's no data sharing across all servers. By deploying advanced file-system technologies in front of a SAN to enable data sharing across servers, it's possible to create a computing environment where any server on the SAN can touch any networked storage resource. Using advanced file systems for SAN clustering is at the heart of several hot trends, such as database consolidation and flexible scaling of applications and servers through dynamic data sharing.
  • Namespace management. Enterprises are deploying new file-system offerings to create a unified view of information resources across a number of discrete devices, each of which may still possess and preserve its own file-system images.
  • Wide-area computing. In the past year, the demand for distributed computing technologies has rocketed off the charts; wide-area file services (WAFS) is the key enabling technology. A range of advances in file systems is helping enterprises to create collaborated and consolidated work environments (see "Keep remote offices in sync," Storage magazine, October 2005).

File systems have evolved beyond the traditional file systems that commonly ship with workstation computers and servers. Journaling is one of the most basic and widespread improvements for traditional file-system architectures. In the event of an internal file-system error or unanticipated system shutdown, a traditional file system must rely on a time-consuming reboot with a granular data scan to recover itself (for Unix and Linux, the fsck command examines and repairs the file system). For larger deployments, this can translate into hours before the file system can assess its integrity and come back online. Needless to say, such downtime in critical server environments is now completely unacceptable.

How to choose a file system
Whether you're looking for a scalable NAS solution, a SAN cluster, a high-performance cluster or a wide-area deployment, there are certain factors every file system needs to offer. In any scenario, the following criteria will determine why one file system is a better fit than another for a particular environment or storage application:

Workload. The data workload will have a profound impact on the kind of file system deployed. Some file systems can't perform well under dynamic, random I/O workloads (e.g., databases), but excel in sequential data environments (e.g., digital content and streaming media.) There's no perfect file system that's optimized across all workloads.

Scalability. Scalability issues include how a file-system technology handles the addition of new clients, servers, applications, networking elements and storage capacity. Finding the appropriate mix for a given deployment requires careful analysis. Vendors may have developed offerings that excel in one or two aspects of scalability (e.g., addition of computing and storage resources), but can't handle application loads at that scale.

Application goals. The kinds of applications to be supported will be one of the key determinants for selecting a file-system technology. For example, does the file system provide dynamic access to critical Oracle data stored within a SAN environment; provide a small number of Unix clients access to a high-performance computing and storage pool; or does it provide 80 Windows clients with a unified view of their network directories? The application goals for each of these examples will lead to markedly different vendor choices and file-system architectures.

Performance. Some sophisticated file systems support near-linear performance as new nodes are added to clusters, providing massively parallelized, end-to-end performance. Others degrade in performance quickly as new nodes are added, but provide superior capabilities for data sharing or collaboration across wide areas. All too often, users choose inadequately powered file systems or overspend on complicated architectures that could have been avoided.

When pressed, most file-system vendors will admit that publicly stated performance metrics provide only the most basic outline of a product's suitability for a given environment. Any reputable file-system vendor will support actual testing of their file system in a real-world environment and encourage prospective buyers to speak with others who have deployed their technology for similar workloads.

Because of the criticality of enterprise data today, most data centers deploy a journaling file system (JFS). It's easiest to think of journaling as a rapid backup and recovery mechanism for the file system. A fully implemented JFS creates its own transaction logs for all meta data and user data actions within a file system. By logging file meta data and user data, a JFS can determine precisely what transactions had taken place up to the time of the failure, thereby ensuring full data integrity when the journal is replayed. The file system can then use the journal information to execute an immediate recovery, avoiding a time-consuming walk of the entire file system's data structure. This reduces recovery time from hours to seconds or minutes.

Journaling has become a standard feature of advanced file systems and one of the cornerstones of high-availability initiatives. It's found in the most popular enterprise file systems that run on Unix and Linux platforms, including ext3, XFS, ReiserFS, VxFS and IBM Corp.'s JFS. Microsoft Corp.'s current NTFS release supports meta data journaling capabilities, with a more feature-rich Transactional NTFS version called TxF due in its Vista (formerly code-named Longhorn) operating system release.

Networked file systems
Networked file systems have become the foundation for data and resource sharing in the enterprise. Not surprisingly, the iconic deployment example is NFS, a 20-year-old technology now synonymous with the protocol it exports to drive a majority of the world's NAS deployments. The other major networked file system is the Common Internet File System (CIFS). At their core, both NFS and CIFS are hierarchical file systems that can export specialized protocols to their clients to enable the sharing of files under their control. NFS and CIFS support Unix and Windows clients, respectively. Despite its strong associations with the Unix community, NFS can support other operating systems, including Windows.

Networked file systems are at the center of developments in namespace aggregation, as described above. NFS has continued to add functionality in this respect, including inherent namespace aggregation across multiple machines in NSF V.4, its latest release. With namespace aggregation activated, multiple machines running NSF V.4 can share a common view of file information. Likewise, CIFS can leverage a software platform called Microsoft Distributed File System (DFS) to establish unified namespaces across multiple Windows machines. Despite the "distributed file system" moniker, Microsoft DFS isn't a true file-system technology, but rather a namespace aggregation tool that deploys atop the Windows operating system.

Storage professionals may note that the term virtual file system (VFS) is used increasingly by some vendors. Users should consider the term VFS a synonym for namespace aggregation, and think of it in the context of a networked file system. Specifically, this refers to grouping several file systems to create the virtual image of one file system. Underneath, each physical device retains its own file-system images.

There are several companies developing technologies on top of these networked file systems to enhance namespace management, including Acopia Networks Inc., NeoPath Networks Inc., NuView Inc. and Rainfinity (acquired by EMC Corp.).

Advanced file-system approaches
Some of the more advanced challenges in the data center require file-system technologies that are even more complex than those discussed thus far. The following sections provide a more in-depth look at some of the cutting-edge developments in cluster file systems (CFS) and distributed file systems/parallel file systems (PFS).

Verify vendor performance claims
File-system vendors typically rely on two basic metrics to describe and compare their products' performance: operations per second (Ops/sec) and throughput. Ops/sec is an I/O measurement reflecting how many write-and-read actions a file system can handle vs. a given benchmarking suite. When vendors mention Ops/sec, they need to disclose the benchmark against which this measurement was achieved, otherwise the measurement is meaningless.

Throughput is typically measured in megabytes or gigabytes per second (MB/sec or GB/sec), indicating the total amount of data the file system--or the entire computing system based on that file system--can produce in a given time period. For users evaluating file-system software, it's important to determine the hardware used for throughput benchmarking to ensure a fair comparison to their own data center's hardware. Because file-system benchmarks (Iometer, IOzone) and related NAS protocol benchmarks (NetBench, SPECsfs) are necessarily broad and general in nature, vendors do whatever they can to make their product look better than competing products.
A CFS lets any storage node within the cluster read and write all blocks of the file system concurrently. Sometimes referred to as data-sharing clusters, CFS data isn't owned by a portion of a file system or a particular machine, but is instead open across the entire cluster. As such, the ultimate goal of modern CFS is the creation of scalable file-system clusters spanning multiple applications on multiple servers, all riding on top of the same pool of shared networked storage. An architectural characteristic of CFS is that it has no deployments on clients; it resides only on the servers in the cluster.

To achieve this dynamic sharing of all reads and writes across all server and storage nodes, a CFS must possess sophisticated controls over what node has rights and access to a given piece of data at any given time. The technology within a CFS that plays traffic cop is called a distributed lock manager (DLM). A well-architected DLM enables a CFS to scale to dozens or hundreds of nodes, ensuring performance and total coherency of the data at all times. Today, most CFS have been architected with enterprise deployments in mind. As such, they place high emphasis on dynamic, transparent failover amongst server nodes and immediate recoverability of nodes without any data loss. It's no coincidence that CFS originated in the realm of high availability for databases (see "How to choose a file system," previous page).

Over the past few years, CFS have taken up residency alongside SAN deployments. While a SAN enables the pooling of storage resources, it does nothing to change how servers access that networked data: each file system still owns a piece of the SAN. However, a CFS in conjunction with a SAN actually pools application I/O and storage so that any server can access any of the data within the networked storage environment. CFS has difficulty when scaling to large node counts in highly parallel computing environments, such as large HPC deployments with thousands of servers. Aside from the lack of networked storage architectures in such deployments, the DLM architecture of a CFS can face serious performance issues at such large scales.

Notable companies leveraging CFS for enterprise workloads in NAS and SAN clustering include Advanced Digital Information Corp. (ADIC) with StorNext, IBM's SAN File System, PolyServe Inc.'s Matrix Server, Red Hat Inc.'s Sistina and Silicon Graphics Inc.'s Clustered XFS (CXFS).

DFS and PFS are receiving a lot of attention across a range of applications. The terms distributed and parallel are interchangeable; some vendors call their product distributed while others go with the "parallel" moniker--the two approaches are architecturally and functionally analogous.

DFS/PFS enables thousands of servers to sustain parallel I/O into a file system, directory or single file with minimal coordination required between those servers. All DFS/PFS are two-layer, file-system architectures with clients and servers. On the client layer, the DFS/PFS creates a namespace that spans all of the machines and creates a single file-system presentation. Because it establishes "one big file system," the client layer enables any client to make requests into the cluster that's executed by the server layer.

The server layer of a DFS/PFS is responsible for all I/O operations, and can span to more than 2,500 clients in large implementations (see "Verify vendor performance claims," previous page). From a data storage perspective, the server layer of a DFS/PFS is functionally identical to the storage layer, sometimes even referred to simply as the system's storage nodes. This is because every DFS/PFS is architected so that each individual physical server maintains ownership of its own storage resources. In a DFS or PFS, storage isn't directly shared by other servers, as is the case in a CFS. Because of that difference, a DFS/PFS doesn't need to use SAN networks for storage.

A DFS/PFS uses various internode daemons, meta data and data control mechanisms to ensure that stored content is accessed only by a single client at any given time, ensuring data coherency. While some approaches use a centralized lock manager and meta data server to achieve this traffic cop control, others use non-hierarchical or segmented lock management approaches to achieve extremely high scalability and parallelized I/O. The result is a file-system architecture optimized for huge throughput across many machines.

Because of this architecture, DFS/PFS made an initial beachhead in HPC cluster applications. DFS/PFS are being increasingly deployed in enterprises for data-intensive apps such as digital content delivery and scalable NAS (see "Where CFS and DFS/PFS fit best," above right). A shortcoming of most DFS/PFS implementations has been an inability to handle the random I/O common in workloads such as databases. This relative weakness results from how these architectures handle a bunch of traffic cop issues among large numbers of server I/O nodes.

Notable companies leveraging DFS or PFS technologies include Exanet Inc., IBM's General Parallel File System (GPFS), Ibrix Inc., Isilon Systems Inc. and Lustre, the open-source initiative.

Cluster vs. distributed/parallel
The key question faced by any user is: When should I care about a cluster file system vs. a distributed or parallel file system?

CFS are well designed for comparatively smaller node-count deployments in enterprise environments. Because of their integration with SAN architectures and their extreme focus on high availability and immediate recovery, a CFS is well suited to clustered databases and consolidation of file or application servers into a networked storage pool. Additionally, enterprise CFS have been increasingly coupled with NAS protocols (NFS, CIFS) to create scalable NAS environments or more accurately, "scalable NAS on SAN." Users should also consider CFS when high availability and recoverability for critical apps are absolutely essential.

DFS and PFS continue to dominate HPC environments. This is because they don't require SAN integration and are optimized for highly parallelized I/O to single clients. Additionally, this category of file-system technology is increasingly prevalent at the heart of scalable NAS offerings. These offerings typically leverage the DFS/PFS architecture to create an integrated plug-and-play cluster where every node behaves as an atomic member, contributing new CPU and storage to an overall system as they're added. Some DFS-based NAS offerings can achieve theoretical file-system sizes of more than 100TB and are demonstrating impressive real-world aggregate throughput. When client performance is critical or highly parallel I/O is a must, users should look to DFS/PFS solutions.

Both CFS and DFS/PFS will play important roles in the data center. CFS will undoubtedly continue to play a major role in enterprise virtualization initiatives because a CFS enables fine-grained scalability, and the sharing of multiple applications across all server and storage resources. Likewise, DFS/PFS will build traction as an engine for scalable NAS due to the easy atomic management capabilities these distributed file systems offer.

What's next?
Looking ahead 24 to 36 months, expect to see even more file-system innovations. Specifically, major vendors will integrate CFS technologies into their server and storage virtualization product families. Without easily deployed, transparent, shared file systems, all of the great utility computing dreams of major vendors will remain just dreams.

Accelerated by fast node-to-node interconnects, DFS/PFS will continue to boost performance across a range of I/O types, eventually challenging CFS for the lucrative clustered database market. This could have significant implications for the DBMS market as a whole when plug-and-play scalable databases become much easier to deploy.

Within three years, namespace aggregation technologies will become integrated features of servers, workstation file systems and within enterprise storage switching platforms. In short, unified namespaces will become ubiquitous, an essential part of an enterprise file-system deployment. Both CFS and DFS/PFS will also continue to extend their support for wide-area geographies. Data centers will begin to aggressively deploy both types of technologies to support branch offices for better collaboration and consolidation activities for file and block data.

The once lowly file system has finally blossomed into a state of increasingly diverse innovation. For storage professionals, the complexity has increased, but the good news is that file systems are finally starting to make their jobs easier.

Dig Deeper on Storage management tools