Published: 01 Sep 2010
Companies of all sizes are being inundated with unstructured data that's straining the limits of traditional file storage. File virtualization can pool those strained resources and provide for future growth.
Unstructured data is growing at an unprecedented rate in all industries and has become one of the top challenges for IT departments. Market data from a variety of analyst and research firms shows a congruent picture: In most companies, the amount of unstructured data (file based) outstrips structured data; it's spread across the enterprise; and it tends to reside on a motley assortment of isolated file stores that range from file servers to network-attached storage (NAS). Management pain points have reached a critical level and associated costs are skyrocketing.
How we ended up in this dilemma is well understood. On the one hand, we have the simplicity of implementing unstructured data stores via Windows and Linux file servers with directly attached and storage-area network (SAN) storage; on the other hand, we have traditional NAS systems that are based on scale-up architectures with inherent limitations to scale. For instance, until NetApp released Ontap 8 it lacked advanced clustering and a global namespace; the only way to extend beyond a single NetApp filer was to buy a larger filer or deploy another one running independently from already installed systems.
The data storage industry is keenly aware of the situation, and vendors have taken different approaches to provide file system and NAS virtualization products that help to overcome the challenge at hand. Even though progress has been made, adoption has been tepid. "It has taken almost 10 years for block-based virtualization to take place," said Greg Schulz, founder and senior analyst at Stillwater, Minn.-based StorageIO Group. "NAS virtualization is still in an early stage and it will take time for it to be widely adopted."
Four ways to virtualize file access
Virtualizing file access by putting a logical layer between back-end file stores and clients, and providing a global namespace is clearly the most promising approach to tackling the unstructured data challenge. It's akin to block-based storage virtualization, however, there isn't a single method of implementing file-access virtualization. Instead, we have several architectural approaches competing for a potentially lucrative file-access virtualization market.
1. File-system virtualization (aggregation) is one way of virtualizing file access. At a high level, file-system virtualization accumulates individual file systems into a pool that's accessed by clients as a single unit. In other words, clients see a single large namespace without being aware of the underlying file stores. The underlying file store could be a single NAS, or a mesh of various file servers and NAS systems. File-system virtualization products address two main problems: They give users a single virtual file store; and they offer storage management capabilities such as nondisruptive data migration and file-path persistency while files are moved between different physical file stores.
One of the great benefits of file-system virtualization is that it can be deployed in existing environments without having to rip out existing servers and NAS storage. On the downside, file-system aggregation doesn't address the problem of having to manage each file store individually.
2. Clustered file systems are another way of virtualizing file access. Clustered file systems are part of next-generation NAS systems designed to overcome the limitations of traditional scale-up NAS. They're usually composed of block-based storage nodes, typically starting with three nodes and scaling to petabytes of file storage by simply adding additional nodes. The clustered file system glues the nodes together by presenting a single file system with a single global namespace to clients. Among the vendors offering NAS systems based on clustered file systems are FalconStor Software Inc.'s HyperFS, Hewlett-Packard (HP) Co.'s StorageWorks X9000 Network Storage Systems, IBM's Scale Out Network Attached Storage (SONAS), Isilon Systems Inc., Oracle Corp.'s Sun Storage 7000 Unified Series, Panasas Inc., Quantum Corp.'s StorNext and Symantec Corp.'s FileStore.
3. Clustered NAS is a third way of virtualizing file access. Clustered NAS architectures share many of the benefits of clustered file-system-based NAS. Instead of running a single file system that spreads across all nodes, clustered NAS systems run complete file systems on each node, aggregating them under a single root and presenting them as a single global namespace to connected clients. In a sense, clustered NAS is a combination of a scale-out, multi-node storage architecture and file-system aggregation. Instead of aggregating file systems of heterogeneous file stores, they aggregate file systems on native storage nodes. The BlueArc Corp. Titan and Mercury series of scale-out NAS systems are prime examples of clustered NAS systems.
4. NAS gateways can also be viewed as file-system virtualization devices. Sitting in front of block-based storage, they provide NFS and CIFS access to the block-based storage they front end. Offered by most NAS vendors, they usually allow bringing third-party, block-based storage into the NAS and, if supported by the NAS vendor, into the global namespace.
NAS systems and gateways based on clustered file system or clustered NAS architectures are next-generation NAS systems and won't integrate with existing legacy file stores; they usually replace them or run in parallel with them. This makes them more difficult to deploy as well as more expensive than file-system virtualization products. However, the benefit of having to manage a single NAS, rather than many small data silos that are simply aggregated by a file-system virtualization product into a single namespace, more often than not justifies the additional effort and cost.
|NAS virtualization terminology|
Namespace is the organization and presentation of file-system data, such as directory structure and files.
In a non-shared namespace, file-system information is confined to a single physical machine and not shared with others. Traditional scale-up NAS and server-based file stores are examples of products with non-shared namespaces.
Conversely, a shared namespace, also referred to as a global namespace, combines the namespace of multiple physical machines or nodes into a single namespace. It can be implemented by aggregating the namespaces of multiple machines and presenting them as a single federated namespace, as is usually the case in file-system virtualization and clustered NAS products, or it can be achieved via clustered file systems where a single file system spreads across multiple physical nodes.
Scale-up NAS is a file-based storage system that scales by replacing hardware with faster components, such as faster CPUs, more memory and more disks. Its namespace spans one or two nodes clustered for high availability.
Scale-out NAS is a file-based storage system that provides scaling by adding nodes to the cluster. Available in N+1 (single redundant node) or N+M (each node has a redundant node) high-availability configurations, they provide a namespace that spans multiple nodes, allowing access to data throughout all nodes in the namespace.
File-system virtualization use cases and selection criteria
Because ripping out existing file stores and replacing them with a scale-out NAS isn't an option in many situations, file-system virtualization products that aggregate the various file stores into a single global namespace can be viewed as complementary to scale-out and traditional NAS systems, especially during the extended time of transitioning from legacy file stores. "Many customers buy a NAS to get features like replication, archiving and snapshots, but they don't require these for all files," said Brian Gladstein, vice president (VP) of marketing at AutoVirt Inc. "We give them the ability to mix existing low-end file stores with fast filers and provide them with a single namespace."
Even in companies that can centralize their unstructured data onto a NAS with global namespace support, there will likely always be some storage silos that live outside the NAS. It could be departmental data or data that's deemed unworthy to reside on comparatively expensive NAS storage. File-system virtualization products allow combining rogue file stores with NAS devices into a global namespace. A second use case for file-system aggregation is data migration. Acquisitions, storage infrastructure upgrades and data relocation projects are among the reasons for migrating data from one physical location to another. Because file-system aggregation products virtualize access to heterogeneous file stores, they're also simple yet effective data migration solutions. Another use case for file-system aggregation is automated storage tiering. Equipped with policy engines for defining data migration rules based on file-system metadata -- such as last access date, file size and file type -- they enable automatic data movement to suitable storage tiers based on defined policies.
File-system virtualization products are available as appliances and software-only products. A software-only product offers the benefits of more flexible deployment on hardware of your choice, and the products usually have a lower degree of vendor lock-in. Conversely, appliance-based file-system virtualization products come in a proven, performance-optimized package and, because hardware and software are provided by the same vendor, there's less risk of finger pointing.
When comparing file-system virtualization products, the level at which virtualization occurs is a relevant evaluation criteria. For instance, while Microsoft Distributed File System (DFS) provides share-level virtualization, a product like F5 Networks Inc.'s ARX series provides file-level virtualization.
Intrusiveness and ease of deployment are also relevant characteristics to consider during a product evaluation. Ideally, a file-system virtualization product should require minimal client changes and the virtualized data on the back-end file stores shouldn't be changed.
File-system support must also be considered. While some systems support only CIFS, products like F5's ARX and EMC Corp.'s Rainfinity support CIFS and NFS, which is relevant in environments with both Windows and Linux file stores. The presence of a policy engine and its capabilities are critical if the product's intended use is for data mobility and automated storage tiering.
File-system virtualization product sampler
File-system virtualization products are offered by a number of vendors, each coming from a different background with varying objectives.
AutoVirt File Virtualization software: Like Microsoft DFS, AutoVirt is a software-only product that runs on Windows servers.
The AutoVirt global namespace uses the CIFS protocol to interact with file servers, clients and DNS. When a client requests a file, DNS facilitates resolution to the appropriate storage device. The global namespace acts as an intermediary between client and DNS. With the AutoVirt global namespace in place, client shortcuts refer to the namespace. The namespace is the authority on the location of networked files and provides the final storage referral with the help of DNS.
AutoVirt can be introduced nondisruptively to clients, without the need to make any changes on clients, by populating the AutoVirt namespace server with the shares of existing file stores. Although it can be done manually, a data discovery service automates discovery of existing file stores and populates the AutoVirt namespace server with metadata. This differs from Microsoft DFS, which requires clients to be configured with the new DFS shares, rather than continuing to use existing file shares.
Also contrary to Microsoft DFS, AutoVirt provides a policy engine that enables rule-based data mobility across the environment to migrate, consolidate, replicate and tier data without affecting end-user access to networked files. Currently available for CIFS, AutoVirt plans to release a version for NFS by year's end.
EMC Rainfinity file virtualization appliance: Rainfinity is a family of file-system virtualization products that virtualize access to unstructured data, and provide data mobility and file tiering services. The Rainfinity Global Namespace Appliance provides a single mount point for clients and applications; the Rainfinity File Management Appliance delivers policy-based management to automate relocation of files to different storage tiers; and the Rainfinity File Virtualization Appliance provides nondisruptive data movement.
Contrary to F5's ARX, the Rainfinity File Virtualization Appliance architecture is designed to switch between in-band and out-of-band operations as needed. The appliance is out-of-band most of the time, and data flows between client systems and back-end file stores directly. It sits outside the data path until a migration is required and then switches to in-band operation.
F5 ARX Series: Acquired from Acopia in 2007 and rebranded as F5 ARX, the F5 ARX series is an inline file-system virtualization appliance. Usually deployed as an active-passive cluster, its located between CIFS/NFS clients and heterogeneous CIFS/NFS file stores, presenting virtualized CIFS and NFS file systems to clients. Unstructured data is presented in a global virtualized namespace. Built like a network switch, it's available with 2 Gbps ports (ARX500), 12 Gbps ports (ARX2000) and 12 Gbps ports plus two 10 Gbps ports (ARX4000).
With a focus on data mobility and storage tiering, F5's ARX comes with strong data mobility and automated storage tiering features. Orchestrated by a policy engine, it performs bidirectional data movements between different tiers of heterogeneous storage in real-time and transparently to users. Similar to AutoVirt, policies are based on file metadata, such as last-accessed date, creation date, file size and file type.
The fact that F5 ARX is an appliance allows it to provide a performance- optimized product that's hard to match by a software-only solution. Built on a split-path architecture, it has both a data path that passes data straight through the device for tasks that don't involve policies, and a control path for anything that requires policies. "We are a DFS on steroids," said Renny Shen, product marketing manager at F5. "While DFS gives you share-level virtualization, we give you file-level virtualization."
Microsoft DFS: Microsoft DFS is a set of client and server services that allow an organization using Microsoft Windows servers to organize distributed CIFS file shares into a distributed file system. DFS provides location transparency and redundancy to improve data availability in case of failure or heavy load by allowing shares in multiple locations to be logically grouped under a single DFS root.
DFS supports the replication of data between servers using File Replication Service (FRS) in server versions up to Windows Server 2003, and DFS Replication (DFSR) in Server 2003 R2, Server 2008 and later versions.
Microsoft DFS supports only Windows CIFS shares and has no provision for bringing NFS or NAS shares into the DFS global namespace. Furthermore, it lacks a policy engine that would enable intelligent data movements. As part of Windows Server, it's free and a good option for companies whose file stores reside mainly on Windows servers.
|Open source NAS virtualization|
NAS virtualization products are also available as open source software. For instance, the Apache Hadoop Distributed File System (HDFS) handles distribution and redundancy of files, and enables logical files that far exceed the size of any one data storage device. HDFS is designed for commodity hardware and supports anywhere from a few nodes to thousands of nodes. Another example of an open source file system is the Gluster clustered file system for building a scalable NAS with a single global namespace.
Instead of spending a lot of money for traditional NAS systems, an open source file system running on inexpensive hardware components seems like a good alternative. But open source file systems are usually not a good choice for the enterprise. They require significant tuning and maintenance efforts, as well as experts intimately familiar with the intricacies of the chosen software, and they don't come with the support that traditional NAS vendors offer. Availability, reliability, performance and support come first for enterprise storage, and these attributes are difficult to achieve with open source software. Open source file systems are a great choice for cloud storage providers and companies that have to make money on storage, as well as for the research and educational sector, but they're usually not the product of choice in the enterprise.
File virtualization outlook
Access to unstructured data hasn't changed much in the past 15 years, but big changes are happening now. NAS system architectures are moving toward more scalable, multi-node scale-out architectures with global namespace support. NAS behemoth NetApp finally incorporating technology acquired from Spinnaker in its Ontap 8 release, enabling customers to build multi-node NetApp clusters, is indicative of the change.
File-system virtualization products are complementing traditional scale-up and next-generation scale-out NAS systems to provide a global namespace across heterogeneous file stores in the enterprise. While they're currently mostly deployed for the purpose of data mobility and storage tiering, they're likely to play a significant role in the future in providing an enterprise-wide, global unified namespace for all unstructured data.
BIO: Jacob Gsoedl is a freelance writer and a corporate director for business systems. He can be reached at email@example.com.