There are two sides to the big data story: the more familiar one involves analytics using vast numbers of small files, but dealing with big file storage is another issue.
Much of the discussion around big data analytics involves dealing with extensive data sets that typically comprise thousands or millions of smaller data objects gleaned from sources such as Web traffic, transactional databases or machine sensor output. But there's another side to the big data discussion where rather than focusing on analytics using huge numbers of smaller files, the processes involved require the handling and manipulation of much larger files. Use cases would include "big data archive" and similar applications, and some of the unique characteristics of big files will warrant special consideration when it comes to storage systems design.
Big file data defined
Typically, big file data involves some kind of images or video, with the most common example being digital content such as movies and television. The production processes used to create those assets generate some very large files, but it's not just the finished product that consumes so much storage. There are usually multiple variations of the raw footage created for different viewing platforms and consumer markets. And those file sizes just keep growing as each new technology -- HD, 3D, 4K and so on -- increases image resolution. The use of video surveillance has dramatically expanded with the availability of Web cameras and inexpensive video-processing gear. And as with motion pictures, the file sizes for these videos are directly related to their length, which for video surveillance can be hours' or days' worth of recordings. The resolution of many of these cameras compounds the issue and causes file sizes to swell significantly, with much the same effect that megapixel smartphone cameras had on personal data storage.
Satellite-based remote sensing and aerial photography are other examples of growing applications that create some enormous files, in both optical (pictures) and multispectral imagery. And with each new generation of satellites, the resolution increases, driving up file sizes. A similar example is scientific data from large sensor arrays and radio telescopes. When operational in the next decade, the Square Kilometre Array, a project that uses multiple radio telescopes over several square miles, is expected to create 1 PB of new data each day.
Is big file data really a big deal?
The immovable object. When a storage system hits a point where it simply won't scale any larger, or it becomes so bottlenecked that access times and throughput are unacceptable, it's usually time to migrate to a new system. But with a big file data application, a migration may be nearly impossible. Few businesses or organizations have enough downtime to move petabytes of data, especially when new data is still flowing into the system. Like the proverbial "immovable object," large file archives can get so big that they become unmanageable within a traditional infrastructure.
Similar to a foundation of a building, once these infrastructures are set up and put into use, it's often too late to change. For this reason, big file data storage infrastructures must be designed for maximum flexibility with the ability to be upgraded as nondisruptively as possible with their data in place.
The long haul. Keeping data for a long time isn't unique to big file data applications, but when each file adds another 10 GB (or 100 GB or more), retention quickly becomes an issue. Data retention isn't related to file size per se, but many of the files that people and companies want to keep are image based. Digital content such as video and audio are good examples, as are digital snapshots (Shutterfly maintains nearly 100 PB of photo data) and video surveillance files.
Long-term retention has historically been driven by regulatory compliance, but now data is just as likely to be kept for its possible reuse or for security. A good example is surveillance videos. Historically archived for legal reasons, these files are now being used to help analyze customers' shopping behavior. Storing this type of data for extended, often open-ended, timeframes creates operational cost issues. Maintaining the disk space to keep tens or hundreds of terabytes for years isn't trivial, but it's nothing compared to the power and floor space required to support petabytes of data on even the lowest cost disk available.
Human consumption. In a lot of big data analytics apps, computers perform the analysis, so data is often stored in the same data center that houses the database servers or in the same servers themselves, as with Hadoop clusters. But in big file data use cases, the data is often analyzed by people -- and people don't live in data centers. When the processing engine wants to consume data on a tablet from home or a smartphone on the road, the storage infrastructure must deliver that data appropriately.
Most of the files are consumed in order, so they need to be streamed (often through a low-bandwidth connection), and can't be chopped up and reassembled upon delivery. To support that kind of consumption pattern, many big file data repositories need a random-access storage tier that can quickly send enough content to get the streaming process started, and then buffer the rest of the file. But that disk storage tier must be very large and very scalable, since it has to contain the first portions of the files in a very large archive and keep up when that archive grows.
Designing big file data storage systems
To address the special challenges of big file data, storage infrastructures must be designed with care. For example, the immovable object challenge dictates that maximum flexibility must be designed in. But more than that, the architecture should allow the storage system to scale while the data remains in place, using a modular building-block approach. Much more than just scale-out storage, these systems will typically include multiple types of storage in different modules or nodes, along with a global file system. They'll store the longest term data on tape and add disk storage nodes and processing nodes as needed to scale the system to the right "shape" for the application. This mix of storage may include high-performance disk, high-capacity disk and flash in different combinations.
As an example, Silicon Graphics International (SGI) Corp.'s DMF can scale both horizontally and vertically, meaning more processing nodes can be added in parallel (scale out) to support greater performance, and capacity can be added in higher density storage devices (scale up) to keep costs down. This modular architecture also includes the back end, with parallel data mover nodes able to provide fast file access into and out of a repository. The result is a storage infrastructure that may look different to each organization that implements it, and can be modified to support hardware upgrades and new generations of storage media -- all without disrupting the workflow or physically moving the data set.
One large government agency doing weather analysis has experienced relentless data growth in a system over the past 20 years. Currently managing more than 60 PB, the system must move over 300 TB per day on the back end, while providing more than 100 GBps file access to NFS-attached clients. Its DMF-based system was able to scale incrementally to the current configuration of 52 1U edge server nodes, each providing 2 GBps of NFS throughput out of the front end. Six 1U parallel mover nodes out of the back end provide 60 GBps over Fibre Channel to manage the data movement between the multiple tiers of storage. In this way, the infrastructure is optimized to reduce cost and can be scaled to accommodate increased performance requirements over time.
Tape has a role in big file data systems
The sheer size of big file data repositories, and the fact that much of the data they hold has no expiration date, dictates the use of tape. There's simply no other way to store the hefty volumes of data for that duration economically. The ongoing cost of power, cooling and floor space makes even low-power disk storage a non-starter in the multiple-petabyte domain of big file data.
In addition to its economics, tape has some enviable performance characteristics, especially when dealing with big files and the need to stream data into and out of the repository. LTO-6, for example, provides 160 MBps of native file throughput. When managed correctly, with adequate random access storage (disk and flash) on the front end, tape can be a very effective storage medium for big file data.
LTFS makes tape appealing for big file data
The Linear Tape File System (LTFS) is the file-aware, open format developed by the LTO consortium that enables cross-platform compatibility of LTO tapes. By creating an index partition on each tape, LTFS allows the files on tape (a linear medium) to be searched in a random-access fashion, like disk. To be clear, those files would still need to be transferred linearly, but LTFS greatly improves tape "searchability." It also makes each tape cartridge "self-describing," unhooking the files stored on tape from archive software in general and from the software or platform used to create it in the first place.
Historically, tape-based archives have been somewhat complex. Big files were stored in file systems but tape drives weren't "file aware," so these infrastructures required special archive software that would provide a file-based interface to users and applications on the front end but could still talk tape to the tape drives on the back end. Now, the Linear Tape File System (LTFS), an open system tape interface, simplifies tape access and improves its flexibility.
Spectra Logic recently released an interface called Deep Simple Storage Service (DS3) that takes the Amazon Simple Storage Service (S3) interface and adds commands for sequential data movement and removable media support. The result is a REST-based interface to tape, making it directly accessible to applications designed to communicate over the Internet and to systems designed around Amazon S3 APIs. DS3 also makes it easy to connect tape to object storage systems, which use REST APIs as well.
BlackPearl is Spectra's DS3-enabled appliance that provides a solid-state storage cache to clients on the front end while handling the direct connection to tape drives on the back end. It also manages data security and the long-term integrity of data on tape, plus conversion to the LTFS format.
But big file data storage must be more than a tape library and a disk cache. Given the access requirements of these large files, object storage systems are coming into favor because they can scale into petabytes more efficiently than traditional RAID-based storage systems.
Object storage offers advantages
Big file data storage systems are now being built with multiple disk tiers to support real-time analytics and fast access, along with a tape tier for long-term archive. Object storage systems, like Quantum Corp.'s Lattus, can provide the affordable capacity required to front-end even very large tape-based big file repositories and support the global availability use cases common to the motion picture and broadcast industries. Using sophisticated analytics, these companies can "geo-spread" the files that need to be the most accessible and can "batch migrate" them to the appropriate storage tiers even before they're needed.
Quantum's StorNext file system and data management system are built with a modular architecture that allows metadata processing to scale independently from storage capacity. This allows users to add metadata engines on the back end to support faster access to data on tape or on object storage, and to scale the file system on the front end to support more users and different platforms.
Big file data bottom line
Big file data creates some special storage challenges. Maintaining a storage system that can hold a mountain of data and still provide decent throughput and access performance is all but impossible using traditional infrastructures. These systems must be extremely flexible, able to support multiple types of random-access storage and tape, and allow users to upgrade and modify them while leaving the data in place. They also need a highly scalable, economical disk tier, like object storage, to support streaming of these large files to small devices over skinny pipes.
Big file archives must include tape and should also write data in open formats, such as LTFS, to support content processing workflows that involve multiple users on different platforms. Tape should use REST-ful interfaces instead of the proprietary archive software layers of the past. This will allow tape to be directly connected to apps, object storage systems and even to the Internet, which most big file data repositories will be supporting.
About the author:
Eric Slack is an analyst at Storage Switzerland, an IT analyst firm focused on storage and virtualization.