Big data tutorial: Everything you need to know
A comprehensive collection of articles, videos and more, hand-picked by our editors
Target audience will largely determine the selection of disk- or tape-based big data storage solutions in the media and entertainment (M&E) industry. Short clips are conducive to disk-based network-attached storage (NAS), while long-form video or film might work well with less-expensive tape, according to the chief technology officer at Wikibon.
But David Floyer, who also co-founded the Marlborough, Mass.-based Wikibon research and analysis firm, predicted that object-based storage will become an increasingly popular choice for media-rich big data because of the greater flexibility it affords.
In this interview with Carol Sliwa, a senior writer for TechTarget's Storage Media Group, Floyer also discussed the importance of metadata in delivering more fine-grained options to end users, the challenges associated with storing big data in the M&E industry and key points to consider when designing a storage environment for media-rich big data files.
How does an IT organization in the media and entertainment industry go about deciding what type of storage to use with big data?
David Floyer: It's going to fundamentally depend on the audience that you have paying for this particular service. If it's short clips that you're after, then obviously disk- and NAS-type systems are going to be the way to go because people will not want to wait a long period of time to get the data itself. If it is longer clips and whole films, then it's going to move it much more toward tape because you can fill in that short gap at the beginning with another piece of information or an introduction or an advert or something like that, and 30 seconds or 60 seconds on the front of a film is not going to make much difference. And it's so much cheaper to hold it.
You're looking at the convenience for the end user. The more metadata that you want to allow the user to be able to get at, and the more choice about what they can get at and look at and research, then of course that pushes it toward file-based systems -- though a tiered system, where the most popular clips are held on disk and less popular ones are held on tape, would be very satisfactory for most viewers if it was significantly cheaper to do that.
So, it's a question of knowing the audience, designing the system for the audience and then picking the best technologies to meet the end-user requirements. That's the way you should think about it. There's no right or wrong answer here. Tape is going to be excellent for an increasing number of [use cases], and obviously disk allows that greater flexibility and faster time to the first frame.
Which technology makes more sense for media-rich files: scale-out NAS or object-based storage?
Floyer: Scale-out NAS can support an object system, so the two are not incompatible. If you start with the object system and then put a file system on top, that's going to probably give you the greatest flexibility. And scale-out NAS systems are going in that direction anyway. So, how the underlying pieces of it are put together is important, and then it's equally important that you can access that in different ways.
For some people, accessing it with NFS is going to be absolutely OK and the right way of doing it. For others, they will want to see the underlying components and use that additional data for a lot of beneficial ways that could enhance the viewing experience. So, both are right. Both are good ways of doing it. But, increasingly, the underlying layer will be an object-based system.
Cost is of very, very significant importance here. If you're looking for future flexibility, then I think scale-out, fundamentally object-based systems are going to be the ones that provide the greatest flexibility and the greatest end-user value. And this is going to be true for both the media industry and for other uses of rich data as well.
It's no longer satisfactory just to think about it as a single file or even a series of clips of a file. You will want to add a lot of information. And the design of that is going to be much, much easier if you build that on an object-based system and put your file on top of that if you want to view it as a continuous stream. That to me is where it's going and where people will pay significantly additional dollars to view it in different ways that we don't even think nowadays could even be possible.
Why is media-rich data well suited to object storage?
Floyer: Traditionally, they've all been large files, and maybe they have been chopped up a bit to enable pieces to be taken out. But, fundamentally, it's been very large sequential files. What is increasingly important is that people can access and understand more in-depth about what is happening within this large file.
For example, people would like to know all of the times that Tom Hanks is on his own in a film. So, breaking it up into much smaller components, objects that have their own data about who is saying what or what words are being used, and having a separate metadata around that large media file, those are becoming increasingly important to enable new ways that people can enjoy things, enjoy looking for particular scenes, enjoy comparing scenes across, for example, different films. So, object-based is beginning to make a very significant impact of being the only effective way that you can break up that long file into smaller components. And also, there are some technical reasons why that's a help as well.
Historically, one knock against object storage has been performance. Can object storage handle the big data of the media and entertainment industry from a performance standpoint?
Floyer: When you're using it as a file system in sequential mode, then the file system will reflect that and be organized such that it will automatically then link to the next one and the next one, and you'll have a sequence number within that, which will tell you where you are and look ahead to what you need to get. So, the traditional problems with object-based systems will go away. And you don't update a media file that much. You tend to have it as a sequential, in-place set of objects. So, for media, object systems are going to be fine for the most part. It's just that it gives that flexibility of additional information and additional access capability at a much finer level than the file-based systems can deliver.
What features or capabilities of file-based and object-based storage systems are especially helpful for storing and managing media-rich files?
Floyer: There are several features that are useful in different areas. For example, if you want to cache media-rich data, then obviously you want to be able to have a system where you can distribute that cache and know what's there in different parts of the country, and you can point users at cached copies to reduce latency. So, caching is a very important feature.
Features such as erasure coding, which allow for much lower overhead in putting these large systems across multiple locations and being able to recover, that's a very useful technology, particularly for these very, very large files. Using erasure coding and spreading that file across several different physical locations, that can be a very cost-effective way of reducing the number of copies and the management cost of those copies.
When you come to tape, what's really important is that you can know what's on the tape and where things are on the tape. So, the introduction of the LTO-5, LTO-6 technologies and the introduction of [Linear Tape File System] LTFS, the file system directory, on the tapes allows you to go much more quickly to the exact piece of information that's required. And these systems have led to a resurgence of tape as a way of holding large amounts of data that are relevant and being able to access them in a reasonable time. The cost of these tape systems is significantly lower -- five times, 10 times lower -- than the equivalent disk-based systems. So for people who want to listen to a whole film, for example, waiting 30 seconds extra at the beginning of a tape really isn't an issue.