An expert's guide to big data storage architecture
A comprehensive collection of articles, videos and more, hand-picked by our editors
How should I spec a storage system for big data apps?
By submitting your email address, you agree to receive emails regarding relevant topic offers from TechTarget and its partners. You can withdraw your consent at any time. Contact TechTarget at 275 Grove Street, Newton, MA.
There are two kinds of big data apps floating around out there. One is storage for large-capacity applications used in industries such as media and entertainment, oil and gas exploration, and life sciences; and the other is analytics applications. For large-capacity apps, there are two primary issues: the bandwidth needed to transfer the large files typically found in these environments, and being able to support a large number of files without slowing access as the file count increases. The bandwidth issue is measured by the bandwidth for a single file transfer and the aggregate bandwidth required to transfer multiple files. For very high-capacity environments (at least hundreds of terabytes), there may be hundreds of thousands of files. With some file systems, as the number of files increases, access slows because of the table structures utilized. When choosing a storage file system, whether on servers with SAN-attached storage or scale-out NAS, it's important to know the maximum number of files supported without performance impact.
The term big data is more commonly used for analytics. Hadoop is often mentioned as a big data analytics platform. For these apps, the emphasis is on fast and (nearly) real-time analysis of information. From a storage systems standpoint, the first concern should be the response time an analytics system user experiences. The response time should be the total time required to transfer to and from the storage device. Measures that use only the cache in the storage system should be ignored; the real work is storing to and retrieving from the device. Many IOPS calculations are completed using cache. The other requirements regarding capacity and throughput (sustained bandwidth) are also important.
Related Q&A from John Webster
John Webster describes how changes to HDFS and the NameNode can help to improve Hadoop infrastructure.continue reading
Analyst John Webster details issues with Hadoop architecture and what users can expect from Hadoop Version 2.0.continue reading
Understanding big data analytics, and how it differs from data warehousing, depends on time to information, content complexity and cost.continue reading
Have a question for an expert?
Please add a title for your question
Get answers from a TechTarget expert on whatever's puzzling you.