This content is part of the Essential Guide: An expert's guide to big data storage architecture
Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

How should I spec a storage system for big data apps?

Analyst John Webster explains the difference between big data apps for large-capacity applications and analytics applications.

How should I spec a storage system for big data apps?

There are two kinds of big data apps floating around out there. One is storage for large-capacity applications used in industries such as media and entertainment, oil and gas exploration, and life sciences; and the other is analytics applications. For large-capacity apps, there are two primary issues: the bandwidth needed to transfer the large files typically found in these environments, and being able to support a large number of files without slowing access as the file count increases. The bandwidth issue is measured by the bandwidth for a single file transfer and the aggregate bandwidth required to transfer multiple files. For very high-capacity environments (at least hundreds of terabytes), there may be hundreds of thousands of files. With some file systems, as the number of files increases, access slows because of the table structures utilized. When choosing a storage file system, whether on servers with SAN-attached storage or scale-out NAS, it's important to know the maximum number of files supported without performance impact.

The term big data is more commonly used for analytics. Hadoop is often mentioned as a big data analytics platform. For these apps, the emphasis is on fast and (nearly) real-time analysis of information. From a storage systems standpoint, the first concern should be the response time an analytics system user experiences. The response time should be the total time required to transfer to and from the storage device. Measures that use only the cache in the storage system should be ignored; the real work is storing to and retrieving from the device. Many IOPS calculations are completed using cache. The other requirements regarding capacity and throughput (sustained bandwidth) are also important.

Dig Deeper on Big data storage

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.