How should I spec a storage system for big data apps?
There are two kinds of big data apps floating around out there. One is storage for large-capacity applications used in industries such as media and entertainment, oil and gas exploration, and life sciences; and the other is analytics applications. For large-capacity apps, there are two primary issues: the bandwidth needed to transfer the large files typically found in these environments, and being able to support a large number of files without slowing access as the file count increases. The bandwidth issue is measured by the bandwidth for a single file transfer and the aggregate bandwidth required to transfer multiple files. For very high-capacity environments (at least hundreds of terabytes), there may be hundreds of thousands of files. With some file systems, as the number of files increases, access slows because of the table structures utilized. When choosing a storage file system, whether on servers with SAN-attached storage or scale-out NAS, it's important to know the maximum number of files supported without performance impact.
The term big data is more commonly used for analytics. Hadoop is often mentioned as a big data analytics platform. For these apps, the emphasis is on fast and (nearly) real-time analysis of information. From a storage systems standpoint, the first concern should be the response time an analytics system user experiences. The response time should be the total time required to transfer to and from the storage device. Measures that use only the cache in the storage system should be ignored; the real work is storing to and retrieving from the device. Many IOPS calculations are completed using cache. The other requirements regarding capacity and throughput (sustained bandwidth) are also important.
This was first published in October 2012