Are solid-state drives necessary to perform analytics on big data efficiently?
To begin with, let me describe the three different kinds of solid-state drive (SSD) deployments. The first is server-side cache where you have a SSD that's installed directly into the server. The second is storage-side cache, which I would characterize as Tier 0, where you have SSD in a specific layer in the array that's used with automated storage tiering. And then the third is an all-SSD storage array.
Now, to answer the question directly, are SSDs necessary to perform analytics on big data efficiently, the answer is no, but it depends on whether or not your environment is CPU-bound or I/O-bound. In analytics there are two important components to this: processing and I/O. If you're CPU-bound, or very heavily on the processing side, then more I/O isn't going to buy you that much -- you really need a faster processor. On the other hand, if you're reading huge amounts of data -- recursively you're pulling in a lot of data from sequential reads -- and things like that, then you definitely could be I/O-bound, and SSD is certainly going to be helpful in performing big data analytics efficiently.
So if an environment is I/O-bound, the question is, which SSD deployment is better? In many cases, if it's a situation where you're reading data over and over again, then you're going to be better off with the server-side cache or the Tier 0. On the other hand, if you're reading huge amounts of data and it's sequential rather than recursive, you may actually be better off with an all-SSD storage array, where you're getting massive amounts of performance all across the data set, and in so doing, you can really get some pretty amazing performance results from that kind of configuration.
About the author:
Phil Goodwin is a storage consultant and frequent TechTarget contributor.
This was first published in February 2014