michelangelus - Fotolia
The term big data typically describes two types of data: data from sensors attached to equipment, and even living things, consisting of millions of small log files; and data generated from rich media, comprising fewer, but much larger files. The first data type is analyzed as part of a big data analytics project, while the second is part of a big data archive project.
Both data types demand enormous amounts of capacity and have IT professionals scrambling to figure out the best way to store big data. In general, most organizations settle on a scale-out data lake as the repository. The next decision is where this data lake will reside: on-premises (private), in a public cloud or in a hybrid infrastructure. When making this decision, IT professionals need to consider the cost to store big data in the cloud and the cost to apply compute resources to it.
The cost to store big data in the cloud
Big data gets its name because of the amount of capacity it consumes. Analytics data comprises million or billions of files that individually are relatively small in size, but combined can consume petabytes (PB) of storage. Video and media information is typically smaller in file count, but significantly larger in size per file and can also consume petabytes of capacity. Servicing user requests quickly requires most of this data to be on a storage system with decent performance -- a hard disk drive-based scale-out storage system. Again, the location of that system can be on-premises, in the cloud or a mixture of both (hybrid).
The cloud has immediate appeal thanks to its very low startup costs and periodic billing. In addition, an organization does not have to factor in power, cooling and data center floor space. However, the regular billing of petabytes of information over the course of many years can become very expensive. In most cases, even factoring soft costs, an organization can store big data less expensively on-premises if its capacity demands are greater than 1 PB. Most organizations with capacity requirements of this size have already invested in an IT infrastructure and the associated personnel and processes to support the effort.
The cost to compute big data
When determining where to store big data, administrators need to consider how compute will be applied to it. When an analytics job needs to process a data set, the goal is to process that data as fast as possible, provide an answer and then wait for the next request. The cloud has a tremendous advantage in this situation because computing resources can be scaled up and down far more efficiently in the cloud than on-premises. The cloud is an ideal location for CPU resources because the amount of capacity consumed will vary widely and the cloud can respond to those changes dynamically. More importantly, an organization is billed for those resources only when they are used. In comparison, on-premises storage is a constant -- data always needs to be stored, can rarely be scaled-down, often needs to be retained for very long periods of time and the cloud provider has to bill for that resource consumption continuously.
The public cloud is so effective at providing scalable compute that its advantages can offset the extra cost it may take to store big data in the cloud.
Storing big data in the cloud: The hybrid twist
A hybrid cloud storage model treats the cloud as a storage tier. Data is stored and processed locally, and moves to the cloud as it ages. The problem with this approach is it requires the compute and cloud to act as long-term storage even though each one is best for short-term storage. However, there are two emerging hybrid cloud models that leverage the strengths of the data center and the cloud:
- Direct-to-compute cloud. In this example, the organization owns the storage and has a direct connection to the compute at the cloud provider. The organization's storage is essentially across the street from the public cloud provider. They can use scalable compute but have instant access to the data storage they own.
- Cached-to-cloud. This model uses the caching technology standard in traditional hybrid deployments, but uses it in reverse. The most active data is cached to the cloud so the organization can take advantage of public cloud compute to process data, but the actual data is in the organization's private data center.
Both of these hybrid approaches provide an organization with the flexibility to use the strengths of the data center and the strengths of the public cloud.
Big data deployments in the cloud can become mired down in decisions about which model to use. If the data center has less than 1 PB of information, it will probably find that a 100% public deployment is best. Beyond 1 PB, the decision gets more complicated. In general, leveraging an organization's on-premises data center needs to be part of the architecture, either by doing 100% of the processing on-premises or using a hybrid model as described earlier.
Quiz: Test your knowledge of big data in the cloud
To build, buy or rent the cloud for big data
Quick take: Five links to help you effectively store your big data