One of the best features of public cloud storage has been the ability to easily store large volumes of data without...
having to deal with the headache of managing the infrastructure that supports it. With the kind of exponential data growth that continues in most organizations, the business of managing storage can be a chaotic chore, coping with growth and infrastructure refreshes. Cloud storage is potentially a lifesaver for IT departments that want to get off the treadmill of managing infrastructure so they can focus on the data. But is cloud archiving really a unique service and could the effort expended to use a cloud archive be more than the benefits achieved?
Why you should archive
Before looking at the technical aspects of cloud-based archiving, it's worth discussing why there's a need to archive data in the first place. The most obvious reason is cost; production systems (databases, files and unstructured data) consist of large amounts of inactive and rarely accessed information, all of which is sitting on expensive primary storage. Companies are looking to retain data "forever" -- or at least for a very long time -- on the assumption that there is some future value to be obtained from the content. In some instances, regulatory constraints require data to be kept for long periods of time (decades) or in the case of medical record-keeping, the lifetime of the patient as a minimum. Moving data out of the production environment might also have a significant impact on other production costs, including savings on database licenses, smaller virtual machines or physical hosts where licenses are based on data volume.
There are also operational aspects to storing large volumes of data in primary systems; the bigger the system, the longer the backup/restore or recovery process, and the bigger the backups will be. There's no benefit to continually backing up data that never changes when it can be archived off primary systems and stored and protected in a different way. The performance of production systems can also be affected, too. There's much more overhead in accessing and storing data in a database with 100 million rows, compared to one with only five hundred thousand, for example.
Why you should consider cloud archive
So the need for an archive can be clearly seen, but why choose the cloud as the target destination? There are a number of operational benefits that are inherent in using cloud services that make it an attractive destination for archive data. These include:
- Elasticity. The cloud storage provider takes the responsibility and associated headaches of making sure that the archive capacity is always available to meet demand. The customer simply consumes the resource on the assumption that there is an infinite capacity available. There's no need to think about data center space, power, cooling or other physical aspects.
- Abstraction. The customer doesn't need to know or care how the data is being stored in the cloud, only that the service is being delivered by the cloud provider at some agreed-upon service level. This means the data could be on disk, tape, optical or any combination. The cloud vendor takes the responsibility for managing and refreshing the storage media and associated infrastructure over time, as technology ages and needs replacing.
- Durability. Primary storage resilience is measured in terms of availability or how much uptime the system delivers. We typically see figures quoted of five, six or now seven 9s, meaning 99.999% uptime or better. In the archive world, measurement is based on durability with a lower level of availability as the data is assumed to be accessed less frequently, but needs to be there in 5, 10, 20 or 50 years' time. Amazon Web Services' S3 offering, for example, touts durability levels of 99.999999999% or "eleven 9s."
- Cost. The cost of cloud storage is predictable and based on access profile and volume of data stored (more on this later), making rebilling and accounting easier.
So cloud archive makes sense, the question is, how do IT operations teams get the data in and out in a way that meets operational requirements?
The cost of cloud archiving
The cost structure for cloud-based archive can be very different to on-premises and is typically based on the volume of data stored, plus a charge for recalling and accessing the data in the future.
Cloud archiving considerations
Probably the most obvious concern about a cloud archive is that of security. How will my data be secured both in-flight across the network and at rest once it reaches the service provider's data center? The in-flight issue is easily resolved; as data in and out of cloud archives is managed through secure HTTPS protocols (SSL). So data transferred across the public network will be safe in flight.
Most providers now also offer the ability to encrypt data stored within their clouds. As an extra level of security, customers can provide their own encryption keys to be used by the provider to encrypt data on the customer's behalf. Alternatively, data can be encrypted before sending it to the cloud. The choice of encryption option is dictated by the risk profile of the customer; provider-based encryption may be sufficient, whereas compliance rules or flat-out paranoia may dictate using personal encryption keys. In that instance, the customer must maintain the keys for future data retrieval, which can be a significant effort if data is intended to be stored for many years.
Archives in the cloud = No hardware headaches
Cloud-based archiving removes the headaches associated with planning and maintaining large archives, such as regular hardware and data format refreshes.
The second issue to consider is that of performance, or how quickly data can be stored and retrieved from the cloud. Depending on the type of connectivity in place, the latency or round-trip time to write data into the cloud could be as high as 20 to 30 milliseconds. This level of response time is fine for sequential transfers but not so great for "random" access. In reality, most archive processes will not have a problem with latency issues as they work on storing and retrieving large volumes of data but updates to metadata, if cloud-based, could be a problem.
Two other issues affect the performance of accessing data. First, the providers themselves may place restrictions on access. Amazon Web Services' Glacier, for example, provides a lower-cost alternative to S3 (Simple Storage Service) but provides access through a staging process that takes 3 to 5 hours to retrieve the data that is then available for up to 24 hours (after which it needs to be retrieved again). There are also transfer costs to access data above a free 1 GB limit with Glacier, which we will discuss later. Not all vendors have performance restrictions on data access; Google Cloud Storage Nearline, for example, offers response times (or access to first byte) of around three seconds for long-term archive data. There's clearly a trade-off in choosing the right price of service versus the performance the service offers.
Accessibility and data format is another area of concern when using cloud archives. Archive platforms are typically object stores accessible over Web-based protocols. On-premises data, however, may be in the form of structured data (like a database), semi-structured data (like emails) or unstructured data-like files. Each of these data formats will be associated with metadata that's used to describe the content. So how does this data get transformed into a generic object format? One answer is to use products that act as either gateways or archive platforms to provide the bridge between the local and the archive format. Examples include the AWS Storage Gateway, EMC CloudBoost, Microsoft Azure's StorSimple, Nasuni Cloud NAS and NetApp's AltaVault. Most of these products are conduits to cloud storage and don't directly integrate with a particular application. However, they do offer a more consumable protocol for archive data and the ability to cache some content locally and reduce the impact of always reverting to cloud storage to access data. Application integration work may still be required, but that could be the case whether the data is on or offsite.
Objects in the cloud
Cloud archiving provides the capability to store large volumes of data as objects, so some kind of protocol or format conversion (with metadata) is required to exploit cloud storage effectively.
Finally, you should consider cost. Most on-premises archiving systems are typically based on the cost of the infrastructure itself whereas cloud-based archiving will be driven by the volume of stored data and access profiles. As more data is stored and recalled from the archive, so the monthly costs increase. IT organizations need to be ready to deal with re-billing the cost back to their end users (where appropriate) and this will mean creating policies on data retention and retrieval as well as partitioning archive data into logical repositories (like vaults) that can be reported individually. The cost of using cloud storage could become a real issue where large volumes of data are placed with a single provider. This is because moving data between archives (and providers) could be cost-prohibitive even though it may be desirable for redundancy and risk reduction purposes.
One opportunity to reduce costs is to look at implementing data reduction technologies, such as deduplication and compression. These can be implemented in the application before the data is archived, or deployed within a cloud gateway. One such product is StorReduce from a startup company of the same name. The StorReduce appliance sits in the public cloud and accepts data in S3 format, writing data back out to AWS S3 in deduplicated format. The company claims up to 95% savings on data stored, which can result in significant cost reductions with large archives.
Cloud archiving: Boon or bane?
Archiving can work as a cloud-only application, with consideration to the points raised here: security, performance, accessibility and cost. In deciding whether cloud is the right place to put archive data, the issues of flexibility in billing and management have to be weighed against the requirement to implement on-premises systems that translate data into object compatible formats.
One final point to consider is how archive data will be accessed in the future. Having the data already in the cloud provides the ability to run cloud-based analytics against the archive. Accessing data within the cloud from cloud-based applications running as virtual instances typically doesn't attract any additional access costs and so cloud might prove the stepping stone to actually doing something useful with all of that cold data.
Cloud-based archiving and cold storage
How cloud-based archives impact compliance data
Cloud archive services vs. tape archives