Just as high availability and resiliency are important considerations for local storage, they're equally important...
to organizations that rely upon cloud storage. Although cloud storage providers almost always use redundant hardware and offer customers a service-level agreement, it's common for protection gaps to exist that could potentially result in outages or even data loss.
Although cloud providers typically build redundancy into all levels of their infrastructure, redundancy alone can be inadequate to protect against an outage. Local component failures, WAN outages or cloud provider outages can lead to data becoming unavailable. At a minimum, organizations using cloud storage should deploy redundant cloud storage gateways and redundant WAN links. If the budget allows, a higher level of redundancy can be achieved by adopting a Bunch of Redundant Independent Clouds (BRIC) architecture.
The first step in ensuring high availability (HA) for cloud storage is to verify the level of protection you're receiving from your cloud storage provider. It's essential to ensure that the level of redundancy provided by your cloud storage vendor adheres to your business requirements. For instance, if your organization's data storage policy requires three copies of all data, a cloud storage provider that merely replicates your data to a secondary data center might be inadequate for your needs. You may find that you need to subscribe to a higher tier of service to receive the level of redundancy you require.
As important as cloud storage redundancy is, there are additional considerations that must be taken into account. To achieve true HA for cloud storage you also need to build redundancy into the way you connect to the cloud storage. Cloud providers construct their own infrastructures using a "fail first" mentality, but the provider has no control over the architecture used in your local infrastructure.
How does Tahoe-LAFS store data in the cloud?
The Tahoe Least-Authority File System (Tahoe-LAFS) is a distributed storage system for local or cloud storage. Tahoe-LAFS is different from other distributed storage systems in that it offers provider-independent security. In other words, data storage is encrypted prior to being sent to the storage device.
Although Tahoe-LAFS can be used to distribute data across multiple clouds, it's important to understand that it doesn't perform data replication. Otherwise, each cloud would have a complete copy of the data. Instead, Tahoe-LAFS stripes data across storage clouds. This way, it's able to make more efficient use of storage while avoiding the common latency issues in data replication solutions.
Although Tahoe-LAFS is an open source solution, its use isn't limited to Linux shops. Builds now exist for Windows and Apple.
Key places to consider redundancy: WAN connections, gateways
Two components are typically used to provide cloud storage connectivity: a WAN connection and a cloud storage gateway, also known as a cloud storage controller. Both of these must be addressed to achieve high availability.
One common solution to WAN redundancy is to lease redundant connections from separate WAN providers. If one provider has an outage, you should theoretically be able to maintain cloud connectivity through another provider's link.
Most cloud storage is based on an object storage platform. Since local storage tends to be block based, a mechanism is needed to translate between block and object storage. This task is usually handled by a cloud storage gateway appliance, which provides a global namespace for local and cloud storage.
Given the importance of a cloud storage gateway, it's critical to prevent it from becoming a single point of failure. If the cloud gateway is a physical appliance, the obvious solution is to deploy one or more additional appliances in accordance with your organization's redundancy requirements. But physical appliances can be expensive. If additional appliances aren't in your budget, talk to your vendor to determine whether other options might exist. You may find that you can use a lower-end appliance or even a virtual appliance if HA is your only goal.
Erasure coding and cloud storage
Some cloud storage providers ensure storage redundancy by using a technique called erasure coding (or forward error correction). For example, Microsoft uses erasure coding to provide storage redundancy within Windows Azure.
Erasure coding is based on the idea that data redundancy isn't limited to what is possible with RAID levels such as RAID 5 and RAID 6. Instead, it specifies the number of storage devices (clouds, in this case) that will be used and the number of failures that must be tolerated. A mathematical formula is then used to determine how the data should be stored so the specified requirements are met.
Tahoe-LAFS makes use of erasure coding. The storage configuration is configurable, but the default parameters spread the data across 10 different storage devices (or clouds). This architecture tolerates the failure of up to seven storage devices. The volume of data that must be stored to provide this level of redundancy is 3.3 times greater than that of a single copy of the data.
In the case of virtual cloud storage gateway appliances, you'll have to consider the level of redundancy required. Because a virtual cloud gateway appliance is really nothing more than a virtual machine (VM), the appliance can easily be protected by the redundancy built into your server virtualization infrastructure. It's worth noting that while HA features such as a Hyper-V failover cluster can protect a virtual appliance against a physical hardware failure, hardware clusters do nothing to offer protection against a failure that occurs within the VM. It's therefore necessary to consider whether you might need to provide additional protection by deploying parallel virtual appliances.
When it comes to HA for cloud storage, it's important to take a cue from the past. In 2011, for example, Amazon Web Services suffered a major outage due to Elastic Block Store volumes within a single availability zone becoming "stuck" and consequently unable to service read/write requests. Even though this type of massive problem hasn't reappeared, it illustrates the point that cloud providers can and sometimes do have problems in spite of their built-in HA mechanisms. So it's a good idea to have a contingency plan in place in case a cloud storage provider suffers a data loss event.
BRIC architecture protects data and access
Conventional wisdom has long held that customers are at the mercy of a cloud storage provider when it comes to ensuring storage availability. After all, if a provider experiences an outage like Amazon did, it will surely affect their customers. However, BRIC architecture can help.
BRIC works similarly to a RAID array, but rather than worrying about individual disks, BRIC stripes data across multiple clouds. That way, if a cloud provider has an outage or data loss event, data is protected and remains accessible on other clouds.
The biggest drawback to using BRIC architecture is cost. There are free, open source BRIC implementations such as the Tahoe Least-Authority File System (Tahoe-LAFS), but cloud storage providers usually bill customers based on the amount of storage they consume. If an organization uses BRIC to store multiple copies of data on separate clouds, their cloud storage costs can increase exponentially. That being the case, it's important that organizations considering a BRIC architecture accurately estimate their future storage needs and choose a storage striping method that maximizes protection while minimizing costs. Otherwise, cloud storage costs can quickly get out of hand.
When, why and how to build-in cloud service redundancy
Infrastructure redundancy issues to ask your cloud provider