BOSTON -- The data storage choices of the world's largest online companies interest anyone trying to push the envelope...
to lower IT costs. The Vault Linux Storage and Filesystems Conference last week offered a glimpse at the Lambert open source-based cold storage engine of Alibaba Group Holding Ltd.
Alibaba designed Lambert as durable, low-cost storage using the open source Sheepdog distributed object storage system, low-speed hard disks and low-power commodity servers that can scale to handle exabytes of data that are rarely accessed.
Chinese e-commerce giant Alibaba raised $25 billion from its U.S. IPO last September, and its business ventures include the eBay-like Taobao.com, Tmall.com shopping site, Alibaba.com business-to-business trading, Alipay.com online payment, and Aliyun cloud services.
Alibaba intends to use Lambert -- named after the world's largest glacier -- as the underlying technology of the Aliyun (AliCloud) public cold data storage service, according to Coly Li, head of storage engineering for Alibaba Infrastructure Service (AIS).
Asked how Lambert compares to Amazon Glacier, Li said he doesn't know if the cost is less expensive. He knows only that the cost per GB with Lambert is "very cheap." Amazon's advertised price is one penny per GB per month for storage and additional fees for data upload, retrieval and transfer.
Sheepdog is the key ingredient in Lambert. Like many object storage systems, Sheepdog can run on commodity hardware and scale to thousands of nodes. The software manages the nodes and disks, aggregates capacity and performance linearly, and supports volume management features such as snapshots, cloning and thin provisioning, according to the project's website.
Li said the AIS team designed the Lambert cold storage system from scratch to store exabytes of data, in anticipation that data growth could accelerate within two or three years. Lambert currently operates on a smaller scale, with more than 1 PB in a select data center since going live in production in November, he said.
"We cannot afford data loss," Li said, "so we need a very long time to migrate all cold data from the existing storage system into Lambert, step by step, to make sure we have high durability of the data."
Aliyun cloud infrastructure uses software-defined sub-clusters for simplicity
High durability, low cost and flexibility were the top priorities when the AIS team commenced work on the hardware design for Lambert. The challenge was to keep the data available for many years on reliable yet inexpensive storage media. One obvious option for the storage media was tape, but Li said the team figured that using automatic robots, sometimes in small rooms, would be too costly.
One challenge Alibaba has in China is data centers. Li said the company has to rent third-party data centers, where the power supply, cooling and rack capacity can vary by site or region.
"We cannot ask them to follow our unique standard because the infrastructure is already there," said Li.
Alibaba considered following Facebook's approach of using Blu-Ray discs for cold storage. Li said the team currently has no evidence that Blu-Ray discs would meet the requirements for low cost and durability. He said Alibaba contacted several Blu-Ray vendors but hasn't seen enough progress at this point.
So the AIS team settled on cheap, low-performance hard-disk drives (HDDs). Alibaba's hardware design is based on the storage server part of the Project Scorpio data center standard and calls for 18 3.5-inch HDDs of 4 TB or 8 TB in a 1U server, or "case," and 32 of the 1U servers in a single rack, according to Li. The servers run Intel Atom processors, and the system uses a 10 gigabit Ethernet network.
Li said the team wanted to push the system online as soon as possible and decided against building a large cluster. Instead, AIS designed a deployment unit of four "Scorpio" hardware racks with software-defined, distributed sub-clusters that span each rack in the unit. Li said the size of the deployment unit can expand to more racks and sub-clusters depending on data center space, but the focus remains on the quality of the individual sub-clusters.
"If the implementation is correct, most of the time, simple means reliable and high quality," he said.
Li said a front-end system gathers data in various formats, including compressed and encrypted, from internal and public sources and transforms the data into large objects for storage in Lambert. The average object size in Lambert is currently about 100 GB, but Li said the AIS team could change the size if necessary. The maximum size for a data object in the Sheepdog object storage system is 16 PB, he said.
A data object is stored in a single specific software-defined sub-cluster. When a sub-cluster fills, it transforms to a sealed state where the hard disk is powered off and the memory and CPU go into an idle mode with extremely low power consumption, according to Li. Data objects then go into the next available software-defined sub-cluster.
Li said each of the sealed sub-clusters contains enough free space to tolerate failures of about 10% or 15% of the hard disks. The team changes hard disks only when there is no space for recovery, he said.
Deployed at large scale, the Lambert system consists of a collection of sealed servers, working servers storing data, and idle servers that are available for storage. The group of active servers is the smallest. Only a small group of sub-clusters are in a working state at any time, according to a diagram shown during Alibaba's presentation at the Vault conference.
Performance and capacity improved with erasure codes, RESTful API
Alibaba picked Sheepdog over other open source object storage options for its simplicity, according to Robin Dong, the chief software engineer of cold data storage at AIS. Dong said that Sheepdog includes only 35,000 lines of code. Dong said Alibaba did not need a file system or POSIX interface and looked only at systems capable of distributed block storage.
Dong credited Yuan Liu, a former Alibaba employee who is one of the maintainers of Sheepdog, with adding support for erasure codes to help save hard disk space. Without erasure coding, Sheepdog's triple replication would mean, for example, storing 300 MB for every 100 MB of data. With erasure codes, using two parity blocks for every four blocks, the team is protected against any two server failures and stores only 150 MB for every 100 MB of data. A data block is cut into four pieces, and two parity pieces are block from any four pieces.
Li noted that AIS can go to four parity blocks for every eight data blocks to protect against four server failures, or go to erasure codes that protect against even more server failures, depending on the requirements.
In addition to erasure coding, Dong said he contributed support for the open source OpenStack Swift RESTful API for data access and control, hyper volumes of more than 600 PB for storing large objects, and performance improvements for data recovery.
Performance was slow when Sheepdog used the server as the node for its consistent hashing algorithm, according to Dong. If one disk failed, the server fetched the data from other servers and calculated the result of the erasure code to recover the lost data, he said.
To remove the bottleneck, the team used the hard disk instead of the server as the node for the consistent hashing algorithm, enabling all the servers in the sub-cluster to participate in the recovery work if a hard disk failed.
"The recovery performance increased four times after our improvement," he said. "And the durability is much better."
Open source technology use increases, but is yet to be widely adopted
Alibaba opens first U.S. datacenter