The complete rundown on Docker data storage and containers
A comprehensive collection of articles, videos and more, hand-picked by our editors
Containerization is a relatively new application development and deployment methodology, with the Docker platform...
emerging as the de facto container standard. Docker enables breaking monolithic applications into lightweight, portable application services that are packaged into isolated containers, allowing for the transition from traditional n-tier application architecture toward a microservices architecture of narrowly focused, independently deployable services.
Containers are elastic and can be started and shut down almost instantaneously, allowing an application service to scale by quickly spawning additional container instances on the same or across multiple hosts. They are instrumental in facilitating increasingly shorter release cycles for DevOps.
By nature ephemeral, all changes are lost when a container is destroyed or moved between nodes. This makes containers perfectly suitable for applications that don't depend on persistent data, such as an Apache Web Server rendering static web applications or stateless web services, but poses challenges when data needs to be persisted, as is the case with databases and editable, file-based content.
Containers bring a host of new storage challenges, however, so storage managers should understand the various options for storing and protecting container data, portability and persistence, and how to connect containers to legacy storage. Let's take a peek under the hood of the Docker platform to better comprehend containers and their transient nature.
Docker container data
Docker containers launch from read-only Docker images, which are templates that include all applications and files required to deliver intended application services. Docker images ensure containers are launched identically, regardless of environment.
Every Docker image starts with a base image, and every subsequent change -- such as the installation or update of an application -- adds a layer to that image. What makes Docker images so efficient is the fact that only additional layers (that is, changes from the base image) need to be stored, but the base image itself is referenced. Consequently, a container launched from a Docker image only occupies space for changes, and regardless of how many instances of the Docker image are launched as containers, they all share resources from the same read-only base image.
Technically, this is accomplished through a union file system that combines these layers into a single image. Union file systems allow files and directories of separate file systems, known as branches, to be transparently overlaid to form a single coherent file system.
When the Docker platform runs a container from a Docker image, it adds a read-write layer on top of the image, leveraging the union file system. If an application in the container modifies an existing file, the file is copied to the read-write layer via copy-on-write, with the union file system hiding the original file in the read-only layer and making the updated file in the read-write layer accessible from within the container. When a Docker container is destroyed, all changes recorded in the read-write layer are lost. Therefore, if another container is run from the same Docker image, none of the changes made by some other container spawned by the same image will be present.
In other words, all changes are container-specific and ephemeral, and are lost when a container is removed.
Docker data volumes
One way to persist data beyond the lifetime of a container and share data between containers is through Docker data volumes. Data volumes are folders and files outside of the union file system that reside as regular files and directories on the host file system beneath the /var/lib/docker directory. A data volume can be added to a container using the "-v" flag. The following example starts a container named TEST_CON with a volume TEST_VOL, and it creates a directory on the host beneath /var/lib/docker, where data written to /TEST_VOL is stored:
$docker run --name TEST_CON -v /TEST_VOL
Besides creating a data volume during container creation using the "-v" flag, adding data volumes can be directly incorporated in the container image by using the VOLUME command in the underlying Docker file. The fact that data in data volumes reside in a standard host directory has benefits: they can be browsed and edited by the host system; backed up, and copied or moved in and out of the OS; and are protected via standard Linux permissions.
You can also use data volumes to mount an existing host directory using the "-v" flag and a format that separates the host path and the volume name. The following example starts a container named TEST_CON and mounts the host directory /home/jg/data as TEST_VOL inside the container:
$docker run --name TEST_CON -v /home/jg/data:/TEST_VOL
Mounting a host directory as a data volume in the Docker file via the "VOLUME" command is not supported, as host directories are system-specific and it would glaringly violate portability. Mounting a host directory as a data volume is a convenient way to quickly access data on a host in a container, especially for non-production use cases.
You can share data volumes between containers using the "--volumes-from" switch. The following example starts a container named BACKUP_CON and mounts all volumes from container TEST_CON inside container BACKUP_CON:
$docker run --name BACKUP_CON --volumes-from TEST_CON
This works whether the source container TEST_CON is running or not. Besides sharing volumes between containers, it also provides a convenient way for backing up data volumes.
Docker data volume containers
Another option for managing persistent data in the Docker platform is through data volume containers. The idea is to create a container with one or more volumes, and mount Docker volumes to other containers using the "volumes from" switch. Since the data volume container merely serves as a data store without running applications, it need only be created (not stay active). The following example creates a data volume container DBCON with a data volume DBDATA, which is then used by container TEST_CON:
$docker create --name DBCON -v /DBDATA
$docker run --name TEST_CON --volumes-from DBCON
Here, a data volume container abstracts the location of a data store, making the data container a logical mount point. It also persists data while application containers are created and destroyed.
Docker volume plug-ins
A big step forward, Docker volume plug-ins allow for the integration of external storage systems as Docker data volumes. First shown at DockerCon 2015 in the experimental 1.7 release of the Docker Engine, plug-ins were enabled when Docker decoupled volumes from container management to enable the management of persistent storage across an entire Docker Swarm cluster.
A plug-in API that enables third-party vendors to expose and extend storage system capabilities to Docker platform containers powers Docker volume plug-ins. The volume plug-in API acts as a control plane mechanism and defines which volume provider should be used, and the data path and storage functionality is handled by the configured volume provider (storage system). Within containers, storage integrated via a volume plug-in is presented as a block storage device, regardless of how the storage system implements and exposes storage, which could be block, file or even object storage.
The role of orchestration in containerization
Containerized environments are usually distributed platforms with multiple nodes tied into a cluster and the ability to start and destroy containers on various nodes and reshuffle resources based on events and utilization. This is a complex task that requires special cluster management and orchestration tools:
- Docker addresses this with a combination of Docker Swarm and Docker Machine.
- Kubernetes is an alternative open source container cluster manager originally designed by Google. It is considered by many, including Red Hat, a more versatile and mature cluster manager. Case in point: Kubernetes had a volume plug-in for external storage before Docker.
- Apache Mesos is an even more versatile open source cluster manager for efficient resource isolation and resource sharing across distributed applications and frameworks.
Besides integrating containers with external storage, Docker volume plug-ins let you manage external storage from the Docker Universal Control Plane (UCP), a container management platform for Docker applications and infrastructure.
Data volume plug-ins are available from some storage vendors like NetApp with its NetApp Docker Volume Plugin (nDVP) for Data ONTAP that supports both iSCSI and NFS. Similarly, Hedvig provides a volume plug-in for its software-defined Distributed Storage Platform, using NFS. Unlike block storage volume drivers, Hedvig's NFS volume driver permits mounting the same volume on different hosts.
Vendor-agnostic volume plug-ins are offered by ClusterHQ with Flocker, Rancher Labs with Convoy and others. Storage-agnostic plug-ins simplify orchestration and portability, and work with a larger number of storage systems and orchestration tools. For instance, ClusterHQ Flocker not only supports over 15 storage systems with more in the works, it also supports frameworks other than the Docker platform, including Jenkins, Kubernetes, Marathon and Mesos.
One way to use volume drivers is through the Docker "run" command. The following example creates a named volume, TEST_VOL, using the Flocker volume driver and makes it available within container TEST_CON at /webapp, with the Flocker volume plug-in mounting a volume from the configured storage system:
$docker run --name TEST_CON --volume-driver=flocker -v TEST_VOL:/webapp
This example illustrates how volume plug-ins decouple container management from storage management. The Flocker plug-in transparently interacts with the storage system based on volume plug-in configuration.
Hyper-convergence and containerization
Software-defined storage enables the hyper-convergence of application containers and storage, where application containers and persistent storage services run on the same platform. While the end goal is the containerization of the storage platform itself -- so storage can be served as a microservice from storage containers -- contemporary products vary in how far along they are and how they deliver on this vision:
- Diamanti, formerly known as Datawise.io, is developing a purpose-built, scale-out appliance that converges storage, networking and compute. High-performance optimized, the appliance features a PCIe acceleration card with a Cavium OCTEON networking processor, four 10 GigE ports and NVM Express SSD cards. In beta, Diamanti's first product supports Kubernetes to orchestrate storage, networking and Docker containers and, according to the company, runs the complete Red Hat stack, including OpenShift.
- Joyent delivers a hyper-converged compute and storage cloud service, with containers, storage and network resources managed and orchestrated by Joyent's Triton software. Joyent offers both Docker and Joyent container services.
- Nexenta. NexentaEdge is a software-only, scale-out block and object storage platform, with cluster-wide deduplication and compression that supports container-based deployments. NexentaEdge storage is exposed to the host as native block devices and then shared as a volume through the ClusterHQ Flocker volume driver.
- Portworx, a Silicon Valley startup, is creating a containerized SDS product, PX-Lite, that's currently in beta and should be generally available by the time you read this. PX-Lite is delivered containerized and aggregates local Docker host storage, including all-flash and SATA arrays, into a single pool of block storage that's accessed by application containers through the Docker volume plug-in API. PX-Lite can run on-premises or in the cloud.
- Red Hat supports Docker containers in Red Hat Enterprise Linux (RHEL), RHEL Atomic and OpenShift, the latter a hyper-converged PaaS that combines core Docker container packaging, Kubernetes container cluster management, SDN and storage. Since March 2016, containerized Red Hat Gluster Storage can serve storage from a dedicated storage cluster for apps running in RHEL, RHEL Atomic or OpenShift, with hyper-convergence for all three environments expected this summer. Containerized Red Hat Ceph Storage for block and object storage is on Red Hat's roadmap.
The robust support of persistent storage in containerized applications was a prerequisite for, and coincides with, an increase in the adoption of the Docker platform in enterprises. When Docker first shipped, it was used primarily for stateless applications, and the most robust use of persistent storage was via integration with object stores like Amazon Simple Storage Service. The release of a revamped volume management system and volume plug-in framework in Docker release 1.9 was a game-changer, as it enabled the integration of containers and external storage systems.
We are also seeing the emergence of containerized software-defined storage products that are hyper-converged with application containers on the same infrastructure. These are glued together by Docker volume plug-ins or orchestration tools like Kubernetes and Mesos.
Learn how to integrate Docker containers
A Docker container storage primer
Track Docker platform's rise as storage technology