A data lake is designed to store information from a variety of sources, including Internet of Things devices and...
humans. Big data analytics or big data archives then access the data lake to process or deliver a subset of it to the requesting user. But a data lake architecture has to be more than just a giant disk drive.
While most IT planners worry about the cost of a data lake first, data durability and security should be the top priorities. Plenty of options can deliver a reasonable cost per gigabyte, but not many can meet the long-term data storage requirement of a data lake. The challenge is that much of the data stored in data lakes is never deleted. The value of this data is its ability to be analyzed and compared to data year after year, which can offset capacity costs.
This is where data durability comes in -- for that data to be of value five or 10 years after it is originally stored, it has to be readable. All forms of media degrade over time. A data lake storage system has to protect against that degradation by continuously examining it. If it finds a corrupted or degrading data set, it must use replication or erasure coding to generate a new copy.
Securing the information within a data lake architecture is another challenge that's often overlooked. Security is potentially more important for this type of storage than any other. A data lake by definition attempts to put all the data eggs in one basket. If the security of a single storage repository is broken, an unwanted party may have access to all of an organization's data. Much of this data is also kept in a very easy-to-read format, such as JPEG or PDF files -- if your data lake architecture is not secure, it is easy to consume the information.
It is therefore advisable to implement multiple levels of security, such as:
- Encrypt all the data in the data lake. Generating encryption by data category with separate keys limits the exposure and still allows applications full access when needed.
- Copies of all data in a data lake should be stored in a location that is disconnected and offline. The offline copy can be on tape or another disk-based system that has its physical connection removed, except for when a copy is made or updated.
Examining the potential of Hadoop data lakes
Four steps to crafting a data lake architecture plan
Making the case against the term data lake
Storing data lakes in the cloud