Manage Learn to apply best practices and optimize your operations.

Top data lake architecture concerns

While IT might be more concerned with the cost of its data lake architecture, data durability and security should be the most important issues.

A data lake is designed to store information from a variety of sources, including Internet of Things devices and...

humans. Big data analytics or big data archives then access the data lake to process or deliver a subset of it to the requesting user. But a data lake architecture has to be more than just a giant disk drive.

While most IT planners worry about the cost of a data lake first, data durability and security should be the top priorities. Plenty of options can deliver a reasonable cost per gigabyte, but not many can meet the long-term data storage requirement of a data lake. The challenge is that much of the data stored in data lakes is never deleted. The value of this data is its ability to be analyzed and compared to data year after year, which can offset capacity costs.

This is where data durability comes in -- for that data to be of value five or 10 years after it is originally stored, it has to be readable. All forms of media degrade over time. A data lake storage system has to protect against that degradation by continuously examining it. If it finds a corrupted or degrading data set, it must use replication or erasure coding to generate a new copy.

Securing the information within a data lake architecture is another challenge that's often overlooked. Security is potentially more important for this type of storage than any other. A data lake by definition attempts to put all the data eggs in one basket. If the security of a single storage repository is broken, an unwanted party may have access to all of an organization's data. Much of this data is also kept in a very easy-to-read format, such as JPEG or PDF files -- if your data lake architecture is not secure, it is easy to consume the information.

It is therefore advisable to implement multiple levels of security, such as:

  • Encrypt all the data in the data lake. Generating encryption by data category with separate keys limits the exposure and still allows applications full access when needed.
  • Copies of all data in a data lake should be stored in a location that is disconnected and offline. The offline copy can be on tape or another disk-based system that has its physical connection removed, except for when a copy is made or updated.

Next Steps

Examining the potential of Hadoop data lakes

Four steps to crafting a data lake architecture plan

Making the case against the term data lake

Storing data lakes in the cloud

This was last published in September 2015

Dig Deeper on Data storage strategy

PRO+

Content

Find more PRO+ content and other member only offers, here.

Join the conversation

1 comment

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

How has your company kept data lake costs down while maintaining security?
Cancel

-ADS BY GOOGLE

SearchSolidStateStorage

SearchCloudStorage

SearchDisasterRecovery

SearchDataBackup

Close