Big data tutorial: Everything you need to know
A comprehensive collection of articles, videos and more, hand-picked by our editors
Big data sure is exciting to business folks, with all sorts of killer applications just waiting to be discovered. And you no doubt have a growing pile of data bursting the seams of your current storage infrastructure, with lots of requests to mine even more voluminous data streams. Haven't you been collecting microsecond end-user behavior across all your customers and prospects, not to mention collating the petabytes of data exhaust from instrumenting your systems to the nth degree? Imagine the insight management would have if they could look at all that data at once. Forget about data governance, data management, data protection and all those other IT worries -- you just need to land all that data in a relatively scale-cheap Hadoop cluster!
Seriously, though, big data lakes can meet growing data challenges and provide valuable new services to your business. By collecting a wide variety of data sets relevant to the business all in one place and enabling multi-talented analytics based on big data approaches that easily scale, many new data mining opportunities can be created. The total potential value of a data lake grows with the amount of useful data it holds available for analysis. And, one of the key tenets of big data and the big data lake concept is that you don't have to create a master schema ahead of time, so non-linear growth is possible.
The enterprise data lakes or hub concept was first proposed by big data vendors like Cloudera and Hortonworks, ostensibly using vanilla scale-out HDFS-based commodity storage. But it just so happens that the more data you keep on hand, the more storage of all kinds you will need. Eventually, all corporate data is likely to be considered big data. However, not all of that corporate data is best hosted on a commodity scale-out HDFS cluster.
So, today, traditional storage vendors are signing up to the big data lakes vision. From a storage marketing perspective, it seems like data lakes are the new cloud. "Everyone needs a data lake. How can you compete without one (or two or three)?" And there are a variety of enterprise storage options for big data, including enterprise storage, that can provide remote storage that acts like HDFS, Hadoop virtualization that can translate other storage protocols into HDFS, and scalable software-defined storage options.
Big, fast, now
Part of the value of the data lake is bringing varied data together. Another part of it is enabling big data analytics that don't require a pre-defined schema. And, big data architectures can now scale and deliver more real-time performance to users. While BI and the traditional data warehouse aren't dead, big data analytics and big data lakes are moving toward a more real-time kind of operational intelligence that can support "live" decision-making.
It is clear that Hadoop and its ecosystem have evolved beyond the science project stage, and are ready for production. Everything from management and analytics up to application development and deployment is becoming IT and business user friendly. Even advanced scale-out machine learning techniques are becoming baked and embedded into point-and-click big data mining software. However, IT still needs to be responsible for all the data in the lake, so we've outlined a few key capabilities below. An enterprise data lake should:
Host a centralized index of the inventory of data (and metadata) that is available, including sources, versioning, veracity and accuracy. Without automated support in this area, a big data lake will quickly become overwhelming.
Securely authorize, audit and grant access to subsets of data. The Hadoop ecosystem is evolving quickly in this area as rock-solid security is an absolute IT enterprise requirement. There are several emerging products to help secure big data assets at scale, and many are aimed at helping secure the data lake use case with high volumes of new data, many users and a growing asset value in need of protection.
Enable IT governance of what is in the data lake and assist enforcing policies for retention and disposition (and importantly tracking personally identifiable information). The best products will enforce regulatory and compliance requirements no matter how much data or what kinds of data sets find their way into the data lake (e.g. Dataguise).
Ensure data protection at scale, for operational availability and BC/DR requirements. Ever had to remotely replicate everything? A data lake that becomes a key operational business platform with real-time data streams is a beast to synchronize remotely.
Provide agile analytics into and from the data lake using multiple analytical approaches (i.e., not just Hadoop) and data workflows. In some ways, Hadoop and HDFS are really software-defined storage products that are "data aware" enough to provide built-in analytics. There are others like Spark and proprietary analytics (think OLAP, or online analytical processing) like HP Vertica found in HP Haven that also play well in a data lake environment.
Many of these capabilities are found in today's enterprise storage products, and thus provide a clue as to why vendors are getting on the data lake bandwagon. Because of the similar economics of scale-out needed for cloud and big data, look to software-defined storage versions of enterprise-quality storage to become the leading storage products in this space.
A dark and stormy data lake
Are data lakes a really good idea? One might start by asking if we really should be keeping all that data to start with. And, creating a massive single data ingestion point for the whole enterprise may just be creating a massive vulnerability. It's also unclear if it is really a cost-effective approach. Especially without the resources and scale of, say, Google or Facebook.
The data lake idea is probably best approached slowly, rather than as a wholesale data center re-architecture. Still, the potential value locked in our data, and the economics of massively shared scale-out approaches, will lead many organizations to drink from a data lake -- or at least a data pond.
Data lakes filled with analytics
Data lake functions as enterprise data warehouse alternative
Deep dive into data lake