An expert's guide to big data storage architecture
A comprehensive collection of articles, videos and more, hand-picked by our editors
What's the best way to distinguish big data analytics from data warehousing?
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
My answer will likely differ from those who like the "Three Vs" approach -- volume, velocity and variety -- to big data analytics. I think of three distinguishing characteristics as well, just not the same ones. Distributed computing architectures became the platforms for alternative ways to do analytics versus a traditional data warehouse for three reasons:
Time to information. While an immediate response to a query isn't an absolute requirement, a response time of five seconds or fewer is often desired and can be delivered by a distributed computing cluster running Hadoop MapReduce, for example. Traditional data warehouses are burdened by the perception that a batch process runs overnight to produce results that are available to decision makers the following morning. Another concept popular with new analytics platforms is the ability to do reiterative queries -- i.e., run a query, get results, then run a second query against those results and/or converge with others, for example. Data warehouses aren't known for ad hoc querying. However, data warehousing vendors are catching up in these areas.
Complex content. Again, the perception is that a traditional data warehouse runs analytics processes against structured database data -- a small subset of the kinds of data business executives want to get their analytical arms around. The new analytics platforms can process unstructured data generated, for example, on the Web and via mobile devices, as well as structured data. However, traditional data warehouse vendors are catching up here as well.
Cost. Once you add up the costs for hardware and software, a traditional data warehouse is relatively expensive compared to a 60-node Hadoop cluster built on commodity server racks and open source software.
Hadoop and platforms like it came about specifically because a traditional data warehouse couldn't, for a relatively low cost, scale to handle large, complex data sets and deliver sub-five-second response times.
Related Q&A from John Webster
John Webster describes how changes to HDFS and the NameNode can help to improve Hadoop infrastructure.continue reading
Analyst John Webster details issues with Hadoop architecture and what users can expect from Hadoop Version 2.0.continue reading
Scale-out and object-based storage systems are both built for scaling, but metadata characteristics are the difference maker.continue reading
Have a question for an expert?
Please add a title for your question
Get answers from a TechTarget expert on whatever's puzzling you.