BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
What's the best way to distinguish big data analytics from data warehousing?
My answer will likely differ from those who like the "Three Vs" approach -- volume, velocity and variety -- to big data analytics. I think of three distinguishing characteristics as well, just not the same ones. Distributed computing architectures became the platforms for alternative ways to do analytics versus a traditional data warehouse for three reasons:
Time to information. While an immediate response to a query isn't an absolute requirement, a response time of five seconds or fewer is often desired and can be delivered by a distributed computing cluster running Hadoop MapReduce, for example. Traditional data warehouses are burdened by the perception that a batch process runs overnight to produce results that are available to decision makers the following morning. Another concept popular with new analytics platforms is the ability to do reiterative queries -- i.e., run a query, get results, then run a second query against those results and/or converge with others, for example. Data warehouses aren't known for ad hoc querying. However, data warehousing vendors are catching up in these areas.
Complex content. Again, the perception is that a traditional data warehouse runs analytics processes against structured database data -- a small subset of the kinds of data business executives want to get their analytical arms around. The new analytics platforms can process unstructured data generated, for example, on the Web and via mobile devices, as well as structured data. However, traditional data warehouse vendors are catching up here as well.
Cost. Once you add up the costs for hardware and software, a traditional data warehouse is relatively expensive compared to a 60-node Hadoop cluster built on commodity server racks and open source software.
Hadoop and platforms like it came about specifically because a traditional data warehouse couldn't, for a relatively low cost, scale to handle large, complex data sets and deliver sub-five-second response times.