Once upon a time, organizations stored structured data in relational -- or sometimes NoSQL -- databases, and unstructured data in object storage systems like Amazon S3 or Google Cloud Storage. This distinction between structured and unstructured data storage has become less pronounced, however, and is having a significant impact on how organizations store, query and manage structured data.
The reason for this shift is the advent of platforms like Presto. A free and open source query engine, Presto and its ilk enable the use of low-cost object stores to persist data while still making it accessible through structured data access tools like SQL.
Let's consider a basic data warehouse populated with data from several online transaction processing systems, such as sales, inventory and fulfillment management systems.
The basic data warehouse
Data is extracted from these systems at regular intervals. It is then transformed and restructured for use in the data warehouse. The transformed data may be written directly to the data warehouse, or it may be persisted to files in an object storage system. In the latter case, a load process typically copies the data into the data warehouse.
In this scenario, the data warehouse runs on a platform like Amazon Redshift or Google Cloud BigQuery. Both platforms are managed services, but in the case of Redshift, you will have to select a server size and make some other management decisions. You will also need to compare the pros and cons of these services:
- The longstanding advantage of using services like these is that they enable users to query, report on and visualize the data using SQL and business intelligence reporting tools.
- The disadvantage in managing a data warehouse is cost, both in terms of infrastructure costs and management overhead.
Storing data on block storage, which is used in data warehouses running Redshift, for example, is more expensive than object storage.
Where Presto and Amazon Athena come in
But what if you could eliminate the last step in the extraction, transformation and load process? Rather than loading the transformed data into a relational or analytical database, you could operate on the data stored in object storage similar to the way you can operate on that data in a database management system (DBMS). You could then get the cost benefits of object storage and the query accessibility of databases.
This is where Presto and Amazon Athena -- which is based on Presto -- come in. These platforms provide a SQL interface to data that may be stored in object storage. When organizations store data in columnar file formats, such as Parquet and ORC, they get some of the advantages of databases that store data in columnar formats. Presto is a columnar query engine, so performance is optimized when it can read columns of data; specifically, it provides improved query performance relative to row-based file formats and a smaller storage footprint.
- Databases are designed for incremental updates, inserts and deletes. This is important when working with an analytics application that requires real-time or near-real-time updates.
- Querying files in object stores works better when bulk uploads are acceptable. In addition, many of the features of a relational DBMS, like integrity constraints, are not available when working directly with file-based storage.
While columnar file formats can improve performance relative to row-based formats, a DBMS will provide better performance. If you need to support large volumes of low-latency queries, then a relational or analytical database is the best option.
The biggest benefit of working with query-over-file platforms like Presto may be simplicity, especially in how they help blur the lines between structured and unstructured data storage and management. This is especially true when using a managed service like Amazon Athena.
Simpler pricing a contributing factor
Basic storage prices are converging and therefore becoming simpler, at least in Google Cloud. For example, BigQuery -- Google's managed analytical database designed for petabyte-scale data warehouses -- has adopted a storage pricing schedule that mirrors that of Google Cloud Storage.
Active data in BigQuery, which is data updated in the last 90 days, is billed at $0.02 per gigabyte, per month, while inactive data is billed at $0.01 per gigabyte, per month. Google Cloud Storage standard pricing is $0.02 per gigabyte, per month, and Nearline storage is $0.01 per gigabyte, per month.
If other cloud vendors follow a similar strategy, the distinction between structured and unstructured data storage will blur even further.