How we use data changes over time and across different scenarios. Take time-series data, for example. This type of data includes performance monitoring information, measurements from IoT sensors, streaming location data from mobile devices and other data that includes time as part of its unique identifying characteristics.
Recently generated time-series data is especially useful when analyzing data streams for anomalies or sending a coupon to a mobile device in close proximity to a business. Data such as this can be highly valuable, at least for a short period of time. Sending an alert to an engineer about a disk that is about to run out of space is useful only if the message arrives before the impending event. Similarly, a coupon attracting a potential customer to your store is far less valuable if it arrives after the customer is back home.
A time-series data store can be essential when creating statistical and machine learning models that detect anomalies. With enough data, algorithms can detect patterns that typically precede significant events. Running out of disk space is easy to predict; making inferences from data that shifts due to overall trends or seasonal variations requires more nuanced approaches to modeling. The more complex the modeling approach, the more likely it will benefit from additional data.
Even before machine learning became a regular part of the enterprise IT toolbox, older time-series data was useful for analysis. For example, comparing a store's sales for the month versus the same month in a previous year is a common way to assess performance. If you look at overall trends, say over the past six months, then a visualization that displays data at one-day levels of aggregation may be indistinguishable from one that uses one-minute levels of measurement.
Managers can take advantage of the different ways time-series data is used to optimize both storage costs and the performance of applications using the data.
Recently generated time-series data
Recent time-series data is the most likely information to be used, and it should be accessible with low latency. In some cases, such as stream processing, this may require using an in-memory cache to store data before writing it to persistent storage. This can be the case in anomaly detection applications and other applications that need to detect and respond in a narrow window of time.
When time-series data is persisted, you need to consider how it will be queried. For example, do you usually query a time-series data store with respect to some other value, like a customer, sensor or location identifier? Ideally, data that is likely to be accessed at the same time is stored together, such as in the same partition or database shard.
At the same time, you want to avoid writing too much data to a single partition or shard. This can lead to hot-spotting or isolating a series I/O operation to a small number of storage components instead of parallelizing the operations across many different components.
This is a common concern with wide column NoSQL databases, such as Google Cloud Bigtable and Apache Cassandra. But it can occur with any distributed data store, even Google Cloud Spanner, a horizontally scalable relational database.
Older time-series data
When you're no longer likely to access older data at a fine-grained level, you should consider aggregating and storing it.
For example, think about how you might use time-series data from a previous period. You might compare daily volumes or rates from a year ago with more recent data, but rarely, if ever, would you compare year-old data at a minute level. In this case, daily aggregates in your time-series data store can be computed and stored in low-latency storage for query purposes.
Simple totals and counts are often enough. But if you need more information about the data you are aggregating, you can keep descriptive statistics, such as mean and standard deviation, to describe the distribution of the original data.
Aggregates are useful to satisfy queries that do not require fine-grained detail, but training machine learning models depends on large volumes of detailed data. For this use case, it makes sense to store fine-grained data on lower-cost storage until it is needed for model training. Since model training can take advantage of GPUs and other accelerators, one challenge is to keep these devices supplied with data at a rate they can process.
For example, Google designed Nearline Storage in Google Cloud for data that is not likely to be read more than once per month, making it a good candidate for large volume, long-term storage. When data is needed for machine learning tasks, you can copy it to low-latency persistent disks so it can be directly accessed by training programs.
The way we use data changes over time and by use case. This presents an opportunity to tailor our storage offerings to keep costs down while also optimizing for the way data is used. This applies just as much to a time-series data store as it does to other business data.
Dig Deeper on Machine-learning storage
Computational storage series: Prizsm - Data disaggregation & ‘likable’ latency
Considerations for SIEM logging software and storage
Try out the Graphite monitoring tool for time-series data
Scylla’s real-time NoSQL database tapped by 'super app'