Scanrail - Fotolia
It's all about time. Time is a non-renewable resource that only moves forward. Once spent, it can't be reclaimed or rewound. Time is the construct that separates events and moments. With IT operations, there's never enough of it. No one ever said they needed less time. There's unceasing pressure to do more with every second.
Take the example of high-performance computing (HPC), aka supercomputers, and the ever-constant pursuit to increase the number of calculations or floating-point operations per second (FLOPS) they perform. Measured today in the dozens to hundreds of petaFLOPS, the next goal is exaFLOPS, a thousand times faster.
That same unrelenting burden is readily apparent in storage. Storage infrastructure performance is measured in IOPS and throughput per second -- megabytes to gigabytes per second moving to terabytes per second.
It's this perpetual hunger for greater infrastructure performance per second that's the unspoken premise behind the question: Does it make more sense to move the data or the compute?
No surprises here
Common wisdom has it that moving the compute to the data makes more sense, and in the vast majority of use cases, it does. It's generally easier and faster because the executable accounts for much less data to move around a network or interconnect than input data. Data movement requires network bandwidth, time and, in most situations, skilled manual labor. Labor requirements alone drive many IT managers to move the compute to the data.
An excellent example of how that works is the Hadoop architecture. The Hadoop JobTracker is proficient at scheduling individual tasks to the compute node where the required data resides. If that specific compute node is unavailable for a job, it tries to schedule that job on a compute node within the same rack. If that too is unavailable, it will schedule the job elsewhere on the grid. The principle behind the Hadoop JobTracker scheduling algorithm is that it takes too much time to move input data to the computational node. Moving data adds latency to the job and increases response time.
The Hadoop example is just one definitive circumstance when moving the compute to the data is an obvious fit. Another example is when the bandwidth between the compute and the data is insufficient. There are other conditions, however, when moving the compute would be so costly or complex as to be unfeasible. And there are settings where moving the data is more practical because of workflows, collaboration, security or infrastructure.
Let's break down each these scenarios -- compute to data, data to compute -- and clarify when it makes more sense to move the data or the compute.
When moving the compute to the data makes sense
As Hadoop JobTracker shows, it's usually simpler and faster to move the compute rather than the data. Infrastructure performance is often faster when the compute is closer to the data. This results directly from the shorter distance. Distance equals latency. Latency reduces infrastructure performance and is directly tied to the speed of light. As that old joke states, "The speed of light is not just a limit; it's the law!"
Speed of light and distance will continue to be a latency problem until wormhole technology is invented, which isn't happening anytime soon. Reducing distance and latency improves performance per second when latency is the performance per time bottleneck. When the bottleneck is the CPU, memory, software inefficiency or storage infrastructure, improvements will be limited, though.
Applications that benefit the most from moving the compute and the data closer together are likely to be latency-sensitive. Big volume transactional applications, such as high-frequency trading, are ideal because in this space microseconds (µs) equal millions of dollars. Other big volume transactional applications that benefit in this way include HPC, online transactional processing, e-commerce and online gaming. All of these applications see advantages from reducing latency by moving the compute closer to the data.
The value of moving the compute to the data becomes glaringly obvious when there's a large amount of data to be moved over a WAN. A good example is when you have, say, a petabyte of data in a cloud repository that must be moved on premises over a network connection. This is how hardware security modules, aka appliances, and cloud storage gateways work. They require data they've moved to a cloud be moved back and rehydrated into the original storage to be read, processed or altered.
Assuming a gigabit per second of bandwidth with the theoretical performance of 125 MBps, it will take quite a long time to get at that data. TCP/IP won't deliver the theoretical performance. But assuming it does, hypothetically, it will take at least 93 days to move that petabyte to its destination. The more likely scenario is throughput will top out at approximately 30% of rated performance, which pushes the time to approximately 309 days, or more than 10 months -- not exactly timely. Even with WAN optimizers the time consumed will fall somewhere in between, still excessively long.
Moving the compute to the data in this scenario takes appreciably less time. This is the process by which a compute instance is spun up with the application in the cloud to process the data stored in the cloud. After processing is complete, the instance is torn down so as not be charged for that instance beyond its use. The processing of the data occurs much sooner versus waiting on moving the data on premises to Hadoop JobTracker. It also saves huge costs in cloud storage egress fees.
Another type of application that benefits from having the compute closer to the data is the parallel I/O used in many HPC ecosystems. Here, an HPC processes large amounts of data in parallel via a parallel file system, such as Lustre, Panasas, Quobyte, IBM Spectrum Scale (formerly GPFS) and WekaIO. Meeting the latency demands of the parallel file system generally requires placing data inside each HPC server node's internal storage to shorten latency distance. Nonvolatile memory express over fabrics (NVMe-oF) and Remote Direct Memory Access are altering that requirement somewhat by reducing network latency and, to some extent, minimizing the effects of distance latencies.
However, keeping the compute relatively close to the stored data reduces both. The same principles employed so efficiently by Hadoop JobTracker are also used by HPC job schedulers, such as Slurm Workload Manager. Slurm, which is open source, adeptly schedules each job on the node or nodes with the most available resources closest to the data to get the best performance. New shared rack-scale flash storage simplifies this with only a modicum of additional NVMe-oF latencies in the 5 µs to 10 µs range.
Several relational SQL databases, such as Oracle Database and SAP HANA, have versions that keep all the data being processed in-memory. This use case exploits both the lower latency and higher performance of dynamic RAM and the lower latency of being extremely close to the compute to deliver much-improved database performance. High cost limits it to only the most demanding I/O performance applications, though.
Moving the compute nearer the data can also deliver significant infrastructure performance increases to some high-performance storage architectures. Storage software services are commonly CPU- and memory-intensive. By moving the compute closer to the data, many processes can be offloaded from the storage controller, freeing it up for reads/writes. This is the supposition behind SSD near-line controller coprocessing that offloads the main storage controller CPU from redundant, performance-slowing tasks. That doesn't always improve storage performance, however, and can add cost and slow performance.
When moving the data to the compute makes more sense
Moving the data to the compute typically means infrastructure performance isn't the primary concern. The primary concerns are likely cost, security, convenience and simplicity.
Take the use case of collaboration. It's nearly impossible to move the compute to the data when collaborators belong to different organizations, companies and geographic areas and use different applications. Few organizations will allow outsider access past security into their systems to implement unsanctioned applications. Therefore, collaboration workflows among heterogeneous organizations generally compel the data move to the compute. There are plenty of collaboration applications -- such as Box, Atlassian Confluence, Dropbox, Google Drive, Atlassian Jira and Microsoft OneDrive -- that all move the data to the compute.
Cloud bursting, used to take advantage of cloud compute elasticity when on-premises assets aren't available, is another use case when the data must be moved to the compute. In this scenario, data is copied and sent to the cloud, where a cloud-compute instance with the application is spun up and the data processed. Results are then migrated back on premises, cloud data copy is deleted and the cloud instance torn down. This is a cost-effective use case when on-premises compute is temporarily insufficient because of unexpected, temporary or seasonal demand.
Analytics is a use case discussed earlier for moving the compute to the data. However, analytics is also a use case when the data often has to be moved to the compute. Analytics engines must read data to analyze it. That requires data, more often than not, be moved to the analytics engine for effective analysis.
Spark is the analytics engine for Hadoop, where data needs to be on multiple Hadoop distributed file system (HDFS) nodes for best performance. And although there are file and object storage systems that can make resident file or object data appear as HDFS, they don't always provide adequate performance, which means data must be moved to Hadoop storage. NoSQL databases, such as Apache Cassandra, Couchbase, DataStax, Apache HBase, MemcacheDB, MongoDB, Neo4J, SAP OrientDB and Redis, generally require moving data to them as well.
Aggregation is another use case for moving the data to the compute, a good example of which is IoT. Thousands to millions to billions of devices generate data aggregated for real-time analytics and long-term batch discoveries. Artificial intelligence for IT operations (AIOps) is an example of this machine-data aggregation and real-time analytics. AIOps products come from many vendors, such as BMC Software, Correlsense, Corvil, Hewlett Packard Enterprise, IBM, ITRS Group, Moogsoft, SAP, Splunk and StrongBox Data Solutions.
Distributed multinode storage systems are also a use case when moving the data to the compute makes performance sense. Several distributed storage systems constantly load balance and move stored data across the different storage nodes to make sure performance is optimized.
So what's the answer?
Moving the compute to the data or the data to the compute to improve infrastructure performance and save time is use-case specific. It isn't absolute, and in several instances, both apply for different aspects of the use case. In the end, the answer to the question "Does it make more sense to move the data or the compute?" comes down to balancing application requirements for performance, functionality, cost, security, practicality and ease of use.
- Storage Insights Enables Broader Use of Storage Resource Management –Arrow and IBM
- StrongLink Autonomous Data and Storage Management –Fujifilm Recording Media USA, Inc.
- Storage Designs for Big Data and Real-Time Analytics –Western Digital
- Managed Apache Spark for Large-Scale Analytics –Instaclustr