Does data, like a celestial body, have its own gravitational pull that attracts applications and services into its orbit? That was the proposal in 2010 by VMware’s Dave McCrory who has recently put some mathematical prowess beneath his principle. In his new website, DataGravity.org, McCrory outlines the formula for data gravity and asks the technical community for help in vetting and applying his formula.
In a 2011 post about data gravity, McCrory describes the basics of his principle.
Data Gravity is a theory around which data has mass. As data (mass) accumulates, it begins to have gravity. This Data Gravity pulls services and applications closer to the data. This attraction (gravitational force) is caused by the need for services and applications to have higher bandwidth and/or lower latency access to the data.
This principle has deeply resonated with the technical community. This topic is frequently discussed at conferences including the recent GigaOM Structure, and written about in articles such as the ReadWriteCloud piece entitled What "Data Gravity" Means to Your Data. In that article, author Jon Brockmeier warns against casual investment in data storage that may generate significant gravity.
Whether it's a single-user application like iTunes, or a company wide project: You need to consider the implications of data gravity - once your data is in, how hard will it be to break the gravitational field?
The stronger the data gravity involved, the more cautious you should be when you choose your data storage solution. It's likely that once you have a sufficient amount of data wrapped up in a solution, it's going to be very difficult (if not impossible) to justify the costs of moving it away.
On McCrory’s DataGravity.org site, he described his approach for quantifying this principle. First, he tackled the calculation of data Mass.
The first thing that I learned was that in order to have Gravity, you must calculate Mass. While this is trivial Physics, applying this to an abstract concept is a bit more difficult. After a great deal of time and many versions, I have a current Mass formula for Data and a Mass formula for Applications (either or both of these could change at some point)
He decided to calculate Data Mass by multiplying data volume (equal to the size of the data measured in megabytes) by the data density (which is the compression ratio of the data).
After that, McCrory calculated Application Mass by multiplying application volume (generated by adding amount of memory used plus the amount of disk space used) by the application density (produced by adding the compression ratio of the memory, compression ratio of the disk space, and the total amount of CPU utilization).
To account for significant impact of the network on data gravity, McCrory injects variables for network latency, network bandwidth, the number of requests per second, and the average size of requests. He combines all of these factors to arrive at a calculation for data gravity. In an interview with InfoQ, McCrory shared that he considered and discarded many additional variables. He attempted to factor in the impact of create/read/update/delete operations for a given data mass and even the type of storage that the data rested upon, but ultimately decided that the formula below captured the key aspects of data gravity.
McCrory considers the number produced by this calculation to be relative to the network that data exists in. According to McCrory, each network is a universe, and a given data mass exists in that universe. While one could fruitfully compare data gravity numbers between two objects within the same network, McCrory does not yet have enough information to confidently compare the data gravity for one network versus the data gravity from another network.
On the DataGravity.org site, McCrory lists a few of the possible uses of this principle.
Reasons to move toward a source Data Gravity (Increase Data Gravity)
- You need Lower Latency for your Application or Service
- You need Higher Bandwidth for your Application or Service
- You want to generate more Data Mass more quickly
- You are doing HPC
- You are doing Hadoop or Realtime Processing
Reasons to resist or move away from a source of Data Gravity (Decrease Data Gravity)
- You want to avoid lock-in / keep escape velocity low
- Application Portability
- Resiliency to Increases in Latency or Decreases in Bandwidth (aka Availability)
Data Gravity and Data Mass may have other uses as well:
- Making decisions of movement or location between two Data Masses
- Projecting Growth of Data Mass
- Projecting Increases of Data Gravity (Which could signal all sorts of things)
This formula is a work in progress, according to McCrory, and he is actively seeking real-world use cases and tests of this principle.