As the amount of data being collected and stored expands at astounding rates, scalability requirements once reserved to the Googles of the world are becoming more common and often require dedicated solutions. Daniel Peng and Frank Dabek just published a paper detailing Percolator, Google's indexing system which stores tens of petabytes of data and processes billions of updates per day on thousands of machines.
Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks [and] MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.
Daniel and Frank explain that even though, the indexing itself is a bulk processing task that can be expressed as a series of MapReduce operations, the issues is that when the index needs to be updated, after recrawling a set of pages, it would not be sufficient to run the MapReduce operations on the new pages because of their links. The MapReduces must actually be run over the entire repository. Before Percolator, the search index was actually produced that way. The main issue with that approach is the latency introduced when starting over.
The solution lies in the ability to optimize incremental processing. One of the key design concepts behind Percolator was to provide random access to the repository to process documents individually, and thus avoiding the global processing required by MapReduce. Percolator was designed to provide cross-row, cross-table ACID-compliant transactions via "snapshot isolation" since many threads on many machines need to transform the repository. Percolator also provides "observers" which are invoked by the system whenever a user-specified column changes to assist developers keeping track of the state of the computation.
The authors add:
Percolator was built specifically for incremental processing and is not intended to supplant existing solutions for most data processing tasks. Computations where the result can’t be broken down into small updates (sorting a file, for example) are better handled by MapReduce.
Percolator is better suited when the computation requires strong consistency and has very large requirements in terms of data size, CPU,... In Google's case, its primary application is preparing Web pages for inclusion in the live Web search index. With Percolator, Google is now able to process documents as they are crawled, reducing the average latency by a factor of 100 and the average age of the document by 50%.
Percolator is built on top of the BigTable distributed storage system and consists of three binaries running on every machine of the cluster: a worker, a BigTable tablet server and a Google File System chunkserver.
All observers are linked into the Percolator worker, which scans the Bigtable for changed columns (“notifications”) and invokes the corresponding observers as a function call in the worker process. The observers perform transactions by sending read/write RPCs to Bigtable tablet servers, which in turn send read/write RPCs to GFS chunkservers.
Percolator has no central location for transaction management and lacks a global lock detector. Since Percolator doesn't have low latency requirements like a traditional DBMS running OLTP tasks, it takes a lazy approach to cleaning up locks introducing transaction commit delays of tens of seconds.
This increases the latency of conflicting transactions but allows the system to scale to thousands of machines...While it is possible to incrementally process data with- out the benefit of strong transactions, transactions make it more tractable for the user to reason about the state of the system and to avoid the introduction of errors into a long-lived repository.
Its architecture scales linearly over many orders of magnitude on commodity machines. In terms of performance, Percolator is somewhere between MapReduce and DBMSs. When compared to a DBMS, it uses far more resources to process a fixed amount of data because of its distributed architecture. It also introduces a 30-fold overhead. When compared to MapReduce, Percolator can process data with far lower latency, with an additional set of resources to support random lookups. The system has been delivering Google's websearch index since April 2010 and reduced the latency while requiring an acceptable increase in resources.
Do you see or foresee an increasing need for working with large data sets? Phil Wehlan recently asked the same question and requested your feedback.