eBay presented a keynote at Hadoop World, describing the architecture of its completely rebuilt search engine, Cassini, slated to go live in 2012. It indexes all the content and user metadata to produce better rankings and refreshes indexes hourly. It is built using Apache Hadoop for hourly index updates and Apache HBase to provide random access to item information. Hugh E. Williams the VP Search, Experience & Platforms for eBay Marketplaces delivered the keynote, where he outlined the scale, technologies used, and experiences from an 18 month effort by over 100 engineers to completely rebuild eBay's core site search. The new platform, Cassini, will support:
- 97 million active buyers & sellers
- 250 million queries per day
- 200 million items live in over 50,000 categories
eBay already stores 9 PB of data in Hadoop and Teradata clusters for analysis, but this will be their first production application that users use directly. The new system will be more extensive than the current one (Galileo):
Old System: Galileo |
New System: Cassini |
10's of factors used for ranking |
100's of factors used for ranking |
title-only match by default |
use all data to match by default |
manual intervention for rollout, monitoring, remediation |
automated rollout, monitoring, remediation |
Cassini will keep 90 days of historical data online - currently 1 billion items, and include user and behavioral data for ranking. Most of the work required to support the search system is done in hourly batch jobs that run in Hadoop. Different kinds of indexes will all be generated in the same cluster (an improvement over Galileo, which had different clusters for each kind of indexing). The Hadoop environment allows eBay to restore or reclassify the entire site inventory as improvements are created.
Items are stored in HBase, and are normally scanned during the hourly index updates. When a new item is listed, it will be looked up in HBase and added to the live index within minutes. HBase also allows for bulk and incremental item writes and fast item reads and writes for item annotation.
Williams indicated that the team was familiar with running Hadoop and it had worked reliably with few problems. By contrast he indicated the "ride so far with HBase has been bumpy." Williams noted that eBay remains committed to the technology, have been contributing fixes to issues they found, are learning fast and that the last two weeks have gone smoothly. The engineering team was new to using HBase and ran into some issues when testing at scale, such as:
* production cluster configuration for their workloads
* hardware issues
* stability: unstable region servers, unstable master, regions stuck in transition
* monitoring HBase health: often problems haven't been detected until they impact live service - the team is adding lots of monitoring
* managing multi-step MapReduce jobs
Overall Williams felt the project was ambitious but had gone quickly and well, and that the team was able to use Hadoop and HBase to build a significantly improved search experience.