BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Foursquare's MongoDB Outage

Foursquare's MongoDB Outage

This item in japanese

Foursquare recently suffered a total site outage for eleven hours. The outage was caused by unexpected uneven growth in their MongoDB database that their monitoring didn't detect. The system outage was prolonged when an attempt to add a partition didn't work due to fragmentation, and required taking the database offline to compact it. This article provides more details on what happened, why the system failed, and the responses of foursquare and 10gen to the incident.

Foursquare is a location-based social network which has been growing rapidly - reaching three million users in August. On October 4th, Foursquare experienced an eleven hour outage, because of capacity problems due to that growth. Nathan Folkman, Foursquare's Director of Operations, wrote a blog entry that both apologized to their users and provided some technical details about what happened. Subsequently 10gen's CTO Eliot Horowitz posted a more detailed post mortem to the mongodb users mailing list. 10gen develops and supports MongoDB, and provides support for Foursquare. This post mortem generated a lively discussion, including more technical details from Foursquare's engineering lead, Harry Heymann.

Basic System Architecture

The critical system that was affected was Foursquare's database of user check-ins. Unlike many history databases, where only a small fraction of data needs to be accessed at any given time 10gen's CEO Dwight Merriman told us that "For various reasons the entire DB is accessed frequently so the working set is basically its entire size." Because of this, the memory requirements for this database were the total size of data in the database.If the database size exceeded the RAM on the machine, the machine would thrash, generating more I/O requests than the four disks could service. In response to our questions he clarified that "a very high percentage of documents were touched very often. Much higher than you would expect."

Initially the database ran on a single EC2 instance with 66 GB of RAM. About two months ago, Foursquare had almost run out of capacity and migrated to a two-shard cluster, with each shard having 66 GB of RAM and replicating data to a slave. After this migration, each shard held approximately 33 GB of data. Data in the shards was partitioned into "200 evenly distributed chunks by user id" so all data for a given user is held in a single shard.

 

The Outage

As the system continued to grow, the partitions grew in an imbalanced fashion.  Horowitz notes:

It’s easy to imagine how this might happen: assuming certain subsets of users are more active than others, it’s conceivable that their updates might all go to the same shard.

Chunks get split by MongoDB as they grow past 200 MB, into two 100 MB. The end result was that when the total system grew past 116 GB of data, one partition was 50 GB in size but another one had exceeded the 66 GB of RAM available on the machine, causing unacceptable performance for requests that hit that shard.

In an attempt to fix the system, the team added a third shard to the database, aiming to transfer 5% of the system data so all shards would fit within RAM. They transferred only 5% of data because, Horowitz notes "we were trying to move  as little as possible to get the site back up as fast as possible." However, this didn't relieve the performance problem on the full shard. As Horowitz notes:

...we ultimately discovered that the problem was due to data fragmentation on shard0. In essence, although we had moved 5% of the data from shard0 to the new third shard, the data files, in their fragmented state, still needed the same amount of RAM. This can be explained by the fact that Foursquare check-in documents are small (around 300 bytes each), so many of them can fit on a 4KB page.  Removing 5% of these just made each page a little more sparse, rather than removing pages altogether.


The data that was migrated was sparse because it was small and because "Shard key order and insertion order are different. This prevents
data from being moved in contiguous chunks."  In order to fix the performance problem, they had to compact the overfull shard. MongoDB currently only allows offline compaction of a shard. This took four hours, which was a function of the volume of data to compact and "the slowness of EBS volumes at the time." When this completed, they size of the shard had shrunk by 5% and they were able to bring the system back online, after an eleven hour outage. No data was lost during the incident.

Follow Up

After the system was restored, they set up additional shards and distributed data evenly. To fix fragmentation they compacted the partitions on each of the slaves, then switched the slaves to be masters and compacted the masters. The system ended up using about 20 GB on each partition.

Horowitz noted that

The 10gen team is working on doing online incremental compaction of both data files and indexes.  We know this has been a concern in non-sharded systems as well.  More details about this will be coming in the next few weeks.

Horowitz notes:

The main thing to remember here is that once you’re at max capacity,
it’s difficult to add more capacity without some downtime when objects
are small.  However, if caught in advance, adding more shards on a
live system can be done with no downtime


The foursquare team also responded by promising to improve communication, operational procedures, and according to Folkman:

we’re also looking at things like artful degradation to help in these situations. There may be times when we’re overloaded in the future, and it would be better if certain functionalities were turned off rather than the whole site going down, obviously.


Heymann noted:

Overall we still remain huge fans of MongoDB @foursquare, and expect to be using it for a long time to come.

 

Community Response

In response to the post-mortem a number of questions were raised:

  1. Nat asked:
    Can repairDatabase() leverage multiple cores? Given that data is broken down into chunks, can we process them in parallel? 4 hours downtime seems to be eternal in internet space.

    Horowitz responded:
    Currently no.  We're going to be working on doing compaction in the
    background though so you won't have to do this large offline
    compaction at all.
  2. Alex Popescu asked:
    is there a real solution to the chunk migration/page size issue?

    Horowitz responded:
        Yes - we are working on online compaction would mitigate this.
  3. Suhail Doshi asked:
     I think the most obvious question is: How do we avoid maxing out our
     mongodb nodes and know when to provision new nodes?
     With the assumptions that: you're monitoring everything.
     What do we need to look at? How do we tell? What happens if you're a
     company with lots of changing scaling needs and features...it's tough
     to capacity plan when you're a startup.

    Horowitz answered:
    This is very application dependent.  In some cases you need all your
    indexes in ram, in other cases it can just be a small working set.
    One good way to determine this is figure out how much data you
    determine in a 10 minute period.  Indexes and documents.  Make sure
    you can either store that in ram or read that from disk in the same
    amount of time.
  4. Nat also asked about back pressure: "It seems that when data grows out of memory, the performance seems to degrade significantly."

    To which Roger Binns added:
    What MongoDB needs to do is decrease concurrency under load so that existing
    queries can complete before allowing new ones to add fuel to the fire.
    There is a ticket for this:
      http://jira.mongodb.org/browse/SERVER-574


There was also some discussion of whether solid state drives would have helped improve performance, but there was no conclusive statement about how they might have impacted performance. One also wonders why the user id partitions grew in such an imbalanced fashion. Partitioning by a has of user id would presumably have been nearly balanced - so it's likely that a biased partitioning (e.g., putting older users on one shard) resulted in the imbalanced data growth.

Monitoring and Future Directions

We interviewed 10gen's CEO Dwight Merriman to better understand some issues in this case. In general, we asked how to best monitor large scale MongoDB deployments, to which he replied that there are many monitoring tools, noting that Munin is commonly used and that it has a MongoDB plugin. We asked:

From the description, it seems like you should be able to monitor the resident memory used by the mongo db process to get an alert as shard memory ran low. Is that viable?

If the database is bigger than RAM, MongoDB, like other db's, will tend to use all memory as a cache.  So using all memory doesn't indicate a problem.  Rather, we need to know when working set is approaching the size of ram.  This is rather hard with all databases. One good way is to monitor physical I/O closely and watch for increases.

Merriman agree that in the case of Foursquare, where all the database needs to be held in RAM, monitoring resident memory or just total database size would be sufficient to detect problems in advance. This implies that it would have been fairly simple to identify the problem before a shard filled up. It seems like that no one expected the imbalanced growth so whatever monitoring was in place was inadequate to identify the problem.

We asked what changes in development priorities resulted from the experience, to which Merriman replied that they will be working on background compaction features sooner. In addition, Horowitz had noted that MongoDB should "degrade much more gracefully. We'll be working on these enhancements soon as well." Merriman noted that MongoDB would be allowing re-clustering objects to push inactive objects into pages that are on disk, and that they believe that for them memory mapped files work well. Horowitz elaborated that

The big issue really is concurrency.  The VM side works well, the problem is a read-write lock is too coarse.
1 thread that causes a fault can a bigger slowdown than it should be able to. We'll be addressing this in a few ways: making yielding more intelligent, real intra-collection concurrency.

Rate this Article

Adoption
Style

BT