InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News Presto: Facebook’s Distributed SQL Query Engine

Presto: Facebook’s Distributed SQL Query Engine

Leia em Português

Lire ce contenu en français

Nov 12, 2013 2 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

In the fall of 2012 Facebook started the project that became Presto. The goal of this project was to provide a way for warehouse to perform ad-hock analysis across hundreds of petabytes of data. After rejecting a few external projects, they decided to create their own distributed query engine.

Presto’s interface is based on ANSI SQL. Most distributed query engines require user to learn a new syntax. Sometimes the syntax is SQL-like, but none of them are as well-known and documented as real SQL. Facebook expects this decision will make training new users easier and faster. The reliance on ANSI SQL should also allow Presto to work with existing 3rd party tools.

Internally Presto is based on pipelines. After the query has been analyzed and tasks assigned to the appropriate nodes, the “client pulls data from output stage, which in turn pulls data from underlying stages.” Martin Traverso continues,

The execution model of Presto is fundamentally different from Hive/MapReduce. Hive translates queries into multiple stages of MapReduce tasks that execute one after another. Each task reads inputs from disk and writes intermediate output back to disk. In contrast, the Presto engine does not use MapReduce. It employs a custom query and execution engine with operators designed to support SQL semantics. In addition to improved scheduling, all processing is in memory and pipelined across the network between stages. This avoids unnecessary I/O and associated latency overhead. The pipelined execution model runs multiple stages at once, and streams data from one stage to the next as it becomes available. This significantly reduces end-to-end latency for many types of queries.

Presto was written in Java with a pluggable backend. For each data source such as Hive, HBase, or Scribe; a data connector is needed. The connector provides Presto with metadata, information on which node holds the data, and a way to actually fetch the data as a stream.

According to Martin, Presto is outperforming Hive/MapReduce by a factor of 10 when it comes to latency and CPU efficiency for most queries at Facebook. But they aren’t done yet and have plans to further improve performance. One such plan involves designing a new data format that reduces the amount of transformations needed as the data moves from one stage to the next.

Facebook is also working on removing some of the limitations in the current design.

The main restrictions at this stage are a size limitation on the join tables and cardinality of unique keys/groups. The system also lacks the ability to write output data back to tables (currently query results are streamed to the client).

Presto is available on github under the Apache 2 license.

This content is in the Database topic

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Presto: Facebook’s Distributed SQL Query Engine

Write for InfoQ

This content is in the Database topic

Related Topics:

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter