InfoQ

News

Google Introduces Binary Encoding Format: Protocol Buffers

Posted by Werner Schuster on Jul 21, 2008 06:10 PM

Community
Java,
Architecture,
.NET,
Ruby,
SOA
Topics
Performance & Scalability,
Web Services
Tags
Distributed Programming,
CORBA,
XML Schema,
Google
Google recently open sourced Protocol Buffers - a Data Interchange Format. Behind the somewhat nondescript name hide:
  • an IDL to describe data formats
  • a binary encoding scheme to encode formats described in the IDL
  • data binding support using code generators, with Google providing C++, Python, Java implementations

The IDL allows to describe data formats, here is an example from the Protocol Buffers project page:
message Person { 
required int32 id = 1;
required string name = 2;
 optional string email = 3;
}
The numbers ("tags") assigned to the field names need to be specified explicitly to allow the formats to evolve. If they were automatically assigned, a change to the format - say inserting a new field - would cause trouble. Why? Because in the binary format, tags are used to describe what field (in the protocol description) a particular chunk of bytes is. Together with the rule that unknown tags are ignored, explicitly assigned tag numbers allow to add new fields as the format evolves, yet retain compatibility.

To use the format descriptions, stored in .proto files, they're compiled into source code. Google's release comes with support for C++, Python and Java. Support for other languages is also becoming available, eg. Ruby, Erlang, Perl, Haskell, and others. Everyone interested in adding support for another language will appreciate the reverse engineered grammar of the .proto files as EBNF.

Language support means that .proto files are turned into code in the target language, consisting of classes that map to the formats defined in the .proto files. With this, it's possible to get an object from a binary, modify the object's fields and serialize the state back to the binary format.

As is usual with new Google projects, the release of Protocol Buffers caused quite a stir, with a lot of blog posts devoted to it. The release post on Google's blog explained the reason for Protocol Buffers, and mentioned that XML would be very inefficient as an encoding format. This caused a storm of blog posts - either arguing that Protocol Buffers would mean the end of XML or arguing that Protocol Buffers were inferior to XML. Ted Neward gives an explanation of the situation, with this conclusion:
In the end, if you want an endpoint that is loosely coupled and offers the maximum flexibility, stick with XML, either wrapped in a SOAP envelope or in a RESTful envelope as dictated by the underlying transport (which means HTTP, since REST over anything else has never really been defined clearly by the Restafarians). If you need a binary format, then Protocol Buffers are certainly one answer... but so is ICE, or even CORBA (though this is fast losing its appeal thanks to the slow decline of the players in this space). Don't lose sight of the technical advantages or disadvantages of each of those solutions just because something has the Google name on it.
With all the comparisons to XML or JSON, it's easy to miss that Protocol Buffers are a reimplementation of existing technologies. Next to the already mentioned ones, a widely used competing technology is ASN.1, which seems to be somewhat obscure and little known despite being several decades old. This is peculiar if you look at a small sample of the formats that are described in ASN.1:
  • X.509 certificates (used for PKI in many systems, including SSL)
  • LDAP
  • Cryptographic Message Syntax (CMS) for email cryptography
  • PKCS#1 for RSA keys
  • 3G phone networks
 ASN.1 has many uses ; for example, data encoded using ASN.1 is used by everyone using telecommunication nowadays.  ASN.1 is based on similar concepts as Protocol Buffers - it uses an IDL to describe formats and uses a compiler to generate necessary code for a target language. A key difference, however, are the multiple encodings for ASN.1, which allow to choose from a list of encoding methods for different purposes. The list of encodings includes e.g Canonical Encoding Rules (CER) which enforce strict rules for the encoding - crucial for anything concerning digital signatures which react badly to subtle differences, Packed Encoding Rules (PER) and more. The XML Encoding Rules (XER) allow to have the data encoded as XML - which basically makes ASN.1 an alternative to XML Schema. Fast Web Services is a technology which allows to map XML Schemas to ASN.1 and then use ASN.1's more efficient encodings between endpoints that support them.

Another technology similar to Google's Protocol Buffers is Facebook's Thrift, which works in a similar way (see side by side comparison of Protocol Buffers and Thrift. A less successful technology is Binary XML which has been pondered in the XML scene for a very long time but hasn't really arrived yet. In response to questions about Protocol Buffers in, Erlang's creator Joe Armstrong mentioned UBF as binary format for programs that doesn't require parsing.

The common goal of these technologies is to improve efficiency. It's possible to argue that the amount of data, sent over a wire, doesn't matter because compression can help with data size. However: compression/decompression is an extra step that has to be performed after/before using the data - the actual parsing process still uses the larger amount of data. In the case of XML this means repeatedly reading the same element tags over and over - compare this to the numeric tags of, say Protocol Buffers. Of course - this improvement depends on the actual format. A format that consists of mostly strings will not benefit as much as a format made up of mostly numeric data.

Mark Pilgirm also put together a list of reactions to Protocol Buffer. Another aspect of Protocol Buffers mentioned in a comment by a Google employee on Steve Vinoski's blog, although it's supposedly in heavy used inside Google.

Have you been in a situation where you considered a binary format for efficiency reasons? If yes - did you roll your own or did you use an existing technology?

7 comments

Reply

Should have adopted SDO by mani doraisamy Posted Jul 22, 2008 4:30 AM
Two more !!! by siva prasanna kumar P Posted Jul 22, 2008 6:21 AM
All big shops have are sick with "Not invented here" by Slava Imeshev Posted Jul 22, 2008 7:30 PM
Re: All big shops have are sick with by Slava Imeshev Posted Jul 22, 2008 7:35 PM
Re: All big shops have are sick with by Nikita Ivanov Posted Jul 23, 2008 3:11 PM
Adobe's AMF by Jim Greer Posted Jul 23, 2008 8:12 AM
Performance doubts by Jimmy zhang Posted Aug 4, 2008 12:29 PM
  1. Back to top

    Should have adopted SDO

    Jul 22, 2008 4:30 AM by mani doraisamy

    With ChangeSummary and XML support, SDO should have been a better choice.

  2. Back to top

    Two more !!!

    Jul 22, 2008 6:21 AM by siva prasanna kumar P

    Already there seems to be a huge debate going on about JSON vs XML, two more (thrift and pb) have popped up.

    The three most important characteristics which are must for any good data format are data structure, data types and data constraints.

    According to me currently only XML has all the three. I am not aware of any other format which has all these characteristics and widely accepted.

  3. http://www.omg.org/gettingstarted/omg_idl.htm http://hessian.caucho.com/ Or, maybe, this a key to innovation? Reinvent 100 wheels and 101st will be another big thing? Regards, Slava Imeshev Cacheonix: Clustered Java Cache

  4. Back to top

    Re: All big shops have are sick with

    Jul 22, 2008 7:35 PM by Slava Imeshev

    You can also under-invent the wheel by providing a message editor that does not parse link breaks and links :)

    http://www.omg.org/gettingstarted/omg_idl.htm

    http://hessian.caucho.com


    Regards,

    Slava Imeshev
    Cacheonix: Clustered Java Cache

  5. Back to top

    Adobe's AMF

    Jul 23, 2008 8:12 AM by Jim Greer

    It would be interested (to me, anyway) to also see in this comparison Adobe's AMF, another binary message format that has also been open-sourced.

  6. Back to top

    Re: All big shops have are sick with

    Jul 23, 2008 3:11 PM by Nikita Ivanov

    Agree here w/Slava. WTF is wrong with Caucho if one needs this data portability? Or SDO? I don't understand... Can someone from Google team provide a sensible reasoning to use theirs vs. others?

    Thanks,
    Nikita Ivanov.
    GridGain - Grid Computing Made Simple

  7. Back to top

    Performance doubts

    Aug 4, 2008 12:29 PM by Jimmy zhang

    It is actually not a foregone conclusion that protocol buffer will outperform XML, see this article for further analysis http://soa.sys-con.com/node/250512

Exclusive Content

Typemock: Past, Present and Future

Eli Lopian of Typemock answers a few questions on Typemock origins and where Typemock is headed.

Agile in Practice: What Is Actually Going On Out There?

Scott Ambler talks about actual data resulting from surveys made during 2006-2008, showing how Agile is perceived and implemented within organizations.

Building Smart Windows Applications

From QCon 2008, Daniel Moth presents on using Visual Studio 2008 and .NET 3.5 to create compelling rich Windows applications.

Joshua Kerievsky about Industrial XP

Joshua Kerievsky, founder of Industrial Logic, talks about Industrial Extreme Programming which extends XP by including practices dealing with management, customers and developers.

Jeff Barr Discusses Amazon Web Services

Amazon Web Services (AWS) Evangelist Jeff Barr discusses SimpleDB, S3, EC2, SQS, cloud computing, how different Amazon services interact, origins of AWS, AWS globalization and the March AWS outage.

More Than Just Spin (Up) : Virtualization for the Enterprise and SaaS

Cloud services have helped bring virtualization to the forefront. Its full power however, also includes other benefits such as high availability, disaster recovery, and rapid provisioning.

Ruby Beyond Rails

John Lam talks about his path to dynamic languages, some of the problems of making IronRuby run fast, and how the DLR helps with implementing languages.

VMware Infrastructure 3 Book Excerpt and Author Interview

VMware Infrastructure 3: Advanced Technical Design Guide and Advanced Operations Guide provides a wealth of practical insights into setting up virtualization in todays corporate environments.