BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Google Introduces Binary Encoding Format: Protocol Buffers

Google Introduces Binary Encoding Format: Protocol Buffers

This item in japanese

Google recently open sourced Protocol Buffers - a Data Interchange Format. Behind the somewhat nondescript name hide:
  • an IDL to describe data formats
  • a binary encoding scheme to encode formats described in the IDL
  • data binding support using code generators, with Google providing C++, Python, Java implementations

The IDL allows to describe data formats, here is an example from the Protocol Buffers project page:
message Person { 
required int32 id = 1;
required string name = 2;
 optional string email = 3;
}
The numbers ("tags") assigned to the field names need to be specified explicitly to allow the formats to evolve. If they were automatically assigned, a change to the format - say inserting a new field - would cause trouble. Why? Because in the binary format, tags are used to describe what field (in the protocol description) a particular chunk of bytes is. Together with the rule that unknown tags are ignored, explicitly assigned tag numbers allow to add new fields as the format evolves, yet retain compatibility.

To use the format descriptions, stored in .proto files, they're compiled into source code. Google's release comes with support for C++, Python and Java. Support for other languages is also becoming available, eg. Ruby, Erlang, Perl, Haskell, and others. Everyone interested in adding support for another language will appreciate the reverse engineered grammar of the .proto files as EBNF.

Language support means that .proto files are turned into code in the target language, consisting of classes that map to the formats defined in the .proto files. With this, it's possible to get an object from a binary, modify the object's fields and serialize the state back to the binary format.

As is usual with new Google projects, the release of Protocol Buffers caused quite a stir, with a lot of blog posts devoted to it. The release post on Google's blog explained the reason for Protocol Buffers, and mentioned that XML would be very inefficient as an encoding format. This caused a storm of blog posts - either arguing that Protocol Buffers would mean the end of XML or arguing that Protocol Buffers were inferior to XML. Ted Neward gives an explanation of the situation, with this conclusion:
In the end, if you want an endpoint that is loosely coupled and offers the maximum flexibility, stick with XML, either wrapped in a SOAP envelope or in a RESTful envelope as dictated by the underlying transport (which means HTTP, since REST over anything else has never really been defined clearly by the Restafarians). If you need a binary format, then Protocol Buffers are certainly one answer... but so is ICE, or even CORBA (though this is fast losing its appeal thanks to the slow decline of the players in this space). Don't lose sight of the technical advantages or disadvantages of each of those solutions just because something has the Google name on it.
With all the comparisons to XML or JSON, it's easy to miss that Protocol Buffers are a reimplementation of existing technologies. Next to the already mentioned ones, a widely used competing technology is ASN.1, which seems to be somewhat obscure and little known despite being several decades old. This is peculiar if you look at a small sample of the formats that are described in ASN.1:
  • X.509 certificates (used for PKI in many systems, including SSL)
  • LDAP
  • Cryptographic Message Syntax (CMS) for email cryptography
  • PKCS#1 for RSA keys
  • 3G phone networks
 ASN.1 has many uses ; for example, data encoded using ASN.1 is used by everyone using telecommunication nowadays.  ASN.1 is based on similar concepts as Protocol Buffers - it uses an IDL to describe formats and uses a compiler to generate necessary code for a target language. A key difference, however, are the multiple encodings for ASN.1, which allow to choose from a list of encoding methods for different purposes. The list of encodings includes e.g Canonical Encoding Rules (CER) which enforce strict rules for the encoding - crucial for anything concerning digital signatures which react badly to subtle differences, Packed Encoding Rules (PER) and more. The XML Encoding Rules (XER) allow to have the data encoded as XML - which basically makes ASN.1 an alternative to XML Schema. Fast Web Services is a technology which allows to map XML Schemas to ASN.1 and then use ASN.1's more efficient encodings between endpoints that support them.

Another technology similar to Google's Protocol Buffers is Facebook's Thrift, which works in a similar way (see side by side comparison of Protocol Buffers and Thrift. A less successful technology is Binary XML which has been pondered in the XML scene for a very long time but hasn't really arrived yet. In response to questions about Protocol Buffers in, Erlang's creator Joe Armstrong mentioned UBF as binary format for programs that doesn't require parsing.

The common goal of these technologies is to improve efficiency. It's possible to argue that the amount of data, sent over a wire, doesn't matter because compression can help with data size. However: compression/decompression is an extra step that has to be performed after/before using the data - the actual parsing process still uses the larger amount of data. In the case of XML this means repeatedly reading the same element tags over and over - compare this to the numeric tags of, say Protocol Buffers. Of course - this improvement depends on the actual format. A format that consists of mostly strings will not benefit as much as a format made up of mostly numeric data.

Mark Pilgirm also put together a list of reactions to Protocol Buffer. Another aspect of Protocol Buffers mentioned in a comment by a Google employee on Steve Vinoski's blog, although it's supposedly in heavy used inside Google.

Have you been in a situation where you considered a binary format for efficiency reasons? If yes - did you roll your own or did you use an existing technology?

Rate this Article

Adoption
Style

BT