Google has open sourced Gumbo, an HTML parsing library written in C. Gumbo adheres to the HTML5 parsing algorithm, passing all html5lib-0.95 tests, and has been tested on 2.5 billion pages indexed by Google.
According to the project’s description page, the purpose of releasing Gumbo is to provide developers with a lightweight HTML parsing library that has no outside dependencies and can be called from the majority of languages. The library could be included in web page validators, static analyzers, templating languages, refactoring tools, etc.
Google considers Gumbo as “robust and resilient to bad input”, but it does not recommend maintaining pointers to some of its internal data structures because the ABI is likely to change in the future. But the API is considered to be pretty stable, the team waiting on comments from users before releasing it as 1.0, which is to happen in the near future.
Some of the features to be added in the future are:
- Support for recent HTML5 spec changes to support the template tag.
- Support for fragment parsing.
- Full-featured error reporting.
- Bindings in other languages.
Prior the standardization of the HTML5 parsing algorithm, each browser chose how to tokenize input pages and how to render them. And while HTML 4 had specifications on valid markups, there was no guidance on what a browser should do when the input was not valid, and 95% of world’s web pages did not pass the W3C reference validator. Validating HTML pages with a tool like Gumbo ensures pages will parsed and rendered properly in all major browsers.