When asked about whether Ruby is a difficult language to parse, Wang Haofei, who works on this Google SoC project and the XRuby team, explains:
Yes it is. There are lots of ambiguities in the language. For example, "<<" can be either left shift operator or start of heredoc. To distinguish these two the lexer has to maintain state (context dependent): http://seclib.blogspot.com/2005/11/distinguish-leftshift-and-heredoc.htmlWhen it comes to a difficult task like this, it's always good to ask just how far along it really is. Wang Huofei:
Others likes ID/function ambiguity, expression substitution inside string, heredoc etc are all very difficult to deal with.
Since the first public release it can parse the entire ruby standard libraries and and Ruby on Rails (have not try the latest one recently): http://seclib.blogspot.com/2006/02/first-release-of-rubyfront.htmlXue Yong Zhi is the mentor for this Google SoC project and also works on XRuby.
Xue has fixed a few bugs since then, but generally it is very stable. We will write and run more tests during the SOC project and it may help us to uncover some unknown issues.
One big part of the Google SoC project seems to be porting the existing parser to ANTLR's new version 3. Wang Huofei:
1. ANTLR v3 is a rewrite of v2 and has significantly enhanced parsing strength via LL(*) parsing, v2 is much weaker (limited LL(k)) and it forced me to add some hacks to workaround some problems. While ANTLR based parser is easier to maintain than others, migration to v3 will help us to make it better and cleaner.The second item in this list brings up an a very interesting point. Ruby lacks a Ruby parser written in Ruby. This is an issue for writing tools that handle Ruby code. Code analyzers, refactoring tools and automated refactorings, formatters and more, are difficult, if not impossible, to write in Ruby, because there is no way to parse Ruby source with Ruby code. There are workarounds, like Ryan Davis' ParseTree which uses the parser of Ruby's interpreter (via a native extension) to get at the Abstract Syntax Tree (AST) for Ruby source. An AST is a tree representation of source code, and is necessary for tools that need to know about the structure of the code. Yet, even ParseTree is not a complete solution, since it's current versions don't give the source locations of individual nodes. Obviously, a refactoring algorithm that, for instance, wants to rename an identifier in a Ruby source file needs to know where the identifier actually is.
2. There will be a ruby backend for ANTLR v3, so that we may be able to have ruby parser in ruby.
3. ANTLR v3 's performance is much better.
This issue has become ever more obvious in the past year, due to the arrival of various Ruby IDEs. These ship with code analyzers (to warn about potential errors in the code) and the Eclipse based RDT is the first IDE to feature extensive refactoring support for Ruby. Other features are support for popular Ruby based files, such as Rake files, Ruby's equivalent of make or Ant. The problem: these tools are all built with Java (or other languages) and Java IDEs all use JRuby's parser.
This means that the functionality of these tools is locked in those languages, and even worse, often tied to particular IDEs. For instance, the logic for RDT's refactoring support is not available for, say, Ruby in Steel, an IDE built on Visual Studio. This is different than, for instance, in the Java space, where parsers are available. Tools such as PMD or Findbugs are naturally written in Java and thus available wherever Java runs and, more importantly, can be extended with Java code.
Because the Google SoC description of this project isn't 100% clear on the Ruby based parser, Wang Huofei clarifies the project plan:
It depends on how well we are doing. We are willing to do that even if it may not fit in SOC schedule.Good news.
One necessity for making code tools is the AST, to analyze source. ParseTree, as already mentioned, offers a format to represent Ruby source code. Tools, based on ParseTree, already exist, such as Ruby2Ruby which can turn an AST back into Ruby source code; useful if a tool wants to modify an AST and then output it as Ruby source. Rubinius, a project to implement a Ruby VM in Ruby, also uses ParseTree output to compile Ruby to the Rubinius bytecode, which is then interpreted. When asked about the output of the parser, Wang Haofei explains:
ANTLR has its own builtin AST support, and it is very handy if you want to serialize to a string or change to other struture. It looks similar to ParseTree's output. In XRuby we turned the AST into a DOM-like structure and use visitor pattern to generate java bytecode.While ParseTree output doesn't seem to be planned, it's entirely possible to translate the ANTLR generated AST into the ParseTree format. A similar approach is already used by JParseTree, a port of ParseTree that works on JRuby, now a part of JRuby Extras which provides JRuby ports of popular Ruby libraries.
One way to watch the progress of XRuby and it's parser project is the XRuby team blog.