JRuby Wrapping For ANTLR

I use ANTLR, a fantastic parser generator, for various parsing tasks and experimenting with DSLs1. ANTLR is one of the tools that makes it possible to “write code that writes code”. Previously I wrote about capturing Java’s output and error streams in JRuby, and I used this trick to write a wrapping for ANTLR which makes it very easy to unit test grammars as well as to integrate ANTLR parsers into a Ruby class. I call it AntlrVelvet.

AntlrVelvet is just a Ruby module, so you need to require the file and then include the module in a class:

require 'antlr_velvet'

class Expr
  include AntlrVelvet  
end

And… that’s it. We now have a parser! By default, AntlrVelvet assumes you have given your class the same name as you gave your parser, in this case Expr. (You can easily override this.) ExprLexer.class and ExprParser.class need to be on the Java CLASSPATH somewhere, but if they’re in the same directory as antlr_velvet.rb this will be taken care of automatically.

Let’s take a quick look at Expr.g, the ANTLR grammar file, and see what super-powers we have given our Ruby class with these 4 lines of boilerplate.

// http://www.antlr.org/wiki/display/ANTLR3/Expression+evaluator
// [The "BSD licence"] Copyright (c) 2005-2008 Terence Parr All rights reserved.

grammar Expr;

@header {
import java.util.HashMap;
}

@members {
/** Map variable name to Integer object holding value */
HashMap memory = new HashMap();
}

prog:   stat+ ;
                
stat:   expr NEWLINE {System.out.println($expr.value);}
    |   ID '=' expr NEWLINE
        {memory.put($ID.text, new Integer($expr.value));}
    |   NEWLINE
    ;

expr returns [int value]
    :   e=multExpr {$value = $e.value;}
        (   '+' e=multExpr {$value += $e.value;}
        |   '-' e=multExpr {$value -= $e.value;}
        )*
    ;

multExpr returns [int value]
    :   e=atom {$value = $e.value;} ('*' e=atom {$value *= $e.value;})*
    ; 

atom returns [int value]
    :   INT {$value = Integer.parseInt($INT.text);}
    |   ID
        {
        Integer v = (Integer)memory.get($ID.text);
        if ( v!=null ) $value = v.intValue();
        else System.err.println("undefined variable "+$ID.text);
        }
    |   '(' expr ')' {$value = $expr.value;}
    ;

ID  :   ('a'..'z'|'A'..'Z')+ ;
INT :   '0'..'9'+ ;
NEWLINE:'\r'? '\n' ;
WS  :   (' '|'\t')+ {skip();} ;

Expr.g is taken directly from the ANTLR wiki, it is an expression evaluator. We can assign integer values to variables and perform simple arithmetic operations.

Let’s start feeding in some strings. If you have JRuby and ANTLR installed, you can download antlr_velvet_demo.rb etc. and play along.

AntlrVelvet has convenience methods for parsing files as well as strings. We’ll just parse strings. To initialize a new string parser, you do:

parser = Expr.string_parser("1 + 1")

parser is now all ready to parse, but the parsing hasn’t happened yet. Because AntlrVelvet is designed for unit testing, it doesn’t make any assumptions about how you want to parse. If you call parser.prog, it will parse according to the Expr grammar’s main prog rule. You might only want to test a part of your grammar, however, such as an atom or an expr. You can call any rule defined in your grammar, e.g. parser.atom, and the string you passed at initialization will be evaluated against that rule.

So, let’s try this now:

puts parser.prog

we get:

line 0:-1 missing NEWLINE at '<EOF>'

Oops. Okay, we tried comparing "1 + 1" to the prog rule and we got an error, because the prog rule expects to process one or more stat rules, each of which are newline-terminated. If we add a newline at the end:

parser = Expr.string_parser("1 + 1\n")
puts parser.prog

we get:

nil

nil? Honestly, how hard is 1 + 1 = 2? Okay, looking back at Expr.g I see that prog doesn’t actually return a value. Let’s try an expr since according to the grammar that should return an int.

parser = Expr.string_parser("1 + 1")
puts parser.expr
2

2! Excellent. After all this we can add 1 and 1.

Now, let’s take another look at stat, since that really does seem to be the main action. We don’t return anything, but we do write to System.out.println(). Remember AntlrVelvet captures Java’s output and error streams so it can look for parsing errors and raise an exception if it sees one. If you want to get at the contents of the output stream, you do this via the output method.

parser = Expr.string_parser("1 + 1\n")
parser.stat
puts parser.output.inspect
["2"]

So, we have an array with a 2 in it. This is because stat prints out expr.value every time it gets called. Now, let’s try calling prog again with some additional input.

parser = Expr.string_parser("
1 + 1
x = 1
y = 2
3*(x+y)
")
parser.prog
puts parser.output.inspect
["2", "9"]

The first “2” comes from 1 + 1. The lines x = 1 and y = 2 are of the form ID = expr NEWLINE, they are variable assignments rather than expr evaluations, so nothing is added to the output buffer, instead the parser executes the memory.put ... action. The final line is an expr which evaluates to 9 and so this is added to the output buffer. Note the extra newline at the end of our input, if this wasn’t there we would end up with an EOF error. (Of course you can easily write a grammar which doesn’t need a newline at the end if you want to.)

Let’s do one more example.

puts Expr.string_parser("1 + 2 * (3 + 4 * (5 + 6 * (7 + 8 * 9)))").expr
3839

There’s some nice mutual recursion in Expr.g that lets it handle nested parentheses like this, go check it out if you haven’t seen it before.

Now, Expr.g doesn’t do anything mathematically that we couldn’t do in Ruby. What it does do is let us accept a string and only perform this calculation if the string conforms to our grammar. So, if, say, you were writing a web application and wanted to allow your users to perform arithmetic on your website, but for some reason you didn’t feel like doing this via2 <%= eval(params[:arithmetic_expression]) %>, something like an ANTLR grammar might let you nicely sandbox and sanitize your user’s inputs. And, Expr.g might not be able to do anything all that fancy, but a more complex grammar might give you (and your untrustworthy users) some really interesting functionality.

So, to recap, Expr.string_parser returns an object of class Expr, a Ruby class. We can call any of the grammar’s rules as a method on that object and they will parse the string according to that rule, and the method will return a value if the rule returns a value.

After we have done the parsing, we can call parser.output to see what’s in the output buffer. This will either be an array of strings or nil. We can also call parser.parser or parser.lexer to return the ExprParser or ExprLexer, the actual ANTLR objects which do the parsing. Remember Expr is just our wrapper class. We can even call parser.token_stream. And, since this is JRuby, we can have some introspection fun with these objects:

puts java_methods(parser.parser.methods)
already_parsed_rule
atom
backtracking_level
begin_resync
consume_until
display_recognition_error
emit_error_message
end_resync
equals
expr
grammar_file_name
hash_code
input
java_class
java_object
match
match_any
memoize
mismatch_is_missing_token
mismatch_is_unwanted_token
mult_expr
notify
notify_all
number_of_syntax_errors
prog
recover
recover_from_mismatched_set
report_error
reset
rule_invocation_stack
rule_memoization_cache_size
source_name
stat
synchronized
to_java_object
to_string
to_strings
token_names
token_stream
trace_in
trace_out
wait
puts java_methods(parser.token_stream.methods)
consume
discard_off_channel_tokens
discard_token_type
equals
get
hash_code
index
java_class
java_object
la
lt
mark
notify
notify_all
release
reset
rewind
seek
size
source_name
synchronized
to_java_object
to_string
token_source
tokens
wait
puts java_methods(parser.lexer.methods)
already_parsed_rule
backtracking_level
begin_resync
char_error_display
char_index
char_position_in_line
char_stream
consume_until
display_recognition_error
emit
emit_error_message
end_resync
equals
grammar_file_name
hash_code
java_class
java_object
line
m_id
m_int
m_newline
m_t__10
m_t__11
m_t__12
m_t__13
m_t__8
m_t__9
m_tokens
m_ws
match
match_any
match_range
memoize
mismatch_is_missing_token
mismatch_is_unwanted_token
next_token
notify
notify_all
number_of_syntax_errors
recover
recover_from_mismatched_set
report_error
reset
rule_invocation_stack
rule_memoization_cache_size
skip
source_name
synchronized
text
to_java_object
to_string
to_strings
token_names
trace_in
trace_out
wait

So, here we have the internals of ANTLR laid out. You can see the values of any of these methods. Play with them. Use them for testing. Go into an interactive jirb session and follow the lexing and parsing step-by-step until you know exactly how the grammar does its thing.

By the way, AntlrVelvet uses method_missing to pass method calls like prog and stat on to the embedded ExprParser object, but you could also explicitly define any of these methods and call the ExprParser yourself. This can be useful if you want to do some Ruby post-processing or perhaps debugging. If you do this, you will need to invoke @AntlrVelvet@’s capture_java_streams yourself if you don’t want Java writing to your console. And remember, in JRuby Java’s System.out and Ruby’s $stdout are totally independent of each other. So if you have Ruby’s $stdout redirected to a logfile, Java’s System.out will still write to the console unless it, too, has been redirected.

class Expr
  def atom
    puts "Your atom is " + @parser.atom.to_s
  end
end

Expr.string_parser("99").atom
Your atom is 99

On to testing! I’m just using Ruby’s Test::Unit for now. To begin with, I define some custom assertions and convenience methods:

  def assert_nothing_left(parser)
    assert(parser.token_stream.index == parser.token_stream.size)
  end

  def assert_something_left(parser)
    assert(parser.token_stream.index < parser.token_stream.size)
  end
  
  def assert_valid_input(method, input_string)
    assert_nothing_raised do
      parser = Expr.string_parser(input_string)
      parser.send(method)
      assert_nothing_left(parser)
    end
  end
  
  def assert_too_much_input(method, input_string)
    parser = Expr.string_parser(input_string)
    parser.send(method)
    assert_something_left(parser)
  end
  
  def assert_invalid_input(method, input_string)
    assert_raise(AntlrError) { Expr.string_parser(input_string).send(method) }
  end
  
  def parsed_value(method, input_string)
    Expr.string_parser(input_string).send(method)
  end
  
  def parser_output(method, input_string)
    parser = Expr.string_parser(input_string)
    parser.send(method)
    parser.output
  end

To recap the examples from earlier in test form:

  
  def test_prog_without_newline
    assert_invalid_input(:prog, "1 + 1")
  end

  def test_prog_with_newline
    assert_valid_input(:prog, "1 + 1\n")
  end
  
  def test_expr
    assert_equal 2, parsed_value(:expr, "1 + 1")
  end

  def test_stat_output
    assert_equal ["2"], parser_output(:stat, "1 + 1\n")
  end
  
  def test_prog_output
    str = "
     1 + 1
     x = 1
     y = 2
     3*(x+y)
     "

     assert_equal ["2", "9"], parser_output(:prog, str)
   end
   
   def test_nasty_expr
    assert_equal 3839, parsed_value(:expr, "1 + 2 * (3 + 4 * (5 + 6 * (7 + 8 * 9)))")
   end

The assert_valid_input, assert_invalid_input and assert_too_much_input just take a symbol and an input string for the parser. assert_invalid_input expects the parsing to throw an AntlrError, meaning that ANTLR wrote something to Java’s System.err. This is used when you want to make sure that the input string is illegal for the rule in question, either too short or containing illegal characters or sequences. assert_too_much_input means that the input is acceptable, but after processing there’s something left over. In order for assert_valid_input to pass, ANTLR must not find any errors and the input string must be entirely consumed.

parsed_value and parser_output also take a symbol and an input string as arguments. They parse the string and return the grammar rule’s return value or the text written to System.out respectively. Then you can write assertions for what the value or output should be, as shown. assert_nothing_left and assert_something_left are mostly for internal use, you have to initialize and run a parser yourself before passing the parser in to these functions, but of course you can call these directly if you want to.

I use ANTLR’s Java target because, as of this writing, the Ruby target for ANTLRv3 isn’t complete enough to be practical. Even though I’m not a fan of Java, I have to say that I really don’t mind the little bit of it I have to write in a ANTLR grammar. If you are a Ruby programmer, don’t let a bit of Java scare you away from the possibilities that ANTLR opens up. And, if you are a Java programmer, I hope this has demonstrated that JRuby can be a useful complement to Java, giving you a fantastic scripting language and testing environment for development, even if it doesn’t play any role in your final product.

As usual, all the source code is available for download from the sidebar. If you have any difficulty running any of the files, please let me know in the comments.

1 However, if you want to write a syndicated advice column protocol then you might want a more state-machine-oriented tool like Ragel.

2 Joke!! I’m KIDDING people!! You really think I would put something so unsafe in my code? Of course I meant <%=h eval(params[:arithmetic_expression]) %>.