Skip to content

Calculator Tutorial Part 2: Basic parser

farcaller edited this page Nov 18, 2012 · 1 revision

This is the lexer we ended up in part 1 of the tutorial:

class CalcLex < Rly::Lex
  literals '+-*/'
  ignore " \t\n"
  token :NUMBER, /\d+/ do |t|
    t.value = t.value.to_i
    t
  end

  on_error do |t|
    puts "Illegal character #{t.value}"
    t.lexer.pos += 1
    nil
  end
end

Now it's time to write some parsing rules. It's done by subclassing Rly::Yacc and defining a set of rules:

class CalcParse < Rly::Yacc
  rule 'statement : expression' do |st, e|
    st.value = e.value
  end

  rule 'expression : expression "+" expression
                   | expression "-" expression
                   | expression "*" expression
                   | expression "/" expression' do |ex, e1, op, e2|
    ex.value = e1.value.send(op.value, e2.value)
  end

  rule 'expression : NUMBER' do |ex, n|
    ex.value = n.value
  end
end

Rules are defined in a custom language where production result is defined before double colon sign, and source productions are defined next. You can also collapse several rules that have same result using a pipe ('|').

Each time you hit a rule while parsing, the block receives arguments based on your rule. First argument is always a returning value handler, others are mapped to whatever you have on the right side of the double colon.

By a convention, rules are named in all lowercase. You can reference lexer tokens by upper-case names and lexer literals by a one-character literal string.

Leetspeek note: actually, rules are "nonterminals" and tokens are "terminals".

For each rule you are supposed to set the value of first argument based on your processing logic. E.g.:

rule 'expression : NUMBER' do |ex, n|
  ex.value = n.value
end

this rule takes value of n (which is LexToken of type :NUMBER at this point) and writes it back to YaccSymbol named expression.

The first rule is the start one and it's the rule on which the parsing starts. Now, let's do some math:

parser = CalcParse.new(CalcLex.new) # => #<CalcParse ...>
parser.parse('2+2') # => 4

Doesn't it look awesome?

Now, a few notes here. You need to pass in the lexer instance that would be used to process input string into tokens. Also, the return value of parse method is actually a value of the most top-level YaccSymbol object (the one in 'statement' rule).

Let's now try something a bit complex:

parser.parse('2*2+2') # => 8

Whoops, math is broken (you should totally expect 6 at this point). Let's enable debug mode and look into the problem:

parser.parse('2*2+2', true)
State  : 0
Stack  : #<LexToken NUMBER '2'>
Action : Shift and goto state 3

State  : 3
Stack  : NUMBER #<LexToken * '*'>
Action : Reduce rule [expression -> NUMBER] with [2] and goto state 6
Result : 2

State  : 2
Stack  : expression #<LexToken * '*'>
Action : Shift and goto state 6

State  : 6
Stack  : expression * #<LexToken NUMBER '2'>
Action : Shift and goto state 3

State  : 3
Stack  : expression * NUMBER #<LexToken + '+'>
Action : Reduce rule [expression -> NUMBER] with [2] and goto state 6
Result : 2

State  : 10
Stack  : expression * expression #<LexToken + '+'>
Action : Shift and goto state 4

State  : 4
Stack  : expression * expression + #<LexToken NUMBER '2'>
Action : Shift and goto state 3

State  : 3
Stack  : expression * expression + NUMBER #<YaccSymbol $end ''>
Action : Reduce rule [expression -> NUMBER] with [2] and goto state 6
Result : 2

State  : 8
Stack  : expression * expression + expression #<YaccSymbol $end ''>
Action : Reduce rule [expression -> expression + expression] with [2, +, 2] and goto state 2
Result : 4

State  : 10
Stack  : expression * expression #<YaccSymbol $end ''>
Action : Reduce rule [expression -> expression * expression] with [2, *, 4] and goto state 4
Result : 8

State  : 2
Stack  : expression #<YaccSymbol $end ''>
Action : Reduce rule [statement -> expression] with [8] and goto state 1
Result : 8

State  : 1
Stack  : statement #<YaccSymbol $end ''>
Done   : Returning 8

You see the internals of rly's state machine here (if you wonder about states, check this mini-tutorial).

What's interesting for us is the reduce rule Reduce rule [expression -> expression + expression] with [2, +, 2] and goto state 2 happening before Reduce rule [expression -> expression * expression] with [2, *, 4] and goto state 4. This is due to the fact, that by default all rules have right precedence.

We need to fix this by specifying precedence rules manually:

precedence :left,  '+', '-'
precedence :left,  '*', '/'

Earlier defined precedence has lesser importance. Also note on the :left usage (precedence can be :left, :right or :nonassoc for non-associative).

With this fix everything seems to be good now:

parser.parse('2*2+2') # => 6