-
Notifications
You must be signed in to change notification settings - Fork 68
Home
Neotoma is a packrat parser-generator for Erlang for Parsing Expression Grammars (PEGs). It consists of a parsing-combinator library with memoization routines, a parser for PEGs, and a utility to generate parsers from PEGs. It is inspired by treetop, a Ruby library with similar aims, and parsec, the parser-combinator library for Haskell.
- Clone the repository:
$ git clone git://github.com/seancribbs/neotoma.git
- Build the library:
$ cd neotoma
$ make - Symlink or copy the neotoma application into your lib path (if you configured erlang with —prefix=/usr/local, for example):
$ ln -s neotoma /usr/local/lib/erlang/lib/neotoma
- Start the Erlang shell and generate your parser:
$ erl
1> neotoma:file(“mygrammar.peg”).
ok
Neotoma’s PEG grammars are based on the grammars from Bryan Ford’s thesis with some influences from Treetop. The basic format is thus:
nonterminal <- parsing_expression;
Where parsing_expression
is any combination of nonterminals, terminals and sub-expressions (e
, e1
, e2
are parsing expressions) as described below:
Non-terminal symbol | some_nonterminal |
All nonterminals on the RHS must have a corresponding rule/reduction. |
String | "Hello, world" |
single- or double-quoted, quotes escaped with \\
|
Character class | [a-zA-Z0-9]
|
just as in the re module |
Any single character | . |
|
Sequence | e1 e2
|
|
Ordered choice | e1 / e2 |
|
Grouping | (e) |
|
Zero-width positive lookahead | &e |
|
Zero-width negative lookahead | !e |
|
Optional (zero-or-more) repetition | e* |
|
Mandatory (one-or-more) repetition | e+ |
|
Optional expression | e?
|
|
Label | name:e |
Helps in extracting sub-expressions, creates {name, SubTree} tuples in the AST. |
Currently all reductions must end with a semi-colon ;
. The first rule/reduction in your grammar will be considered the root of the parse-tree.
Without specifying any transformations, Neotoma will return a nested list of the results of its parse — essentially an S-expression. In this form, the AST is not very useful; one needs to transform and annotate the tree into a useful data structure. Neotoma provides hooks into the parsing process in the form of the transform/3
function (or the inline code blocks). Once you have generated your parser, you can edit this function in the generated file. The prototype is thus:
transform('nonterminal', Node, Index)
-
nonterminal
is the nonterminal that was successfully parsed. -
Node
is a list of the results from sub-expressions, which may be raw terminals or the transformations of other nonterminals. -
Index
is a tuple representing the position of the parser at the start of this expression, in the form{{line, L},{column,C}}
whereL
andC
are both integers.
While editing this within the generated parser is easy, Neotoma will overwrite your changes if you regenerate the parser. Therefore, I recommend that you specify an external module in which to do your transformations (or use inline blocks, as described below). Doing so will allow you to develop your grammar and transformations independently, without the parser-generator overwriting your transformations. You can do this by specifying the transform_module
option to peg_gen:file/2
. The module will be generated for you if it does not exist already. An example:
1> neotoma:file("mygrammar.peg", [{transform_module, myast}]).
As of 1.3 and later, Neotoma allows code inline with your grammar for AST transformation and additional support functions. Reductions may be optionally followed by a code block that is enclosed in backticks (`
), or a single tilde (~
). The code block will become the body of the transformation function. The ~
will create an identity transformation, equivalent to `Node`
. Example from the JSON parser:
number <- int frac? exp? ` case Node of [Int, [], []] -> list_to_integer(lists:flatten([Int])); [Int, Frac, []] -> list_to_float(lists:flatten([Int, Frac])); [Int, [], Exp] -> list_to_float(lists:flatten([Int, ".0", Exp)); _ -> list_to_float(lists:flatten(Node)) end `;
The Node
and Idx
variables are available to your code block.
To add additional support functions, just put another backtick-delimited block at the bottom of the grammar. All code will be added verbatim to the generated parser.
- Support for parsing in binary form/UTF.
- Support for LFE and Reia.