Skip to content

vnmakarov/yaep

Repository files navigation

YAEP -- standalone Earley parser library

  • YAEP is an abbreviation of Yet Another Earley Parser.
  • This standalone library is created for convenience.
  • The parser development is actually done as a part of the Dino language project.
  • YAEP is licensed under the MIT license.

YAEP features:

  • It is sufficiently fast and does not require much memory. This is the fastest implementation of the Earley parser which I know of. If you know a faster one, please send me a message. It can parse 300K lines of C program per second on modern computers and allocates about 5MB memory for 10K line C program.
  • YAEP does simple syntax directed translation, producing an abstract syntax tree as its output.
  • It can parse input described by an ambiguous grammar. In this case the parse result can be a single abstract tree or all possible abstract trees. YAEP produces a compact representation of all possible parse trees by using DAG instead of real trees.
  • YAEP can parse input described by an ambiguous grammar according to abstract node costs. In this case the parse result can be a minimal cost abstract tree or all possible minimal cost abstract trees. This feature can be used to code selection task in compilers.
  • It can perform syntax error recovery. Moreover its error recovery algorithm finds error recovery with a minimal number of ignored tokens. This permits implementing parsers with very good error recovery and reporting.
  • It has fast startup. There is only a tiny and insignificant delay between processing grammar and the start of parsing.
  • A grammar for YAEP can be constructed through function calls or using a YACC-like description syntax.

Usage example:

  • The following is a small example of how to use YAEP to parse expressions. We have omitted the functions read_token, syntax_error_func, and parse_alloc_func which are needed to provide tokens, print syntax error messages, and allocate memory for the parser.
static const char *description =
"\n"
"TERM NUMBER;\n"
"E : T         # 0\n"
"  | E '+' T   # plus (0 2)\n"
"  ;\n"
"T : F         # 0\n"
"  | T '*' F   # mult (0 2)\n"
"  ;\n"
"F : NUMBER    # 0\n"
"  | '(' E ')' # 1\n"
"  ;\n"
  ;

static void parse (void)
{
  struct grammar *g;
  struct earley_tree_node *root;
  int ambiguous_p;

  if ((g = earley_create_grammar ()) == NULL) {
      fprintf (stderr, "earley_create_grammar: No memory\n");
      exit (1);
  }
  if (earley_parse_grammar (g, TRUE, description) != 0) {
      fprintf (stderr, "%s\n", earley_error_message (g));
      exit (1);
    }
  if (earley_parse (g, read_token_func, syntax_error_func, parse_alloc_func,
                    NULL, &root, &ambiguous_p))
    fprintf (stderr, "earley_parse: %s\n", earley_error_message (g));
  earley_free_grammar (g);
}
  • To add error recovery, just add a reserved symbol error to the rules. Skipped terminals during error recovery will be represented in the resulting abstract tree by a node called error. For example, if you want to include expression- and statement-level error-recovery in a programming language grammar, the rules could look like the following:
  stmt : IF '(' expr ')' stmt ELSE stmt # if (2 4 6)
       | ...
       | error # 0
       ;
  expr : IDENT # 0
       | ...
       | error # 0
       ;
  • For more details, please see the documentation in directory src/, or the YAEP examples in files test*.c in directories test/C or test/C++.

Installing:

  • mkdir build
  • cd build
  • <srcdir>/configure --srcdir=<srcdir> --prefix=<prefix for install dirs> or cmake -DCMAKE_BUILD_TYPE=Release (make sure you have CMake installed)
  • make
  • make test (optional)
  • make install

Speed comparison of YACC, MARPA, YAEP, and GCC parsers:

  • Tested parsers:

    • YACC 1.9 from Linux Fedora Core 21.
    • MARPA C Library, version 8.3.0. A popular Earley parser implementation using the Practical Earley Parser algorithm and Leo Joop's approach.
    • The C parser in GCC-4.9.2.
    • YAEP as of Oct. 2015.
  • Grammar:

    • The base test grammar is the ANSI C grammar which is mostly a left recursion grammar.
    • For MARPA and YAEP, the grammar is slightly ambiguous as typenames are represented with the same kind of token as identifiers.
    • For the YACC description, typename is a separate token type distinct from other identifiers. The YACC description does not contain any actions except for a small number needed to give feedback to the scanner on how to treat the next identifier (as a typename or regular identifier).
  • Scanning test files for YACC, MARPA, and YAEP:

    • We prepare all tokens beforehand in order to exclude scanning time from our benchmark.
    • For YACC, at the scanning stage we do not yet distinguish identifiers and typenames.
  • Tests:

    • The first test is based on the file gen.c from parser-generator MSTA. The file was concatenated 10 times and the resulting file size was 67K C lines.
    • The second test is a pre-release version of gcc-4.0 for i686 with all the source code combined into one file (source). The file size was 635K C lines.
    • The C pre-processor was applied to the files.
    • Additional preparations were made for YACC, MARPA, and YAEP:
      • GCC extensions (mostly attributes and asm) were removed from the pre-processed files. The removed code is a tiny and insignificant fraction of the entire code.
      • A very small number of identifiers were renamed to avoid confusing the simple YACC actions to distinguish typenames and identifiers. So the resulting code is not correct as C code but it is correct from the syntactic point of view.
  • Measurements:

    • The result times are elapsed (wall) times.
    • Memory requirements are measured by comparing the output of Linux sbrk before and after parsing.
    • For GCC, memory was instead measured as max resident memory reported by /usr/bin/time.
  • How to reproduce: please use the shell script compare-parsers.tst from directory src.

  • Results:

    • First file (67K lines). Test machine is i7-2600 (4 x 3.4GHz) with 8GB memory under FC21.
Parse time only Overall Memory (parse only) MB
YACC 0.07 0.17 20
MARPA 3.48 3.77 516
YAEP 0.18 0.28 26
  • Second file (635K lines). Test machine is 2xE5-2697 (2 x 14 x 2.6GHz) with 128GB memory under FC21.
Parse time only Overall Memory (parse only) MB
YACC 0.25 0.55 120
gcc -fsyntax-only - 1.22 194
gcc -O0 - 19.37 761
MARPA 22.23 23.41 30310
YAEP 1.43 1.68 142
  • Conclusions:
    • YAEP without a scanner is up to 20 times faster than Marpa and requires up to 200 times less memory.
    • Still, it is 2.5 - 6 times slower (1.6 - 3 times when taking the scanner into account) than YACC.

Future directions

  • Implement YACC-style description syntax for operator precedence and associativity.
  • Implement bindings for popular scripting languages.
  • Introduce abstract node codes (instead of string labels) for faster work with abstract trees.
  • Permit nested abstract nodes in simple translation.