From b6ea82f1d3826bd697875423f1bd69c021ffa052 Mon Sep 17 00:00:00 2001 From: Ken Domino Date: Mon, 26 Jun 2023 08:20:14 -0400 Subject: [PATCH] Add "writing target-agnostic grammars". Signed-off-by: Ken Domino --- doc/index.md | 2 + doc/target-agnostic-grammars.md | 94 ++++++++++++++++++++------------- 2 files changed, 59 insertions(+), 37 deletions(-) diff --git a/doc/index.md b/doc/index.md index d3bde038ea..9dfc2e2766 100644 --- a/doc/index.md +++ b/doc/index.md @@ -59,6 +59,8 @@ For those using Java, here's a great [set of ANTLR in Intellij notes](https://do * [Parser and lexer interpreters](interpreters.md) +* [Writing target-agnostic grammars](target-agnostic-grammars.md) + * [Resources](resources.md) # Building / releasing ANTLR itself diff --git a/doc/target-agnostic-grammars.md b/doc/target-agnostic-grammars.md index 00e9eedc75..6f867a615b 100644 --- a/doc/target-agnostic-grammars.md +++ b/doc/target-agnostic-grammars.md @@ -1,46 +1,66 @@ -# Target agnostic grammars +# Writing target-agnostic grammars -If your grammar is targeted to Python only, you may ignore the following. But if your goal is to get your Java parser to also run in Python, then you might find it useful. +Some grammars may require side-effecting +actions or [semantic predicates](https://github.com/antlr/antlr4/blob/dev/doc/predicates.md) +in order to add context-sensitive parsing to what would normally be a context-free grammar. -1. Do not embed production code inside your grammar. This is not portable and will not be. Move all your code to listeners or visitors. -1. The only production code absolutely required to sit with the grammar should be semantic predicates, like: +For example: +* In Fortran90, [lines that being with a 'C' in column 1 +denote a comment](https://github.com/antlr/grammars-v4/blob/43fbb16fec1d474d38a603cc6a6bcbe5edf07b1e/fortran/fortran90/slow/hw.f90#L1). +These types of comments should be tokenized and placed on a comment token stream, +not the default token stream. But, if the 'C' does not begin in +column 1, then the input is invalid and should be flagged as so. +```fortran +c Hello World. + c This is a syntax error because 'c' does not start in column 1 +program hello + print *, 'Hello World!' +end ``` -ID {$text.equals("test")}? -``` - -Unfortunately, this is not portable, as Java and Python (and other target languages) have different syntaxes for all but the simplest language elements. But you can work around it. The trick involves: - -* deriving your parser from a parser you provide, such as BaseParser -* implementing utility methods, such as "isEqualText", in this BaseParser, in different files for each target language -* invoking your utility methods in the semantic predicate from the `$parser` object - -Thanks to the above, you should be able to rewrite the above semantic predicate as follows: - -File `MyGrammarParser.g4`: -``` -options { superClass = MyGrammarBaseParser; } -... -ID {$parser.isEqualText($text,"test")}? +* In CSharp, double angle-brackets `'>>'` can either mean +[a right shift expression](https://github.com/antlr/grammars-v4/blob/43fbb16fec1d474d38a603cc6a6bcbe5edf07b1e/csharp/examples/AllInOneNoPreprocessor.cs#L657C15-L657C17) +or [a declaration that contains a type with +nested templates](https://github.com/antlr/grammars-v4/blob/master/csharp/examples/AllInOneNoPreprocessor.cs#L463C33-L463C35). +Since lexers in Antlr are not parser aware, +the lexer must tokenize the double angle-brackets as two separate tokens, one for each angle bracket. +A semantic predicate should be added to disallow a space between the angle brackets when parsing an +expression. +```C# +class Foo { + void Func() + { + int x = 1000 > > 2; // syntax error if a space exists in the double-angle bracket + } + Dictionary > mapping; // nested template declaration, valid +} ``` -File `MyGrammarBaseParser.py`: -```python -from antlr4 import * - -class MyGrammarBaseParser(Parser): +Antlr does not have a general-purpose language for actions and predicates; +these must be written in the target language of the generated the parser. +The problem is that if one wants a grammar that "works" across a number of +targets, like CSharp, Java, and Cpp, the grammar would need to be forked for +each target. This adds a +burdon of maintaining a version of the grammar for each target. _Or, so it would +seem so._ - def isEqualText(a, b): - return a is b -``` +It turns out that it is possible to write the grammar +so that can appear to be target-independent called _target agnostic format_. -File `MyGrammarBaseParser.java`: -```java -import org.antlr.v4.runtime.*; +## Rules in writing target-agnostic grammars -public abstract class MyGrammarBaseParser extends Parser { +1) Move all actions and semantic predicate written in target-specific +code to base-class methods. +2) Replace all actions and semantic predicates in the grammar with a single +call to the method. For actions, use `{ this.name_of_action() }`. For predicates, +use `{ this.name_of_predicate() }?`. +3) Add a Python script called "transformGrammar.py" that replaces strings +in the grammar per target. + a) For Cpp: replace `this.` strings with `this->`. + b) For PHP: replace `this.` strings with `$this->`. + c) For Python: replace `this.` strings with `self.`, `l.`, or `p.` depending on +where the action or predicate is in the grammar. + d) Run `python transformGrammar.py *.g4` prior to generating the parser and lexer. - public static boolean isEqualText(a, b) { - return a.equals(b); - } -} -``` +## Examples of target agnostic grammars +* [fortran90](https://github.com/antlr/grammars-v4/tree/master/fortran/fortran90) +* [csharp](https://github.com/antlr/grammars-v4/tree/master/csharp)