Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit adds incremental parsing support to ANTLR4
.
I have only updated the Java target, and the out-of-tree typescript target (see tunnelvisionlabs/antlr4ts#414), but it should be very easy to update the other targets for someone who understands that language. The changes are deliberately minimal.
The Java version here is actually a backport of the typescript version, and took O(2 hours).
(as an aside, i have not written Java in a few years, so i totally expect there are things that could be done better). The comments were originally written for the typescript version, I will go through and clean them up.
A detailed description of how it works is here (which also lists the outstanding issues), but it's a very straightforward implementation of detection of rules that could be affected by token changes. Rule contexts that can't have been affected by a set of token changes are reused and the rules are not re-run. To account for possibly infinite lookahead/lookbehind, we keep track of how far ahead/behind the parser looked last time on each rule, and use that as the bounds to detect changes in.
The tests currently test on a simple grammar and the JavaLR grammar (which exercises the left recursion removal support).
The only class i've added that requires anything even mildly interesting of the runtime is the IncrementalParserData class.
Most of the work there is related to changing the start/end tokens of rule contexts to realign them with the token stream changes. If you only care about the text of the parse tree, and not the position/etc info, this is obviously unnecessary. I have not made this an option.
To track changed tokens and stream adjustments, the Java version of IncrementalParseData uses TreeMap/TreeSet. The Typescript versions uses arrays of ranges and binary search (see https://github.com/dberlin/antlr4ts/blob/incremental/src/IncrementalParserData.ts)
I am happy to encapsulate this into a data structure in the runtime if anyone thinks it is worth it.
As for why do this at all: Yes, ANTLR is actually pretty fast.
My use case is a bit weird - large GCode files, which are often 20+ megabytes. As such, a single parse takes 6-10 seconds (for a 20 meg file).
Users often make small edits to various pieces.
(It's part of a vscode extension).
Lexing GCode is also completely trivial to do in a contextless fashion.
The incremental parser brings the reparse time down to <50ms.
I may get around to adding incremental lexing. As i'm sure Terrence knows, this is " trickier".
I have the beginnings of support (elsewhere) based on some papers, but it is incestuous (the parser tells the lexer what tokens could be valid at a given change point and the lexer tries those rules). There are ways that don't do this, but some require being able to store/rewind/replay the transition state at each token, etc.