-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental Parsing support #414
base: master
Are you sure you want to change the base?
Conversation
…when the position/lookahead has not changed
I have actually back ported this to ANTLR4 java and submitted it to the main repo. |
@dberlin this sounds interesting, I'll have a look. |
SGTM. Incremental parsing is in good shape. I got slammed at work and then had another child, so i haven't had time to finish the incremental lexing. |
(the incremental lexing info can be found here: |
A bit late to the party, but this is super cool. Are there any plans on finishing this up and merging it? |
@dberlin, can you help me understand if incremental parsing can help with the a fundamental mismatch between the Java stream model (which uses blocking I/O) and the JavaScript stream model, where no blocking for I/O is permitted? In JavaScript, rather than pulling data from a stream, data arrives in chunks, which are delivered by a callback (continuation passing style), or more recently using Promises. Promises have lead to language extension such as |
@dberlin can you help me understand the parts of this feature which currently require changes to the core library? I'm hoping to find a way that it can be used without needing to change the core code generator or runtime. |
Let me do my best to go in order. @AlexGustafsson I won't have a chance to finish this up anymore (had another kid since then, and moving across the country) I'm not as familiar with the ANTLR core runtime as y'all, i expected. It may be possible to do this without modifying the code generator, but i'm not sure how to do it. Beyond that, a lot of the complexity is because of the intermediate use of a token stream. We waste a bunch of time and energy returning/tracking and processing unchanged tokens. If instead the incrementalparser required an incrementallexer, it could be made to only ever request tokens for changed areas. This would also cleanup the whole interface you see here. (All of this unfortunately requires random access to the underlying thing being lexed, but, on the plus side, can be made non-blocking as a result) The incremental lexer changes i posted have an IncrementalLexer.md file that goes into this a bit. If you want to see another implementation that uses a similar strategy to this , to see if you can figure out a way, take a look at tree sitter. I find it somewhat impenetrable, even knowing exactly how it works, but ... It's also for LR, but both this and that are based on the same paper. The incremental lexing is driven by the parser in tree-sitter, but is otherwise identical algorithm to the incremental lexer i posted. |
@dberlin, Is this designed to deal with incremental parsing as in a portion of the text might have changed (like in an IDE doing syntax checking), or as in the input stream continues to deliver characters beyond those previously available? |
The former.
…On Fri, Apr 10, 2020, 4:51 PM Burt Harris ***@***.***> wrote:
@dberlin <https://github.com/dberlin>, Is this designed to deal with
incremental parsing as a portion of the text might have changed (like in an
IDE doing syntax checking), or because the input stream continues to
deliver characters beyond those previously available?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#414 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACPI24XNBXG2LH4EWLGLP3RL6WHRANCNFSM4HD7LT7Q>
.
|
@sharwell, is there a pre-existing method in the code generator to alter what base class the generated class(es) are derived from? |
Let me differentiate capability from speed. Your case is equivalent to the case where all the text is added at the end, but it would be slow for the reasons the .md files cover (antlr's current way of doing this forces dealing with unchanged tokens ( It could be made to deal with this case very well if you did the "incremental parser drives incremental lexer" way described in those files. In fact, it would be optimal and should be not slower than doing it all at once. |
(and for example, tree sitter, with the same algorithms, is used to parse character at a time) |
Yes there is.
I started with that approach, and if I remember correctly the context
creation in each rule is side effecting or something that made it very hard
to move the guard rule check from where it is.
I probably still have the code around to do it that way somewhere if you
want to see if you can make it work
…On Fri, Apr 10, 2020, 5:03 PM Burt Harris ***@***.***> wrote:
@sharwell <https://github.com/sharwell>, is there a pre-existing method
in the code generator to alter what base class the generated class(es) are
derived from?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#414 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACPI253L5TPMLVX6J54MCDRL6XTZANCNFSM4HD7LT7Q>
.
|
@dberlin: Returning to a higher level discussion, language support in IDEs like vscode often have two different lexers for the same language:
Sub-second responsiveness in syntax highlighting is important, but highlighting only calls for an incremental lexer, no parsing, at least as I've seen it (in Visual Studio and Visual Studio Code.) In fact, the current trend seems to be to use TextMate grammars for highlighting purposes, as vscode supports. Thus it's use-case two (where the lexer and parser are integrated) where this sort of incremental parsing begins to become interesting. If the typical parse time you get for a multi-megabyte g code file is measured in seconds, that seems pretty acceptable for use-case 2. So I think while this sort of incremental lexing and parsing may be interesting, there doesn't seem to be a pressing need for it. I see this as different from streaming lexing and parsing I thought could be related to incremental parsing. The goal in streaming of managing back-pressure so that the memory requirements for a simple command-line tool don't expand excessively. |
P.S. because of the clever way ANTR's ALL(*) works, the recognizer may perform much better after its been warmed-up on similar input. Did the 6.5 meg g code file in 3-6 seconds include the warm-up overhead? |
Hey Burt,
(The ANTLR timings are for second+ parse. First parse is a lot slower due
to the need to build out the state cache, etc, as you say.)
I'm fairly familiar with this style, i've worked on a lot of IDE's in the
past 25 years. My day job is actually owning IDE's, programming languages,
and a few other things for Google :)
A few things:
1. Lexing is actually not good enough anymore for a lot of languages, and
in fact, there is a large and growing chorus of complaints against vscode
to move to tree-sitter as a result (which does sub-second responsiveness
with parsing).
See the long discussion starting here:
microsoft/vscode#50140, for example (there are
other issues here).
See
https://marketplace.visualstudio.com/items?itemName=georgewfraser.vscode-tree-sitter
for
an example comparing syntax highlighting. Note that the tree-sitter
version is not just better, it's actually *faster* than the textmate
grammar engine in vscode by far.
(even though vscode is doing delayed update and tree-sitter is doing
reparse on every keystroke)
Truthfully, at least in the IDEs i worked on, lexing never really *was*
good enough, its just that outside of Pan/Ensemble/Harmonia research
projects, nobody bothered to make parsing fast enough (I actually worked in
some incremental parsing IDEs in the past).
Other IDEs than vscode, such as Atom, already use tree-sitter as an
incremental parsing systems. See
https://github.blog/2018-10-31-atoms-new-parsing-system/
2. Atom and other IDEs that use tree-sitter (again, same incremental
algorithms here) do reparsing on a per-keystroke basis, and have no issue
even with multi-megabyte files. I have a tree-sitter grammar for the same
file, and the parsing/lexing time on changes is zero.
So overall i'd say "I believe what you say has been true in the past, i
definitely would not agree it is true anymore".
In fact, i'd put a significant amount of money that tree-sitter will take
over the IDE space (This is an easy bet to make because it already is
outside of a few mainstays :P)
But, in the end, what y'all choose to do is up to you.
Feel free to close this if you aren't interested. I wrote it mostly
because i've been using ANTLR since the PCCTS days, and it was fun.
I already have tree-sitter grammars for my work that I can use in IDEs
(including vscode), etc.
…On Fri, Apr 10, 2020, 7:13 PM Burt Harris ***@***.***> wrote:
Returning to a higher level discussion, language support in IDEs like
vscode often have two different lexers:
1.
An incremental syntax highlighting lexer which operates on a
line-by-line basis, and who's only job it to classify tokens for color
display purposes.
2.
A full lexer and parser with semantic checking, which runs in a
separate process to generate the "problems" and red squiggles. This process
is a language service for the language.
Sub-second responsiveness in syntax highlighting is important, but
highlighting only calls for an incremental lexer, no parsing, at least as
I've seen it (in Visual Studio and Visual Studio Code.) In fact, the
current trend seems to be to use TextMate grammars
<https://macromates.com/manual/en/language_grammars> for highlighting
purposes, as vscode supports.
Thus it's use-case two (where the lexer and parser are integrated) where
this sort of incremental parsing begins to become interesting. To be frank,
if a multi-megabyte g code file parses in 10s of seconds, I think the
priority of adding incremental support wouldn't come in an IDE environment.
If the typical parse time you get for a multi-megabyte g code file is
measured in seconds, that seems pretty acceptable for use-case 2. So I
think while this sort of incremental lexing and parsing may be interesting,
there doesn't seem to be a pressing need for it.
I guess this is different from *streaming* lexing and parsing I thought
could be related. The goal in streaming of managing back-pressure so that
the memory requirements for a simple command-line tool don't expand
excessively.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#414 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACPI25DFMAPNTUORM33L3LRL7G33ANCNFSM4HD7LT7Q>
.
|
Thanks @dberlin, you make a good case. I certainly am a bit out-of-date, I retired from MSFT a number of years back, and IDEs were never my focus. I think introducing incremental parsing to ANTLR thru Java antlr/antlr4#2527 is the way to go. It looks like that PR needs some minor rebasing, and response to a threading concern. @sharwell (and of course @parrt) are really the ANTLR4 / ALL(*) experts. I just tackled antlr4ts to get some hands-on experience with Typescript, and I didn't care for the existing JavaScript target last time I tried it. Anyway, take my input with a grain of salt, I am no expert. I am a little surprised Google's interested in GCode IDE, or is that a side project? Is there anything you can point me at about that? |
Hey folks,
This pull request adds incremental parsing support to antlr4ts.
i'm explicitly not seeking to get merged in this exact form, i'm more curious whether y'all think it is worth the time/energy to try to get it merged into ANTLR (either the reference or optimized runtime).
I wrote the code for the typescript runtime first because i'm using it with vscode. I am happy to back port it to java.
As you can see, it is deliberately structured to be simple and small, have no extra dependencies outside of the existing runtime. It does not require modifying any existing part of the runtime.
I've tested it on fairly complex grammars (the included tests test it on the JavaLR grammar among others).
I added a doc (IncrementalParser.md) that explains how it works at a high level, as well as the outstanding issues. The tests currently test basic add/remove/etc on a simple grammar as well as the JavaLR grammar, and verify the right parts of the parse tree did what they were supposed to.
My use case is a little weird - i am parsing GCode files, which can get quite large (hundreds of megabytes). While ANTLR is quite fast, the parse time on a 6.5 meg gcode file is already 3-6 seconds (depending on whether parse trees are built). With the incremental parser, adding a new line is O(10ms).
We don't do incremental lexing but it's not difficult for a lot of languages to do by hand (particularly if you only care about text being right).
As mentioned, happy to do the work for TS/Java, and happy to push on it, just trying to understand if i'm the only one in the world who cares :)
Thanks for any thoughts/feedback!