Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Parsing support #414

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

dberlin
Copy link
Contributor

@dberlin dberlin commented Apr 5, 2019

Hey folks,
This pull request adds incremental parsing support to antlr4ts.
i'm explicitly not seeking to get merged in this exact form, i'm more curious whether y'all think it is worth the time/energy to try to get it merged into ANTLR (either the reference or optimized runtime).

I wrote the code for the typescript runtime first because i'm using it with vscode. I am happy to back port it to java.

As you can see, it is deliberately structured to be simple and small, have no extra dependencies outside of the existing runtime. It does not require modifying any existing part of the runtime.

I've tested it on fairly complex grammars (the included tests test it on the JavaLR grammar among others).

I added a doc (IncrementalParser.md) that explains how it works at a high level, as well as the outstanding issues. The tests currently test basic add/remove/etc on a simple grammar as well as the JavaLR grammar, and verify the right parts of the parse tree did what they were supposed to.

My use case is a little weird - i am parsing GCode files, which can get quite large (hundreds of megabytes). While ANTLR is quite fast, the parse time on a 6.5 meg gcode file is already 3-6 seconds (depending on whether parse trees are built). With the incremental parser, adding a new line is O(10ms).

We don't do incremental lexing but it's not difficult for a lot of languages to do by hand (particularly if you only care about text being right).

As mentioned, happy to do the work for TS/Java, and happy to push on it, just trying to understand if i'm the only one in the world who cares :)

Thanks for any thoughts/feedback!

@dberlin
Copy link
Contributor Author

dberlin commented Apr 21, 2019

I have actually back ported this to ANTLR4 java and submitted it to the main repo.
I will keep this up to date with that (and if it gets turned down, i'll close this)

@BurtHarris
Copy link
Collaborator

@dberlin this sounds interesting, I'll have a look.

@dberlin
Copy link
Contributor Author

dberlin commented Jul 2, 2019

SGTM. Incremental parsing is in good shape.

I got slammed at work and then had another child, so i haven't had time to finish the incremental lexing.

@dberlin
Copy link
Contributor Author

dberlin commented Jul 2, 2019

(the incremental lexing info can be found here:
antlr/antlr4#2534
The TL;DR is that it works but i did not finish changed token list generation.
So incremental parsing and lexing work fine individually, they just don't automatically integrate)

@AlexGustafsson
Copy link

A bit late to the party, but this is super cool. Are there any plans on finishing this up and merging it?

@BurtHarris BurtHarris marked this pull request as ready for review April 10, 2020 18:04
@BurtHarris
Copy link
Collaborator

@dberlin, can you help me understand if incremental parsing can help with the a fundamental mismatch between the Java stream model (which uses blocking I/O) and the JavaScript stream model, where no blocking for I/O is permitted?

In JavaScript, rather than pulling data from a stream, data arrives in chunks, which are delivered by a callback (continuation passing style), or more recently using Promises. Promises have lead to language extension such as async/await where the code can look very much as if it supported blocking I/O, but it's an illusion.

@sharwell
Copy link
Member

sharwell commented Apr 10, 2020

@dberlin can you help me understand the parts of this feature which currently require changes to the core library? I'm hoping to find a way that it can be used without needing to change the core code generator or runtime.

@BurtHarris BurtHarris marked this pull request as draft April 10, 2020 20:59
@dberlin
Copy link
Contributor Author

dberlin commented Apr 10, 2020

Let me do my best to go in order.

@AlexGustafsson I won't have a chance to finish this up anymore (had another kid since then, and moving across the country)
@sharwell It is not possible to do it without adding support to the core runtime in various places (IE adding Incremental* classes), or at least, it doesn't occur to me a way to do it. I can express what it's doing, and did my best to do that already in IncrementalParsing.md, happy to help understand any specifics.

I'm not as familiar with the ANTLR core runtime as y'all, i expected.

It may be possible to do this without modifying the code generator, but i'm not sure how to do it.
What would have to happen would be to move the guard check out of the code generator, and into the core runtime.
This would require further modifying IncrementalParser to try to make that work.
I tried it at the start and it was non-obvious to figure out all the changes and places, so i gave up in favor of small obvious changes to the code generator.

Beyond that, a lot of the complexity is because of the intermediate use of a token stream.
You could simplify all of this a lot more if you required incremental lexing, and drove the incremental lexing from the incremental parser as it walked.
You could then, for example, get rid of IncrementalTokenStream and put most of the that into the IncrementalParser. It would also be a lot faster.
What happens right now is that ANTLR parsers expect token streams to be complete, and seeks/skips by calling nextToken, which also blocks.

We waste a bunch of time and energy returning/tracking and processing unchanged tokens.

If instead the incrementalparser required an incrementallexer, it could be made to only ever request tokens for changed areas.

This would also cleanup the whole interface you see here.
It would also let the incremental parser tell the lexer what next tokens would be acceptable, so that it knows whether it needs to relex further or not.

(All of this unfortunately requires random access to the underlying thing being lexed, but, on the plus side, can be made non-blocking as a result)

The incremental lexer changes i posted have an IncrementalLexer.md file that goes into this a bit.

If you want to see another implementation that uses a similar strategy to this , to see if you can figure out a way, take a look at tree sitter. I find it somewhat impenetrable, even knowing exactly how it works, but ...

It's also for LR, but both this and that are based on the same paper. The incremental lexing is driven by the parser in tree-sitter, but is otherwise identical algorithm to the incremental lexer i posted.

@BurtHarris
Copy link
Collaborator

BurtHarris commented Apr 10, 2020

@dberlin, Is this designed to deal with incremental parsing as in a portion of the text might have changed (like in an IDE doing syntax checking), or as in the input stream continues to deliver characters beyond those previously available?

@dberlin
Copy link
Contributor Author

dberlin commented Apr 10, 2020 via email

@BurtHarris
Copy link
Collaborator

@sharwell, is there a pre-existing method in the code generator to alter what base class the generated class(es) are derived from?

@dberlin
Copy link
Contributor Author

dberlin commented Apr 11, 2020

Let me differentiate capability from speed.
It is capable of handling your case now.

Your case is equivalent to the case where all the text is added at the end, but it would be slow for the reasons the .md files cover (antlr's current way of doing this forces dealing with unchanged tokens (

It could be made to deal with this case very well if you did the "incremental parser drives incremental lexer" way described in those files.

In fact, it would be optimal and should be not slower than doing it all at once.

@dberlin
Copy link
Contributor Author

dberlin commented Apr 11, 2020

(and for example, tree sitter, with the same algorithms, is used to parse character at a time)

@dberlin
Copy link
Contributor Author

dberlin commented Apr 11, 2020 via email

@BurtHarris
Copy link
Collaborator

BurtHarris commented Apr 11, 2020

@dberlin: Returning to a higher level discussion, language support in IDEs like vscode often have two different lexers for the same language:

  1. An incremental syntax highlighting lexer which operates on a line-by-line basis, and who's only job it to classify tokens for color display purposes. These often function tightly integrated with the editor's buffer functionality.

  2. A full lexer and parser with semantic checking, which runs in a separate process (at least in vscode) to analyze for "problems" and generate the red squiggles. This process is a language service for the language.

Sub-second responsiveness in syntax highlighting is important, but highlighting only calls for an incremental lexer, no parsing, at least as I've seen it (in Visual Studio and Visual Studio Code.) In fact, the current trend seems to be to use TextMate grammars for highlighting purposes, as vscode supports.

Thus it's use-case two (where the lexer and parser are integrated) where this sort of incremental parsing begins to become interesting. If the typical parse time you get for a multi-megabyte g code file is measured in seconds, that seems pretty acceptable for use-case 2. So I think while this sort of incremental lexing and parsing may be interesting, there doesn't seem to be a pressing need for it.

I see this as different from streaming lexing and parsing I thought could be related to incremental parsing. The goal in streaming of managing back-pressure so that the memory requirements for a simple command-line tool don't expand excessively.

@BurtHarris
Copy link
Collaborator

BurtHarris commented Apr 11, 2020

P.S. because of the clever way ANTR's ALL(*) works, the recognizer may perform much better after its been warmed-up on similar input. Did the 6.5 meg g code file in 3-6 seconds include the warm-up overhead?

@dberlin
Copy link
Contributor Author

dberlin commented Apr 11, 2020 via email

@BurtHarris
Copy link
Collaborator

BurtHarris commented Apr 14, 2020

Thanks @dberlin, you make a good case. I certainly am a bit out-of-date, I retired from MSFT a number of years back, and IDEs were never my focus.

I think introducing incremental parsing to ANTLR thru Java antlr/antlr4#2527 is the way to go. It looks like that PR needs some minor rebasing, and response to a threading concern.

@sharwell (and of course @parrt) are really the ANTLR4 / ALL(*) experts. I just tackled antlr4ts to get some hands-on experience with Typescript, and I didn't care for the existing JavaScript target last time I tried it. Anyway, take my input with a grain of salt, I am no expert.

I am a little surprised Google's interested in GCode IDE, or is that a side project? Is there anything you can point me at about that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants