-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using ECMAScript generators for parse functions #360
Comments
I haven't worked with JS generators before (but know a similar concept from C#). They appear to me more like syntactic sugar. Yes, I can imagine to convert the lexer into a generator, but the executed work is the same and usage only changes in used function calls ( |
You probably understand this, but in ECMAScript, Perhaps my initial description focused too much on low-level detail of generators. At the highest level, this approach enables using the non-blocking node.js streams programming model, which allows simple tools to be built as pipelines of readable -> transform -> transform -> ... -> writable. Here's a snippet (from sitepoint illustrating an app that uses a unzip transform. var fs = require('fs');
var zlib = require('zlib');
fs.createReadStream('input.txt.gz')
.pipe(zlib.createGunzip())
.pipe(fs.createWriteStream('output.txt'));
// node will automatically exit after the pipeline completes... Transforms can use Object Mode can convert between text or byte-streams (like those produced by the For ANTLR based transforms, grammar-specific actions (or events) would pass objects to the next stage of the pipeline There is a necessary condition for this to work efficiently on large streams: the parsing that might be done by a ANTLR based transform must be async, in effect continuable when more data arrives. This makes stream based applications quite conservative of memory (compared to a read the whole file, then parse, then walk the parse tree approach. Using generators for the generated recognizers gives continuation capability (like |
Hmm, would that work with the ALL(*) algorithm and potentially unlimited lookahead? Additionally, we need random seek in certain intervals. That doesn't work well with simple pipes. |
Potentially unlimited lookahead. I'd expect the actual lookahead in most cases is very limited. (Performance would suffer otherwise, right?) I would expect to have something equivalent to the But consider parsing something simple, but potentially large, like a CSV file, perhaps outputting a subset of the columns on a line-by-line basis. There's really no reason to read the whole CSV file in before starting or to build a parse tree when line at a time is probably a reasonable way to process it. |
Just noticed tsouza/yajs which appears to be looking for something along the lines I'm proposing. The author has a dependency on antlr4ts, but it doesn't look like it's been integrated. |
I understand the desire to avoid having to load the entire input into memory, however my bigger problem is that antlr4ts is still alpha and there has been no release since a long time. I wouldn't start even more changes, before there is a GA version first. And I really hope there will be one in the not too distant future. Keeping a project in alpha stadium over years is not a good sign for its liveliness. |
Agreed. I'll add a tag for such post-v1 topics. |
I'm closing this because I think there's a better way. |
TypeScript/ECMAScript generators can be used for a asynchronous parser based on coroutines without much overhead. The advantage of this is that it disconnects us from the Java stream mechanism, which was designed for environments supporting multiple threads, and allows native NodeJS streams to be used to supply input to a parser. (JavaScript doesn't in general support threads, and blocking a thread in NodeJS or a browser is bad form.)
For example, the hildjj/node-cbor decoder uses (an embedded version of) binary-parse-stream to make the parser able to function as a Node.js
stream.transform
stream, and support the streampipe()
operation, which is the preferred mechanism to implement for file manipulation tools for the Node platform.While this would possibly require a change the antler4ts version 0 API, it seems to me that it wouldn't be conceptually hard to deal with, and it does seem to be able to be used in a synchronous mode, for those situation where the entire input stream is available at the start of parse/tokenize operations.
Generators can use the
yield
expression to act as co-routines and suspend their operation. Specifically in the model I contemplate. In the model I contemplate,yield
would be used to indicate that a grammar requires more input than currently available, and the production would be suspended until anotherBuffer
of information was available.For convenience in the synchronous case, the generated recognizer class could have static functions for top-level productions added which are optimized for the case where the full parse input is available (as either a string or Buffer.) These would throw if the input wasn't complete according to the rules of the production, and
yield
would never be called.Any thoughts/comments on this would be appreciated. @sharwell, @mike-lischke?
The text was updated successfully, but these errors were encountered: