Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using ECMAScript generators for parse functions #360

Closed
BurtHarris opened this issue May 9, 2018 · 8 comments
Closed

Using ECMAScript generators for parse functions #360

BurtHarris opened this issue May 9, 2018 · 8 comments

Comments

@BurtHarris
Copy link
Collaborator

TypeScript/ECMAScript generators can be used for a asynchronous parser based on coroutines without much overhead. The advantage of this is that it disconnects us from the Java stream mechanism, which was designed for environments supporting multiple threads, and allows native NodeJS streams to be used to supply input to a parser. (JavaScript doesn't in general support threads, and blocking a thread in NodeJS or a browser is bad form.)

For example, the hildjj/node-cbor decoder uses (an embedded version of) binary-parse-stream to make the parser able to function as a Node.js stream.transform stream, and support the stream pipe() operation, which is the preferred mechanism to implement for file manipulation tools for the Node platform.

While this would possibly require a change the antler4ts version 0 API, it seems to me that it wouldn't be conceptually hard to deal with, and it does seem to be able to be used in a synchronous mode, for those situation where the entire input stream is available at the start of parse/tokenize operations.

Generators can use the yield expression to act as co-routines and suspend their operation. Specifically in the model I contemplate. In the model I contemplate, yield would be used to indicate that a grammar requires more input than currently available, and the production would be suspended until another Buffer of information was available.

For convenience in the synchronous case, the generated recognizer class could have static functions for top-level productions added which are optimized for the case where the full parse input is available (as either a string or Buffer.) These would throw if the input wasn't complete according to the rules of the production, and yield would never be called.

Any thoughts/comments on this would be appreciated. @sharwell, @mike-lischke?

@mike-lischke
Copy link
Contributor

I haven't worked with JS generators before (but know a similar concept from C#). They appear to me more like syntactic sugar. Yes, I can imagine to convert the lexer into a generator, but the executed work is the same and usage only changes in used function calls (next vs. nextToken), no?

@BurtHarris
Copy link
Collaborator Author

BurtHarris commented May 10, 2018

You probably understand this, but in ECMAScript, async/await is actually syntactic sugar over Promises which are themselves sugar on top of generators and yield. But there are other ways to use yield.

Perhaps my initial description focused too much on low-level detail of generators. At the highest level, this approach enables using the non-blocking node.js streams programming model, which allows simple tools to be built as pipelines of readable -> transform -> transform -> ... -> writable. Here's a snippet (from sitepoint illustrating an app that uses a unzip transform.

var fs = require('fs');
var zlib = require('zlib');

fs.createReadStream('input.txt.gz')
  .pipe(zlib.createGunzip())
  .pipe(fs.createWriteStream('output.txt'));

// node will automatically exit after the pipeline completes...

Transforms can use Object Mode can convert between text or byte-streams (like those produced by the fs module) to objects that can then be processed, and then (later in the pipeline) transformed back into text or bytes and stored. (Sort of PowerShell like.)

For ANTLR based transforms, grammar-specific actions (or events) would pass objects to the next stage of the pipeline push() function (following the streams naming convention.)

There is a necessary condition for this to work efficiently on large streams: the parsing that might be done by a ANTLR based transform must be async, in effect continuable when more data arrives. This makes stream based applications quite conservative of memory (compared to a read the whole file, then parse, then walk the parse tree approach.

Using generators for the generated recognizers gives continuation capability (like awaiting a Promise), but yield is more general in that it can be performed multiple times (for a stream-like effect) while Promises are one-shot.

@mike-lischke
Copy link
Contributor

Hmm, would that work with the ALL(*) algorithm and potentially unlimited lookahead? Additionally, we need random seek in certain intervals. That doesn't work well with simple pipes.

@BurtHarris
Copy link
Collaborator Author

BurtHarris commented May 16, 2018

Potentially unlimited lookahead. I'd expect the actual lookahead in most cases is very limited. (Performance would suffer otherwise, right?) I would expect to have something equivalent to the mark()/release() mechanism of IntStream available to keep track of how much lookahead is in use. If we get into a situation where a mark hasn't been released, obviously the memory requirements would grow, but once a marker has been released, we shouldn't need to seek() back before it.

But consider parsing something simple, but potentially large, like a CSV file, perhaps outputting a subset of the columns on a line-by-line basis. There's really no reason to read the whole CSV file in before starting or to build a parse tree when line at a time is probably a reasonable way to process it.

@BurtHarris
Copy link
Collaborator Author

Just noticed tsouza/yajs which appears to be looking for something along the lines I'm proposing. The author has a dependency on antlr4ts, but it doesn't look like it's been integrated.

@mike-lischke
Copy link
Contributor

mike-lischke commented May 16, 2018

I understand the desire to avoid having to load the entire input into memory, however my bigger problem is that antlr4ts is still alpha and there has been no release since a long time. I wouldn't start even more changes, before there is a GA version first. And I really hope there will be one in the not too distant future. Keeping a project in alpha stadium over years is not a good sign for its liveliness.

@BurtHarris
Copy link
Collaborator Author

Agreed. I'll add a tag for such post-v1 topics.

@BurtHarris
Copy link
Collaborator Author

I'm closing this because I think there's a better way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants