Using ECMAScript generators for parse functions #360

BurtHarris · 2018-05-09T18:46:44Z

TypeScript/ECMAScript generators can be used for a asynchronous parser based on coroutines without much overhead. The advantage of this is that it disconnects us from the Java stream mechanism, which was designed for environments supporting multiple threads, and allows native NodeJS streams to be used to supply input to a parser. (JavaScript doesn't in general support threads, and blocking a thread in NodeJS or a browser is bad form.)

For example, the hildjj/node-cbor decoder uses (an embedded version of) binary-parse-stream to make the parser able to function as a Node.js stream.transform stream, and support the stream pipe() operation, which is the preferred mechanism to implement for file manipulation tools for the Node platform.

While this would possibly require a change the antler4ts version 0 API, it seems to me that it wouldn't be conceptually hard to deal with, and it does seem to be able to be used in a synchronous mode, for those situation where the entire input stream is available at the start of parse/tokenize operations.

Generators can use the yield expression to act as co-routines and suspend their operation. Specifically in the model I contemplate. In the model I contemplate, yield would be used to indicate that a grammar requires more input than currently available, and the production would be suspended until another Buffer of information was available.

For convenience in the synchronous case, the generated recognizer class could have static functions for top-level productions added which are optimized for the case where the full parse input is available (as either a string or Buffer.) These would throw if the input wasn't complete according to the rules of the production, and yield would never be called.

Any thoughts/comments on this would be appreciated. @sharwell, @mike-lischke?

The text was updated successfully, but these errors were encountered:

mike-lischke · 2018-05-10T09:10:31Z

I haven't worked with JS generators before (but know a similar concept from C#). They appear to me more like syntactic sugar. Yes, I can imagine to convert the lexer into a generator, but the executed work is the same and usage only changes in used function calls (next vs. nextToken), no?

BurtHarris · 2018-05-10T19:49:13Z

You probably understand this, but in ECMAScript, async/await is actually syntactic sugar over Promises which are themselves sugar on top of generators and yield. But there are other ways to use yield.

Perhaps my initial description focused too much on low-level detail of generators. At the highest level, this approach enables using the non-blocking node.js streams programming model, which allows simple tools to be built as pipelines of readable -> transform -> transform -> ... -> writable. Here's a snippet (from sitepoint illustrating an app that uses a unzip transform.

var fs = require('fs');
var zlib = require('zlib');

fs.createReadStream('input.txt.gz')
  .pipe(zlib.createGunzip())
  .pipe(fs.createWriteStream('output.txt'));

// node will automatically exit after the pipeline completes...

Transforms can use Object Mode can convert between text or byte-streams (like those produced by the fs module) to objects that can then be processed, and then (later in the pipeline) transformed back into text or bytes and stored. (Sort of PowerShell like.)

For ANTLR based transforms, grammar-specific actions (or events) would pass objects to the next stage of the pipeline push() function (following the streams naming convention.)

There is a necessary condition for this to work efficiently on large streams: the parsing that might be done by a ANTLR based transform must be async, in effect continuable when more data arrives. This makes stream based applications quite conservative of memory (compared to a read the whole file, then parse, then walk the parse tree approach.

Using generators for the generated recognizers gives continuation capability (like awaiting a Promise), but yield is more general in that it can be performed multiple times (for a stream-like effect) while Promises are one-shot.

mike-lischke · 2018-05-14T07:17:51Z

Hmm, would that work with the ALL(*) algorithm and potentially unlimited lookahead? Additionally, we need random seek in certain intervals. That doesn't work well with simple pipes.

BurtHarris · 2018-05-16T07:38:00Z

Potentially unlimited lookahead. I'd expect the actual lookahead in most cases is very limited. (Performance would suffer otherwise, right?) I would expect to have something equivalent to the mark()/release() mechanism of IntStream available to keep track of how much lookahead is in use. If we get into a situation where a mark hasn't been released, obviously the memory requirements would grow, but once a marker has been released, we shouldn't need to seek() back before it.

But consider parsing something simple, but potentially large, like a CSV file, perhaps outputting a subset of the columns on a line-by-line basis. There's really no reason to read the whole CSV file in before starting or to build a parse tree when line at a time is probably a reasonable way to process it.

BurtHarris · 2018-05-16T08:09:07Z

Just noticed tsouza/yajs which appears to be looking for something along the lines I'm proposing. The author has a dependency on antlr4ts, but it doesn't look like it's been integrated.

mike-lischke · 2018-05-16T14:05:24Z

I understand the desire to avoid having to load the entire input into memory, however my bigger problem is that antlr4ts is still alpha and there has been no release since a long time. I wouldn't start even more changes, before there is a GA version first. And I really hope there will be one in the not too distant future. Keeping a project in alpha stadium over years is not a good sign for its liveliness.

BurtHarris · 2018-05-19T17:00:43Z

Agreed. I'll add a tag for such post-v1 topics.

BurtHarris · 2020-04-15T20:47:11Z

I'm closing this because I think there's a better way.

BurtHarris added ts-flavor needs discussion breaking change after v1 labels May 19, 2018

BurtHarris added this to the vNext milestone May 19, 2018

BurtHarris mentioned this issue Aug 26, 2018

Perf and binary-parse-stream/binary-parser hildjj/node-cbor#86

Closed

BurtHarris closed this as completed Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using ECMAScript generators for parse functions #360

Using ECMAScript generators for parse functions #360

BurtHarris commented May 9, 2018

mike-lischke commented May 10, 2018

BurtHarris commented May 10, 2018 •

edited

Loading

mike-lischke commented May 14, 2018

BurtHarris commented May 16, 2018 •

edited

Loading

BurtHarris commented May 16, 2018

mike-lischke commented May 16, 2018 •

edited

Loading

BurtHarris commented May 19, 2018

BurtHarris commented Apr 15, 2020

Using ECMAScript generators for parse functions #360

Using ECMAScript generators for parse functions #360

Comments

BurtHarris commented May 9, 2018

mike-lischke commented May 10, 2018

BurtHarris commented May 10, 2018 • edited Loading

mike-lischke commented May 14, 2018

BurtHarris commented May 16, 2018 • edited Loading

BurtHarris commented May 16, 2018

mike-lischke commented May 16, 2018 • edited Loading

BurtHarris commented May 19, 2018

BurtHarris commented Apr 15, 2020

BurtHarris commented May 10, 2018 •

edited

Loading

BurtHarris commented May 16, 2018 •

edited

Loading

mike-lischke commented May 16, 2018 •

edited

Loading