Stream support in PEGs #1211

czkz · 2023-07-02T13:45:18Z

czkz
Jul 2, 2023

Hi! I've been writing a Janet library to simulate mouse and keyboard events to learn the language. During development I've found some things that could (in my opinion) be improved in Janet, and I thought it's a good idea to write them down so I can submit some issues/PRs to discuss with the community. This is one of them.

One of the things Janet excels at is shell scripting and everything related.
Yet there's a common problem I haven't found an existing solution for - processing data from long-running processes, aka processing Janet streams.

The problem

How to read and process a stream line-by-line?
This is needed when processing output from os/spawn, parsing HTTP requests with Keep-Alive, implementing simple getline for streams (e.g. netrepl), etc.

This is not necessarily about lines - any protocol / data format which doesn't specify packet length in advance will have similar problems when being read from a stream - JSON, HTTP, netstring, HTML/XML, etc.

What doesn't work:

Reading the stream byte-by-byte - it does work, but it's almost always too slow to be viable
(:read s :all) and then (string/split) isn't usually possible - e.g. HTTP stream with Keep-Alive, reading REPL input, reading from long-lived processes (tail -f logs, tcpdump).

How it is done right now:

spork/http/http-header (link)

Very verbose and imperative - needs a buffer that outlives the function, a last-index variable, a forever loop, a ret variable.

How others do it:

Janet parser - parser/consume and parser/produce

This works, but too verbose for such a simple (and common) thing.

spork/netrepl

Uses messages from spork/msg with prefixed length to avoid the problem altogether. Very nice, but not always possible.

spork/getline

Reads data byte-by-byte. Works for a REPL, but too slow for parsing lots of data.

Clojure

Everything is lazy, so (string/split) could work even for infinite streams, as long as it's not consumed all at once (clojure.string/split is actually not lazy, but other algorithms (map, reduce, etc) could be used to get the same effect). Not really an option for Janet.
Java interop provides Readers, BufferedReaders, InputStreams, etc.

Proposed solution - stream support in PEGs

Using PEGs would be awesome:

Simple from the user's perspective
Gives better performance then a Janet implementation
Allows for complex parsing

For example, the spork/http/read-header function would shrink from 24 lines to (peg/match peg conn). And right now the function uses peg/match anyway, but it's surrounded with lots of code to convert conn into a large enough buf.

Reading a line would be (to "\n").

PEGs already have a concept of consuming characters so a PEG could just work on a buffer like usual, but (:read stream 64 buf) if the buffer ends when the PEG wants more characters.

Buffering

The biggest problem with all this is that a PEG could :read more bytes from a stream then it had to, and those bytes would be lost when the stream is used afterwards.

This could use some form of "unread" functionality, but I'm not sure what's the best way to implement it. Maybe a stream could hold an internal buffer to allow unreading up to the last read's size? That seems to be enough for almost all use cases.

Another way is to require the user to pass an external buffer that will be used for this. This is simpler to implement, but less elegant and requires a different API - (peg/match peg stream) would not be possible anymore.

Internal buffer (unread)

Pros:

Elegant
Same API as with buffers - (peg/match peg stream)

Cons:

When a peg doesn't match, there is no way to get the characters - although in many situations (e.g. protocol parsing) it means that the stream is invalid and should be closed anyway.
Requires "unread" - depending on how it's implemented, it could add an overhead to all stream reads, which might be not worth it. On the other hand, "unread" might be useful outside of pegs.

External buffer

Pros:

Simple to implement
When peg/match returns nil, the buffer contains unmatched data and can be used to retry matching with a different peg
Could be an optional argument

Cons:

A bit more cumbersome to use - user needs to define a buffer

I suggest using an external buffer and function signatures similar to this:
(peg/match peg buf &opt stream-to-get-extra-characters)

CosmicToast · 2023-07-02T13:54:01Z

CosmicToast
Jul 2, 2023

True streaming parsers are an open problem in parsing theory because reads against streams/pipes/sockets/etc are stateful.
The vast majority of parsers require lookahead (typically 1 for LL(1)), which is problematic because when you try to read from, say, a network socket, but there's no further data, the call blocks instead of telling you that it ran out of input (as a file would).
This problem also happens, for instance, with stdin (in readline cases).
This is why REPLs typically simply read the line, try parsing on it, and if the parse fails try to read an additional line to combine the two and repeat the process.
In short, the real issue would actually be not accidentally blocking the read call.
Imagine, for example, a peg that's simply '(to "\n") as a first alt.
In a file, this fails, then tries the alternative.
In a stream, this will simply keep reading.
All this to say that it's not quite as straightforward as it sounds, and that making PEGs stream-aware by default isn't enough (the grammar itself (or a lexing step) often needs modifications to be stream-aware).
It's also why lexing hand-parsers (e.g ala go) are popular, moreso than parser generators.

5 replies

czkz Jul 2, 2023
Author

The vast majority of parsers require lookahead (typically 1 for LL(1)), which is problematic because when you try to read from, say, a network socket, but there's no further data, the call blocks instead of telling you that it ran out of input (as a file would).

Lookahead is only problematic when there's no more data. There's no more data only when we've reached an end of a packet/datagram/http-request.
Network protocol designers know this, so lookahead is never required at the end of a packet. So shouldn't we be fine?

The PEG design is very convenient here, as only choice (aka +) is backtracking.

Note that I'm not that knowledgeable about parsing theory.

Imagine, for example, a peg that's simply '(to "\n") as a first alt.
In a file, this fails, then tries the alternative.
In a stream, this will simply keep reading.

The (+ (to "\n")) example would read until EOF by design - if you are not allowed to read until EOF, you can't use it, there's no way around it.
Again, network protocol designers know this, so it should be unnecessary.

CosmicToast Jul 2, 2023

Lookahead is only problematic when there's no more data. There's no more data only when we've reached an end of a packet/datagram/http-request.
Network protocol designers know this, so lookahead is never required at the end of a packet. So shouldn't we be fine?

Nope!
There can be no more data in the middle of a TCP session.
Similarly, "End of an http request" isn't a thing if you're writing an http request parser, for instance.
This is a practical problem that keeps needing to be addressed in production parsers, primarily because it simply has not been solved yet.
Typical streaming hand parsers read one character at a time with a timeout, doing lexing (and parsing in parallel of that, in-line).
Finally, "network protocol designers know this" - this wouldn't just be for network protocols though, but part of PEGs, and thus generally should be useful... wherever PEGs are used, right?
Otherwise it could just be a small library that buffers internally, kind of like a REPL.

The (+ (to "\n")) example would read until EOF by design - if you are not allowed to read until EOF, you can't use it, there's no way around it.

Oh you can read infinitely.
It'll just block.
And there's no such thing as EOF, so long as the socket isn't closed.

Again, network protocol designers know this, so it should be unnecessary.

How about other people using PEGs?

Typical reasons to want streaming parsers are being able to parse things directly from pipes/networks/etc without having to aggregate the entire available stream into memory, and to limit memory usage.
So if the methodology doesn't work for those, it's not really a streaming parser, but a specialized PEG sub-library for a specific use-case, which I'm not fully certain belongs in core.

Of course, anyone's free to disagree, but if I saw the docs say "you can pass a janet stream (including a socket) to have the PEGs operate on that", I would presume it'd do funny things like ANTLR.
As a sidenote: the ANTLR 4 book had planned to have a streaming parser example, but it was dropped, mostly due to issues with the inevitable issues of doing things like that with what is essentially an unlimited lookahead parser.
Which PEGs are, they certainly aren't LL(1) (which tend to have issues too, mind you).

Anyway, you can build a "streaming" parser on top of a static parser (and you can actually also build a true streaming parser this way, but it's much harder), so see above.

czkz Jul 2, 2023
Author

Finally, "network protocol designers know this" - this wouldn't just be for network protocols though, but part of PEGs, and thus generally should be useful... wherever PEGs are used, right?

No! PEGs are already enough, and work fine in all but one (too common to be ignored) kind of situations.

My whole proposal is exactly for and only for these kind of situations - when you can not process the data all at once - in other words, when for whatever reason you can't just (:read stream :all) and PEG that.

I can think of 3 situations when this might happen:

The stream is a SOCK_STREAM socket that is used to send application-defined "packets" (not to be confused with OSI network layer packets). Some examples include:
- HTTP requests over a TCP socket.
- Wayland-protocol packets over a UNIX socket.
- Minecraft packets over a TCP socket.
- Any other application-level protocol that works on top of TCP.
The stream is a pipe to a program that doesn't terminate immediately. Examples:
- Watching logs with tail -F.
- Watching device events with evemu-record.
- Watching network activity with tcpdump.
- Watching something else.
The stream is a file that is too large to fit into RAM. The file is a concatenation of many small independent pieces of data (for example, lines in a big log file are small and independent of each other).

I would argue that the situations 1 and 2 are common enough to be considered a general use case.
Furthermore, I think that the situation 2 is very important because of how useful Janet is when it comes to scripting - working with processes is awesome in Janet, but stumble upon a long-living process, and you're back to C levels of verbosity - even reading data line-by-line requires considerable effort and plenty of imperative code.

Typical reasons to want streaming parsers are being able to parse things directly from pipes/networks/etc without having to aggregate the entire available stream into memory, and to limit memory usage.

I don't think what I'm suggesting is what you're calling a "streaming parser".
My suggestion is not made to limit memory usage and the implementation I have in mind will not affect memory usage at all.
Rather, it's made to allow for something previously impossible (very hard) - like reading a line from a stream :)

This is what reading a line currently looks like:

(defn- make-line-reader [stream]
  (var buf @"")
  (var start 0)
  (defn reloc-buf []
    (when (> start 0)
      (set buf (buffer/slice buf start))
      (set start 0)))
  (fn []
    (var ret nil)
    (forever
      (if-let [line-end (string/find "\n" buf start)]
        (do
          (set ret (buffer/slice buf start line-end))
          (set start (inc line-end))
          (break))
        (do
          (reloc-buf)
          (or
            (:read stream 64 buf)
            (break)))))
    ret))

ianthehenry Jul 3, 2023

You can use fibers to write a more ergonomic version of this:

(defmacro while-let [bindings & body]
  ~(while (if-let ,bindings (do ,;body true) false)))

(defn- lines [stream &opt chunk-size]
  (default chunk-size 2048)
  (coro
    (var buf @"")
    (var start 0)
    (while (:read stream chunk-size buf)
      (while-let [end (string/find "\n" buf start)]
        (def result (string/slice buf 0 end))
        (set buf (buffer/slice buf (inc end)))
        (set start 0)
        (yield result))
      (set start (length buf)))
    (when (> start 0)
      (yield (string buf)))))

(Note that this has more overhead from the fiber allocation, and might perform unnecessary compactions that your implementation avoids.)

But this lets you iterate over lines using the normal iteration APIs:

(defn main [&]
  (each line (lines stdin 64)
    (print (string/ascii-upper line))))

It would be neat if there were a version of PEGs that allowed you to stream input. Attoparsec and Angstrom allow this by having parsers return one of three values: parsed, failed, or "I need more bytes." Basically those parsers are like coroutines that you can resume in the future... which is a natural fit for fibers... but that would be a major change to the PEG implementation.

And this doesn't give you streaming output, which is not really possible in general (you don't know whether or not a PEG can fail until it terminates), although it sounds like that's not what you're asking about here. And you still have the problem that it's not clear what state you'd leave the input buffer in... you can have success return "remaining unconsumed bytes," but I don't know what you'd do on PEG failure.

czkz Jul 3, 2023
Author

Didn't know fibers were iterable like that, that's definitely the better way to do it, thanks!

bakpakin · 2023-07-02T22:13:17Z

bakpakin
Jul 2, 2023
Maintainer

While useful, general support of "stream" support in PEGs is not feasible as mentioned - PEGs as a general construct support infinite backtracking. That said, there are of course several "easy" to implement workarounds for implementing a protocol with pegs that involve defining a message delimiter pattern.

Use the "message peg" to decode from a buffer. If the message peg doesn't match (the message still needs more bytes), read more data and restart. This is simple, works well, but suffers from bad quadratic behavior when the messages are much larger than the number of bytes received at a time.
What spork/http does - define a separate message delimiter that is the cleverly used to avoid the quadratic behavior by not re-scanning the entire buffer when the HTTP header is not complete. This algorithm can be done fairly easily when the message delimiter is just a string constant.

As @CosmicToast has pointed out, there are fundamental issues with the method of abstraction here. Also using the event loop inside the peg interpreter is not code that I want to write or deal with. Hardly "simple".

I think a function like (read-peg stream peg &opt buf message-delimiter) that implements the above strategies would be useful, but wouldn't require any crazy rework and tradeoffs of how pegs work.

1 reply

czkz Jul 3, 2023
Author

What I was suggesting is pretty much replacing this code in peg_rule() in core/peg.c

if (text + len > s->text_end) return NULL;

with something similar to this:

if (text + len > s->text_end) {
    if (get_more_data() == NULL) {
        return NULL;
    } else {
        goto try_this_rule_again; // Not the whole PEG, only this rule!
    }
}

where get_more_data() is the Janet function (fn [] (:read stream buf)).

get_more_data() is obviously using the event loop, and I have no idea how it translates to C, and how hard it is to turn the C pseudocode I wrote here into real code, so maybe it's just not worth it.

In the case that it isn't:

I think a function like (read-peg stream peg &opt buf message-delimiter) that implements the above strategies would be useful, but wouldn't require any crazy rework and tradeoffs of how pegs work.

This indeed would work in all real-life use cases I can think of.
But if it's not connected to PEGs anymore, maybe just make it a completely separate function, e.g. ev/read-to or ev/read-thru, alongside ev/read and ev/chunk?

The reason for merging this with PEGs was to avoid writing (peg/match http-peg (ev/read-to s "\r\n\r\n")) when the fact that the match ends with "\r\n\r\n" is already specified in http-peg.
If we're writing it twice anyway, doing it via ev/read-to is both more descriptive and flexible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream support in PEGs #1211

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Stream support in PEGs #1211

czkz Jul 2, 2023

The problem

What doesn't work:

How it is done right now:

spork/http/http-header (link)

How others do it:

Janet parser - parser/consume and parser/produce

spork/netrepl

spork/getline

Clojure

Proposed solution - stream support in PEGs

Buffering

Internal buffer (unread)

External buffer

Replies: 2 comments · 6 replies

CosmicToast Jul 2, 2023

czkz Jul 2, 2023 Author

CosmicToast Jul 2, 2023

czkz Jul 2, 2023 Author

ianthehenry Jul 3, 2023

czkz Jul 3, 2023 Author

bakpakin Jul 2, 2023 Maintainer

czkz Jul 3, 2023 Author

czkz
Jul 2, 2023

Replies: 2 comments 6 replies

CosmicToast
Jul 2, 2023

czkz Jul 2, 2023
Author

czkz Jul 2, 2023
Author

czkz Jul 3, 2023
Author

bakpakin
Jul 2, 2023
Maintainer

czkz Jul 3, 2023
Author