Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new (sub) PEG special #1344

Merged
merged 1 commit into from
Jan 1, 2024
Merged

Conversation

ianthehenry
Copy link
Contributor

@ianthehenry ianthehenry commented Dec 27, 2023

The first commit in this branch condenses the two of the existing opcodes into a single opcode, so that there are still exactly 32 PEG opcodes. I asked about this on Zulip, and as I said there I'm still not very confident that this code is necessary. It does remove a conditional branch on the main bytecode loop, but I worry there might be some 32-bitness thing that I don't understand that might make it even more costly to remove.

It then proposes two one new specials:

sub

(sub window patt) executes the window pattern, remembers how many bytes it matches, then executes patt over exactly the bytes matched by the window. This allows you to limit what the subpattern can match, e.g. (sub (to "\n") :foo) ensures that :foo does not have a chance to match beyond the current line. If patt also matches, then the full window is consumed by the (sub) rule.

Another good name for this might be limit, as in (limit (to "\n") patt), but this doesn't really communicate that the full matched text is equal to the window pattern.

On the surface this is similar to (cmt (% ...) ,|(peg/match ... $)), but differs in several ways: the subpattern in sub can still refer to :other patterns inside struct rules, the position and other helpers are still relative to the whole input, the pattern can leave multiple captures...


sep

I'll move this into its own PR, but I'm preserving the original description so the discussion still makes sense.

(sep separator patt) is a combination of sub, to, and thru. The canonical example would be something like ~(sep "," :w*) to match comma-separated words.

It's similar to ~(some (* (sub (to (+ ,separator -1)) ,patt) (? ,separator))), except that it will not consume terminating separators: the final (? ,separator) rule is conditional on the next instance of the subpattern matching. A more accurate translation would be:

(defn sep [separator patt]
  ~(? (sub (to (+ ,separator -1)) ,patt) (any (* ,separator (sub (to (+ ,separator -1)) ,patt)))))

Which is pretty unwieldy. I think this is a common enough pattern to deserve an optimized helper -- it compiles to a lot less bytecode and doesn't match separators twice.

Since sep repeats, it might be useful to have separte forms like sep-any and sep-some.

Neither of these combinators are really necessary, as you can do a one-pass thing that ensures that your inner patterns never advance past a separator (which is necessary in the case of a real csv parser that has to handle escapes). But they're convenient for lots of simple ad-hoc parsing tasks (hello from advent of code).

@bakpakin
Copy link
Member

The fact that there are 32 opcodes is entirely coincidental and I was not even aware of this. I suppose this could result in tighter code generation but I doubt it does.

@bakpakin
Copy link
Member

The sub pattern looks interesting; it's similar to the lookahead combinator but requiring that the pattern is inside the window is useful.

@ianthehenry ianthehenry force-pushed the peg-sub-special branch 2 times, most recently from 909b701 to dfb527b Compare December 28, 2023 03:35
@ianthehenry
Copy link
Contributor Author

Well that simplifies things! Yeah I couldn't work out why the bitmask was there so I didn't want to remove it. But I pulled that commit out.

Potential idea for distinguishing sep-some vs sep-any: an optional "miminum matches" argument. So (sep "," :w+ 0) for any, (sep "," :w+ 1) for some. I doubt you'd really use anything else, but (sep "," :w+ 10) would theoretically work... basically works like at-least.

Another idea is that sep should always require at least one match, and if you want sep-any you can just make it optional. Although that's... hmm. Don't love it.

My instinct is sep and sep+, using the regex sigils, but that's sort of inconsistent with PEGs...

@pepe
Copy link
Member

pepe commented Dec 28, 2023

Unfortunately, I cannot add anything meaningful to the implementation debate. Yet both specials proposed here make sense to me. In my current project, I am using PEG extensively for parsing small user inputs, and at least sub will be very handy.

@sogaiu
Copy link
Contributor

sogaiu commented Dec 28, 2023

Interested in sub (and am in the midst of implementing it in margaret).

I've not happened to have noticed a case of wanting a sep-like thing (perhaps just not recognizing opportunities), though it seemed a number of folks were discussing an AOC thing this year that might have benefited?

I wonder if there's some value in adding just one special at first (e.g. sub) and seeing how it works out in combination with the existing system and then after a bit of exercising, revisiting the other one. May be that's unnecessary.


For sub, perhaps a test involving error might be relevant:

  (try
    (peg/match ~(sequence "a"
                          (sub "bcd" (error "bc")))
               "abcdef")
    ([e] e))
  # =>
  "match error at line 1, column 2"

as it makes use of the line and column functionality and the PR touches get_linecol_from_position which is used from the error special.

@pepe
Copy link
Member

pepe commented Dec 28, 2023

@sogaiu I have the same feeling about sep, but the arguments in the description of this PR make it a little bit easier to believe for me :-). Anyway, I like the idea of starting with sub; if everything is correct, consider sep.

@sogaiu
Copy link
Contributor

sogaiu commented Dec 28, 2023

FWIW, in addition to the tests for sub contained in this PR, I've added some additional tests for the sub implementation in margaret. I generated the expected values via a janet compiled from this commit of this PR.

For the interested, they live here.

@ianthehenry
Copy link
Contributor Author

ianthehenry commented Dec 28, 2023

@sogaiu there is a test showing the behavior of the line/col/position helpers, which I think is sufficient to cover the error case as well (if I understand correctly).

Regarding sep, I think it does make sense to open a separate PR for that. I'm not sure if the behavior here is exactly right -- really I think of sep as a way to run string/split inside a PEG. Except that the implementation here was not exactly that -- it's sort of "string split for as long as you can but stop as soon as something fails to match." Which is maybe less useful than something that requires that all split substrings match exactly.

I just rebased this branch and updated the description. There is another high-level sub-inspired combinator that I think would be useful, I might open a separate PR for that and sep/split if this gets merged.

@ianthehenry ianthehenry changed the title Add two new PEG specials, (sub) and (sep) Add a new PEG specials (sub) Dec 28, 2023
@ianthehenry ianthehenry changed the title Add a new PEG specials (sub) Add a new (sub) PEG special Dec 28, 2023
@sogaiu
Copy link
Contributor

sogaiu commented Dec 29, 2023

@ianthehenry The latest commit looks good to me 👍

Probably it wasn't necessary but I also verified against the additional tests I mentioned earlier.

So far my impression is that sub is working well with other bits -- have my fingers crossed that remains the case and that we can see additional goodies you have in mind ;)

(sub) will first match one pattern, then match another pattern against the
text that the first pattern advanced over.
@ianthehenry
Copy link
Contributor Author

ianthehenry commented Dec 29, 2023

I found an issue with PEG unmarshaling and I added a previously-failing test for it -- the bytecode verifier was also doing the & 0x1f bitmasking, and I failed to remove that when I rebased out the 32-opcodes-only commit.

@bakpakin
Copy link
Member

bakpakin commented Jan 1, 2024

LGTM.

@bakpakin bakpakin merged commit e8ed961 into janet-lang:master Jan 1, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants