-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new (sub) PEG special #1344
Conversation
4a0798e
to
c4629ab
Compare
The fact that there are 32 opcodes is entirely coincidental and I was not even aware of this. I suppose this could result in tighter code generation but I doubt it does. |
The sub pattern looks interesting; it's similar to the lookahead combinator but requiring that the pattern is inside the window is useful. |
909b701
to
dfb527b
Compare
Well that simplifies things! Yeah I couldn't work out why the bitmask was there so I didn't want to remove it. But I pulled that commit out. Potential idea for distinguishing Another idea is that My instinct is |
Unfortunately, I cannot add anything meaningful to the implementation debate. Yet both specials proposed here make sense to me. In my current project, I am using PEG extensively for parsing small user inputs, and at least |
Interested in I've not happened to have noticed a case of wanting a I wonder if there's some value in adding just one special at first (e.g. For (try
(peg/match ~(sequence "a"
(sub "bcd" (error "bc")))
"abcdef")
([e] e))
# =>
"match error at line 1, column 2" as it makes use of the line and column functionality and the PR touches |
@sogaiu I have the same feeling about |
FWIW, in addition to the tests for For the interested, they live here. |
@sogaiu there is a test showing the behavior of the Regarding I just rebased this branch and updated the description. There is another high-level |
dfb527b
to
4503ca2
Compare
@ianthehenry The latest commit looks good to me 👍 Probably it wasn't necessary but I also verified against the additional tests I mentioned earlier. So far my impression is that |
(sub) will first match one pattern, then match another pattern against the text that the first pattern advanced over.
4503ca2
to
ea75086
Compare
I found an issue with PEG unmarshaling and I added a previously-failing test for it -- the bytecode verifier was also doing the |
LGTM. |
The first commit in this branch condenses the two of the existing opcodes into a single opcode, so that there are still exactly 32 PEG opcodes. I asked about this on Zulip, and as I said there I'm still not very confident that this code is necessary. It does remove a conditional branch on the main bytecode loop, but I worry there might be some 32-bitness thing that I don't understand that might make it even more costly to remove.It then proposes
twoone new specials:sub
(sub window patt)
executes thewindow
pattern, remembers how many bytes it matches, then executespatt
over exactly the bytes matched by thewindow
. This allows you to limit what the subpatt
ern can match, e.g.(sub (to "\n") :foo)
ensures that:foo
does not have a chance to match beyond the current line. Ifpatt
also matches, then the full window is consumed by the(sub)
rule.Another good name for this might be
limit
, as in(limit (to "\n") patt)
, but this doesn't really communicate that the full matched text is equal to thewindow
pattern.On the surface this is similar to
(cmt (% ...) ,|(peg/match ... $))
, but differs in several ways: the subpattern insub
can still refer to:other
patterns inside struct rules, theposition
and other helpers are still relative to the whole input, the pattern can leave multiple captures...sep
(sep separator patt)
is a combination ofsub
,to
, andthru
. The canonical example would be something like~(sep "," :w*)
to match comma-separated words.It's similar to
~(some (* (sub (to (+ ,separator -1)) ,patt) (? ,separator)))
, except that it will not consume terminating separators: the final(? ,separator)
rule is conditional on the next instance of the subpattern matching. A more accurate translation would be:Which is pretty unwieldy. I think this is a common enough pattern to deserve an optimized helper -- it compiles to a lot less bytecode and doesn't match
separator
s twice.Since
sep
repeats, it might be useful to have separte forms likesep-any
andsep-some
.Neither of these combinators are really necessary, as you can do a one-pass thing that ensures that your inner patterns never advance past a separator (which is necessary in the case of a real csv parser that has to handle escapes). But they're convenient for lots of simple ad-hoc parsing tasks (hello from advent of code).