-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Perl extended character classes #553
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice patch! The precedence is implemented with the tight/loose thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The precedence is implemented with the tight/loose thing?
That's right. It's not mega-efficient but it's very robust, and it's accurate to Perl's behaviour. In a classic recursive descent parser, there's one separate parse function for each level of precedence in the grammar.
I will do the optimised parser in a PR after this one. I decided to make it all tested and feature-complete first.
Note! I haven't written the automated tests yet...
I have had a play with some invented Perl syntax tests and it all worked. Good work. Happy to help with documentation if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obviously missing tests, but assume that is why it is a draft?
not sure how "temporal" some of the "//" comments are either (assume to be fixed before an RC though)
Thank @PhilipHazel! What would you like to be added? I have put in a little section on the Perl syntax, inside |
5385503
to
5c714ca
Compare
That's right, just tests remaining.
All the "// TODO" comments will be merged with the PR, and addressed in a follow-up. All the "TODO" comments that will be merged have a GitHub ticket linked for the follow-up work. I have just addressed the one "// XXX" comment. |
@PhilipHazel There's a small design choice around one of Perl's strictness options. In Perl, an extended class
I haven't applied these restrictions inside my Perl extended implementation. They don't really seem necessary, certainly not for Perl compatibility. However, we could do them if we feel that users ought to be able to copy a working |
Documentation: pcre2syntax.3 needs updating - I am happy to do that if you like. Also I think your (very good) text in pcre2pattern.3 might benefit from a little more expansion about what can be an atom. I hadn't really studied Perl enough, and was surprised to learn that, though you can't have an individual character as a literal, you can have it as an escape. So "a" isn't valid, but "\x{61}" is, or of course you can use "[a]" (which I did know). Perl strict rules: I am more or less neutral on this, but quite happy to leave things as they are. |
I haven't done it myself, but I think Perl allows "replacing" their "re" implementation with other engines and one of them might be PCRE, I agree with you both that the "incpompatibilies" are minor enough that it should not hold this, but also suggested we might be able to "reuse" the flag that is being used to enable the other extended classes to also activate those for whoever might require them. |
If we did want to implement the Right now, I don't feel especially motivated to go to extra effort, to lock down three perfectly-OK bits of syntax, just so we can show more annoying messages to users "haha, you wrote your regex badly". No-one is likely to ever use the flag (PCRE2_EXTRA_PERL_RE_STRICT), and similarly |
I've added some docs in |
Excellent! But .... sorry to nitpick ... there are dates at the top and bottom of pcre2syntax.3 (and pcre2pattern.3, come to think of it) that need updating. The one at the bottom is the only one that gets copied through to the HTML version, which is why it is there as well as the date that is one of the arguments to the .TH macro at the top. I always find it useful to have dates on documents because things go out of date so quickly. |
OK! But one day, I'll remember to add a task to PrepareRelease to do that for us, by searching-for and updating all the dates with the git date:
|
... but I only update the dates when I do actually edit the files... so one knows if something has not been updated. |
Exactly. If I did automate the documentation dates, it would be based on git's per-file record of when it was last edited. |
Aha! Git is much cleverer than I imagined. I should have learned it earlier. |
The testing took me a while, but I've done it now. No longer a draft! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work, might be nice to add some caseless tests in 4, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps #559 should land first to add more testing.
testdata/testoutput2
Outdated
Failed: error 106 at offset 5: missing terminating ] for character class | ||
|
||
/(?[\n]/ | ||
Failed: error 215 at offset 6: unexpected ] with no following parenthesis in (?[...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(?[
or (?[...])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's intentional. I'm exercising not just every line of code, but also every branch in all the ifs
. So there are lots of tests where I take valid patterns, and chop bits off the end, in order to exercise the if (ptr < ptrend && ...)
part of the conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the error message should be changed. (?[...])
used most of the time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I was following Perl's error message, which is:
$ echo '/(?[\n]/;' | perl
Unexpected ']' with no following ')' in (?[... in regex; marked by <-- HERE in m/(?[\n] <-- HERE / at - line 1.
I just tweaked the message a tiny bit. The in (?[...]
(or "in foo" in all error messages) was supposed to mean "in this text which you provided", and indeed (?[...]
is what the user provided.
I'll change it to (?[...
to exactly match Perl.
Failed: error 216 at offset 4: unexpected character in (?[...]) character class | ||
|
||
/(?[(])/ | ||
Failed: error 114 at offset 4: missing closing parenthesis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't parenthesis are normal characters in a class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an ordinary character class, yes, parentheses are not special. But inside (?[...])
they are a metacharacter for grouping arithmetic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the same as '[' ? Or does it do something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a Perl extended class, (
and [
are quite different. There are two parsing grammars, the "outer" expression grammar, and an "inner" character class grammar.
PerlExtended := "(?[" OuterExpr "])" ;
OuterExpr := OuterExpr2 ([|+^-] OuterExpr2)* ;
OuterExpr2 := OuterExpr3 ([&] OuterExpr3)* ;
OuterExpr3 := [!]* OuterExpr4 ;
OuterExpr4 := InnerClass | "(" OuterExpr ")" ;
InnerClass := "\\" [dDsSwW] | "[:" Posix ":]" | "[" NormalPerlCharacterClassSyntax "]" ;
So (
doesn't change the parse mode, it stays within the expression syntax (in which "-" is an operator). But "[" switches the parse mode to the ordinary Perl class syntax (in which "-" is a character range, maybe).
I'm very happy to wait, but I don't think it's necessary. All the C code in this PR is very well-behaved, and won't be affected by Clang's |
All added, thank you for these suggestions |
If you are ok with #560, I think we can do this really quickly. Then we have an extra layer of protection. |
cfb6fb0
to
f1c2406
Compare
Please rebase this patch. |
f1c2406
to
70c1a07
Compare
Gladly rebased. Thank you Zoltan! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I have been reviewing this new feature, in particular the documentation. It is excellent but (of course there's a "but"; why else am I posting?) I think it might profit from a bit more explanation, possibly with examples. There were a few things I had to run tests in order to learn some exact behaviour. Rather than try to list suggested changes it will be easiest if I just edit the docs directly, which I will do later tomorrow unless anybody jumps up and says NO. I think also some error messages would be clearer if "class" was changed to "extended class" (which I can also do). Thanks to @NWilson and all who have worked on this. |
Thank you Philip! That would be very helpful. I'm grateful for any documentation improvements you could suggest or make yourself. |
I have just merged an amended version of pcre2pattern.3 and pcre2_error.c with my changes. Please grumble if I've got anything wrong. I think there is a typo in a comment in pcre2_compile.c line 4175 which says "Check for no preceding operand". Shouldn't that be "operator"? |
|
I realize that in UTS#18 mode there are just "classes", but Perl distinguishes "classes" and "extended classes" and some, at least, of the error messages are used in both modes. That's why I made the change. The problem I had with "check there is no preceding operand" was that the error message 113 says "no preceding operator". One of them must be wrong! I'll have a check on EBCDIC mentions. Good idea. I was thinking that I could perhaps usefully do a general trawl through the docs (a bit at a time to avoid boredom) before the next release. |
Oh I see! They're both right... but both super-unclear. The comment "check there is no preceding operand" was meant to mean, "we want to check that we don't have a preceding operand; if there is one, reject the regex". The error message says "unexpected expression in character class (no preceding operator)" because it's trying to say "you didn't put in an operator, which was required". The case is: The comment is ultimately unimportant. But if you think there's a better phrasing for that error message, please do improve it! |
Fixes #536