Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Perl extended character classes #553

Merged
merged 1 commit into from
Nov 15, 2024

Conversation

NWilson
Copy link
Contributor

@NWilson NWilson commented Nov 8, 2024

Fixes #536

Copy link
Collaborator

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice patch! The precedence is implemented with the tight/loose thing?

src/pcre2_compile.c Show resolved Hide resolved
src/pcre2_compile.c Outdated Show resolved Hide resolved
Copy link
Contributor Author

@NWilson NWilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The precedence is implemented with the tight/loose thing?

That's right. It's not mega-efficient but it's very robust, and it's accurate to Perl's behaviour. In a classic recursive descent parser, there's one separate parse function for each level of precedence in the grammar.

I will do the optimised parser in a PR after this one. I decided to make it all tested and feature-complete first.

Note! I haven't written the automated tests yet...

@NWilson NWilson marked this pull request as draft November 8, 2024 08:55
@PhilipHazel
Copy link
Collaborator

I have had a play with some invented Perl syntax tests and it all worked. Good work. Happy to help with documentation if needed.

Copy link
Contributor

@carenas carenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously missing tests, but assume that is why it is a draft?

not sure how "temporal" some of the "//" comments are either (assume to be fixed before an RC though)

src/pcre2_compile.c Outdated Show resolved Hide resolved
@NWilson
Copy link
Contributor Author

NWilson commented Nov 8, 2024

Happy to help with documentation if needed.

Thank @PhilipHazel! What would you like to be added? I have put in a little section on the Perl syntax, inside pcre2pattern.

@NWilson
Copy link
Contributor Author

NWilson commented Nov 8, 2024

obviously missing tests, but assume that is why it is a draft?

That's right, just tests remaining.

not sure how "temporal" some of the "//" comments are either (assume to be fixed before an RC though)

All the "// TODO" comments will be merged with the PR, and addressed in a follow-up. All the "TODO" comments that will be merged have a GitHub ticket linked for the follow-up work.

I have just addressed the one "// XXX" comment.

@NWilson
Copy link
Contributor Author

NWilson commented Nov 8, 2024

@PhilipHazel There's a small design choice around one of Perl's strictness options.

In Perl, an extended class (?[...]) always enables the use re strict; restrictions within the class. Some of those restrictions we already enforce in PCRE2 - things that are just plainly errors in the pattern like \y (unknown alpha escape) or \x{} (hex with zero digits). However, from studying the Perl source code a little, I have found three restrictions we don't currently implement (annoyingly, the Perl docs do not state anywhere that I can find all the changes made by use re strict; source sleuthing seems the only way to discover these):

  • Octal escapes require exactly three digits with re strict. This bans (?[ \0 \1 \12 ]).
  • Non-octal digits are not supported inside classes with re strict. This bans (?[ \8 \9 ]).
  • An unbracketed hex escape must have no less AND no more than two digits. This bans (?[ \x1 \x111 ])

I haven't applied these restrictions inside my Perl extended implementation. They don't really seem necessary, certainly not for Perl compatibility. However, we could do them if we feel that users ought to be able to copy a working (?[...]) regex from PCRE2 into Perl, and have it work there. We don't currently do that - PCRE2 is very much a superset of Perl syntax, so I don't see it as a hard requirement for our extended character class syntax to match Perl's exactly. As long as any valid Perl regex can be copied into PCRE2, we should be OK.#

@PhilipHazel
Copy link
Collaborator

Documentation: pcre2syntax.3 needs updating - I am happy to do that if you like. Also I think your (very good) text in pcre2pattern.3 might benefit from a little more expansion about what can be an atom. I hadn't really studied Perl enough, and was surprised to learn that, though you can't have an individual character as a literal, you can have it as an escape. So "a" isn't valid, but "\x{61}" is, or of course you can use "[a]" (which I did know).

Perl strict rules: I am more or less neutral on this, but quite happy to leave things as they are.

@carenas
Copy link
Contributor

carenas commented Nov 8, 2024

we could do them if we feel that users ought to be able to copy a working (?[...]) regex from PCRE2 into Perl

I haven't done it myself, but I think Perl allows "replacing" their "re" implementation with other engines and one of them might be PCRE, I agree with you both that the "incpompatibilies" are minor enough that it should not hold this, but also suggested we might be able to "reuse" the flag that is being used to enable the other extended classes to also activate those for whoever might require them.

@NWilson
Copy link
Contributor Author

NWilson commented Nov 8, 2024

If we did want to implement the re strict rules, I guess we'd add some new flag PCRE2_EXTRA_PERL_RE_STRICT, and set it to true when parsing (?[...]). Or we could make it private, so that customers can only get hold of the flag by using an extended class, but that seems a bit restrictive.

Right now, I don't feel especially motivated to go to extra effort, to lock down three perfectly-OK bits of syntax, just so we can show more annoying messages to users "haha, you wrote your regex badly". No-one is likely to ever use the flag (PCRE2_EXTRA_PERL_RE_STRICT), and similarly (?[...]) will get low usage, with even lower usage of the bad escapes.

@NWilson
Copy link
Contributor Author

NWilson commented Nov 8, 2024

I've added some docs in pcre2syntax.3.

@PhilipHazel
Copy link
Collaborator

Excellent! But .... sorry to nitpick ... there are dates at the top and bottom of pcre2syntax.3 (and pcre2pattern.3, come to think of it) that need updating. The one at the bottom is the only one that gets copied through to the HTML version, which is why it is there as well as the date that is one of the arguments to the .TH macro at the top. I always find it useful to have dates on documents because things go out of date so quickly.

@NWilson
Copy link
Contributor Author

NWilson commented Nov 8, 2024

OK! But one day, I'll remember to add a task to PrepareRelease to do that for us, by searching-for and updating all the dates with the git date:

git log -n1 --date=format:"%d %B %Y" --format=%cd ./doc/pcre2syntax.3

@PhilipHazel
Copy link
Collaborator

... but I only update the dates when I do actually edit the files... so one knows if something has not been updated.

@NWilson
Copy link
Contributor Author

NWilson commented Nov 10, 2024

Exactly. If I did automate the documentation dates, it would be based on git's per-file record of when it was last edited.

@PhilipHazel
Copy link
Collaborator

Aha! Git is much cleverer than I imagined. I should have learned it earlier.

@NWilson NWilson marked this pull request as ready for review November 12, 2024 15:36
@NWilson
Copy link
Contributor Author

NWilson commented Nov 12, 2024

The testing took me a while, but I've done it now.

No longer a draft!

Copy link
Contributor

@carenas carenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, might be nice to add some caseless tests in 4, though

Copy link
Collaborator

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps #559 should land first to add more testing.

testdata/testoutput2 Show resolved Hide resolved
testdata/testoutput2 Outdated Show resolved Hide resolved
Failed: error 106 at offset 5: missing terminating ] for character class

/(?[\n]/
Failed: error 215 at offset 6: unexpected ] with no following parenthesis in (?[...]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(?[ or (?[...])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intentional. I'm exercising not just every line of code, but also every branch in all the ifs. So there are lots of tests where I take valid patterns, and chop bits off the end, in order to exercise the if (ptr < ptrend && ...) part of the conditions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the error message should be changed. (?[...]) used most of the time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I was following Perl's error message, which is:

$ echo '/(?[\n]/;' | perl
Unexpected ']' with no following ')' in (?[... in regex; marked by <-- HERE in m/(?[\n] <-- HERE / at - line 1.

I just tweaked the message a tiny bit. The in (?[...] (or "in foo" in all error messages) was supposed to mean "in this text which you provided", and indeed (?[...] is what the user provided.

I'll change it to (?[... to exactly match Perl.

Failed: error 216 at offset 4: unexpected character in (?[...]) character class

/(?[(])/
Failed: error 114 at offset 4: missing closing parenthesis
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't parenthesis are normal characters in a class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an ordinary character class, yes, parentheses are not special. But inside (?[...]) they are a metacharacter for grouping arithmetic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the same as '[' ? Or does it do something different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a Perl extended class, ( and [ are quite different. There are two parsing grammars, the "outer" expression grammar, and an "inner" character class grammar.

PerlExtended := "(?[" OuterExpr "])" ;

OuterExpr := OuterExpr2 ([|+^-] OuterExpr2)* ;
OuterExpr2 := OuterExpr3 ([&] OuterExpr3)* ;
OuterExpr3 := [!]* OuterExpr4 ;
OuterExpr4 := InnerClass | "(" OuterExpr ")" ;

InnerClass := "\\" [dDsSwW] | "[:" Posix ":]" | "[" NormalPerlCharacterClassSyntax "]" ;

So ( doesn't change the parse mode, it stays within the expression syntax (in which "-" is an operator). But "[" switches the parse mode to the ordinary Perl class syntax (in which "-" is a character range, maybe).

testdata/testoutput5 Show resolved Hide resolved
@NWilson
Copy link
Contributor Author

NWilson commented Nov 12, 2024

Perhaps #559 should land first to add more testing.

I'm very happy to wait, but I don't think it's necessary. All the C code in this PR is very well-behaved, and won't be affected by Clang's -O2 vs -O0.

@NWilson
Copy link
Contributor Author

NWilson commented Nov 12, 2024

[Carlo] might be nice to add some caseless tests in 4, though

[Zoltan] Maybe a /B could clarify its structure.

[Zoltan] Is there a '\P' test?

All added, thank you for these suggestions

@zherczeg
Copy link
Collaborator

I'm very happy to wait, but I don't think it's necessary. All the C code in this PR is very well-behaved, and won't be affected by Clang's -O2 vs -O0.

If you are ok with #560, I think we can do this really quickly. Then we have an extra layer of protection.

@zherczeg
Copy link
Collaborator

Please rebase this patch.

@NWilson
Copy link
Contributor Author

NWilson commented Nov 15, 2024

Gladly rebased. Thank you Zoltan!

Copy link
Collaborator

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zherczeg zherczeg merged commit e0d4eee into PCRE2Project:master Nov 15, 2024
18 checks passed
@PhilipHazel
Copy link
Collaborator

I have been reviewing this new feature, in particular the documentation. It is excellent but (of course there's a "but"; why else am I posting?) I think it might profit from a bit more explanation, possibly with examples. There were a few things I had to run tests in order to learn some exact behaviour. Rather than try to list suggested changes it will be easiest if I just edit the docs directly, which I will do later tomorrow unless anybody jumps up and says NO. I think also some error messages would be clearer if "class" was changed to "extended class" (which I can also do). Thanks to @NWilson and all who have worked on this.

@NWilson
Copy link
Contributor Author

NWilson commented Nov 24, 2024

Thank you Philip! That would be very helpful. I'm grateful for any documentation improvements you could suggest or make yourself.

@PhilipHazel
Copy link
Collaborator

I have just merged an amended version of pcre2pattern.3 and pcre2_error.c with my changes. Please grumble if I've got anything wrong. I think there is a typo in a comment in pcre2_compile.c line 4175 which says "Check for no preceding operand". Shouldn't that be "operator"?

@NWilson
Copy link
Contributor Author

NWilson commented Nov 26, 2024

  • In all the error messages you added "extended". I don't mind, but it was a conscious decision not to. In the UTS#18 case, do users know about "extended" classes? You just have [...] classes, rather than two different types, which (in that dialect) can include operators. In my mind, it's more that the UTS#18 dialect has an extended version of character classes, rather than any one particular character class being "extended". Still, the messages are probably an improvement though.
  • It's lovely that the EBCDIC support is documented; however, it's not available unless you use a very, very unusual compiler. For the benefit of people who don't know about it, perhaps the sections in pcre2pattern should start by saying "this is only relevant to certain mainframe environments; if you don't know what EBCDIC is, then your system is not using it, and the following section does not apply".
  • "Check for no preceding operand" was meant to mean, "check [ensure] that there is no preceding operand". I should make the comment more clear! I'll do a little PR for that.

@PhilipHazel
Copy link
Collaborator

I realize that in UTS#18 mode there are just "classes", but Perl distinguishes "classes" and "extended classes" and some, at least, of the error messages are used in both modes. That's why I made the change. The problem I had with "check there is no preceding operand" was that the error message 113 says "no preceding operator". One of them must be wrong!

I'll have a check on EBCDIC mentions. Good idea. I was thinking that I could perhaps usefully do a general trawl through the docs (a bit at a time to avoid boredom) before the next release.

@NWilson
Copy link
Contributor Author

NWilson commented Nov 27, 2024

The problem I had with "check there is no preceding operand" was that the error message 113 says "no preceding operator". One of them must be wrong!

Oh I see! They're both right... but both super-unclear.

The comment "check there is no preceding operand" was meant to mean, "we want to check that we don't have a preceding operand; if there is one, reject the regex".

The error message says "unexpected expression in character class (no preceding operator)" because it's trying to say "you didn't put in an operator, which was required".

The case is: /(?[\d !\w])/, where it encounters the '!' character but it's juxtaposed with a previous atom/leaf.

The comment is ultimately unimportant. But if you think there's a better phrasing for that error message, please do improve it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extended character classes: implement the Perl syntax
4 participants