Implement Perl extended character classes #553

NWilson · 2024-11-08T01:30:10Z

Fixes #536

zherczeg

Nice patch! The precedence is implemented with the tight/loose thing?

src/pcre2_compile.c

NWilson

The precedence is implemented with the tight/loose thing?

That's right. It's not mega-efficient but it's very robust, and it's accurate to Perl's behaviour. In a classic recursive descent parser, there's one separate parse function for each level of precedence in the grammar.

I will do the optimised parser in a PR after this one. I decided to make it all tested and feature-complete first.

Note! I haven't written the automated tests yet...

PhilipHazel · 2024-11-08T09:54:28Z

I have had a play with some invented Perl syntax tests and it all worked. Good work. Happy to help with documentation if needed.

carenas

obviously missing tests, but assume that is why it is a draft?

not sure how "temporal" some of the "//" comments are either (assume to be fixed before an RC though)

src/pcre2_compile.c

NWilson · 2024-11-08T11:29:18Z

Happy to help with documentation if needed.

Thank @PhilipHazel! What would you like to be added? I have put in a little section on the Perl syntax, inside pcre2pattern.

NWilson · 2024-11-08T11:51:21Z

obviously missing tests, but assume that is why it is a draft?

That's right, just tests remaining.

not sure how "temporal" some of the "//" comments are either (assume to be fixed before an RC though)

All the "// TODO" comments will be merged with the PR, and addressed in a follow-up. All the "TODO" comments that will be merged have a GitHub ticket linked for the follow-up work.

I have just addressed the one "// XXX" comment.

NWilson · 2024-11-08T12:30:25Z

@PhilipHazel There's a small design choice around one of Perl's strictness options.

In Perl, an extended class (?[...]) always enables the use re strict; restrictions within the class. Some of those restrictions we already enforce in PCRE2 - things that are just plainly errors in the pattern like \y (unknown alpha escape) or \x{} (hex with zero digits). However, from studying the Perl source code a little, I have found three restrictions we don't currently implement (annoyingly, the Perl docs do not state anywhere that I can find all the changes made by use re strict; source sleuthing seems the only way to discover these):

Octal escapes require exactly three digits with re strict. This bans (?[ \0 \1 \12 ]).
Non-octal digits are not supported inside classes with re strict. This bans (?[ \8 \9 ]).
An unbracketed hex escape must have no less AND no more than two digits. This bans (?[ \x1 \x111 ])

I haven't applied these restrictions inside my Perl extended implementation. They don't really seem necessary, certainly not for Perl compatibility. However, we could do them if we feel that users ought to be able to copy a working (?[...]) regex from PCRE2 into Perl, and have it work there. We don't currently do that - PCRE2 is very much a superset of Perl syntax, so I don't see it as a hard requirement for our extended character class syntax to match Perl's exactly. As long as any valid Perl regex can be copied into PCRE2, we should be OK.#

PhilipHazel · 2024-11-08T12:41:00Z

Documentation: pcre2syntax.3 needs updating - I am happy to do that if you like. Also I think your (very good) text in pcre2pattern.3 might benefit from a little more expansion about what can be an atom. I hadn't really studied Perl enough, and was surprised to learn that, though you can't have an individual character as a literal, you can have it as an escape. So "a" isn't valid, but "\x{61}" is, or of course you can use "[a]" (which I did know).

Perl strict rules: I am more or less neutral on this, but quite happy to leave things as they are.

carenas · 2024-11-08T13:40:46Z

we could do them if we feel that users ought to be able to copy a working (?[...]) regex from PCRE2 into Perl

I haven't done it myself, but I think Perl allows "replacing" their "re" implementation with other engines and one of them might be PCRE, I agree with you both that the "incpompatibilies" are minor enough that it should not hold this, but also suggested we might be able to "reuse" the flag that is being used to enable the other extended classes to also activate those for whoever might require them.

NWilson · 2024-11-08T14:51:58Z

If we did want to implement the re strict rules, I guess we'd add some new flag PCRE2_EXTRA_PERL_RE_STRICT, and set it to true when parsing (?[...]). Or we could make it private, so that customers can only get hold of the flag by using an extended class, but that seems a bit restrictive.

Right now, I don't feel especially motivated to go to extra effort, to lock down three perfectly-OK bits of syntax, just so we can show more annoying messages to users "haha, you wrote your regex badly". No-one is likely to ever use the flag (PCRE2_EXTRA_PERL_RE_STRICT), and similarly (?[...]) will get low usage, with even lower usage of the bad escapes.

NWilson · 2024-11-08T15:33:56Z

I've added some docs in pcre2syntax.3.

PhilipHazel · 2024-11-08T16:22:12Z

Excellent! But .... sorry to nitpick ... there are dates at the top and bottom of pcre2syntax.3 (and pcre2pattern.3, come to think of it) that need updating. The one at the bottom is the only one that gets copied through to the HTML version, which is why it is there as well as the date that is one of the arguments to the .TH macro at the top. I always find it useful to have dates on documents because things go out of date so quickly.

NWilson · 2024-11-08T16:37:19Z

OK! But one day, I'll remember to add a task to PrepareRelease to do that for us, by searching-for and updating all the dates with the git date:

git log -n1 --date=format:"%d %B %Y" --format=%cd ./doc/pcre2syntax.3

PhilipHazel · 2024-11-08T16:46:49Z

... but I only update the dates when I do actually edit the files... so one knows if something has not been updated.

NWilson · 2024-11-10T14:00:17Z

Exactly. If I did automate the documentation dates, it would be based on git's per-file record of when it was last edited.

PhilipHazel · 2024-11-10T17:04:42Z

Aha! Git is much cleverer than I imagined. I should have learned it earlier.

NWilson · 2024-11-12T15:38:07Z

The testing took me a while, but I've done it now.

No longer a draft!

carenas

nice work, might be nice to add some caseless tests in 4, though

zherczeg

Perhaps #559 should land first to add more testing.

testdata/testoutput2

zherczeg · 2024-11-12T19:01:34Z

testdata/testoutput2

+Failed: error 106 at offset 5: missing terminating ] for character class
+
+/(?[\n]/
+Failed: error 215 at offset 6: unexpected ] with no following parenthesis in (?[...]


(?[ or (?[...])

It's intentional. I'm exercising not just every line of code, but also every branch in all the ifs. So there are lots of tests where I take valid patterns, and chop bits off the end, in order to exercise the if (ptr < ptrend && ...) part of the conditions.

I mean the error message should be changed. (?[...]) used most of the time

I see. I was following Perl's error message, which is:

$ echo '/(?[\n]/;' | perl Unexpected ']' with no following ')' in (?[... in regex; marked by <-- HERE in m/(?[\n] <-- HERE / at - line 1.

I just tweaked the message a tiny bit. The in (?[...] (or "in foo" in all error messages) was supposed to mean "in this text which you provided", and indeed (?[...] is what the user provided.

I'll change it to (?[... to exactly match Perl.

zherczeg · 2024-11-12T19:03:30Z

testdata/testoutput2

+Failed: error 216 at offset 4: unexpected character in (?[...]) character class
+
+/(?[(])/
+Failed: error 114 at offset 4: missing closing parenthesis


Isn't parenthesis are normal characters in a class?

In an ordinary character class, yes, parentheses are not special. But inside (?[...]) they are a metacharacter for grouping arithmetic.

Is it the same as '[' ? Or does it do something different?

In a Perl extended class, ( and [ are quite different. There are two parsing grammars, the "outer" expression grammar, and an "inner" character class grammar.

PerlExtended := "(?[" OuterExpr "])" ; OuterExpr := OuterExpr2 ([|+^-] OuterExpr2)* ; OuterExpr2 := OuterExpr3 ([&] OuterExpr3)* ; OuterExpr3 := [!]* OuterExpr4 ; OuterExpr4 := InnerClass | "(" OuterExpr ")" ; InnerClass := "\\" [dDsSwW] | "[:" Posix ":]" | "[" NormalPerlCharacterClassSyntax "]" ;

So ( doesn't change the parse mode, it stays within the expression syntax (in which "-" is an operator). But "[" switches the parse mode to the ordinary Perl class syntax (in which "-" is a character range, maybe).

testdata/testoutput5

NWilson · 2024-11-12T20:36:32Z

Perhaps #559 should land first to add more testing.

I'm very happy to wait, but I don't think it's necessary. All the C code in this PR is very well-behaved, and won't be affected by Clang's -O2 vs -O0.

NWilson · 2024-11-12T21:09:51Z

[Carlo] might be nice to add some caseless tests in 4, though

[Zoltan] Maybe a /B could clarify its structure.

[Zoltan] Is there a '\P' test?

All added, thank you for these suggestions

zherczeg · 2024-11-13T05:24:43Z

I'm very happy to wait, but I don't think it's necessary. All the C code in this PR is very well-behaved, and won't be affected by Clang's -O2 vs -O0.

If you are ok with #560, I think we can do this really quickly. Then we have an extra layer of protection.

zherczeg · 2024-11-15T14:11:36Z

Please rebase this patch.

Fixes PCRE2Project#536

NWilson · 2024-11-15T14:18:04Z

Gladly rebased. Thank you Zoltan!

zherczeg

LGTM

PhilipHazel · 2024-11-24T17:17:01Z

I have been reviewing this new feature, in particular the documentation. It is excellent but (of course there's a "but"; why else am I posting?) I think it might profit from a bit more explanation, possibly with examples. There were a few things I had to run tests in order to learn some exact behaviour. Rather than try to list suggested changes it will be easiest if I just edit the docs directly, which I will do later tomorrow unless anybody jumps up and says NO. I think also some error messages would be clearer if "class" was changed to "extended class" (which I can also do). Thanks to @NWilson and all who have worked on this.

NWilson · 2024-11-24T21:39:18Z

Thank you Philip! That would be very helpful. I'm grateful for any documentation improvements you could suggest or make yourself.

PhilipHazel · 2024-11-26T16:04:19Z

I have just merged an amended version of pcre2pattern.3 and pcre2_error.c with my changes. Please grumble if I've got anything wrong. I think there is a typo in a comment in pcre2_compile.c line 4175 which says "Check for no preceding operand". Shouldn't that be "operator"?

NWilson · 2024-11-26T20:08:01Z

In all the error messages you added "extended". I don't mind, but it was a conscious decision not to. In the UTS#18 case, do users know about "extended" classes? You just have [...] classes, rather than two different types, which (in that dialect) can include operators. In my mind, it's more that the UTS#18 dialect has an extended version of character classes, rather than any one particular character class being "extended". Still, the messages are probably an improvement though.
It's lovely that the EBCDIC support is documented; however, it's not available unless you use a very, very unusual compiler. For the benefit of people who don't know about it, perhaps the sections in pcre2pattern should start by saying "this is only relevant to certain mainframe environments; if you don't know what EBCDIC is, then your system is not using it, and the following section does not apply".
"Check for no preceding operand" was meant to mean, "check [ensure] that there is no preceding operand". I should make the comment more clear! I'll do a little PR for that.

PhilipHazel · 2024-11-27T09:14:37Z

I realize that in UTS#18 mode there are just "classes", but Perl distinguishes "classes" and "extended classes" and some, at least, of the error messages are used in both modes. That's why I made the change. The problem I had with "check there is no preceding operand" was that the error message 113 says "no preceding operator". One of them must be wrong!

I'll have a check on EBCDIC mentions. Good idea. I was thinking that I could perhaps usefully do a general trawl through the docs (a bit at a time to avoid boredom) before the next release.

NWilson · 2024-11-27T15:33:21Z

The problem I had with "check there is no preceding operand" was that the error message 113 says "no preceding operator". One of them must be wrong!

Oh I see! They're both right... but both super-unclear.

The comment "check there is no preceding operand" was meant to mean, "we want to check that we don't have a preceding operand; if there is one, reject the regex".

The error message says "unexpected expression in character class (no preceding operator)" because it's trying to say "you didn't put in an operator, which was required".

The case is: /(?[\d !\w])/, where it encounters the '!' character but it's juxtaposed with a previous atom/leaf.

The comment is ultimately unimportant. But if you think there's a better phrasing for that error message, please do improve it!

zherczeg reviewed Nov 8, 2024

View reviewed changes

src/pcre2_compile.c Show resolved Hide resolved

src/pcre2_compile.c Outdated Show resolved Hide resolved

NWilson commented Nov 8, 2024

View reviewed changes

NWilson marked this pull request as draft November 8, 2024 08:55

carenas suggested changes Nov 8, 2024

View reviewed changes

src/pcre2_compile.c Outdated Show resolved Hide resolved

NWilson force-pushed the user/niwilson/perl-ext branch from 5385503 to 5c714ca Compare November 8, 2024 11:51

NWilson marked this pull request as ready for review November 12, 2024 15:36

carenas approved these changes Nov 12, 2024

View reviewed changes

zherczeg reviewed Nov 12, 2024

View reviewed changes

NWilson force-pushed the user/niwilson/perl-ext branch from cfb6fb0 to f1c2406 Compare November 13, 2024 08:03

zherczeg mentioned this pull request Nov 15, 2024

Test clang with and without optimisations #559

Merged

Implement Perl extended character classes

70c1a07

Fixes PCRE2Project#536

NWilson force-pushed the user/niwilson/perl-ext branch from f1c2406 to 70c1a07 Compare November 15, 2024 14:17

zherczeg approved these changes Nov 15, 2024

View reviewed changes

zherczeg merged commit e0d4eee into PCRE2Project:master Nov 15, 2024
18 checks passed

PhilipHazel mentioned this pull request Nov 16, 2024

Revise and extend character classes #13

Closed

Implement Perl extended character classes #553

Implement Perl extended character classes #553

Conversation

NWilson commented Nov 8, 2024 • edited Loading

zherczeg left a comment

Choose a reason for hiding this comment

NWilson left a comment

Choose a reason for hiding this comment

PhilipHazel commented Nov 8, 2024

carenas left a comment • edited Loading

Choose a reason for hiding this comment

NWilson commented Nov 8, 2024

NWilson commented Nov 8, 2024

NWilson commented Nov 8, 2024

PhilipHazel commented Nov 8, 2024

carenas commented Nov 8, 2024 • edited Loading

NWilson commented Nov 8, 2024

NWilson commented Nov 8, 2024

PhilipHazel commented Nov 8, 2024

NWilson commented Nov 8, 2024

PhilipHazel commented Nov 8, 2024

NWilson commented Nov 10, 2024

PhilipHazel commented Nov 10, 2024

NWilson commented Nov 12, 2024

carenas left a comment

Choose a reason for hiding this comment

zherczeg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NWilson commented Nov 12, 2024

NWilson commented Nov 12, 2024

zherczeg commented Nov 13, 2024

zherczeg commented Nov 15, 2024

NWilson commented Nov 15, 2024

zherczeg left a comment

Choose a reason for hiding this comment

PhilipHazel commented Nov 24, 2024

NWilson commented Nov 24, 2024

PhilipHazel commented Nov 26, 2024

NWilson commented Nov 26, 2024

PhilipHazel commented Nov 27, 2024

NWilson commented Nov 27, 2024

NWilson commented Nov 8, 2024 •

edited

Loading

carenas left a comment •

edited

Loading

carenas commented Nov 8, 2024 •

edited

Loading