Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pattern /[^\D\P{Nd}]/utf matches to \x{1d7cf} #497

Open
zherczeg opened this issue Sep 26, 2024 · 11 comments
Open

Pattern /[^\D\P{Nd}]/utf matches to \x{1d7cf} #497

zherczeg opened this issue Sep 26, 2024 · 11 comments

Comments

@zherczeg
Copy link
Collaborator

zherczeg commented Sep 26, 2024

This test is in testinput5:

  re> /[^\D\P{Nd}]/B,utf
------------------------------------------------------------------
        Bra
        [^\x00-/:-\xff\P{Nd}]
        Ket
        End
------------------------------------------------------------------
data> \x{1d7cf}
 0: \x{1d7cf}

Currently this pattern matches to \x{1d7cf}

Since \D is in ascii mode (ucp is not enabled), /[\D]/utf matches to anything not 0-9. That should include \x{1d7cf}. This looks true:

  re> /[\D\P{Nd}]/B,utf
------------------------------------------------------------------
        Bra
        [\x00-/:-\xff\P{Nd}\x{100}-\x{10ffff}]
        Ket
        End
------------------------------------------------------------------
data> \x{1d7cf}
 0: \x{1d7cf}

Note: /[\P{Nd}]/utf does not match to \x{1d7cf}

Summary: both /[^\D\P{Nd}]/utf and /[\D\P{Nd}]/utf matches to \x{1d7cf}. Obviously a character class and its negated form cannot match to the same character, and I think the first one is incorrect. My newest code changes this pattern to no-match, and I wanted to discuss it.

@NWilson
Copy link
Contributor

NWilson commented Sep 26, 2024

This is a very interesting case!

There's something related here - which is the behaviour of \P{...} with /i modifier.

They specifically changed the behaviour of [^\P{...}] in ECMAScript. If you run the regex using /[^\P{...}]/iu vs using /[^\P{...}]/iv in JavaScript you get different results. (Note the change from "/u Unicode" to "/v Next-gen Unicode".)

They discussed this change here: tc39/proposal-regexp-v-flag#30

(Note - there's a lot of confusion and irrelevant chatter on that thread. Most of it just irrelevant.)

How character classes work

  • Take the characters and ranges, and union them to form a "set" of characters
  • Things like \D mean "the complement/inverted-set" for whatever characters are in the \d set. This is applied immediately, at compile-time.
  • Similarly for \p{...} and \P{...}.
  • Then, ^ is applied last to complement/invert the set of matched characters.
  • The input/match character is then tested to see if it's contained in the set.

With the /i modifier:

  • Each character that appears in the set is "case-folded".
    • So [A] means → add case_fold('A') to the set of matched characters
    • And [A-B] means → take the set of characters {A...B} and form the set {case_fold(c) for c ∈ {A...B}}, and add that
    • Obviously the input (match) string is read character-by-character and the input characters are case-folded too before matching against the set
    • The input string is case-handled character-by-character because the flag can be set using (?i:[...]) to apply to portions of the regex only.
  • The ambiguity that the ECMAScript guys got hung up on is around \P{...}
    • The old behaviour was: form the set for \p{...}. Then invert it to form the \P{...} set. Then finally form the set {case_fold(c) for c ∈ \P{...}}
    • The new behaviour in ECMAScript is to change the order. Form the \p{..} set, then case-fold it, then invert it, so that /\P{...}/i means complement({case_fold(c) for c ∈ \p{...}})

Things like \d and \D behave the same as \p{...} or \P{...}. They are all just shorthands for a set of characters.

@zherczeg
Copy link
Collaborator Author

Which \p{} is affected by case folding? It seemed to me (and I assumed), that classes always contains the other cases of all of its characters. Most properties are scripts or control characters. I suspect some extended script is affected.

@NWilson
Copy link
Contributor

NWilson commented Sep 26, 2024

Well \P{Ll} and \P{Lu} do, maybe others. It drastically affects their meaning under /i, whether you do the case-fold before or after the inversion implied by \P.

@zherczeg
Copy link
Collaborator Author

True. In PCRE2, /i does not affect properties. This way we don't need to generate that many databases.

@carenas
Copy link
Contributor

carenas commented Sep 26, 2024

In PCRE2, /i does not affect properties

Note that is no longer the case since #432

@PhilipHazel
Copy link
Collaborator

Perl does this:

Perl v5.40.0

/[\D\P{Nd}]/utf
\x{1d7cf}
No match

/[^\D\P{Nd}]/utf
\x{1d7cf}
0: \x{1d7cf}

Which I think is right - PCRE2 is currently wrong. As the character is greater than 255 and UCP is not set, the bit map set up by \D is not relevant and only \P should count. However, if the first pattern is just [\P{Nd}] there is no match. So there is indeed a bug in PCRE2 and I think it is /[\D\P{Nd}]/utf that is incorrect.

@PhilipHazel
Copy link
Collaborator

I'm not sure in #497 which one you think is wrong and have fixed...

@zherczeg
Copy link
Collaborator Author

\D matches to anything that is not 0-9, that includes all > 255 characters. Perl uses unicode for \D, that is why /[\D\P{Nd}]/utf does not match. [\d] matches to \x{1d7cf} in perl. Perl has no non-ucp mode.

@PhilipHazel
Copy link
Collaborator

Ah yes, of course. I was forgetting that. OK, let's merge your patch.

@NWilson
Copy link
Contributor

NWilson commented Nov 6, 2024

I made a random digression above, about the behaviour of /\P{...}/i.

PCRE2's behaviour matches Perl currently, and Python's regex module (the standard re module doesn't have \P{...}), and also JavaScript with the new v flag.

So our behaviour is all good, but I think we're maybe missing tests for (negated, case-insensitive \P{}). So I'll add them, to lock in this behaviour during any future refactoring.

NWilson added a commit to NWilson/pcre2 that referenced this issue Nov 6, 2024
PCRE2's behaviour matches Perl currently, and Python's regex module
(the standard re module doesn't have \P{...}).

However, JavaScript behaviour changed recently (when they added the new
'v' flag; the old 'u' flag's behaviour is unchanged), so this case is
worth additional testing.

Raised in issue PCRE2Project#497.
@NWilson
Copy link
Contributor

NWilson commented Nov 6, 2024

I think this can be closed now?

zherczeg pushed a commit that referenced this issue Nov 6, 2024
PCRE2's behaviour matches Perl currently, and Python's regex module
(the standard re module doesn't have \P{...}).

However, JavaScript behaviour changed recently (when they added the new
'v' flag; the old 'u' flag's behaviour is unchanged), so this case is
worth additional testing.

Raised in issue #497.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants