Pattern /[^\D\P{Nd}]/utf matches to \x{1d7cf} #497

zherczeg · 2024-09-26T09:50:17Z

This test is in testinput5:

  re> /[^\D\P{Nd}]/B,utf
------------------------------------------------------------------
        Bra
        [^\x00-/:-\xff\P{Nd}]
        Ket
        End
------------------------------------------------------------------
data> \x{1d7cf}
 0: \x{1d7cf}

Currently this pattern matches to \x{1d7cf}

Since \D is in ascii mode (ucp is not enabled), /[\D]/utf matches to anything not 0-9. That should include \x{1d7cf}. This looks true:

  re> /[\D\P{Nd}]/B,utf
------------------------------------------------------------------
        Bra
        [\x00-/:-\xff\P{Nd}\x{100}-\x{10ffff}]
        Ket
        End
------------------------------------------------------------------
data> \x{1d7cf}
 0: \x{1d7cf}

Note: /[\P{Nd}]/utf does not match to \x{1d7cf}

Summary: both /[^\D\P{Nd}]/utf and /[\D\P{Nd}]/utf matches to \x{1d7cf}. Obviously a character class and its negated form cannot match to the same character, and I think the first one is incorrect. My newest code changes this pattern to no-match, and I wanted to discuss it.

The text was updated successfully, but these errors were encountered:

NWilson · 2024-09-26T10:15:50Z

This is a very interesting case!

There's something related here - which is the behaviour of \P{...} with /i modifier.

They specifically changed the behaviour of [^\P{...}] in ECMAScript. If you run the regex using /[^\P{...}]/iu vs using /[^\P{...}]/iv in JavaScript you get different results. (Note the change from "/u Unicode" to "/v Next-gen Unicode".)

They discussed this change here: tc39/proposal-regexp-v-flag#30

(Note - there's a lot of confusion and irrelevant chatter on that thread. Most of it just irrelevant.)

How character classes work

Take the characters and ranges, and union them to form a "set" of characters
Things like \D mean "the complement/inverted-set" for whatever characters are in the \d set. This is applied immediately, at compile-time.
Similarly for \p{...} and \P{...}.
Then, ^ is applied last to complement/invert the set of matched characters.
The input/match character is then tested to see if it's contained in the set.

With the /i modifier:

Each character that appears in the set is "case-folded".
- So [A] means → add case_fold('A') to the set of matched characters
- And [A-B] means → take the set of characters {A...B} and form the set {case_fold(c) for c ∈ {A...B}}, and add that
- Obviously the input (match) string is read character-by-character and the input characters are case-folded too before matching against the set
- The input string is case-handled character-by-character because the flag can be set using (?i:[...]) to apply to portions of the regex only.
The ambiguity that the ECMAScript guys got hung up on is around \P{...}
- The old behaviour was: form the set for \p{...}. Then invert it to form the \P{...} set. Then finally form the set {case_fold(c) for c ∈ \P{...}}
- The new behaviour in ECMAScript is to change the order. Form the \p{..} set, then case-fold it, then invert it, so that /\P{...}/i means complement({case_fold(c) for c ∈ \p{...}})

Things like \d and \D behave the same as \p{...} or \P{...}. They are all just shorthands for a set of characters.

zherczeg · 2024-09-26T10:31:17Z

Which \p{} is affected by case folding? It seemed to me (and I assumed), that classes always contains the other cases of all of its characters. Most properties are scripts or control characters. I suspect some extended script is affected.

NWilson · 2024-09-26T13:44:46Z

Well \P{Ll} and \P{Lu} do, maybe others. It drastically affects their meaning under /i, whether you do the case-fold before or after the inversion implied by \P.

zherczeg · 2024-09-26T13:56:15Z

True. In PCRE2, /i does not affect properties. This way we don't need to generate that many databases.

carenas · 2024-09-26T15:55:20Z

In PCRE2, /i does not affect properties

Note that is no longer the case since #432

PhilipHazel · 2024-09-26T16:21:47Z

Perl does this:

Perl v5.40.0

/[\D\P{Nd}]/utf
\x{1d7cf}
No match

/[^\D\P{Nd}]/utf
\x{1d7cf}
0: \x{1d7cf}

Which I think is right - PCRE2 is currently wrong. As the character is greater than 255 and UCP is not set, the bit map set up by \D is not relevant and only \P should count. However, if the first pattern is just [\P{Nd}] there is no match. So there is indeed a bug in PCRE2 and I think it is /[\D\P{Nd}]/utf that is incorrect.

PhilipHazel · 2024-09-26T16:29:56Z

I'm not sure in #497 which one you think is wrong and have fixed...

zherczeg · 2024-09-26T16:37:11Z

\D matches to anything that is not 0-9, that includes all > 255 characters. Perl uses unicode for \D, that is why /[\D\P{Nd}]/utf does not match. [\d] matches to \x{1d7cf} in perl. Perl has no non-ucp mode.

PhilipHazel · 2024-09-26T16:44:54Z

Ah yes, of course. I was forgetting that. OK, let's merge your patch.

NWilson · 2024-11-06T10:34:59Z

I made a random digression above, about the behaviour of /\P{...}/i.

PCRE2's behaviour matches Perl currently, and Python's regex module (the standard re module doesn't have \P{...}), and also JavaScript with the new v flag.

So our behaviour is all good, but I think we're maybe missing tests for (negated, case-insensitive \P{}). So I'll add them, to lock in this behaviour during any future refactoring.

PCRE2's behaviour matches Perl currently, and Python's regex module (the standard re module doesn't have \P{...}). However, JavaScript behaviour changed recently (when they added the new 'v' flag; the old 'u' flag's behaviour is unchanged), so this case is worth additional testing. Raised in issue PCRE2Project#497.

NWilson · 2024-11-06T10:44:24Z

I think this can be closed now?

PCRE2's behaviour matches Perl currently, and Python's regex module (the standard re module doesn't have \P{...}). However, JavaScript behaviour changed recently (when they added the new 'v' flag; the old 'u' flag's behaviour is unchanged), so this case is worth additional testing. Raised in issue #497.

zherczeg mentioned this issue Sep 26, 2024

Simplify range data construction. #496

Merged

NWilson mentioned this issue Nov 6, 2024

Add additional tests for negated /[^\P{Lu}]/i behaviour #550

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pattern /[^\D\P{Nd}]/utf matches to \x{1d7cf} #497

Pattern /[^\D\P{Nd}]/utf matches to \x{1d7cf} #497

zherczeg commented Sep 26, 2024 •

edited

Loading

NWilson commented Sep 26, 2024 •

edited

Loading

zherczeg commented Sep 26, 2024

NWilson commented Sep 26, 2024

zherczeg commented Sep 26, 2024

carenas commented Sep 26, 2024

PhilipHazel commented Sep 26, 2024

PhilipHazel commented Sep 26, 2024

zherczeg commented Sep 26, 2024

PhilipHazel commented Sep 26, 2024

NWilson commented Nov 6, 2024

NWilson commented Nov 6, 2024

Pattern /[^\D\P{Nd}]/utf matches to \x{1d7cf} #497

Pattern /[^\D\P{Nd}]/utf matches to \x{1d7cf} #497

Comments

zherczeg commented Sep 26, 2024 • edited Loading

NWilson commented Sep 26, 2024 • edited Loading

How character classes work

zherczeg commented Sep 26, 2024

NWilson commented Sep 26, 2024

zherczeg commented Sep 26, 2024

carenas commented Sep 26, 2024

PhilipHazel commented Sep 26, 2024

PhilipHazel commented Sep 26, 2024

zherczeg commented Sep 26, 2024

PhilipHazel commented Sep 26, 2024

NWilson commented Nov 6, 2024

NWilson commented Nov 6, 2024

zherczeg commented Sep 26, 2024 •

edited

Loading

NWilson commented Sep 26, 2024 •

edited

Loading