Simplify range data construction. #496

zherczeg · 2024-09-25T13:50:10Z

This patch uses the computed ranges to generate byte code rather than using add_to_class. It is a considerable simplification of the code.

carenas · 2024-09-25T13:56:20Z

src/pcre2_compile_class.c

-      size_t usize = utf_caseless_extend(start_char, *ptr++, options, buffer);
-      if (buffer != NULL) buffer += usize;
-      total_size += usize;
+      size = utf_caseless_extend(start_char, *ptr++, options, buffer);


FYI this will revert Phillip's last fix for -Wshadow, is that intended?

Yes. I wanted to remove the size_t but forgot it.

nevermind, shold had pulled it first before commenting, guess that is why it was using size in the original to begin with then ;), nice work, and yes GitHub is acting weird today with comments, yours didn't refresh after I posted mine.

is this the last from your fixes to close #469?

Oh no. I found another issue with negated ascii classes. And this is still just the range merge, the logarithmic search is still very far.

zherczeg · 2024-09-26T13:45:54Z

@PhilipHazel if you agree my suggestion in #497 , this patch is ready

carenas · 2024-09-26T16:20:41Z

I think the following assertion is not correct:

Obviously a character class and its negated form cannot match to the same character

in PCRE2 it can, and the reasons are historic and described in #186.

In summary our "/u" Perl equivalent requires both utf and ucp modifiers to be set

zherczeg · 2024-09-26T16:47:45Z

I am not sure I understand that part, it talks about configuring the modifier. Normally if [C] matches to something, [^C] must not match to that except for invalid utf characters, which never matches to anything like NaN in numbers.

carenas · 2024-09-26T17:02:03Z

The point is that without PCRE2_UCP(as you pointed out) all characters above 255 (in the 8-bit library) are not defined, so any [^C] would match them if PCRE2_UTF is enabled. As you pointed out Perl has no non-UCP mode, but we do, and we even have UCP mode without UTF (ex: in the 16-bit library).

Agree with you that they "shouldn't" match and that is arguably a bug, but it is the currently expected behaviour when ONLY one of those options are set.

The "ambiguity" is resolved at compile time by the redefinition of \D that PCRE2_UCP drives as shown by:

PCRE2 version 10.44 2024-06-07 (8-bit)
  re> /[^\D\P{Nd}]/B,utf,ucp
------------------------------------------------------------------
        Bra
        [^\P{Nd}\P{Nd}]
        Ket
        End
------------------------------------------------------------------
data> \x{1d7cf}
 0: \x{1d7cf}
data> 
  re> /[\D\P{Nd}]/B,utf,ucp
------------------------------------------------------------------
        Bra
        [\P{Nd}\P{Nd}]
        Ket
        End
------------------------------------------------------------------
data> \x{1d7cf}
No match

carenas · 2024-09-26T17:23:39Z

indeed I think this might had just introduced a regression:

PCRE2 version 10.44 2024-06-07 (8-bit)
  re> /[^\D\P{Nd}]/B,utf,ascii_bsd
------------------------------------------------------------------
        Bra
        [^\x00-/:-\xff\P{Nd}]
        Ket
        End
------------------------------------------------------------------
PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
  re> /[^\D\P{Nd}]/B,utf,ascii_bsd
------------------------------------------------------------------
        Bra
        [^\x00-/:-\xff\P{Nd}\x{100}-\x{10ffff}]
        Ket
        End
------------------------------------------------------------------

At least this one works, but the B output is confusing (not an issue introduced with this patch though):

 re> /[^\d]/B,utf,ascii_bsd
------------------------------------------------------------------
        Bra
        [\x00-/:-\xff] (neg)
        Ket
        End
------------------------------------------------------------------
data> 1
No match

zherczeg · 2024-09-26T17:38:01Z

It is fixed that regression, and this is what I am talking about. \D matches anything not [0-9], which includes all > 255 characters.

  re> /[\d]/B,utf
------------------------------------------------------------------
        Bra
        [0-9]
        Ket
        End
------------------------------------------------------------------
  re> /[\D]/B,utf
------------------------------------------------------------------
        Bra
        [\x00-/:-\xff] (neg)
        Ket
        End
------------------------------------------------------------------

The [\x00-/:-\xff] (neg) is the same as [\x00-/:-\xff\x{100}-\x{10ffff}]. This is negated above with ^.

carenas reviewed Sep 25, 2024

View reviewed changes

zherczeg marked this pull request as ready for review September 25, 2024 14:01

zherczeg force-pushed the simplify_class branch from d664f93 to 2c74c97 Compare September 25, 2024 14:05

zherczeg marked this pull request as draft September 26, 2024 10:01

zherczeg force-pushed the simplify_class branch 2 times, most recently from 8a78a3a to 0713f09 Compare September 26, 2024 12:44

zherczeg marked this pull request as ready for review September 26, 2024 12:45

zherczeg force-pushed the simplify_class branch from 0713f09 to 22d43ce Compare September 26, 2024 12:49

Simplify class range processing

c64a84b

zherczeg force-pushed the simplify_class branch from 22d43ce to c64a84b Compare September 26, 2024 13:03

PhilipHazel merged commit 46668dd into PCRE2Project:master Sep 26, 2024
14 checks passed

zherczeg deleted the simplify_class branch September 26, 2024 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify range data construction. #496

Simplify range data construction. #496

Uh oh!

zherczeg commented Sep 25, 2024 •

edited

Loading

Uh oh!

carenas Sep 25, 2024

Uh oh!

zherczeg Sep 25, 2024

Uh oh!

carenas Sep 25, 2024 •

edited

Loading

Uh oh!

zherczeg Sep 25, 2024

Uh oh!

zherczeg commented Sep 26, 2024

Uh oh!

carenas commented Sep 26, 2024 •

edited

Loading

Uh oh!

Uh oh!

zherczeg commented Sep 26, 2024

Uh oh!

carenas commented Sep 26, 2024 •

edited

Loading

Uh oh!

carenas commented Sep 26, 2024 •

edited

Loading

Uh oh!

zherczeg commented Sep 26, 2024

Uh oh!

Uh oh!

Simplify range data construction. #496

Simplify range data construction. #496

Uh oh!

Conversation

zherczeg commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carenas Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

zherczeg Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

carenas Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zherczeg Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

zherczeg commented Sep 26, 2024

Uh oh!

carenas commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zherczeg commented Sep 26, 2024

Uh oh!

carenas commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carenas commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zherczeg commented Sep 26, 2024

Uh oh!

Uh oh!

zherczeg commented Sep 25, 2024 •

edited

Loading

carenas Sep 25, 2024 •

edited

Loading

carenas commented Sep 26, 2024 •

edited

Loading

carenas commented Sep 26, 2024 •

edited

Loading

carenas commented Sep 26, 2024 •

edited

Loading