Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify range data construction. #496

Merged
merged 1 commit into from
Sep 26, 2024

Conversation

zherczeg
Copy link
Collaborator

@zherczeg zherczeg commented Sep 25, 2024

This patch uses the computed ranges to generate byte code rather than using add_to_class. It is a considerable simplification of the code.

size_t usize = utf_caseless_extend(start_char, *ptr++, options, buffer);
if (buffer != NULL) buffer += usize;
total_size += usize;
size = utf_caseless_extend(start_char, *ptr++, options, buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this will revert Phillip's last fix for -Wshadow, is that intended?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I wanted to remove the size_t but forgot it.

Copy link
Contributor

@carenas carenas Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind, shold had pulled it first before commenting, guess that is why it was using size in the original to begin with then ;), nice work, and yes GitHub is acting weird today with comments, yours didn't refresh after I posted mine.

is this the last from your fixes to close #469?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no. I found another issue with negated ascii classes. And this is still just the range merge, the logarithmic search is still very far.

@zherczeg zherczeg marked this pull request as ready for review September 25, 2024 14:01
@zherczeg zherczeg marked this pull request as draft September 26, 2024 10:01
@zherczeg zherczeg force-pushed the simplify_class branch 2 times, most recently from 8a78a3a to 0713f09 Compare September 26, 2024 12:44
@zherczeg zherczeg marked this pull request as ready for review September 26, 2024 12:45
@zherczeg
Copy link
Collaborator Author

@PhilipHazel if you agree my suggestion in #497 , this patch is ready

@carenas
Copy link
Contributor

carenas commented Sep 26, 2024

I think the following assertion is not correct:

Obviously a character class and its negated form cannot match to the same character

in PCRE2 it can, and the reasons are historic and described in #186.

In summary our "/u" Perl equivalent requires both utf and ucp modifiers to be set

@PhilipHazel PhilipHazel merged commit 46668dd into PCRE2Project:master Sep 26, 2024
14 checks passed
@zherczeg zherczeg deleted the simplify_class branch September 26, 2024 16:45
@zherczeg
Copy link
Collaborator Author

I am not sure I understand that part, it talks about configuring the modifier. Normally if [C] matches to something, [^C] must not match to that except for invalid utf characters, which never matches to anything like NaN in numbers.

@carenas
Copy link
Contributor

carenas commented Sep 26, 2024

The point is that without PCRE2_UCP(as you pointed out) all characters above 255 (in the 8-bit library) are not defined, so any [^C] would match them if PCRE2_UTF is enabled. As you pointed out Perl has no non-UCP mode, but we do, and we even have UCP mode without UTF (ex: in the 16-bit library).

Agree with you that they "shouldn't" match and that is arguably a bug, but it is the currently expected behaviour when ONLY one of those options are set.

The "ambiguity" is resolved at compile time by the redefinition of \D that PCRE2_UCP drives as shown by:

PCRE2 version 10.44 2024-06-07 (8-bit)
  re> /[^\D\P{Nd}]/B,utf,ucp
------------------------------------------------------------------
        Bra
        [^\P{Nd}\P{Nd}]
        Ket
        End
------------------------------------------------------------------
data> \x{1d7cf}
 0: \x{1d7cf}
data> 
  re> /[\D\P{Nd}]/B,utf,ucp
------------------------------------------------------------------
        Bra
        [\P{Nd}\P{Nd}]
        Ket
        End
------------------------------------------------------------------
data> \x{1d7cf}
No match

@carenas
Copy link
Contributor

carenas commented Sep 26, 2024

indeed I think this might had just introduced a regression:

PCRE2 version 10.44 2024-06-07 (8-bit)
  re> /[^\D\P{Nd}]/B,utf,ascii_bsd
------------------------------------------------------------------
        Bra
        [^\x00-/:-\xff\P{Nd}]
        Ket
        End
------------------------------------------------------------------
PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
  re> /[^\D\P{Nd}]/B,utf,ascii_bsd
------------------------------------------------------------------
        Bra
        [^\x00-/:-\xff\P{Nd}\x{100}-\x{10ffff}]
        Ket
        End
------------------------------------------------------------------

At least this one works, but the B output is confusing (not an issue introduced with this patch though):

 re> /[^\d]/B,utf,ascii_bsd
------------------------------------------------------------------
        Bra
        [\x00-/:-\xff] (neg)
        Ket
        End
------------------------------------------------------------------
data> 1
No match

@zherczeg
Copy link
Collaborator Author

It is fixed that regression, and this is what I am talking about. \D matches anything not [0-9], which includes all > 255 characters.

  re> /[\d]/B,utf
------------------------------------------------------------------
        Bra
        [0-9]
        Ket
        End
------------------------------------------------------------------
  re> /[\D]/B,utf
------------------------------------------------------------------
        Bra
        [\x00-/:-\xff] (neg)
        Ket
        End
------------------------------------------------------------------

The [\x00-/:-\xff] (neg) is the same as [\x00-/:-\xff\x{100}-\x{10ffff}]. This is negated above with ^.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants