Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Even in 8-bit mode, perform range computation for char classes if UCP flag is set #527

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alexdowad
Copy link
Contributor

When testing another patch, I discovered that #474 caused a small change in the behavior of character classes when caseless mode and UCP were enabled.

Thank you to Zoltan Herczeg for suggesting a fix.

Closes GH-526.

@alexdowad
Copy link
Contributor Author

@zherczeg, I've been going through your char class code and have figured out how some parts of it work, though there is still a lot which I don't understand well. If I understood your comment on #526 well, you suggested that this is what is needed to fix the unexpected behavior change from #474 which I discovered. Did I understand you correctly?

Copy link
Collaborator

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that this is not enough. The caseless 's' and 'k' (and turkish 'i' soon) might force an xclass, we need to exclude them in some way.

A check around here could help probably, to terminate processing ranges >256 in 8 bit mode.

https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile.c#L6383

src/pcre2_compile_class.c Outdated Show resolved Hide resolved
@zherczeg
Copy link
Collaborator

Btw quite good patch for first attempt.

@carenas
Copy link
Contributor

carenas commented Oct 14, 2024

The caseless 's' and 'k' (and turkish 'i' soon) might force an xclass,

  re> /[sk]/iBI,ucp
------------------------------------------------------------------
        Bra
        [KSks]
        Ket
        End
------------------------------------------------------------------
Capture group count = 0
Options: caseless ucp
Starting code units: K S k s
Subject length lower bound = 1
data> 
  re> /[sk]/iBI,ucp,utf
------------------------------------------------------------------
        Bra
        [KSks\x{17f}\x{212a}]
        Ket
        End
------------------------------------------------------------------
Capture group count = 0
Options: caseless ucp utf
Starting code units: K S k s \xc5 \xe2
Subject length lower bound = 1

Without utf those characters can't be represented in the class.

@zherczeg
Copy link
Collaborator

In 8 bit mode, they cannot, but in 16 bit mode, they can.

For me 8 bit + ucp is the same as 16 bit + ucp, except characters > 255 are simply not present, as characters > 65535 are simply not present for 16 bit mode.

It looks like perl has a different approach for this.

@carenas
Copy link
Contributor

carenas commented Oct 14, 2024

The main problem I have with our approach is that we are just grossly misinterpreting characters with UCP and not UTF. It is just not the same character.

UCP without UTF makes sense in the 16/32bit libraries were it could represent UCS, but in the 8 bit library we are reading 1 byte and assigning it properties that belong to a different character, just because the ord was the same.

@alexdowad
Copy link
Contributor Author

Carlo, if UCP mode should be 'banned' for the 8-bit library because it doesn't make sense (which I tend to agree with), I suggest a separate issue could be opened to discuss that issue. I don't know the policies followed by PCRE2, but for some other OSS projects which I have worked on before, this would require a deprecation first.

If UCP was banned for the 8-bit library (i.e. using the PCRE2_UCP compile flag or (*UCP) would cause compilation to fail with an error), then all code to support that use case could be removed.

In the meantime, as much as it doesn't make sense, I think that UCP+8bit should be supported, because there is nothing in the documentation saying that it is illegal. The obvious thing for UCP+8bit to do is to treat each byte as a Unicode code point from U+0000 up to U+00FF.

There are a lot of things I don't know well here, so I may be completely off base.

@zherczeg
Copy link
Collaborator

The obvious thing for UCP+8bit to do is to treat each byte as a Unicode code point from U+0000 up to U+00FF.

As far as I know this is not happening. We use the pre-compiled tables for many things, and use ucp for other things. The pre-compiled tables depend on locals, and they can be very different from unicode codepoints.

We might need to clean up these things at some point.

@PhilipHazel
Copy link
Collaborator

The obvious thing for UCP+8bit to do is to treat each byte as a Unicode code point from U+0000 up to U+00FF.

Yes, and I think that was behind the thinking when PCRE2_UCP was allowed in 8-bit mode. Note this change from 10.35:

  1. Changes in many areas of the code so that when Unicode is supported and
    PCRE2_UCP is set without PCRE2_UTF, Unicode character properties are used for
    upper/lower case computations on characters whose code points are greater than

There must have been a reason for this but I can't remember it. It might have related to 16/32-bit. I think this is a relatively minor issue, because there are not likely to be many (any?) 8-bit use cases where PCRE2_UCP is set without PCRE2_UCP. So we shouldn't spend a lot of time on it. Simplest not to introduce any incompatibilities.

@zherczeg
Copy link
Collaborator

I agree. Fortunately the new code is perfectly capable of generating the full bitmask for any properties, and handling anything caseless ranges. We just make to ensure that XCLASS is never generated in this case.

https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile_class.c#L531

The 8 bit case simply does not worth any optimizations though. The extra checks should be guarded to 8 bit as well.

@NWilson
Copy link
Contributor

NWilson commented Oct 14, 2024

The obvious thing for UCP+8bit to do is to treat each byte as a Unicode code point from U+0000 up to U+00FF.

That seems reasonable to me, and it's what I've been assuming.

Note that this aligns well with Latin-1, which is a popular 8-bit encoding that was grandfathered into Unicode to fill the codepoints U+0080 to U+00FF. There is quite a lot of text out there where using 8-bit+UCP for Latin-1 interpretation would be accurate.

… flag is set

When testing another patch, I discovered that PCRE2Project#474 caused a small change
in the behavior of character classes when caseless mode and UCP were enabled.

Thank you to Zoltan Herczeg for suggesting a fix.

Closes PCRE2ProjectGH-526.
@alexdowad
Copy link
Contributor Author

@zherczeg, I've tried to apply your advice, please have a look and tell me if this looks right or not.

@@ -6479,6 +6485,9 @@ for (;; pptr++)
#ifdef SUPPORT_WIDE_CHARS /* Defined for 16/32 bits, or 8-bit with Unicode */
if ((xclass_props & XCLASS_REQUIRED) != 0)
{
/* We should never generate a (useless) xclass in 8-bit library when UTF flag is false */
PCRE2_ASSERT(PCRE2_CODE_UNIT_WIDTH != 8 || utf);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is usually an #if since it looks better, even if this variant is working.

/* If code unit width is 8 bits, and UCP flag is set, but UTF flag is not, we still
* generate cranges, but in that case we should not process any crange > 0xFF,
* because it's impossible to encounter code points > 0xFF in the subject string */
if (utf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need code indentation below? An if (!utf) range = end; also works. I have no preference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding conditional indentation can only help making the logic easier to follow IMHO

@alexdowad
Copy link
Contributor Author

Update: The results of the CI build showed that my added if (utf) was not enough to prevent a (useless) xclass from being generated in some cases. It turns out that this would still happen for some regexes containing \p escapes.

I tried further suppressing this, but right now, when I run the test suite with ASan enabled, it is detecting a heap buffer overflow. I will analyze further and figure out why this is the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in char class handling, in caseless mode, with UCP (caused by PR 474)
5 participants