Even in 8-bit mode, perform range computation for char classes if UCP flag is set #527

alexdowad · 2024-10-13T12:12:35Z

When testing another patch, I discovered that #474 caused a small change in the behavior of character classes when caseless mode and UCP were enabled.

Thank you to Zoltan Herczeg for suggesting a fix.

Closes GH-526.

alexdowad · 2024-10-13T12:14:09Z

@zherczeg, I've been going through your char class code and have figured out how some parts of it work, though there is still a lot which I don't understand well. If I understood your comment on #526 well, you suggested that this is what is needed to fix the unexpected behavior change from #474 which I discovered. Did I understand you correctly?

zherczeg

I worry that this is not enough. The caseless 's' and 'k' (and turkish 'i' soon) might force an xclass, we need to exclude them in some way.

A check around here could help probably, to terminate processing ranges >256 in 8 bit mode.

https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile.c#L6383

src/pcre2_compile_class.c

zherczeg · 2024-10-13T13:08:30Z

Btw quite good patch for first attempt.

carenas · 2024-10-14T03:55:08Z

The caseless 's' and 'k' (and turkish 'i' soon) might force an xclass,

  re> /[sk]/iBI,ucp
------------------------------------------------------------------
        Bra
        [KSks]
        Ket
        End
------------------------------------------------------------------
Capture group count = 0
Options: caseless ucp
Starting code units: K S k s
Subject length lower bound = 1
data> 
  re> /[sk]/iBI,ucp,utf
------------------------------------------------------------------
        Bra
        [KSks\x{17f}\x{212a}]
        Ket
        End
------------------------------------------------------------------
Capture group count = 0
Options: caseless ucp utf
Starting code units: K S k s \xc5 \xe2
Subject length lower bound = 1

Without utf those characters can't be represented in the class.

zherczeg · 2024-10-14T04:23:00Z

In 8 bit mode, they cannot, but in 16 bit mode, they can.

For me 8 bit + ucp is the same as 16 bit + ucp, except characters > 255 are simply not present, as characters > 65535 are simply not present for 16 bit mode.

It looks like perl has a different approach for this.

carenas · 2024-10-14T05:31:47Z

The main problem I have with our approach is that we are just grossly misinterpreting characters with UCP and not UTF. It is just not the same character.

UCP without UTF makes sense in the 16/32bit libraries were it could represent UCS, but in the 8 bit library we are reading 1 byte and assigning it properties that belong to a different character, just because the ord was the same.

alexdowad · 2024-10-14T05:39:00Z

Carlo, if UCP mode should be 'banned' for the 8-bit library because it doesn't make sense (which I tend to agree with), I suggest a separate issue could be opened to discuss that issue. I don't know the policies followed by PCRE2, but for some other OSS projects which I have worked on before, this would require a deprecation first.

If UCP was banned for the 8-bit library (i.e. using the PCRE2_UCP compile flag or (*UCP) would cause compilation to fail with an error), then all code to support that use case could be removed.

In the meantime, as much as it doesn't make sense, I think that UCP+8bit should be supported, because there is nothing in the documentation saying that it is illegal. The obvious thing for UCP+8bit to do is to treat each byte as a Unicode code point from U+0000 up to U+00FF.

There are a lot of things I don't know well here, so I may be completely off base.

zherczeg · 2024-10-14T06:43:46Z

The obvious thing for UCP+8bit to do is to treat each byte as a Unicode code point from U+0000 up to U+00FF.

As far as I know this is not happening. We use the pre-compiled tables for many things, and use ucp for other things. The pre-compiled tables depend on locals, and they can be very different from unicode codepoints.

We might need to clean up these things at some point.

PhilipHazel · 2024-10-14T07:22:57Z

The obvious thing for UCP+8bit to do is to treat each byte as a Unicode code point from U+0000 up to U+00FF.

Yes, and I think that was behind the thinking when PCRE2_UCP was allowed in 8-bit mode. Note this change from 10.35:

Changes in many areas of the code so that when Unicode is supported and
PCRE2_UCP is set without PCRE2_UTF, Unicode character properties are used for
upper/lower case computations on characters whose code points are greater than

There must have been a reason for this but I can't remember it. It might have related to 16/32-bit. I think this is a relatively minor issue, because there are not likely to be many (any?) 8-bit use cases where PCRE2_UCP is set without PCRE2_UCP. So we shouldn't spend a lot of time on it. Simplest not to introduce any incompatibilities.

zherczeg · 2024-10-14T08:35:57Z

I agree. Fortunately the new code is perfectly capable of generating the full bitmask for any properties, and handling anything caseless ranges. We just make to ensure that XCLASS is never generated in this case.

https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile_class.c#L531

The 8 bit case simply does not worth any optimizations though. The extra checks should be guarded to 8 bit as well.

NWilson · 2024-10-14T09:19:41Z

The obvious thing for UCP+8bit to do is to treat each byte as a Unicode code point from U+0000 up to U+00FF.

That seems reasonable to me, and it's what I've been assuming.

Note that this aligns well with Latin-1, which is a popular 8-bit encoding that was grandfathered into Unicode to fill the codepoints U+0080 to U+00FF. There is quite a lot of text out there where using 8-bit+UCP for Latin-1 interpretation would be accurate.

… flag is set When testing another patch, I discovered that PCRE2Project#474 caused a small change in the behavior of character classes when caseless mode and UCP were enabled. Thank you to Zoltan Herczeg for suggesting a fix. Closes PCRE2ProjectGH-526.

alexdowad · 2024-10-15T13:28:23Z

@zherczeg, I've tried to apply your advice, please have a look and tell me if this looks right or not.

zherczeg · 2024-10-15T14:10:11Z

src/pcre2_compile.c

@@ -6479,6 +6485,9 @@ for (;; pptr++)
 #ifdef SUPPORT_WIDE_CHARS /* Defined for 16/32 bits, or 8-bit with Unicode */
 if ((xclass_props & XCLASS_REQUIRED) != 0)
 {
+ /* We should never generate a (useless) xclass in 8-bit library when UTF flag is false */
+ PCRE2_ASSERT(PCRE2_CODE_UNIT_WIDTH != 8 || utf);


This is usually an #if since it looks better, even if this variant is working.

zherczeg · 2024-10-15T14:11:06Z

src/pcre2_compile.c

+ /* If code unit width is 8 bits, and UCP flag is set, but UTF flag is not, we still
+ * generate cranges, but in that case we should not process any crange > 0xFF,
+ * because it's impossible to encounter code points > 0xFF in the subject string */
+ if (utf)


Do we need code indentation below? An if (!utf) range = end; also works. I have no preference.

Avoiding conditional indentation can only help making the logic easier to follow IMHO

alexdowad · 2024-10-16T13:16:24Z

Update: The results of the CI build showed that my added if (utf) was not enough to prevent a (useless) xclass from being generated in some cases. It turns out that this would still happen for some regexes containing \p escapes.

I tried further suppressing this, but right now, when I run the test suite with ASan enabled, it is detecting a heap buffer overflow. I will analyze further and figure out why this is the case.

alexdowad force-pushed the fixucp branch from d4aba9d to d50527d Compare October 13, 2024 12:39

zherczeg requested changes Oct 13, 2024

View reviewed changes

src/pcre2_compile_class.c Outdated Show resolved Hide resolved

alexdowad force-pushed the fixucp branch from d50527d to bd74ff7 Compare October 15, 2024 13:28

zherczeg requested changes Oct 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Even in 8-bit mode, perform range computation for char classes if UCP flag is set #527

Even in 8-bit mode, perform range computation for char classes if UCP flag is set #527

alexdowad commented Oct 13, 2024

alexdowad commented Oct 13, 2024

zherczeg left a comment

zherczeg commented Oct 13, 2024

carenas commented Oct 14, 2024 •

edited

Loading

zherczeg commented Oct 14, 2024

carenas commented Oct 14, 2024

alexdowad commented Oct 14, 2024

zherczeg commented Oct 14, 2024

PhilipHazel commented Oct 14, 2024

zherczeg commented Oct 14, 2024

NWilson commented Oct 14, 2024

alexdowad commented Oct 15, 2024

zherczeg Oct 15, 2024

zherczeg Oct 15, 2024

carenas Oct 15, 2024

alexdowad commented Oct 16, 2024

Even in 8-bit mode, perform range computation for char classes if UCP flag is set #527

Are you sure you want to change the base?

Even in 8-bit mode, perform range computation for char classes if UCP flag is set #527

Conversation

alexdowad commented Oct 13, 2024

alexdowad commented Oct 13, 2024

zherczeg left a comment

Choose a reason for hiding this comment

zherczeg commented Oct 13, 2024

carenas commented Oct 14, 2024 • edited Loading

zherczeg commented Oct 14, 2024

carenas commented Oct 14, 2024

alexdowad commented Oct 14, 2024

zherczeg commented Oct 14, 2024

PhilipHazel commented Oct 14, 2024

zherczeg commented Oct 14, 2024

NWilson commented Oct 14, 2024

alexdowad commented Oct 15, 2024

zherczeg Oct 15, 2024

Choose a reason for hiding this comment

zherczeg Oct 15, 2024

Choose a reason for hiding this comment

carenas Oct 15, 2024

Choose a reason for hiding this comment

alexdowad commented Oct 16, 2024

carenas commented Oct 14, 2024 •

edited

Loading