Skip to content

Commit

Permalink
Even in 8-bit mode, perform range computation for char classes if UCP…
Browse files Browse the repository at this point in the history
… flag is set

When testing another patch, I discovered that PCRE2Project#474 caused a small change
in the behavior of character classes when caseless mode and UCP were enabled.

Thank you to Zoltan Herczeg for suggesting a fix.

Closes PCRE2ProjectGH-526.
  • Loading branch information
alexdowad committed Oct 15, 2024
1 parent c9bf833 commit bd74ff7
Show file tree
Hide file tree
Showing 4 changed files with 22 additions and 2 deletions.
11 changes: 10 additions & 1 deletion src/pcre2_compile.c
Original file line number Diff line number Diff line change
Expand Up @@ -5891,7 +5891,7 @@ for (;; pptr++)
#if PCRE2_CODE_UNIT_WIDTH == 8
cranges = NULL;

if (utf)
if (utf || ucp)
#endif
{
if (lengthptr != NULL)
Expand Down Expand Up @@ -6388,6 +6388,12 @@ for (;; pptr++)
range = end;
}

#if PCRE2_CODE_UNIT_WIDTH == 8
/* If code unit width is 8 bits, and UCP flag is set, but UTF flag is not, we still
* generate cranges, but in that case we should not process any crange > 0xFF,
* because it's impossible to encounter code points > 0xFF in the subject string */
if (utf)
#endif
while (range < end)
{
uint32_t range_start = range[0];
Expand Down Expand Up @@ -6479,6 +6485,9 @@ for (;; pptr++)
#ifdef SUPPORT_WIDE_CHARS /* Defined for 16/32 bits, or 8-bit with Unicode */
if ((xclass_props & XCLASS_REQUIRED) != 0)
{
/* We should never generate a (useless) xclass in 8-bit library when UTF flag is false */
PCRE2_ASSERT(PCRE2_CODE_UNIT_WIDTH != 8 || utf);

*class_uchardata++ = XCL_END; /* Marks the end of extra data */
*code++ = OP_XCLASS;
code += LINK_SIZE;
Expand Down
2 changes: 1 addition & 1 deletion src/pcre2_compile_class.c
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ const uint32_t *skip_range = get_nocase_range(c);
uint32_t skip_start = skip_range[0];

#if PCRE2_CODE_UNIT_WIDTH == 8
PCRE2_ASSERT(options & PARSE_CLASS_UTF);
PCRE2_ASSERT(options & (PARSE_CLASS_UTF | PARSE_CLASS_CASELESS_UTF));
#endif

#if PCRE2_CODE_UNIT_WIDTH == 32
Expand Down
3 changes: 3 additions & 0 deletions testdata/testinput10
Original file line number Diff line number Diff line change
Expand Up @@ -623,6 +623,9 @@
/X(\x{e1})Y/i,ucp,replace=>\L$1<,substitute_extended
X\x{c1}Y

/[a\x{c1}]/iI,ucp
\x{e1}

# Without UTF or UCP characters > 127 have only one case in the default locale.

/X(\x{e1})Y/replace=>\U$1<,substitute_extended
Expand Down
8 changes: 8 additions & 0 deletions testdata/testoutput10
Original file line number Diff line number Diff line change
Expand Up @@ -1883,6 +1883,14 @@ Subject length lower bound = 1
X\x{c1}Y
1: >\xe1<

/[a\x{c1}]/iI,ucp
Capture group count = 0
Options: caseless ucp
Starting code units: A a \xc1 \xe1
Subject length lower bound = 1
\x{e1}
0: \xe1

# Without UTF or UCP characters > 127 have only one case in the default locale.

/X(\x{e1})Y/replace=>\U$1<,substitute_extended
Expand Down

0 comments on commit bd74ff7

Please sign in to comment.