Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve character classes #474

Merged
merged 1 commit into from
Sep 21, 2024
Merged

Conversation

zherczeg
Copy link
Collaborator

This is the first patch, which aims to rework character classes. It does not do too much, because it does not handle caseless matching.

When a class has >255 character (the bitset is perfect for ascii / EBCDIC), it sorts and merges the ranges when possible. The current code is careful about not increasing code size, but this will change later. Probably this will be the most challenging part in the future. My idea is that the meta code for classes will be stored elsewhere, and only a reference will be stored in the original pattern.

The purpose of this patch is opening discussion about what should we do with classes. Optimizing them in any way is worth it or not.

add_to_class(classbits, &class_uchardata, options, xoptions, cb,
range[0], range[1]);

if (class_uchardata > class_uchardata_base)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement is a code duplication, but I don't know what to do with it.

@zherczeg zherczeg force-pushed the xclass_reorg branch 2 times, most recently from b47513b to 1307a79 Compare September 20, 2024 12:33
@@ -5473,8 +5219,7 @@ Returns: the number of < 256 characters added

static unsigned int
add_to_class_internal(uint8_t *classbits, PCRE2_UCHAR **uchardptr,
uint32_t options, uint32_t xoptions, compile_block *cb, uint32_t start,
uint32_t end)
uint32_t options, compile_block *cb, uint32_t start, uint32_t end)
Copy link
Collaborator Author

@zherczeg zherczeg Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason of removing xoptions is that the only flag which is used is PCRE2_EXTRA_CASELESS_RESTRICT, and that flag does not change, so cb->cx->extra_options is good enough to access it. Less arguments means faster function call as well.

@zherczeg
Copy link
Collaborator Author

Some statistics with -O3:

Binary size: old: 2020664 new: 2021096. Few bytes bigger.

Compilation time is slower:

/[\x{200}-\x{400}\x{1000}-\x{1600}\x{10000}-\x{10800}]/utf
 Old: 5.9259 microseconds
 New: 6.3427 microseconds

/[\x{100}-\x{400}]+/i,utf
 Old: 22.0599 microseconds
 New: 31.3692 microseconds

Runtime is a bit better:

/a+[\x{100}-\x{400}]/i,utf
 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 Old: 9.6301 microseconds
 New: 9.5551 microseconds

@zherczeg
Copy link
Collaborator Author

@PhilipHazel probably only you can check this code. The gain at this point is little, but it is possible to extend the code with more features in the future.

@zherczeg zherczeg marked this pull request as ready for review September 20, 2024 13:16
@zherczeg
Copy link
Collaborator Author

The difference is bigger on better tests (I forgot the auto possessify optimization):

/.+[\x{200}\x{201}\x{202}\x{203}\x{204}\x{205}\x{206}\x{207}\x{208}\x{209}\x{20a}\x{20b}]/s,utf
 Old: 13.0699 microseconds
 New: 8.3277 microseconds

JIT is not really affected, probably the test is too simple. Anyway, for corner cases the new method should be better. It is also possible to optimize the code further, but the patch is large enough.

@zherczeg
Copy link
Collaborator Author

Conflicts resolved.

@PhilipHazel PhilipHazel merged commit d3095b4 into PCRE2Project:master Sep 21, 2024
13 checks passed
@zherczeg zherczeg deleted the xclass_reorg branch September 21, 2024 14:52
alexdowad added a commit to alexdowad/pcre2 that referenced this pull request Oct 13, 2024
… flag is set

When testing another patch, I discovered that PCRE2Project#474 caused a small change
in the behavior of character classes when caseless mode and UCP were enabled.

Thank you to Zoltan Herczeg for suggesting a fix.

Closes PCRE2ProjectGH-526.
alexdowad added a commit to alexdowad/pcre2 that referenced this pull request Oct 13, 2024
… flag is set

When testing another patch, I discovered that PCRE2Project#474 caused a small change
in the behavior of character classes when caseless mode and UCP were enabled.

Thank you to Zoltan Herczeg for suggesting a fix.

Closes PCRE2ProjectGH-526.
alexdowad added a commit to alexdowad/pcre2 that referenced this pull request Oct 15, 2024
… flag is set

When testing another patch, I discovered that PCRE2Project#474 caused a small change
in the behavior of character classes when caseless mode and UCP were enabled.

Thank you to Zoltan Herczeg for suggesting a fix.

Closes PCRE2ProjectGH-526.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants