-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve character classes #474
Conversation
babc837
to
7993145
Compare
add_to_class(classbits, &class_uchardata, options, xoptions, cb, | ||
range[0], range[1]); | ||
|
||
if (class_uchardata > class_uchardata_base) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if
statement is a code duplication, but I don't know what to do with it.
b47513b
to
1307a79
Compare
@@ -5473,8 +5219,7 @@ Returns: the number of < 256 characters added | |||
|
|||
static unsigned int | |||
add_to_class_internal(uint8_t *classbits, PCRE2_UCHAR **uchardptr, | |||
uint32_t options, uint32_t xoptions, compile_block *cb, uint32_t start, | |||
uint32_t end) | |||
uint32_t options, compile_block *cb, uint32_t start, uint32_t end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason of removing xoptions
is that the only flag which is used is PCRE2_EXTRA_CASELESS_RESTRICT
, and that flag does not change, so cb->cx->extra_options
is good enough to access it. Less arguments means faster function call as well.
Some statistics with -O3: Binary size: old: 2020664 new: 2021096. Few bytes bigger. Compilation time is slower:
Runtime is a bit better:
|
@PhilipHazel probably only you can check this code. The gain at this point is little, but it is possible to extend the code with more features in the future. |
The difference is bigger on better tests (I forgot the auto possessify optimization):
JIT is not really affected, probably the test is too simple. Anyway, for corner cases the new method should be better. It is also possible to optimize the code further, but the patch is large enough. |
1307a79
to
c852995
Compare
Conflicts resolved. |
… flag is set When testing another patch, I discovered that PCRE2Project#474 caused a small change in the behavior of character classes when caseless mode and UCP were enabled. Thank you to Zoltan Herczeg for suggesting a fix. Closes PCRE2ProjectGH-526.
… flag is set When testing another patch, I discovered that PCRE2Project#474 caused a small change in the behavior of character classes when caseless mode and UCP were enabled. Thank you to Zoltan Herczeg for suggesting a fix. Closes PCRE2ProjectGH-526.
… flag is set When testing another patch, I discovered that PCRE2Project#474 caused a small change in the behavior of character classes when caseless mode and UCP were enabled. Thank you to Zoltan Herczeg for suggesting a fix. Closes PCRE2ProjectGH-526.
This is the first patch, which aims to rework character classes. It does not do too much, because it does not handle caseless matching.
When a class has >255 character (the bitset is perfect for ascii / EBCDIC), it sorts and merges the ranges when possible. The current code is careful about not increasing code size, but this will change later. Probably this will be the most challenging part in the future. My idea is that the meta code for classes will be stored elsewhere, and only a reference will be stored in the original pattern.
The purpose of this patch is opening discussion about what should we do with classes. Optimizing them in any way is worth it or not.