Revise and extend character classes #13

PhilipHazel · 2021-08-24T16:26:05Z

This issue records several potential upgrades to the handling of character classes in PCRE2. This could be a lot of work in both the interpreters and the JIT.

The current code in the compiler has been hacked into an untidy mess and the compiled code is also messy. A revised implementation is needed that is more uniform and can better handle Unicode characters so as to make matching more efficient. For example, bitmaps could be used for runs of characters other than just 0-0xFF. Or some better coding scheme could be devised.
Perl has an experimental extended class feature as in this example:

/(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])/

Any new compiled format should be able to handle such things.

There was a request for a way of re-defining \w (and therefore \W, \b, and \B). An in-pattern sequence such as (?w=[...]) was suggested. Easiest way would be simply to inline the class, with lookarounds for \b and \B. Ideally the setting should last till the end of the group, which means remembering all previous settings; maybe a fixed amount of stack would do - how deep would anyone want to nest these things? Of course, this idea also suggests redefining \d and \s. Is this worth doing, given that named groups can be used? It would be more efficient, because it can be processed at compile time.

michaeltang4829 · 2024-07-17T23:48:09Z

Hi, I'm curious if there's any new interesting in this effort? As mention above, Perl already has experimental behavior with extended character classes. I believe ECMA JavaScript and Rust engines already support character class set operations e.g. subtraction, intersection, union, etc... And Python is issuing FutureWarnings hinting this feature is coming to their engine soon. For PCRE2, it's without a doubt this would be a huge effort to implement.

PhilipHazel · 2024-07-18T08:03:55Z

Quite a long time ago I had some ideas as to how to do this, but I didn't write them down and I got distracted with other things. It would, as you say, be a big effort, affecting the compiler, the interpreters, and the JIT coding. I won't be getting involved because I'm trying to get other folk to take over PCRE2 - see #426.

PhilipHazel · 2024-09-11T08:40:07Z

For the record, there has been some recent mention of this, and I have written down some ideas which may or may not actually be implementable. The attached text file contains them.

ExtendClass.txt

PhilipHazel · 2024-11-16T15:39:14Z

I am closing this because #553 implements extended character classes.

PhilipHazel added the enhancement New feature or request label Aug 24, 2021

SolitaryGrass mentioned this issue May 31, 2023

internal_dfa_match, a stack overflow occurred due to recursive calls. #258

Closed

PhilipHazel mentioned this issue Dec 25, 2023

Inconsistent behaviour of character classes + ucp in 16- and 32-bit mode #360

Closed

PhilipHazel mentioned this issue Sep 9, 2024

Rewrite regexes where common prefix can be pulled out from alternation branches #464

Draft

PhilipHazel closed this as completed Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise and extend character classes #13

Revise and extend character classes #13

PhilipHazel commented Aug 24, 2021

michaeltang4829 commented Jul 17, 2024 •

edited

Loading

PhilipHazel commented Jul 18, 2024

PhilipHazel commented Sep 11, 2024

PhilipHazel commented Nov 16, 2024

Revise and extend character classes #13

Revise and extend character classes #13

Comments

PhilipHazel commented Aug 24, 2021

michaeltang4829 commented Jul 17, 2024 • edited Loading

PhilipHazel commented Jul 18, 2024

PhilipHazel commented Sep 11, 2024

PhilipHazel commented Nov 16, 2024

michaeltang4829 commented Jul 17, 2024 •

edited

Loading