Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise and extend character classes #13

Closed
PhilipHazel opened this issue Aug 24, 2021 · 4 comments
Closed

Revise and extend character classes #13

PhilipHazel opened this issue Aug 24, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@PhilipHazel
Copy link
Collaborator

This issue records several potential upgrades to the handling of character classes in PCRE2. This could be a lot of work in both the interpreters and the JIT.

  1. The current code in the compiler has been hacked into an untidy mess and the compiled code is also messy. A revised implementation is needed that is more uniform and can better handle Unicode characters so as to make matching more efficient. For example, bitmaps could be used for runs of characters other than just 0-0xFF. Or some better coding scheme could be devised.

  2. Perl has an experimental extended class feature as in this example:

/(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])/

Any new compiled format should be able to handle such things.

  1. There was a request for a way of re-defining \w (and therefore \W, \b, and \B). An in-pattern sequence such as (?w=[...]) was suggested. Easiest way would be simply to inline the class, with lookarounds for \b and \B. Ideally the setting should last till the end of the group, which means remembering all previous settings; maybe a fixed amount of stack would do - how deep would anyone want to nest these things? Of course, this idea also suggests redefining \d and \s. Is this worth doing, given that named groups can be used? It would be more efficient, because it can be processed at compile time.
@michaeltang4829
Copy link

michaeltang4829 commented Jul 17, 2024

Hi, I'm curious if there's any new interesting in this effort? As mention above, Perl already has experimental behavior with extended character classes. I believe ECMA JavaScript and Rust engines already support character class set operations e.g. subtraction, intersection, union, etc... And Python is issuing FutureWarnings hinting this feature is coming to their engine soon. For PCRE2, it's without a doubt this would be a huge effort to implement.

@PhilipHazel
Copy link
Collaborator Author

Quite a long time ago I had some ideas as to how to do this, but I didn't write them down and I got distracted with other things. It would, as you say, be a big effort, affecting the compiler, the interpreters, and the JIT coding. I won't be getting involved because I'm trying to get other folk to take over PCRE2 - see #426.

@PhilipHazel
Copy link
Collaborator Author

For the record, there has been some recent mention of this, and I have written down some ideas which may or may not actually be implementable. The attached text file contains them.

ExtendClass.txt

@PhilipHazel
Copy link
Collaborator Author

I am closing this because #553 implements extended character classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants