-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise and extend character classes #13
Comments
Hi, I'm curious if there's any new interesting in this effort? As mention above, Perl already has experimental behavior with extended character classes. I believe ECMA JavaScript and Rust engines already support character class set operations e.g. subtraction, intersection, union, etc... And Python is issuing FutureWarnings hinting this feature is coming to their engine soon. For PCRE2, it's without a doubt this would be a huge effort to implement. |
Quite a long time ago I had some ideas as to how to do this, but I didn't write them down and I got distracted with other things. It would, as you say, be a big effort, affecting the compiler, the interpreters, and the JIT coding. I won't be getting involved because I'm trying to get other folk to take over PCRE2 - see #426. |
For the record, there has been some recent mention of this, and I have written down some ideas which may or may not actually be implementable. The attached text file contains them. |
I am closing this because #553 implements extended character classes. |
This issue records several potential upgrades to the handling of character classes in PCRE2. This could be a lot of work in both the interpreters and the JIT.
The current code in the compiler has been hacked into an untidy mess and the compiled code is also messy. A revised implementation is needed that is more uniform and can better handle Unicode characters so as to make matching more efficient. For example, bitmaps could be used for runs of characters other than just 0-0xFF. Or some better coding scheme could be devised.
Perl has an experimental extended class feature as in this example:
/(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])/
Any new compiled format should be able to handle such things.
The text was updated successfully, but these errors were encountered: