-
-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fix unicode property support #1278
Conversation
c16b020
to
b966eba
Compare
- This is up for early review because I'm not sure about the dynamic creation of the table of unicode properties. I tried just creating a list of them but it was so slow for my editor to process that I couldn't even format the giant lookup table. I suspect that if we want to "bake" these to avoid however long it takes to compute the table and maybe avoid any unexpected drift, it might make sense to dump to YAML or something like that. I'm not sure the best approach. - I'm also guessing there's a better option than just dumping all the regexp node types in the other list of supported regexp nodes. - We probably should do this for other regex types--we might be missing some of the posix classes, for instance (I have not checked yet). - Prevents crashes when having an unsupported property type in source. - Related to #1234 (which was a very partial fix) - Note that this turns our `\p{Latin}` formatting into `\p{latin}`. We could fix this with some very simple inflection but I wanted to do the simplest approach first to demonstrate the problem since this seems to be semantically equivalent. The ruby docs use the uppercase form. I have a text file from the upstream regex toolkit that we could use to confirm inflection rules if we want to.
4f1a339
to
95c90e7
Compare
@@ -12,7 +12,7 @@ class Transformer | |||
include AbstractType | |||
|
|||
REGISTRY = Registry.new( | |||
->(type) { fail "No regexp transformer registered for: #{type}" } | |||
->(type) { } # fail "No regexp transformer registered for: #{type}" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually my "failing test" here probably won't fail with this disabled. i was trying to demonstrate that in another commit but I couldn't get CI to build separately. just try adding any not-explicitly-listed \p{...}
group and it will fail mutant.
Mutant also fails on these regexp expressions (taken from '\g<1>'
"\\g'1'"
'\g<0>'
"\\g'0'" I can file a separate issue for it but we have several problems with the regex handling in mutant. We have disabled these subjects in several places but it's a headache. I am also trying to diff my environment with some configuration changes using I can file a separate issue for this, I'm just brain-dumping right now. We probably need a much better regex corpus test. |
@mbj Could you take a look and suggest how we should handle the regex types registry? I can play around with it a little more but I don't know if I'll have time to bring this all the way to completion. It causes some headaches for the primary codebase we run |
@dgollahon Overall I like the initiative. But I think we need some sync time on discord to discuss details. Hit me up. |
* Original work in #1278 * Adapted to reflect the unicode properties from the `regexp_parser` gem. * Regexp parser gem seems to not properly map ruby features for each release so subsetting via filtering against the ruby regexp parser.
* Original work in #1278 * Adapted to reflect the unicode properties from the `regexp_parser` gem. * Regexp parser gem seems to not properly map ruby features for each release so subsetting via filtering against the ruby regexp parser.
* Original work in #1278 * Adapted to reflect the unicode properties from the `regexp_parser` gem. * Regexp parser gem seems to not properly map ruby features for each release so subsetting via filtering against the ruby regexp parser.
Closing in favor of #1319 |
* Original work in #1278 * Adapted to reflect the unicode properties from the `regexp_parser` gem. * Regexp parser gem seems to not properly map ruby features for each release so subsetting via filtering against the ruby regexp parser.
\p{Latin}
formatting into\p{latin}
. We could fix this with some very simple inflection but I wanted to do the simplest approach first to demonstrate the problem since this seems to be semantically equivalent. The ruby docs use the uppercase form. I have a text file from the upstream regex toolkit that we could use to confirm inflection rules if we want to.