[WIP] Fix unicode property support #1278

dgollahon · 2021-11-07T20:22:37Z

This is up for early review because I'm not sure about the dynamic creation of the table of unicode properties. I tried just creating a list of them but it was so slow for my editor to process that I couldn't even format the giant lookup table. I suspect that if we want to "bake" these to avoid however long it takes to compute the table and maybe avoid any unexpected drift, it might make sense to dump to YAML or something like that. I'm not sure the best approach.
I'm also guessing there's a better option than just dumping all the regexp node types in the other list of supported regexp nodes.
We probably should do this for other regex types--we might be missing some of the posix classes, for instance (I have not checked yet).
Prevents crashes when having an unsupported property type in source.
Related to Fix regexp mutation p{Latin} #1234 (which was a very partial fix)
Note that this turns our \p{Latin} formatting into \p{latin}. We could fix this with some very simple inflection but I wanted to do the simplest approach first to demonstrate the problem since this seems to be semantically equivalent. The ruby docs use the uppercase form. I have a text file from the upstream regex toolkit that we could use to confirm inflection rules if we want to.

- This is up for early review because I'm not sure about the dynamic creation of the table of unicode properties. I tried just creating a list of them but it was so slow for my editor to process that I couldn't even format the giant lookup table. I suspect that if we want to "bake" these to avoid however long it takes to compute the table and maybe avoid any unexpected drift, it might make sense to dump to YAML or something like that. I'm not sure the best approach. - I'm also guessing there's a better option than just dumping all the regexp node types in the other list of supported regexp nodes. - We probably should do this for other regex types--we might be missing some of the posix classes, for instance (I have not checked yet). - Prevents crashes when having an unsupported property type in source. - Related to #1234 (which was a very partial fix) - Note that this turns our `\p{Latin}` formatting into `\p{latin}`. We could fix this with some very simple inflection but I wanted to do the simplest approach first to demonstrate the problem since this seems to be semantically equivalent. The ruby docs use the uppercase form. I have a text file from the upstream regex toolkit that we could use to confirm inflection rules if we want to.

dgollahon · 2021-11-07T20:32:00Z

lib/mutant/ast/regexp/transformer.rb

@@ -12,7 +12,7 @@ class Transformer
        include AbstractType

        REGISTRY = Registry.new(
-          ->(type) { fail "No regexp transformer registered for: #{type}" }
+          ->(type) { } # fail "No regexp transformer registered for: #{type}" }


actually my "failing test" here probably won't fail with this disabled. i was trying to demonstrate that in another commit but I couldn't get CI to build separately. just try adding any not-explicitly-listed \p{...} group and it will fail mutant.

dgollahon · 2021-11-07T20:38:42Z

Mutant also fails on these regexp expressions (taken from regexp_parser's specs): which are regexp_number_call_backref nodes.

'\g<1>'
"\\g'1'"
'\g<0>'
"\\g'0'"

I can file a separate issue for it but we have several problems with the regex handling in mutant. We have disabled these subjects in several places but it's a headache. I am also trying to diff my environment with some configuration changes using mutant environment subject list but it crashes so I can't do that. I have just realized I can pass --ignore-subject to that command but because I have a custom mutant:disable comment interpreter it's a little bit of a pain to do that. I also want to diff the environment without having to do --ignore-subject so I can make sure everything gets picked up regardless of the disables, but I can't easily do that right now.

I can file a separate issue for this, I'm just brain-dumping right now.

We probably need a much better regex corpus test.

dgollahon · 2021-11-07T20:43:33Z

@mbj Could you take a look and suggest how we should handle the regex types registry? I can play around with it a little more but I don't know if I'll have time to bring this all the way to completion. It causes some headaches for the primary codebase we run mutant on and I wanted to demonstrate the problem we have.

mbj · 2021-11-15T00:53:57Z

@dgollahon Overall I like the initiative. But I think we need some sync time on discord to discuss details. Hit me up.

* Original work in #1278 * Adapted to reflect the unicode properties from the `regexp_parser` gem. * Regexp parser gem seems to not properly map ruby features for each release so subsetting via filtering against the ruby regexp parser.

mbj · 2022-04-24T02:39:03Z

Closing in favor of #1319

* Original work in #1278 * Adapted to reflect the unicode properties from the `regexp_parser` gem. * Regexp parser gem seems to not properly map ruby features for each release so subsetting via filtering against the ruby regexp parser.

dgollahon force-pushed the fix-unicode-property-support branch from c16b020 to b966eba Compare November 7, 2021 20:22

dgollahon force-pushed the fix-unicode-property-support branch from 4f1a339 to 95c90e7 Compare November 7, 2021 20:23

dgollahon commented Nov 7, 2021

View reviewed changes

dgollahon changed the title ~~Fix unicode property support~~ [WIP] Fix unicode property support Nov 7, 2021

mbj mentioned this pull request Apr 24, 2022

Fix unicode property support #1319

Merged

mbj closed this Apr 24, 2022

dgollahon deleted the fix-unicode-property-support branch May 1, 2022 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix unicode property support #1278

[WIP] Fix unicode property support #1278

dgollahon commented Nov 7, 2021 •

edited

Loading

dgollahon Nov 7, 2021

dgollahon commented Nov 7, 2021 •

edited

Loading

dgollahon commented Nov 7, 2021

mbj commented Nov 15, 2021

mbj commented Apr 24, 2022

[WIP] Fix unicode property support #1278

[WIP] Fix unicode property support #1278

Conversation

dgollahon commented Nov 7, 2021 • edited Loading

dgollahon Nov 7, 2021

Choose a reason for hiding this comment

dgollahon commented Nov 7, 2021 • edited Loading

dgollahon commented Nov 7, 2021

mbj commented Nov 15, 2021

mbj commented Apr 24, 2022

dgollahon commented Nov 7, 2021 •

edited

Loading

dgollahon commented Nov 7, 2021 •

edited

Loading