Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix unicode property support #1278

Closed
wants to merge 1 commit into from

Conversation

dgollahon
Copy link
Collaborator

@dgollahon dgollahon commented Nov 7, 2021

  • This is up for early review because I'm not sure about the dynamic creation of the table of unicode properties. I tried just creating a list of them but it was so slow for my editor to process that I couldn't even format the giant lookup table. I suspect that if we want to "bake" these to avoid however long it takes to compute the table and maybe avoid any unexpected drift, it might make sense to dump to YAML or something like that. I'm not sure the best approach.
  • I'm also guessing there's a better option than just dumping all the regexp node types in the other list of supported regexp nodes.
  • We probably should do this for other regex types--we might be missing some of the posix classes, for instance (I have not checked yet).
  • Prevents crashes when having an unsupported property type in source.
  • Related to Fix regexp mutation p{Latin} #1234 (which was a very partial fix)
  • Note that this turns our \p{Latin} formatting into \p{latin}. We could fix this with some very simple inflection but I wanted to do the simplest approach first to demonstrate the problem since this seems to be semantically equivalent. The ruby docs use the uppercase form. I have a text file from the upstream regex toolkit that we could use to confirm inflection rules if we want to.

@dgollahon dgollahon force-pushed the fix-unicode-property-support branch from c16b020 to b966eba Compare November 7, 2021 20:22
- This is up for early review because I'm not sure about the dynamic creation of the table of unicode properties. I tried just creating a list of them but it was so slow for my editor to process that I couldn't even format the giant lookup table. I suspect that if we want to "bake" these to avoid however long it takes to compute the table and maybe avoid any unexpected drift, it might make sense to dump to YAML or something like that. I'm not sure the best approach.
- I'm also guessing there's a better option than just dumping all the regexp node types in the other list of supported regexp nodes.
- We probably should do this for other regex types--we might be missing some of the posix classes, for instance (I have not checked yet).
- Prevents crashes when having an unsupported property type in source.
- Related to #1234 (which was a very partial fix)
- Note that this turns our `\p{Latin}` formatting into `\p{latin}`. We could fix this with some very simple inflection but I wanted to do the simplest approach first to demonstrate the problem since this seems to be semantically equivalent. The ruby docs use the uppercase form. I have a text file from the upstream regex toolkit that we could use to confirm inflection rules if we want to.
@dgollahon dgollahon force-pushed the fix-unicode-property-support branch from 4f1a339 to 95c90e7 Compare November 7, 2021 20:23
@@ -12,7 +12,7 @@ class Transformer
include AbstractType

REGISTRY = Registry.new(
->(type) { fail "No regexp transformer registered for: #{type}" }
->(type) { } # fail "No regexp transformer registered for: #{type}" }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually my "failing test" here probably won't fail with this disabled. i was trying to demonstrate that in another commit but I couldn't get CI to build separately. just try adding any not-explicitly-listed \p{...} group and it will fail mutant.

@dgollahon
Copy link
Collaborator Author

dgollahon commented Nov 7, 2021

Mutant also fails on these regexp expressions (taken from regexp_parser's specs): which are regexp_number_call_backref nodes.

'\g<1>'
"\\g'1'"
'\g<0>'
"\\g'0'"

I can file a separate issue for it but we have several problems with the regex handling in mutant. We have disabled these subjects in several places but it's a headache. I am also trying to diff my environment with some configuration changes using mutant environment subject list but it crashes so I can't do that. I have just realized I can pass --ignore-subject to that command but because I have a custom mutant:disable comment interpreter it's a little bit of a pain to do that. I also want to diff the environment without having to do --ignore-subject so I can make sure everything gets picked up regardless of the disables, but I can't easily do that right now.

I can file a separate issue for this, I'm just brain-dumping right now.

We probably need a much better regex corpus test.

@dgollahon dgollahon changed the title Fix unicode property support [WIP] Fix unicode property support Nov 7, 2021
@dgollahon
Copy link
Collaborator Author

@mbj Could you take a look and suggest how we should handle the regex types registry? I can play around with it a little more but I don't know if I'll have time to bring this all the way to completion. It causes some headaches for the primary codebase we run mutant on and I wanted to demonstrate the problem we have.

@mbj
Copy link
Owner

mbj commented Nov 15, 2021

@dgollahon Overall I like the initiative. But I think we need some sync time on discord to discuss details. Hit me up.

mbj pushed a commit that referenced this pull request Apr 24, 2022
* Original work in #1278
* Adapted to reflect the unicode properties from the `regexp_parser`
  gem.
* Regexp parser gem seems to not properly map ruby features for each
  release so subsetting via filtering against the ruby regexp parser.
mbj pushed a commit that referenced this pull request Apr 24, 2022
* Original work in #1278
* Adapted to reflect the unicode properties from the `regexp_parser`
  gem.
* Regexp parser gem seems to not properly map ruby features for each
  release so subsetting via filtering against the ruby regexp parser.
mbj pushed a commit that referenced this pull request Apr 24, 2022
* Original work in #1278
* Adapted to reflect the unicode properties from the `regexp_parser`
  gem.
* Regexp parser gem seems to not properly map ruby features for each
  release so subsetting via filtering against the ruby regexp parser.
@mbj
Copy link
Owner

mbj commented Apr 24, 2022

Closing in favor of #1319

@mbj mbj closed this Apr 24, 2022
mbj pushed a commit that referenced this pull request Apr 24, 2022
* Original work in #1278
* Adapted to reflect the unicode properties from the `regexp_parser`
  gem.
* Regexp parser gem seems to not properly map ruby features for each
  release so subsetting via filtering against the ruby regexp parser.
@dgollahon dgollahon deleted the fix-unicode-property-support branch May 1, 2022 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants