Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use regexp_property_values gem for accurate property mapping #40

Merged
merged 6 commits into from
May 20, 2024

Conversation

tom-lord
Copy link
Owner

@tom-lord tom-lord commented May 20, 2024

Replace usage of the unicode ranges pstore (which had not been updated for unicode 13.0 --> 15.1+!), with the regexp_property_values gem.

The gem hooks into the C API to directly load the matching codepoints for named properties. It's something. wanted to do originally with this library, 10+ years ago, but didn't know how back then! 😄

Thanks to @jaynetics for pointing this out, and building the above gem.

This PR solves the long-standing issue: #14

It fixes several related issues:

  • Some named properties did not generate any examples, e.g. /\p{Carian}/. This was because the script to generate (by brute force!) any matching characters only searched up to 0xFFFF, but the only matching characters start higher than this, from 0x102A0: "𐊠𐊡𐊢𐊣𐊤𐊥𐊦𐊧𐊨𐊩..."
  • The gem previously only stored a maximum of 128 characters per property, for performance reasons, which meant not all possible matching characters would be included in examples. This limitation is no longer present.
  • Some named properties may have been missing entirely, e.g. Age=15.1, or overlooked in the original script, e.g. \p{In Miscellaneous Mathematical Symbols-B}. Not any more, because it's directly calling the Onigmo API.
  • The gem now correctly handles property names padded with hyphens or spaces.

Examples:

/\p{Age=6.0}/.random_example #=> "䧖"
/\p{In Miscellaneous Mathematical Symbols-B}/.random_example #=> "⦉"
/\p{Lowercase-Letter}/.examples(max_group_results:99999).count
=> 2155

@tom-lord tom-lord marked this pull request as ready for review May 20, 2024 18:40
@tom-lord tom-lord merged commit ed29069 into master May 20, 2024
@tom-lord tom-lord deleted the use_regexp_property_values branch May 20, 2024 18:42
@tom-lord tom-lord mentioned this pull request May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant