Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grapheme text segmentation and test suite #3

Merged
merged 23 commits into from
Dec 31, 2023

Conversation

lukewilliamboswell
Copy link
Collaborator

@lukewilliamboswell lukewilliamboswell commented Oct 27, 2023

This PR;

  • Includes a script to automate generation of test suite from Grapheme unicode data file
  • Includes a script to automate generation of the implementation for Emoji generated from unicode data file
  • Implements the Grapheme text segmentation algorithm, up to and including legacy grapheme clusters
  • Adds a CI scripts for running a number of checks including, roc check, roc test, roc build, and roc docs

NOTE implementation of Extended Grapheme Cluster requires the implementation of rules GB9a, GB9b, GB9c which are left for a future PR.

Run Generation Scripts

To re-generate the generated files you can use bash rebuild.sh

Tests

To run the tests for Grapheme test suite use roc test package/GraphemeTest.roc

Screenshot 2023-12-17 at 20 33 25

Examples

I tried to include an additional example that used Grapheme.split but there are significant compiler bugs that prevented me from including with this PR.

Here is an demo from the tests showing the function in use.

Screenshot 2023-12-17 at 20 34 01

@rtfeldman
Copy link
Contributor

@lukewilliamboswell Just checking - should I hold off on review until the tests are passing? (I saw in the description you mentioned the TODOs, but I wanted to check!)

@lukewilliamboswell
Copy link
Collaborator Author

Thank you for clarifying. I think those changes will be more suited for another PR. I suspect it is going to be a challenge, at least I need to learn a lot more about emoji before then, and we may need to change the approach/algorithm to do it. If you have feedback on these changes that would be most appreciated, thank you.

@lukewilliamboswell lukewilliamboswell changed the title More Grapheme and Emoji generation and tests WIP Grapheme text segmentation and test generation Dec 15, 2023
@lukewilliamboswell
Copy link
Collaborator Author

Update on this PR; I've re-written the script for generating the test suite, currently called GraphemeTestGen2.roc. Now I can filter tests to include or exclude based on the rules (or capabilities) they are testing. This is a significant improvement as now I can see where there are significant gaps in the implementation, and progressively improve support for the text segmentation rules.

I've also started on a new implementation of the algorithm for text segmentation currently called Grapheme2.roc

@lukewilliamboswell lukewilliamboswell changed the title WIP Grapheme text segmentation and test generation Grapheme text segmentation and test suite Dec 17, 2023

# allocated extra space for the extra bytes as some CPs expand into
# multiple U8s, so this minimises extra allocations
capacity = List.withCapacity (50 + List.len cps)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change, but might be worth scaling the extra bytes based on length of list? Like maybe (List.len cps // 10) + List.len cps or something.

Copy link
Contributor

@rtfeldman rtfeldman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic! 🤩 🤩 🤩 🤩 🤩

Really great stuff! I'm so happy to see this working! 💯

@lukewilliamboswell lukewilliamboswell merged commit 3b4ec1f into roc-lang:main Dec 31, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants