[stdlib] Add full unicode support for character casing functions #3496

mzaks · 2024-09-18T15:07:00Z

The code I used to generate the lookup tables can be found here https://gist.github.com/mzaks/bbadaeebcf81a5200021af041568b26b

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

martinvuyk

Hi @mzaks this is great! . Just a bit of a nitpick: I think _unicode.mojo and _unicode_lookups.mojo should be in the utils package

stdlib/src/collections/_unicode.mojo

martinvuyk · 2024-09-18T16:41:05Z

stdlib/src/collections/_unicode.mojo

+    if (rune >> 7) == 0:  # This is 1 byte ASCII char
+        p[0] = rune.cast[DType.uint8]()
+        return 1
+
+    @always_inline
+    fn _utf8_len(val: UInt32) -> Int:
+        alias sizes = SIMD[DType.uint32, 4](
+            0, 0b1111_111, 0b1111_1111_111, 0b1111_1111_1111_1111
+        )
+        var values = SIMD[DType.uint32, 4](val)
+        var mask = values > sizes
+        return int(mask.cast[DType.uint8]().reduce_add())
+
+    var num_bytes = _utf8_len(rune)
+    var shift = 6 * (num_bytes - 1)
+    var mask = UInt32(0xFF) >> (num_bytes + 1)
+    var num_bytes_marker = UInt32(0xFF) << (8 - num_bytes)
+    p[0] = (((rune >> shift) & mask) | num_bytes_marker).cast[DType.uint8]()
+    for i in range(1, num_bytes):
+        shift -= 6
+        p[i] = (((rune >> shift) & 0b00111111) | 0b10000000).cast[DType.uint8]()
+    return num_bytes


This code logic is used in many other places. I'll open a PR to fix chr implementation which I put in #3239 and ended up mixed with other problems and getting forgotten.

JoeLoser · 2024-09-19T13:38:52Z

!sync

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Co-authored-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com> Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

…:mzaks/mojo into feature/support-unicode-character-casing # Conflicts: # stdlib/src/collections/_unicode.mojo

mzaks · 2024-09-19T14:02:26Z

Hi @mzaks this is great! . Just a bit of a nitpick: I think _unicode.mojo and _unicode_lookups.mojo should be in the utils package

Very good point. Just moved them to utils.

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

JoeLoser · 2024-09-19T14:22:21Z

!sync

mzaks · 2024-09-21T16:41:44Z

I tried a performance improvement based on static B+Tree described here: https://en.algorithmica.org/hpc/data-structures/s-tree/

It looks promising, but I think it will be better as a separate PR.
@JoeLoser what do you think?

martinvuyk · 2024-09-21T20:59:46Z

https://en.algorithmica.org/hpc/data-structures/s-tree/

what an awesome book, I have new reading material for a couple of months...

I tried a performance improvement based on static B+Tree

I think in this case it might or it might not be better. The lists are static and known at compile time yet the binary search is generic. I think you could just hardcode values and ranges here, though I'm not sure how that much branching would affect performance. But there are some quite big ranges is some places e.g.:

alias has_uppercase_mapping = List[UInt32, hint_trivial_type=True](
    ...
    0x007A,  # LATIN SMALL LETTER Z z
    0x00B5,  # MICRO SIGN µ
    0x00E0,  # LATIN SMALL LETTER A WITH GRAVE à
    ...
    0x0586,  # ARMENIAN SMALL LETTER FEH ֆ
    0x10D0,  # GEORGIAN LETTER AN ა
    ...
    0x2D2D,  # GEORGIAN SMALL LETTER AEN ⴭ
    0xA641,  # CYRILLIC SMALL LETTER ZEMLYA ꙁ
    ...
)

And if we are going to nitpick performance, I think ASCII letters should be prioritized and not use a generic algorithm for something that is realistically mostly going to be used for the first X amount of letters (we could use an ASCII optimized version and have another Unicode one, at the cost of checking the first UTF8 byte and the branch that follows)

…arating (#47532) [External] [stdlib] Fix chr impl taking funcs to string_slice and separating Fix `chr` implementation taking funcs to `string_slice` and separating their respective functionalities. This code will be used elsewhere e.g., PR [#3496](#3496 (comment)) ORIGINAL_AUTHOR=martinvuyk <110240700+martinvuyk@users.noreply.github.com> PUBLIC_PR_LINK=#3506 Co-authored-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com> Closes #3506 MODULAR_ORIG_COMMIT_REV_ID: dc8c96e58f5e272d01e3c26f6daf3ffe8f7c0b36

JoeLoser · 2024-10-16T13:48:19Z

!sync

JoeLoser · 2024-11-18T17:44:35Z

FYI I'm chasing after the issues internally that this PR is hitting, I'll have an update soon.

JoeLoser · 2024-11-18T18:29:10Z

FYI I'm chasing after the issues internally that this PR is hitting, I'll have an update soon.

All merged now @mzaks! Thanks for the contribution 🚀

modularbot · 2024-11-18T18:29:36Z

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

JoeLoser · 2024-11-19T13:55:56Z

I tried a performance improvement based on static B+Tree described here: https://en.algorithmica.org/hpc/data-structures/s-tree/

It looks promising, but I think it will be better as a separate PR. @JoeLoser what do you think?

Agree re separate PR if you think there's perf optimizations to be had here. Always like keeping it simpler as we're getting started.

mzaks · 2024-11-19T16:10:56Z

I tried a performance improvement based on static B+Tree described here: https://en.algorithmica.org/hpc/data-structures/s-tree/
It looks promising, but I think it will be better as a separate PR. @JoeLoser what do you think?

Agree re separate PR if you think there's perf optimizations to be had here. Always like keeping it simpler as we're getting started.

Yes I was also thinking along the same line, this PR is quite simple and provides a correct implementation of casing. I will create follow up PRs to introduce performance relevant changes.

Add full unicode support for character casing functions

f9b3dcc

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

mzaks requested a review from a team as a code owner September 18, 2024 15:07

mzaks mentioned this pull request Sep 18, 2024

[Feature Request] Support String.toupper and .tolower for non ASCII chars which have upper lower cases #1543

Open

1 task

martinvuyk reviewed Sep 18, 2024

View reviewed changes

modular-automation bot assigned JoeLoser Sep 19, 2024

modularbot added the imported-internally Signals that a given pull request has been imported internally. label Sep 19, 2024

mzaks and others added 4 commits September 19, 2024 15:46

Add full unicode support for character casing functions

652b4b6

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Update stdlib/src/collections/_unicode.mojo

3ce71b8

Co-authored-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com> Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Update stdlib/src/collections/_unicode.mojo

3fdcfed

Co-authored-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com> Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Update stdlib/src/collections/_unicode.mojo

fa8c0c9

Co-authored-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com> Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

mzaks force-pushed the feature/support-unicode-character-casing branch from da0b9bf to fa8c0c9 Compare September 19, 2024 13:46

Merge branch 'feature/support-unicode-character-casing' of github.com…

d40c28a

…:mzaks/mojo into feature/support-unicode-character-casing # Conflicts: # stdlib/src/collections/_unicode.mojo

Move _unicode and _unicode_lookups to utils

258a088

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

mzaks force-pushed the feature/support-unicode-character-casing branch from c28d399 to 258a088 Compare September 19, 2024 14:05

martinvuyk mentioned this pull request Sep 19, 2024

[stdlib] Fix chr impl taking funcs to string_slice and separating #3506

Closed

JoeLoser mentioned this pull request Sep 19, 2024

[stdlib] Support negative indexing and checking bounds for VariadicList. #2886

Closed

modularbot added the merged-internally Indicates that this pull request has been merged internally label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stdlib] Add full unicode support for character casing functions #3496

[stdlib] Add full unicode support for character casing functions #3496

mzaks commented Sep 18, 2024

martinvuyk left a comment

martinvuyk Sep 18, 2024

JoeLoser commented Sep 19, 2024

mzaks commented Sep 19, 2024

JoeLoser commented Sep 19, 2024

mzaks commented Sep 21, 2024

martinvuyk commented Sep 21, 2024

JoeLoser commented Oct 16, 2024

JoeLoser commented Nov 18, 2024

JoeLoser commented Nov 18, 2024

modularbot commented Nov 18, 2024

JoeLoser commented Nov 19, 2024

mzaks commented Nov 19, 2024

[stdlib] Add full unicode support for character casing functions #3496

Are you sure you want to change the base?

[stdlib] Add full unicode support for character casing functions #3496

Conversation

mzaks commented Sep 18, 2024

martinvuyk left a comment

Choose a reason for hiding this comment

martinvuyk Sep 18, 2024

Choose a reason for hiding this comment

JoeLoser commented Sep 19, 2024

mzaks commented Sep 19, 2024

JoeLoser commented Sep 19, 2024

mzaks commented Sep 21, 2024

martinvuyk commented Sep 21, 2024

JoeLoser commented Oct 16, 2024

JoeLoser commented Nov 18, 2024

JoeLoser commented Nov 18, 2024

modularbot commented Nov 18, 2024

JoeLoser commented Nov 19, 2024

mzaks commented Nov 19, 2024