-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Add full unicode support for character casing functions #3496
base: nightly
Are you sure you want to change the base?
[stdlib] Add full unicode support for character casing functions #3496
Conversation
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mzaks this is great! . Just a bit of a nitpick: I think _unicode.mojo
and _unicode_lookups.mojo
should be in the utils package
stdlib/src/collections/_unicode.mojo
Outdated
if (rune >> 7) == 0: # This is 1 byte ASCII char | ||
p[0] = rune.cast[DType.uint8]() | ||
return 1 | ||
|
||
@always_inline | ||
fn _utf8_len(val: UInt32) -> Int: | ||
alias sizes = SIMD[DType.uint32, 4]( | ||
0, 0b1111_111, 0b1111_1111_111, 0b1111_1111_1111_1111 | ||
) | ||
var values = SIMD[DType.uint32, 4](val) | ||
var mask = values > sizes | ||
return int(mask.cast[DType.uint8]().reduce_add()) | ||
|
||
var num_bytes = _utf8_len(rune) | ||
var shift = 6 * (num_bytes - 1) | ||
var mask = UInt32(0xFF) >> (num_bytes + 1) | ||
var num_bytes_marker = UInt32(0xFF) << (8 - num_bytes) | ||
p[0] = (((rune >> shift) & mask) | num_bytes_marker).cast[DType.uint8]() | ||
for i in range(1, num_bytes): | ||
shift -= 6 | ||
p[i] = (((rune >> shift) & 0b00111111) | 0b10000000).cast[DType.uint8]() | ||
return num_bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code logic is used in many other places. I'll open a PR to fix chr
implementation which I put in #3239 and ended up mixed with other problems and getting forgotten.
!sync |
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Co-authored-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com> Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Co-authored-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com> Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Co-authored-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com> Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
da0b9bf
to
fa8c0c9
Compare
…:mzaks/mojo into feature/support-unicode-character-casing # Conflicts: # stdlib/src/collections/_unicode.mojo
Very good point. Just moved them to utils. |
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
c28d399
to
258a088
Compare
!sync |
I tried a performance improvement based on static B+Tree described here: https://en.algorithmica.org/hpc/data-structures/s-tree/ It looks promising, but I think it will be better as a separate PR. |
what an awesome book, I have new reading material for a couple of months...
I think in this case it might or it might not be better. The lists are static and known at compile time yet the binary search is generic. I think you could just hardcode values and ranges here, though I'm not sure how that much branching would affect performance. But there are some quite big ranges is some places e.g.: alias has_uppercase_mapping = List[UInt32, hint_trivial_type=True](
...
0x007A, # LATIN SMALL LETTER Z z
0x00B5, # MICRO SIGN µ
0x00E0, # LATIN SMALL LETTER A WITH GRAVE à
...
0x0586, # ARMENIAN SMALL LETTER FEH ֆ
0x10D0, # GEORGIAN LETTER AN ა
...
0x2D2D, # GEORGIAN SMALL LETTER AEN ⴭ
0xA641, # CYRILLIC SMALL LETTER ZEMLYA ꙁ
...
) And if we are going to nitpick performance, I think ASCII letters should be prioritized and not use a generic algorithm for something that is realistically mostly going to be used for the first X amount of letters (we could use an ASCII optimized version and have another Unicode one, at the cost of checking the first UTF8 byte and the branch that follows) |
…arating (#47532) [External] [stdlib] Fix chr impl taking funcs to string_slice and separating Fix `chr` implementation taking funcs to `string_slice` and separating their respective functionalities. This code will be used elsewhere e.g., PR [#3496](#3496 (comment)) ORIGINAL_AUTHOR=martinvuyk <110240700+martinvuyk@users.noreply.github.com> PUBLIC_PR_LINK=#3506 Co-authored-by: martinvuyk <110240700+martinvuyk@users.noreply.github.com> Closes #3506 MODULAR_ORIG_COMMIT_REV_ID: dc8c96e58f5e272d01e3c26f6daf3ffe8f7c0b36
!sync |
FYI I'm chasing after the issues internally that this PR is hitting, I'll have an update soon. |
All merged now @mzaks! Thanks for the contribution 🚀 |
✅🟣 This contribution has been merged 🟣✅ Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours. We use Copybara to merge external contributions, click here to learn more. |
Agree re separate PR if you think there's perf optimizations to be had here. Always like keeping it simpler as we're getting started. |
Yes I was also thinking along the same line, this PR is quite simple and provides a correct implementation of casing. I will create follow up PRs to introduce performance relevant changes. |
The code I used to generate the lookup tables can be found here https://gist.github.com/mzaks/bbadaeebcf81a5200021af041568b26b