-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Add more utf-8 validation unit tests #3405
[stdlib] Add more utf-8 validation unit tests #3405
Conversation
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for improving the tests! Do you or other people in the community you've been working with have a more sketched out broader/longer-term design for incorporating UTF-8 strings in the stdlib more generally speaking (e.g. should it be String
type, a distinct type, how do we think about different ways of indexing (by byte, code point, etc.)?
assert_false(validate_utf8(sequence[])) | ||
|
||
|
||
def test_combinaison_good_utf8_sequences(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_combinaison_good_utf8_sequences(): | |
def test_combination_good_utf8_sequences(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix this and the other similar typo internally when I import it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oupsy, my french is leaking
!sync |
About utf-8 and Strings longer term, there was quite a bit of discussion on the discord about this. (internal encoding + null terminator) and while there wasn't a consensus, there is a subset of ideas that everyone agreed on. I propose we do the non-controversial work first and then we'll be able to discuss the more controversial changes. Non-controversial changesEveryone agreed about the necessity of having utf-8 strings in one shape or another with a null terminator for compatibility with the OS. I propose we go in this direction first and make our Controversial changesNow for the rest of the ideas that we can discuss later, there was:
Those three changes were controversial, and complexe. Especially with Python devs talking about changing the internal representation of So yeah the immediate goal would be to integrate utf-8 to our |
✅🟣 This contribution has been merged 🟣✅ Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours. We use Copybara to merge external contributions, click here to learn more. |
[External] [stdlib] Add more utf-8 validation unit tests Part of my work on utf-8 validation. The new algorithm is more complex. So, to ensure everything is working as expected, add some additional unit tests. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes #3405 MODULAR_ORIG_COMMIT_REV_ID: 0dbb5c80b8326d6063f64f5bd998e509d72ecf67
Landed in b40ec35! Thank you for your contribution 🎉 |
[External] [stdlib] Add more utf-8 validation unit tests Part of my work on utf-8 validation. The new algorithm is more complex. So, to ensure everything is working as expected, add some additional unit tests. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes modular#3405 MODULAR_ORIG_COMMIT_REV_ID: 0dbb5c80b8326d6063f64f5bd998e509d72ecf67 Signed-off-by: Manuel Saelices <msaelices@gmail.com>
[External] [stdlib] Add more utf-8 validation unit tests Part of my work on utf-8 validation. The new algorithm is more complex. So, to ensure everything is working as expected, add some additional unit tests. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes #3405 MODULAR_ORIG_COMMIT_REV_ID: 0dbb5c80b8326d6063f64f5bd998e509d72ecf67
Part of my work on utf-8 validation. The new algorithm is more complexe, thus to improve our trust in it, I added new unit tests.