[stdlib] Add more utf-8 validation unit tests #3405

gabrieldemarmiesse · 2024-08-22T13:17:52Z

Part of my work on utf-8 validation. The new algorithm is more complexe, thus to improve our trust in it, I added new unit tests.

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

JoeLoser

Nice, thanks for improving the tests! Do you or other people in the community you've been working with have a more sketched out broader/longer-term design for incorporating UTF-8 strings in the stdlib more generally speaking (e.g. should it be String type, a distinct type, how do we think about different ways of indexing (by byte, code point, etc.)?

JoeLoser · 2024-08-22T16:43:30Z

stdlib/test/utils/test_string_slice.mojo

+        assert_false(validate_utf8(sequence[]))
+
+
+def test_combinaison_good_utf8_sequences():


Suggested change

def test_combinaison_good_utf8_sequences():

def test_combination_good_utf8_sequences():

I'll fix this and the other similar typo internally when I import it.

oupsy, my french is leaking

JoeLoser · 2024-08-22T17:21:32Z

!sync

gabrieldemarmiesse · 2024-08-22T21:14:09Z

About utf-8 and Strings longer term, there was quite a bit of discussion on the discord about this. (internal encoding + null terminator) and while there wasn't a consensus, there is a subset of ideas that everyone agreed on. I propose we do the non-controversial work first and then we'll be able to discuss the more controversial changes.

Non-controversial changes

Everyone agreed about the necessity of having utf-8 strings in one shape or another with a null terminator for compatibility with the OS. I propose we go in this direction first and make our String type an utf-8 string with a null terminator always present.

Controversial changes

Now for the rest of the ideas that we can discuss later, there was:

Adding a struct parameter to String Saying if the null terminator was present or not, and not present by default.
Adding a struct parameter to String indicating the internal encoding, notably to have an internal encoding compatible with python, which does not have utf-8 string. This parameter can also indicate that a string is ASCII only, type promotion can be applied when mixing strings with different internal representations.
Add a table for faster indexing in utf-8 strings.

Those three changes were controversial, and complexe. Especially with Python devs talking about changing the internal representation of str to use UTF-8, see faster-cpython/ideas#684 for their faster-cpython initiative.

So yeah the immediate goal would be to integrate utf-8 to our String struct. We'll see later for the rest.

modularbot · 2024-08-23T00:42:22Z

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

[External] [stdlib] Add more utf-8 validation unit tests Part of my work on utf-8 validation. The new algorithm is more complex. So, to ensure everything is working as expected, add some additional unit tests. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes #3405 MODULAR_ORIG_COMMIT_REV_ID: 0dbb5c80b8326d6063f64f5bd998e509d72ecf67

modularbot · 2024-08-24T20:29:39Z

Landed in b40ec35! Thank you for your contribution 🎉

[External] [stdlib] Add more utf-8 validation unit tests Part of my work on utf-8 validation. The new algorithm is more complex. So, to ensure everything is working as expected, add some additional unit tests. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes modular#3405 MODULAR_ORIG_COMMIT_REV_ID: 0dbb5c80b8326d6063f64f5bd998e509d72ecf67 Signed-off-by: Manuel Saelices <msaelices@gmail.com>

[External] [stdlib] Add more utf-8 validation unit tests Part of my work on utf-8 validation. The new algorithm is more complex. So, to ensure everything is working as expected, add some additional unit tests. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes #3405 MODULAR_ORIG_COMMIT_REV_ID: 0dbb5c80b8326d6063f64f5bd998e509d72ecf67

[stdlib] Add more utf-8 validation unit tests

9adfa5d

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

gabrieldemarmiesse requested a review from a team as a code owner August 22, 2024 13:17

Add todo

efdef33

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

gabrieldemarmiesse mentioned this pull request Aug 22, 2024

[stdlib] Make utf8 validation ~10-13x faster on neon and sse4 #3401

Closed

JoeLoser self-assigned this Aug 22, 2024

JoeLoser approved these changes Aug 22, 2024

View reviewed changes

modularbot added the imported-internally Signals that a given pull request has been imported internally. label Aug 22, 2024

modularbot added the merged-internally Indicates that this pull request has been merged internally label Aug 23, 2024

modularbot added the merged-externally Merged externally in public mojo repo label Aug 24, 2024

modularbot closed this Aug 24, 2024

gabrieldemarmiesse mentioned this pull request Sep 8, 2024

[stdlib] Use SIMD to make b64encode 4.7x faster #3443

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stdlib] Add more utf-8 validation unit tests #3405

[stdlib] Add more utf-8 validation unit tests #3405

gabrieldemarmiesse commented Aug 22, 2024

JoeLoser left a comment

JoeLoser Aug 22, 2024

JoeLoser Aug 22, 2024

gabrieldemarmiesse Aug 22, 2024

JoeLoser commented Aug 22, 2024

gabrieldemarmiesse commented Aug 22, 2024

modularbot commented Aug 23, 2024

modularbot commented Aug 24, 2024

		assert_false(validate_utf8(sequence[]))


		def test_combinaison_good_utf8_sequences():

	def test_combinaison_good_utf8_sequences():
	def test_combination_good_utf8_sequences():

[stdlib] Add more utf-8 validation unit tests #3405

[stdlib] Add more utf-8 validation unit tests #3405

Conversation

gabrieldemarmiesse commented Aug 22, 2024

JoeLoser left a comment

Choose a reason for hiding this comment

JoeLoser Aug 22, 2024

Choose a reason for hiding this comment

JoeLoser Aug 22, 2024

Choose a reason for hiding this comment

gabrieldemarmiesse Aug 22, 2024

Choose a reason for hiding this comment

JoeLoser commented Aug 22, 2024

gabrieldemarmiesse commented Aug 22, 2024

Non-controversial changes

Controversial changes

modularbot commented Aug 23, 2024

modularbot commented Aug 24, 2024