Skip to content

Test that invalid UTF-8 byte sequences are rejected. #450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

sunfishcode
Copy link
Member

WebAssembly/design#1016 in the design repository requires implementations to validate that import/export names are UTF-8. This PR contains test coverage for this feature only.

The tests may need to be modified depending on whether UTF-8 validation is implemented as a syntactic constraint or not, or other details, however they should offer a good starting point.

@rossberg
Copy link
Member

rossberg commented Apr 4, 2017

Looks good to me as far as the test cases go. However, on reflection it is most natural that these will be considered decoding errors, and that the AST itself represents names as vectors of code points. As you suspect, that implies that in the text format, errors will have to manifest themselves as immediate syntax errors. So to make the textual tests (the import/export ones) work you'll either need to turn them into individual .fail. files, or (preferably) express them in binary format like the custom section test, whose decoding happens separately from parsing.

But before the spec interpreter implements the UTF-8 restriction, these tests break CI on travis either way (and potentially downstream users running the test suite on waterfalls). For that reason alone I would suggest holding off landing until the spec has caught up. I'll try to get to it later this week.

@rossberg
Copy link
Member

rossberg commented Apr 5, 2017

See #454 for the interpreter implementation. After having implemented it I noticed that a few cases aren't tested in this PR:

  • non-scalar code points (U+D800-U+DFFF)
  • out-of-range code points (larger than U+10FFFF)
  • overlong encodings (using more bytes than necessary)

It would be great if you could include a few tests for those as well.

@kmiller68
Copy link

I may have missed it but it looks like there are no tests ensuring UTF-8s with a BOM fail to parse. It seems like that would be a good test too.

As a nit it might also be useful to put roughly why the UTF-8 should fail to validate on each of the tests.

@sunfishcode
Copy link
Member Author

This is superseded by #468, which I believe addresses all the feedback here.

@sunfishcode sunfishcode closed this May 4, 2017
@sunfishcode sunfishcode deleted the validate-utf8 branch May 5, 2017 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants