-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require import/export names to be UTF-8. #1016
Conversation
This implements the UTF-8 proposal described in #989 (comment). This does not currently rename "name" to "utf8-name", because if UTF-8 is required for import/export names, there's a greater appeal to just saying that all strings are UTF-8, though this is debatable.
BinaryEncoding.md
Outdated
@@ -253,9 +253,9 @@ The import section declares all imports that will be used in the module. | |||
| Field | Type | Description | | |||
| ----- | ---- | ----------- | | |||
| module_len | `varuint32` | module string length | | |||
| module_str | `bytes` | module string of `module_len` bytes | | |||
| module_str | `bytes` | module name: `module_len` bytes holding valid utf8 string | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"UTF-8" here and elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -48,7 +48,8 @@ In the future, other kinds of imports may be added. Imports are designed to | |||
allow modules to share code and data while still allowing separate compilation | |||
and caching. | |||
|
|||
All imports include two opaque names: a *module name* and an *export name*. The | |||
All imports include two opaque names: a *module name* and an *export name*, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In JS.md, which type of exception should occur if import or export are invalid UTF-8 strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JS.md already seems to specify WebAssembly.CompileError
in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be a validation requirement, so WebAssembly.validate would return false, and APIs that throw would throw WebAssembly.CompileError. The design docs don't specify the details of validation, so there doesn't seem to be a clear place to specify this; perhaps we could just handle this in eventual spec PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, would be good to get input from @annevk / @tabatkins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart from this minor nit.
BinaryEncoding.md
Outdated
@@ -253,9 +253,9 @@ The import section declares all imports that will be used in the module. | |||
| Field | Type | Description | | |||
| ----- | ---- | ----------- | | |||
| module_len | `varuint32` | module string length | | |||
| module_str | `bytes` | module string of `module_len` bytes | | |||
| module_str | `bytes` | module name: `module_len` bytes holding valid UTF-8 string | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a valid UTF-8 byte sequence. A string is what you get after you decode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the correction! This is now fixed.
This document is describing the encoded bytes, rather than the string which one gets from decoding them. Also, make the descriptions of the byte sequence length fields more precise.
BinaryEncoding.md
Outdated
| field_len | `varuint32` | field name length | | ||
| field_str | `bytes` | field name: `field_len` bytes holding valid UTF-8 string | | ||
| module_len | `varuint32` | length of `module_str` in bytes | | ||
| module_str | `bytes` | module name: valid UTF-8 byte sequnce | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
I still think this is the wrong place to impose such a requirement, for all the reasons stated. However, I realised that under such a spec implementations could still be allowed to restrict the range of code points they accept (and thereby limit to ASCII in particular) by the same token that they are allowed to impose other implementation restrictions, such as on the number of local variables or sizes of functions, etc. So as long as there is agreement that it is legal for engines to implement such restrictions -- and we'll include it in the previously discussed (offline) but yet-to-be-written list on allowable implementation restrictions -- I would be fine with the change. |
@rossberg-chromium Yes, allowing embedders to impose additional constraints in this space would be fine with me. Are there any other comments on this PR? |
less validation rules on the base spec === smaller code base. I think it should be left to the upper layers to decided on the string format |
@wanderer The amount of code needed is quite small. Some implementations will already have a Unicode library linked in for other purposes, and for those that don't, here's a simple standalone implementation in C, for example: https://gist.github.com/sunfishcode/c050d4f60633c49ae6e54a3d45385031 In my experiment adding this to a production wasm decoder, the performance impact was negligible. An implementation which only accepted ASCII strings, as mentioned above, could be even simpler -- just check that no byte has the MSB set. |
Anything left to discuss before this can be merged? |
I believe all the concerns raised have been answered. |
This implements the UTF-8 proposal described in
#989 (comment).
This does not currently rename "name" to "utf8-name", because if UTF-8 is
required for import/export names, there's a greater appeal to just saying
that all strings are UTF-8, though this is debatable.