Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String encoding is often unspecified #968

Closed
DLehenbauer opened this issue Jan 26, 2017 · 7 comments
Closed

String encoding is often unspecified #968

DLehenbauer opened this issue Jan 26, 2017 · 7 comments
Milestone

Comments

@DLehenbauer
Copy link

In BinaryEncoding.md, 'fun_name_str' and 'local_name_str' specify "valid utf8 encoding".

Other string fields are less specific about the encoding (e.g., "module string of module_len bytes").

I assume utf8 is permitted everywhere?

@jfbastien
Copy link
Member

BinaryEncoding.md should only specify length + bytes. It's not a string, but rather a collection of bytes, and is treated as such.

An embedder such as JavaScript then can restrict encoding to something it can handle, such as UTF-8. This is what JS.md should do.

Any divergence from this is a spec bug.

@RyanLamansky
Copy link

Hypothetically, an embedder could ban UTF8, also. There are no assurances of interoperability among embedders built into the current spec.

@DLehenbauer
Copy link
Author

DLehenbauer commented Jan 27, 2017

I'm assuming that the motivation is to emphasize that, at the binary level, strings are always treated as opaque binary keys and that the embedder has flexibility/responsibility in how it handles invalid identifiers, case-folding for case-insensitive languages, etc.

If so, perhaps BinaryEncoding.md could be explicit about the opaque treatment of strings, and still state that strings are UTF-8 encoded to close the interoperability issue at the binary level?

(...or was there some other motivation to leave the door open for UTF-16, etc.?)

PS - I am coming from the language compiler author's perspective, looking to the binary encoding spec to inform me about which bytes to emit. I realize that from the bytecode compiler side, it is desirable for string encoding to be an uninteresting detail.

@RyanLamansky
Copy link

RyanLamansky commented Jan 27, 2017

The JavaScript API requires UTF-8 for every string it uses. So, if you intend to run your compiler's output directly in a web browser, UTF-8 is what you want.

The binary encoding document makes no mention of JavaScript at all. It's extremely "pure"... it tells you everything you need to know about reading and writing WASM with no discussion about how it fits into a larger ecosystem. You really need to read that, plus Semantics and JavaScript API to get a full picture on how it actually works today.

@annevk
Copy link
Member

annevk commented Mar 30, 2017

Was this fixed by 8e5ecc3?

@sunfishcode
Copy link
Member

Indeed; this was fixed by 8e5ecc3, aka #1016. If there are any other areas where encodings are unspecified, please file new issues.

@DLehenbauer
Copy link
Author

Thank you! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants