String encoding is often unspecified #968

DLehenbauer · 2017-01-26T20:16:13Z

In BinaryEncoding.md, 'fun_name_str' and 'local_name_str' specify "valid utf8 encoding".

Other string fields are less specific about the encoding (e.g., "module string of module_len bytes").

I assume utf8 is permitted everywhere?

jfbastien · 2017-01-26T20:54:24Z

BinaryEncoding.md should only specify length + bytes. It's not a string, but rather a collection of bytes, and is treated as such.

An embedder such as JavaScript then can restrict encoding to something it can handle, such as UTF-8. This is what JS.md should do.

Any divergence from this is a spec bug.

RyanLamansky · 2017-01-26T21:04:45Z

Hypothetically, an embedder could ban UTF8, also. There are no assurances of interoperability among embedders built into the current spec.

DLehenbauer · 2017-01-27T17:54:30Z

I'm assuming that the motivation is to emphasize that, at the binary level, strings are always treated as opaque binary keys and that the embedder has flexibility/responsibility in how it handles invalid identifiers, case-folding for case-insensitive languages, etc.

If so, perhaps BinaryEncoding.md could be explicit about the opaque treatment of strings, and still state that strings are UTF-8 encoded to close the interoperability issue at the binary level?

(...or was there some other motivation to leave the door open for UTF-16, etc.?)

PS - I am coming from the language compiler author's perspective, looking to the binary encoding spec to inform me about which bytes to emit. I realize that from the bytecode compiler side, it is desirable for string encoding to be an uninteresting detail.

RyanLamansky · 2017-01-27T18:26:48Z

The JavaScript API requires UTF-8 for every string it uses. So, if you intend to run your compiler's output directly in a web browser, UTF-8 is what you want.

The binary encoding document makes no mention of JavaScript at all. It's extremely "pure"... it tells you everything you need to know about reading and writing WASM with no discussion about how it fits into a larger ecosystem. You really need to read that, plus Semantics and JavaScript API to get a full picture on how it actually works today.

annevk · 2017-03-30T16:53:36Z

Was this fixed by 8e5ecc3?

sunfishcode · 2017-03-30T16:58:42Z

Indeed; this was fixed by 8e5ecc3, aka #1016. If there are any other areas where encodings are unspecified, please file new issues.

DLehenbauer · 2017-03-31T22:39:02Z

Thank you! :)

rossberg mentioned this issue Jan 30, 2017

UTF-8 decoding of import/export names in JS #970

Closed

sunfishcode modified the milestone: MVP Jan 31, 2017

flagxor added the imports/exports label Feb 3, 2017

sunfishcode closed this as completed Mar 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String encoding is often unspecified #968

String encoding is often unspecified #968

DLehenbauer commented Jan 26, 2017

jfbastien commented Jan 26, 2017

RyanLamansky commented Jan 26, 2017

DLehenbauer commented Jan 27, 2017 •

edited

Loading

RyanLamansky commented Jan 27, 2017 •

edited

Loading

annevk commented Mar 30, 2017

sunfishcode commented Mar 30, 2017

DLehenbauer commented Mar 31, 2017

String encoding is often unspecified #968

String encoding is often unspecified #968

Comments

DLehenbauer commented Jan 26, 2017

jfbastien commented Jan 26, 2017

RyanLamansky commented Jan 26, 2017

DLehenbauer commented Jan 27, 2017 • edited Loading

RyanLamansky commented Jan 27, 2017 • edited Loading

annevk commented Mar 30, 2017

sunfishcode commented Mar 30, 2017

DLehenbauer commented Mar 31, 2017

DLehenbauer commented Jan 27, 2017 •

edited

Loading

RyanLamansky commented Jan 27, 2017 •

edited

Loading