Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add relevant items to CG-08-03 string discussion #830

Merged
merged 1 commit into from
Jul 29, 2021
Merged

Add relevant items to CG-08-03 string discussion #830

merged 1 commit into from
Jul 29, 2021

Conversation

dcodeIO
Copy link
Contributor

@dcodeIO dcodeIO commented Jul 16, 2021

It has been suggested to me to increase the time slot allotted for the scheduled string discussion, and I'd like to make use of this suggestion to add relevant votes that I think should be decided before the string respectively char type is restricted. In particular, there appears to be a general appetite to at least make sure that well-formed UTF-16 is properly supported, while I am personally very interested in hearing opinions on whether the single mechanism being proposed can indeed be considered in-line with Wasm's high-level goals.

main/2021/CG-08-03.md Outdated Show resolved Hide resolved
main/2021/CG-08-03.md Outdated Show resolved Hide resolved
* i.e. "maintain the backwards-compatible nature of the Web"
* i.e. "promote other compilers and tools targeting WebAssembly"
* i.e. "WebAssembly **also** supports non-web embeddings" (but not primarily)
1. Discussion: Is the Component Model in-line with our [high-level goals](https://webassembly.org/docs/high-level-goals/), with [what](https://webassembly.org/) we [communicate](https://www.w3.org/2020/03/webassembly-wg-charter), and [envisioned](https://www.youtube.com/watch?v=ZKydJPRosa0)? In particular:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how productive this agenda item can be, given that we previously achieved consensus on pursuing the component model, but I don't have any further editorial suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would typically agree, but this was and still is a very controversial topic for a reason, that I think hasn't gotten our full attention because we are too afraid to address the elephant in the room. I think it can be healing for the group to finally speak this out, so we can eventually move on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like to use the opportunity here to respectfully question the consensus we achieved on the component model:

  • In the next steps we voted on, it was claimed that "adapter functions [...] are just an optimization over using a fixed, canonical ABI", and I gave a reasonable, as I think, presentation why this is not the case for strings once the char type has been restricted to only allow specific semantics.
  • Right before voting, the champion clarified upon request that the vote excluded string semantics, yet it turned out that it in fact directly implied string semantics as a result of the scope change. It has been argued that this was necessary given the new requirements, yet doing so implies that choosing DOMString-compatible semantics is off the table anyhow, so the clarification would not have been necessary in the first place, and it could just have been stated right away.

As such I think that the most productive we can do is to have more discussion, and perhaps undo a potentially problematic scope change if it turns out that the group has an interest to do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics of strings is a minor detail compared to the full scope of the component model we voted and achieved consensus on. I think it would set a bad precedent to allow that vote to be reconsidered. We should respect the decision of the community and continue working assuming the component model will have the scope and goals we voted on, no matter what our different technical opinions about string semantics and encodings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we disagree, and according to the process this is fine. But what do we do now? Propose a vote on whether to discuss the topic? What do we do when we disagree on adding it to the agenda as well? Propose a vote on whether to propose a vote on whether to discuss the topic? Have we maybe just found a reference cycle in the process? How do we invoke the backup collector? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vote was about the goals and requirements of the new component model (and whether to have one at all); things like cross-language composability, portability, virtualization, etc. These set an overall context for making design choices like string semantics (and 100s of other design choices) which are, of course, highly context dependent. Thus, the vote was on the context, not all the 100s of consequent design choices, and the clarification was meant to be clear on this, since it was an explicit concern you raised in an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree that what you are saying is in-line with voting on a "general direction" yet not achieving consensus on the concrete "next steps". This is not how I understood it and if it would have been clear I would not have voted NEUTRAL to do you a favor. In fact, I was pressured, against my reading of what is at stake here, by several WG members not to block consensus on the general vote, which I would question separately if this is the outcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time parsing what precisely you are disagreeing with. My claim is simply: there is strong CG consensus around the goals of the component model including cross-language interoperability and virtualizability, and thus these should be the starting points for any discussion about string semantics within the component model; we shouldn't start over from first principles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I was pressured, against my reading of what is at stake here, by several WG members not to block consensus on the general vote, which I would question separately if this is the outcome.

@dcodeIO I take this to be referring to the many hours I spent in calls with you, giving advice on how to navigate this situation to your advantage, which you were very grateful for at the time. I don't appreciate your misrepresentation here.

I stand by my advice that it would have been poor behaviour to attempt to block the general component model vote on the grounds that the draft canonical ABI supported only UTF-8. To facilitate you, I explicitly brought up during the CG vote that we would need a separate future vote on the contents of the canonical ABI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still appreciate all you have done, Conrad, and I respect you for your efforts. Wasm needs more people like you, but it also direly needs some honesty and integrity. In fact I didn't block the general component model vote in good faith in the hope that it would motivate constructive discussion on my concerns, but now look at the outcome: A follow-up meeting in a smaller group was suggested, but then was refused due to "reluctance". I've presented my concerns reasonably I think, yet was met with the same old fallacies. I've made concrete suggestions on how to improve the proposal, yet was met with omission. I suggested concrete follow-up action items, yet was gamed again. Needless to say that my earlier general concerns on Wasm's non-web language-exclusive trajectory remain largely unaddressed until today, also because an entire group of people decided to ghost me. And what's happening in this PR just makes me lose the remaining bits of trust into the process that you preserved in me, as well as in Wasm's goals, its vision, its principles, its values, and in particular its leadership.

* "execute in the same semantic universe as JavaScript"
* "maintain the backwards-compatible nature of the Web"
* "promote other compilers and tools targeting WebAssembly"
* "WebAssembly **also** supports non-web embeddings" (but not primarily)
1. Poll for retaining compatibility with `DOMString` - or otherwise - maintain a single list-of-USV `string` type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better, but "retaining compatibility with DOMString" is misleading. It's not as though introducing the component model "breaks" compatibility that was there before. Also, referring to the transmission of isolated surrogates across component boundaries as "compatibility with DOMString" severely obscures the true point of the vote.

I recommend

"Poll to either maintain a single list-of-USV string type, or alternatively allow isolated surrogates to cross component boundaries"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you either support DOMString, or not, then you are supporting USVString, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the distinction between the two WebIDL concepts, and removed "compatibility" as per your request.

Copy link
Contributor Author

@dcodeIO dcodeIO Jul 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I think a very important aspect here is the word "single" that may eventually bite us since we cannot even introduce a working alternative in the future, or for the Web embedding. Would it be OK if I remove it, or emphasize it prominently, or split this aspect out similarly to making UTF-16 and Latin1 separate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @conrad-watt's suggested wording is good. I understand "single" to mean "in the current proposal", so it'd also be fine to include that sort of qualifier.

Copy link
Contributor Author

@dcodeIO dcodeIO Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, makes sense. I wanted to be as diplomatic as possible, as I think that this is not an either or, and as clear as possible, as people have argued in the past that I'd be advocating for just one, but not both which I'd prefer. Do you see another way to clarify this nuance (or no need to), so my perspective isn't as easily misunderstood?

To give a concrete example, just for reference (we can talk about this separately), I think it would be possible to indicate statically that a module only supports list-of-USVs and produce an error during link time, to be resolved by the developer by either overriding or handling (in either module) accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any need to clarify that nuance here on the agenda, but I'm sure it will come up in discussion and we can make sure it is reflected in the notes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's correct to conflate "strings with lone surrogates" with DOMString (by saying "aka DOMString") because DOMString also generally suggests an encoding (WTF-16). As we keep saying in the other issue, this is a poll about string semantics so please use exactly the text @conrad-watt proposed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I say "DOMString semantics"? I think the WebIDL concept is useful here for those who may not be familiar with the topic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the agenda item should link to the relevant technical discussions to read beforehand which are then summarized during the discussion, not capture parts of the technical discussion within the agenda item itself, since the same unfamiliarity will likely conflate "DOMString semantics" with "two byte encodings".

@lukewagner
Copy link
Member

Also, I don't think it's a good idea to add a poll for "compact UTF-16"; compared to the other topics, it has had relatively less discussion and supporting data so it seems premature to ask the CG to vote on it. In general, I don't think the goal here is to nail down every detail of strings, but just to get through this one sticking semantic issue so we can make progress on the proposal the normal way.

@dcodeIO
Copy link
Contributor Author

dcodeIO commented Jul 20, 2021

I am open to removing the compact UTF bullet point. Just added it because it was part of your suggestion.

@dcodeIO
Copy link
Contributor Author

dcodeIO commented Jul 22, 2021

I just figured that I can just propose a vote on "compact UTF-16" as well if I wish, so I did. I used the wording suggested for the final poll, but added a more precise clarification on how it relates to enable lossless transfer of DOMString through the string type. I've also updated the discussion point to specifically address my potential follow-up formal objection in the hope that it can be addressed to achieve consensus.

main/2021/CG-08-03.md Outdated Show resolved Hide resolved
1. Summary of [interface-types/#135](https://github.com/WebAssembly/interface-types/issues/135) and discussion [40 min].
1. Poll for supporting [UTF-16](https://github.com/WebAssembly/interface-types/issues/136#issuecomment-861799460) lifting and lowering
1. Poll for supporting compact UTF-16 (Latin1) lifting and lowering
1. Potential formal objection: A "**single** list-of-USV `string` type" may conflict with our [high-level goals](https://webassembly.org/docs/high-level-goals/), with [what](https://webassembly.org/) we [communicate](https://www.w3.org/2020/03/webassembly-wg-charter), and [envisioned](https://www.youtube.com/watch?v=ZKydJPRosa0). In particular:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be on the agenda? If you want to formally object, then it could be done as part of the poll below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't know why we are polling in the first place with my concerns and suggestions largely unaddressed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although unanimity is the goal, it's by no means required of the consensus process. There are a number of times in the past when there was a non-unanimous poll, but, to make progress, the CG chairs determined overall consensus was reached and moved on. In any case, agreed with Piotr that objections are things to raise in the meeting, not in the agenda item which, according to the process, just states poll text as determined by the proposal champion(s).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's overall pretty clever, yeah.

* "maintain the backwards-compatible nature of the Web"
* "promote other compilers and tools targeting WebAssembly"
* "WebAssembly **also** supports non-web embeddings" (but not primarily)
1. Poll to maintain a **single** list-of-USV `string` type - or otherwise - allow isolated surrogates to cross component boundaries (enables lossless transfer of `DOMString`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Could you add (byte strings) after isolated surrogates? I'm not sure if everybody is familiar with that terminology.
  2. Could you add at the cost of interoperability after DOMString? It's a trade-off.

Copy link
Contributor Author

@dcodeIO dcodeIO Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In WebIDL there is the concept of ByteString, which due to its value range cannot contain isolated surrogates unless reinterpreted. The concept of surrogates, paired or isolated, is exclusive to 16-bit encodings, respectively is extended to UTF-8 with WTF-8.
  2. For affected languages it enables interoperability without correctness cliffs. Similar trade-off, but different perspective. Ambiguous.

@dschuff
Copy link
Member

dschuff commented Jul 23, 2021

I agree with several of the posters here that advocating for a position doesn't belong in the agenda. I also agree that we have consensus on the goals of the Component Model (in particular, its emphasis on IT types as interchange formats between coarse-grained groups of modules). Having said that, I think that if you want to object to a particular proposal on the grounds that it's not consistent with wasm's goals as a whole, that's what the meeting discussion time is for. (But if there becomes consensus that a proposal fits with the goals of IT, a narrower framework than wasm as a whole, then that might not be a convincing argument). If you do want to give readers some background or even do some ahead-of-time persuasion (pre-suasion?) I think it would be ok to include some links in the agenda. IMO you could probably put as many as you wanted, but if you put too much, I'd expect people wouldn't would read it all. 
More generally i think we should discuss supporting UTF-16 lifting and lowering, as the opinions stated in WebAssembly/interface-types#136 seem to be that it would be valuable to have; and that getting some agreement on that direction would make it easier to to get agreement on USV strings at the boundaries. We'd want to make it clear that these are separable in principle, but since people's opinions on one might be conditional on the other it probably makes sense to have some discussion of both before we do either of the votes. Since @yury91 agreed to move branch hinting to the next meeting (thanks!) we should have plenty of time for this.
One issue with the voting order though (i.e. UTF16-first) is that it doesn't necessarily make sense to say that we will support UTF16 lifting/lowering unless the IT string itself is a list-of-USV. We could just poll USV-strings first, or perhaps even poll UTF16 conditionally on USVstrings being accepted?

@dcodeIO
Copy link
Contributor Author

dcodeIO commented Jul 23, 2021

I would strongly prefer to poll for UTF-16 before. Even if the list-of-USVs vote does not find consensus, which I expect, we'd at least have made forward progress on the matter and can iterate towards having support for W/UTF-8 and W/UTF-16 from the start instead of going through all this hassle again. If you prefer, I can amend it to "both UTF-16 and UTF-8" to lock in both.

@dcodeIO
Copy link
Contributor Author

dcodeIO commented Jul 23, 2021

What I find much more problematic is the word "single" that is most likely going to bite us. We could just omit it to leave the door open to add a second string type if necessary. If not, I think we should separate this aspect.

@dschuff
Copy link
Member

dschuff commented Jul 28, 2021

Forgive me (and correct me!) if I'm not not representing anyone's position or desired poll outcomes here... mostly I'm trying to make it as clear as possible so we can get a durable consensus.

It seems to me that "supporting UTF-16" lifting and lowering means different things (and in fact a whole different decision tree), based on whether or not we agree list-of-USV semantics, though.
In WebAssembly/interface-types#136 (comment) @lukewagner mentions having specific adapter functions in the canonical ABI. But if the abstract string type is not list-of-USV but something else, then the adapter functions for UTF-16 specifically could be no-ops.
Can you say more about what adding UTF16 support would mean otherwise?

Regarding "single" list-of-USV as the string type: my general take on things that get left out of proposals is that the door is in some sense never completely closed. We had to leave a lot of things everyone wanted out of wasm MVP and add them in later. So having a single type with multiple lifting/lowering ops could be extended to multiple types in the future (just as there will certainly be additional types added to both IT and wasm itself (to go along with whatever set of types the GC MVP gets).

@dcodeIO
Copy link
Contributor Author

dcodeIO commented Jul 28, 2021

Supporting lifting and lowering of UTF-16 avoids the inefficiency and inconvenience aspects I presented for languages utilizing 16-bit Unicode, in that these would not need to double re-encode for example (transcoding the string to UTF-8 before being able to lift, then transcoding again after lowering), or involve runtime fundamentals like memory allocation in order to do so. There is also the aspect that supporting a second encoding would motivate implementing the proposed realloc mechanism from the start instead of stubbing it out, as mentioned in the linked post.

Independently of whether UTF-16 lifting and lowering are supported in addition to UTF-8 lifting and lowering, we would have to decide separately whether we want to utilize the replacement strategy on unpaired surrogates, whether we'd prefer to trap to make the mismatch non-silent, or whether we'd want to provide an option to roundtrip losslessly, or a combination of these. This aspect is tied to "list-of-USVs", in that deciding for it would require such a choice, but only allow for the first two options. The last option, that is roundtrip losslessly to aid languages that evolved from UCS-2 to UTF-16, will become impossible with "list-of-USVs" conceptionally, leading us to the incorrectness aspect that can have various implications on affected languages as they deliberately chose to maintain integrity and not replace/trap, also over boundaries.

While hard to quantify, I believe that this will likely lead to issues like rustwasm/wasm-bindgen#1348 all over the place when affected languages are involved, especially on arbitrary user input, in a cat-and-mouse game to mitigate the effects of silent data corruption or denial of service where it is currently hard to spot. The POC here is

let myString = inputString.substring(0, 10); // user finds it funny to place an emoji at 9
map.set(myString, 42);

let alsoMyString = roundtripStringOverInterfaceTypesBoundaryButWhoKnows(myString);

map.get(alsoMyString) // undefined
if (myString == alsoMyString) {
  // false
}

which even though it looks simple can have various implications on affected languages, especially because code like this currently does not lead to hard sanitization or failures in these languages by design (except when calling APIs that require it but typically don't roundtrip on a level of function calls, like HTTP), also not over boundaries in between themselves or to JS.

In context, it has been suggested that affected languages could avoid the string type and use list u16 as an escape hatch, which again has various implications we haven't talked about much yet. The one that seems the most unfortunate to me is that as soon as a language has multiple encodings underneath, or does straight-forward optimizations like compact strings, it will not be possible to annotate the dynamic encoding statically, making the suggested escape hatch impractical. The other aspect is of course that it would neither integrate with JS (becomes Uint16Array) nor WASI (wants string).

In a similar context, it has been mentioned that languages that currently utilize potentially ill-formed UTF-16 (WTF-16) semantics/encoding could be switched to UTF-8. The one mechanism I know of that would allow to do this transparently, that is without introducing replacement or trapping in existing string APIs, is Swift-like breadcrumbs, but then one would still end up with WTF-8 underneath, which is not an improvement. There is also a performance aspect here in that while breadcrumbs promise "amortized O(1)", with a stride of 64 like in Swift it will in relation and practice look more like "amortized O(32)" (roughly), which may on its own not be acceptable for all implementers.

As such it seems likely to me that a "single" list-of-USV type will turn out as insufficient, so I figured that it would be beneficial to future discussion to remove the word (we'd have a separate vote anyway), say so we do not make it harder than it needs to be to have a future poll on adding a second string type.

Sorry for the wall of text. This is all closely connected, and the most important aspects here are implicit.

@dschuff
Copy link
Member

dschuff commented Jul 28, 2021

Thanks, I think that's actually a pretty nice summary of some points that you've previously left as even bigger walls of text and/or links to long and sometimes difficult-to-follow threads. As I alluded to above, IMO you'd help your case if you did even more of this kind of thing before the meeting; i.e. summarizing your arguments and building a case systematically and concisely. In other threads you've made some of the same points of course, but that kind of discussion/reply-in-context is often different from what you want in a case like this. And let's be honest, you've also said a lot of other things which have actively harmed your case and made people less willing to take your side even if they agree with you.

Anyway, back to the topic of the agenda. I think I'm OK polling for UTF-16 first. Probably just calling it "Poll for supporting UTF-16 lifting and lowering" as you have it now, is OK for this PR, with the idea that this would signal the CG's intent to support lifting and lowering to whatever type we end up having, and that the details could change depending on what else is decided. (and if others have suggestions on refinements even before the meeting let's discuss that too).
In this PR, let's also leave Luke's poll alone; if we want to also suggest changes to that, let's open another PR for it.
If you make those changes, I'll merge this PR. (You can also add links as I mentioned before, either in this or a future PR).

1. Poll for maintaining single list-of-USV `string` type
* [Summary of concerns](https://github.com/WebAssembly/interface-types/issues/135#issuecomment-888878363) by W/UTF-16 languages
Copy link
Contributor

@conrad-watt conrad-watt Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still on the edge of editorialising, especially since this is @lukewagner's agenda item. It's not as though @lukewagner linked WebAssembly/interface-types#135 as "summary of benefits" (which could be an appropriate balance here if something like this line is to be included).

@dschuff proposed discussing changes to this agenda item as a separate PR, which I think makes sense since we really want to make sure the UTF-16 vote is timetabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done as I was told and do think that the context is both useful and reasonably stated.

@dschuff
Copy link
Member

dschuff commented Jul 29, 2021

I'm going to merge this PR (i.e. settling the issue of "should we add discussion/vote on UTF-16 lifting/lowering") and open another one to continue the discussion about what exactly is the best way to organize the agenda items.

@dschuff dschuff merged commit ec9f481 into WebAssembly:main Jul 29, 2021
@dschuff
Copy link
Member

dschuff commented Jul 29, 2021

Continued in #845

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants