Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom sections in the text format #1153

Open
copy opened this issue Nov 10, 2017 · 17 comments
Open

Custom sections in the text format #1153

copy opened this issue Nov 10, 2017 · 17 comments

Comments

@copy
Copy link

copy commented Nov 10, 2017

The design document on the text format says:

WebAssembly will define a standardized text format that encodes a WebAssembly module with all its contained definitions in a way that is equivalent to the binary format.

However the specification doesn't specify how to encode custom sections: https://webassembly.github.io/spec/text/modules.html#text-module and wasm2wat ignores custom sections.

@binji
Copy link
Member

binji commented Nov 14, 2017

Good point, there should be a way to specify these sections in the text format. It seems like this was probably discussed in the past, but I can't remember where that may have been.

@rossberg, any thoughts on this?

@rossberg
Copy link
Member

Well, a couple of issues with expressing custom sections directly:

  • The text format does not reflect the section order of the binary format. Hence, it is not clear how it could express where custom sections are supposed to go.

  • The same goes for the abstract syntax in the spec, which does not currently represent custom sections, because they are immaterial. So we cannot easily specify how they would interconvert.

  • Since the structure of custom sections is custom, they could only be given as raw bytes in the text format. That makes them only mildly useful for anything but low-level tests.

In general, the custom section format is more like a detail of the binary format. The assumption was that relevant custom sections are not written verbatim but rather synthesised from the text format, like the name section or the binding section.

What we should think about, then, is a generic syntax for annotations that can be put anywhere in the syntax tree. That would already be needed by the binding proposal. My suggestion would be to use nodes of the form (@id ...) that would be allowed anywhere in an S-expr and are uninterpreted by the core spec. The id would roughly correspond to a certain custom section, so that it is generic and extensible in a similar manner. A given tool may choose to interpret certain annotations and turn them into custom sections according to some separate spec. With global annotations in the module body such a spec could even enable spelling custom sections almost verbatim (up to their position in the binary).

WDYT?

@lukewagner
Copy link
Member

Since the core spec does have a defined notion of a custom section, I think it makes sense to give a fully-specified representation in the text format. While it's true that we'd have a hard time expressing the precise placement of the custom section, I expect it's fine to just say that (custom ...) sections are just appended to the end of the module in the order they are encountered.

(Honestly, I wonder about the utility of allowing custom sections anywhere but at the end; I bet we could remove that "feature" and nothing would break.)

@binji
Copy link
Member

binji commented Nov 22, 2017

The text format does not reflect the section order of the binary format. Hence, it is not clear how it could express where custom sections are supposed to go.

All known sections have to be ordered, so you could just use a number to specify which known section it comes after. Something like (custom 0 ...) would come before the type section (1). (custom 3 ...) would go after the function section (3) and before the table section (4).

Since the structure of custom sections is custom, they could only be given as raw bytes in the text format. That makes them only mildly useful for anything but low-level tests.

I agree if we assume the purpose of the text format is just to generate tests for the spec. But we're already using the text format as a way to express the contents of the binary, and AFAICT it doesn't lose much information currently. The only thing I can think of right now is the length of varint values and custom section data. Are there others?

Also, we could make it slightly nicer than raw bytes by having a structured data format. The name section and the reloc section follow the same basic structure of other sections, using varints, strings and vectors. If we provided those primitives we could make it pretty easy to generate. They wouldn't roundtrip very nicely of course. Something like this, maybe:

(custom 12 "foo"
  (string "hello")
  (vector
    (group (varuint32 1) (f32 3.4))
    (group (varuint32 2) (bytes "12345"))
  )
)

RE: annotations

Agreed, annotations would be useful. I believe @yurydelendik was suggesting something like this before, maybe he has some thoughts about it. And you're right, I think we could handle custom sections in a structured way doing this. But I'd like to see a way to handle a custom section that has unstructured data, or one that is unknown to the parser too.

@rossberg
Copy link
Member

rossberg commented Nov 23, 2017 via email

@binji
Copy link
Member

binji commented Nov 23, 2017

That would be rather brittle and expose low-level details of the binary
encoding. In particular, we have assumed that we may insert new sections
anywhere in future extensions of the binary format, so a numeric scheme is
not future-proof.

Right, I forgot that new known sections may not be ordered. I think it will still work, though. If we assume that all known sections can occur only 0 or 1 times, as is currently true, then it doesn't seem like this is a problem. The number can just mean which section the custom section is before in the given module. If the section doesn't occur in the module, we could say that the text for that section is invalid. If we decide later that a known section can occur more than once, we can extend the text format at the same time to indicate which section we mean. And if using a number is gross/ugly, we can always use the names given in the spec:

(@custom "foo" (after import) "...")
(@custom "bar" (before data) "...")

If we did this, we'd probably also want to require that you can't specify sections out of order. Not so sure about the before/after thing either, but it's easy to understand and allows all placements.

If we could adopt @lukewagner's suggestion of eliminating free placement of
custom sections then I'd feel more comfortable, but I'm not sure how
realistic that is.

It probably isn't used much, but I would prefer not to break compatibility over it.

Sure thing, we can simply support (@custom "name" "contents") etc as a
generic fallback. AFAICS, that could subsume the suggestion above.

Right, this covers everything, it just is inconvenient.

@lukewagner
Copy link
Member

It probably isn't used much, but I would prefer not to break compatibility over it.

In addition to text-format motivations, there's also the fact that if it's an infrequently used feature, it will be undertested and likely to have problems in practice. I know we've had specific bugs about custom sections in weird places.

Maybe worth putting discussion/poll on CG agenda?

@eholk
Copy link

eholk commented Dec 14, 2017

At the most recent CG meeting, we had some opposition to the idea of requiring custom sections to be at the end. The reason is that some uses cases for custom section involve informing later stages of the compilation pipeline. For example, tools might want to provide extra hints (which functions should be compiled first, which locals should get registers, etc.) that VMs could optionally consume. In this case, we'd want to read the hints before we start streaming compilation of the code.

@binji
Copy link
Member

binji commented Dec 19, 2017

First pass proposal overview for custom sections in text format: https://gist.github.com/binji/d1cfff7faaebb2aa4f8b1c995234e5a0

@binji
Copy link
Member

binji commented Jan 9, 2018

I've updated the gist after some feedback. Sorry I didn't notice this earlier, it seems that gist comments don't show up in my notifications (or I missed them).

@Pauan
Copy link

Pauan commented Jan 10, 2018

@binji GitHub doesn't send notifications for Gist comments, it's very annoying.

@AndrewScheidecker
Copy link

I prototyped something similar to @binji's proposed syntax, but an issue I ran into is that it can express more information about the section ordering than the binary format. For example, a binary module with no data segments cannot distinctly encode (@custom (after code)) and (@custom (after data)). I can't think of a nice way to solve that problem without adding explicit order information to binary custom sections.

@rossberg
Copy link
Member

rossberg commented Oct 8, 2019

@AndrewScheidecker, you might want to discuss this over at the annotations proposal, which contains a more up-to-date and complete definition of custom section annotations.

To reply to your commt, though, I am not sure why you consider this a problem. There are many examples of the text format being able to express the same binary in multiple ways. How is this different?

Providing a unique way of describing placement is not a goal of these multiple forms, but being able to place something reliably in a fashion that is agnostic to the actual absence or presence of particular sections. So this is working as intended. You pick the placement that is correct in the presence of all sections, but it will also work fine if a respective section happens to be absent. You don't have to worry about which case you're in.

@AndrewScheidecker
Copy link

If it is useful to express ordering constraints relative to virtual sections that may or may not be present in the binary module, then it must be worthwhile to encode those constraints in the binary module somehow.

Imagine that some compiler produces a WASM object file with a custom section that needs to be ordered between the code and data sections, but that module does not contain a code section. If you want to link that object file with another that does have a code section, then you need some additional metadata (or knowledge of that particular custom section) to ensure that the custom section ends up after that code section and not before it in the linked WASM module.

There's no text format involved here, but this scenario would benefit from being able to express the ordering constraints relative to virtual sections that are proposed here for the text format only.

It's true that there's other information in the text format that is not present in the abstract syntax and binary format, but the stuff I can think of is all trivia: the interleaving of definitions of different kinds, function types that aren't explicitly declared up front, comments, whitespace, expression vs instruction syntax, etc.

@rossberg
Copy link
Member

rossberg commented Oct 9, 2019

If it is useful to express ordering constraints relative to virtual sections that may or may not be present in the binary module, then it must be worthwhile to encode those constraints in the binary module somehow.

I don't think that follows. You shouldn't think of placements as a restrictive mechanism but a descriptive one.

But more importantly, as you say, this has nothing to do with the text format. Your complaint is about the design of the binary format itself.

But that is an inherent and unsolvable (and known) problem with the notion of custom data. It is true that a generic tool dealing with unfamiliar custom sections cannot know how to handle them correctly. But that is a much more general problem. To be correct, a linker might need to combine or modify certain custom sections, but by their nature of being custom, it generally has no way of knowing if or how. Their placement probably is the smallest problem such a tool faces. There is no solution to this.

It's true that there's other information in the text format that is not present in the abstract syntax and binary format, but the stuff I can think of is all trivia: the interleaving of definitions of different kinds, function types that aren't explicitly declared up front, comments, whitespace, expression vs instruction syntax, etc.

Function type desugaring in particular is way more complicated. ;)

@AndrewScheidecker
Copy link

I don't think that follows. You shouldn't think of placements as a restrictive mechanism but a descriptive one.

But more importantly, as you say, this has nothing to do with the text format. Your complaint is about the design of the binary format itself.

My complaint is not about only the binary format, or only the text format, but about a mismatch between them. :)

What I'm doing for now is to restrict the text format to prohibit specifying ordering relative to virtual sections that are not present according to some predicate defined on the abstract syntax. When decoding a binary module, empty sections (or sections that may not be present according to the abstract syntax predicate) are ignored for purposes of inferring the custom section order.

With those changes, I can round-trip custom sections ast->text->ast and ast->binary->ast.

Function type desugaring in particular is way more complicated. ;)

The desugaring is non-trivial, but the additional information in the text format is "trivia" in the sense that it doesn't affect the meaning of the program.

@binji
Copy link
Member

binji commented Jan 14, 2020

I think I see what you're saying @AndrewScheidecker. I agree it would be better to continue the discussion on the annotations proposal, however. Would you mind opening a new issue there instead? We haven't done much work on that recently, but if someone picked it up, I wouldn't want this concern to fall through the cracks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants