Skip to content

Prototype GC instructions #2935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dcodeIO opened this issue Jun 30, 2020 · 22 comments
Closed

Prototype GC instructions #2935

dcodeIO opened this issue Jun 30, 2020 · 22 comments

Comments

@dcodeIO
Copy link
Contributor

dcodeIO commented Jun 30, 2020

According to WebAssembly/gc#81 and the linked status document, there is early prototyping work of the GC proposal done in WABT and V8, and I'd love to start playing with it. Are there any plans on the Binaryen side already? I'd also be happy to help where I can, that is if there's anything I can do that doesn't hurt more to review than it helps. Just let me know :)

One interesting aspect also is that once the respective instructions are available in Binaryen, there also comes the opportunity to transform a GC-enabled module to a non-GC module by essentially polyfilling a runtime. I'd imagine that a transform pass like this would be useful for all sorts of tools and languages to ease the transition in the future (emit Wasm GC today, run without), and perhaps some bits of what we have at AS already can be used to make this possible.

@kripken
Copy link
Member

kripken commented Jun 30, 2020

I'm not aware of anyone planning to work on this. But I'm also not sure if it makes sense atm to experiment in multiple toolchain projects. cc @binji for thoughts.

I like the idea for a pass to lower GC to non-GC, yeah, that sounds like it could be useful for polyfilling etc.!

@binji
Copy link
Member

binji commented Jul 13, 2020

@tlively Yeah, this would be another benefit of having shared parsing/loading code for wabt and binaryen :-) I'm still hoping to do this by using wasp as a long-term goal, but it seems I never have quite enough time. That said, it's definitely not dropped.

@tlively
Copy link
Member

tlively commented Jul 14, 2020

I also still hope to use wasp in Binaryen! We'll be in a good position to evaluate that project after the current Stack IR redesign/refactoring/replacement is done. That being said, GC support would require much more work in Binaryen than just the parsing that wasp would provide. If anyone from the wider community would be interested in taking a stab at implementing experimental GC support, I'd be happy help out with guidance and code reviews.

@dcodeIO
Copy link
Contributor Author

dcodeIO commented Jul 29, 2020

So I took an initial stab at this and added the instruction classes StructNew, StructGet, StructSet, ArrayNew, ArrayGet, ArraySet and ArrayLen (ref: WebAssembly/wabt#1345) in a way that Binaryen compiles again. Anything touching RTTs is more in a stub/TODO state still and there is no text parsing yet.

Consequently, the next aspect to get right is how to represent struct and array types. One strategy might be to:

  • Rename Signature to FunctionDefinition
  • Add StructDefintion (has multiple FieldDefinitions with type and mutability)
  • Add ArrayDefinition (has an element Type and mutability)
  • Add TypeDefinition as a union of FunctionDefinition | StructDefinition | ArrayDefinition
  • Fixup existing Signature usage accordingly
  • Continue implementing the instructions above similar to CallIndirect, referencing either a StructDefinition or ArrayDefinition

Does this approach sound ok?

@binji
Copy link
Member

binji commented Jul 29, 2020

I'm not sure exactly how binaryen does it, but the more invasive change (in my experience) is to ValueType, which now must also support an optional index (or optional name, for the text format).

@tlively
Copy link
Member

tlively commented Jul 29, 2020

@dcodeIO The recently-redesigned Type is meant to be extensible to these new kinds of complex types, just like it has been extended to handle tuples already. Currently all types are interned in the indices map in wasm-type.cpp. Right now the key to that map is std::vector<Type> because the only complex types are tuples. To support new kinds of types, we will need to have a more interesting interned type that can represent the new type structures as well. If we had C++17, this would probably be a std::variant, but since we're still on C++14, we will have to do something more ad hoc with a tagged union.

@dcodeIO
Copy link
Contributor Author

dcodeIO commented Jul 29, 2020

I'm not sure exactly how binaryen does it, but the more invasive change (in my experience) is to ValueType, which now must also support an optional index (or optional name, for the text format).

Can you elaborate a bit why this information has to live on the types directly? So far I imagined that we'll need some sort of mapping of types (global to all modules in Binaryen, describing layouts of structs, arrays or functions) to the types section (local to each module, like function signatures) and that indices or names would be a property of the link from a global type to a
specific module's type section entry. Very likely that I'm overlooking something in the spec so far.

Currently all types are interned in the indices map in wasm-type.cpp.

I see, thanks, looking at it now. Hmm, perhaps separate indices, one for multi values, structs, arrays and signatures with a shared internal counter for the unique id? Structs can be keyed as a list of their fields (each field has a Type and a mutable flag) and arrays as a single field indicating their element type (unless arrays can have other fields?). Oh my.

@binji
Copy link
Member

binji commented Jul 29, 2020

Can you elaborate a bit why this information has to live on the types directly?

There are (at least) two new value types: ref $t and ref null $t, where $t is a type index. These are actually first introduced in the function-references proposal, but are used in the gc proposal too. So you can use these types anywhere that i32 etc. were previously allowed, e.g.

(type $t ...)
(table 0 (ref $t))
(global (ref $t) ...)
(func (param (ref $t)) (result (ref null $t))
  (local (ref null $t))
  (block (param (ref null $t)) (result (ref $t))
    ...
  )
)
...

@tlively
Copy link
Member

tlively commented Jul 29, 2020

I see, thanks, looking at it now. Hmm, perhaps separate indices, one for multi values, structs, arrays and signatures with a shared internal counter for the unique id? Structs can be keyed as a list of their fields (each field has a Type and a mutable flag) and arrays as a single field indicating their element type (unless arrays can have other fields?). Oh my.

The map keys are used when creating a new Type to look up whether an instance of that type already exists. Since we know what kind of type we are trying to create, having separate maps for each type would work. However, a Type itself is just a pointer to richer interned data structure that describes the structure of the type, so the Type does not know what kind of type it is without looking at the interned data structure. That means that we still need a unified data type that can describe the layout of any kind of type. Given that, I'm not sure having separate maps buys us much.

@dcodeIO
Copy link
Contributor Author

dcodeIO commented Jul 29, 2020

Thanks guys! So, would it be viable to establish something like

class Type {
  enum Kind {
    Default,
    Multi,
    Function,
    Struct,
    Array
  };
  uintptr_t id;
  Type::Kind kind;
  union {
    MultivalueDefinition multiDef;
    FunctionDefinition funcDef;
    StructDefinition structDef;
    ArrayDefinition arrayDef;
  };
  bool isStruct() { return kind == Type::Kind::Struct; }
  bool isNullable() {
    switch (kind) {
      ...
      case Type::Kind::Struct: return structDef.nullable;
      ...
      default: return false;
    }
  }
  ...
}

struct MultivalueDefinition {
  std::vector<Type> types;
};

struct FunctionDefinition {
  std::vector<Type> params;
  std::vector<Type> results;
  bool nullable;
};

struct Field {
  Type type;
  bool mutable_;
};

struct StructDefinition {
  std::vector<Field> fields;
  bool nullable;
};

struct ArrayDefinition {
  Field element;
  bool nullable;
};

with instances of MultivalueDefinition, FunctionDefinition, Field, StructDefinition and ArrayDefinition being globally deduplicated just like the types referencing them, and the $t in (ref $t ...) being represented by the per-module index of such a type in the module's type section array (inserted when reading / looked up when printing / otherwise mostly irrelevant)?

If not, what'd be a good alternative fitting into the existing code base?

@tlively
Copy link
Member

tlively commented Jul 29, 2020

Yeah, that's almost what I have in mind. The only difference is that your code snippet above adds a kind field to Type, which would increase the size of every Type in the IR. We definitely need a TypeKind somewhere, but I would put it into a new TypeDescription tagged union struct that is private to wasm-types.cpp rather than in Type itself. The TypeDescription would replace the std::vector<Type> as the thing that gets interned.

A few other notes so I don't have to repeat them later in code review:

  • I think we should standardize on the "tuple" terminology for tuple types, rather than "multi."

  • The Definition suffix on the structs seems unnecessary; I think it could be shortened to Def or omitted.

  • Rather than defining a new struct MultivalueDefinition, I think it would be simpler for all the client code to just have using Tuple = std::vector<Type> (or just omit it entirely) to avoid a layer of type wrapping.

  • Instead of having separate params and results fields, FunctionDefinition can just contain a Signature.

@binji
Copy link
Member

binji commented Jul 29, 2020

I think this is a reasonable way to do it, though it ends up combining the two concepts of a "value type" and the "type definition". One concern is that a type definition has an explicit index space, but value types don't. Since this model compresses all types together, you would need to have a separate mapping from a type index to your Type struct, or you would need to fully remove all type indexes when parsing.

You'll need to be a bit careful when doing this too, since types can be recursive in different ways, but should be considered equivalent, e.g.

(module wat
  (type $t1 (func (param i32 (ref $t1))))
  (type $t2 (func (param i32 (ref $t3))))
  (type $t3 (func (param i32 (ref $t2))))
  ...
)

Here $t1, $t2, and $t3 are all equivalent.

@tlively
Copy link
Member

tlively commented Jul 30, 2020

Binaryen IR does not have a concept of type definitions or type indices, so I don't think this will be a problem. We reconstruct the type section and all type indices on demand when the module is emitted.

@dcodeIO
Copy link
Contributor Author

dcodeIO commented Jul 30, 2020

So far I came up with these:

typedef std::vector<Type> Tuple;

struct Field {
  Type type;
  bool mutable_;
  Field(Type type, bool mutable_ = false) : type(type), mutable_(mutable_) {}
  bool operator==(const Field& other) const {
    return type == other.type && mutable_ == other.mutable_;
  }
  bool operator!=(const Field& other) const { return !(*this == other); }
};

typedef std::vector<Field> FieldList;

struct Struct {
  FieldList fields;
  bool nullable;
  Struct(FieldList fields, bool nullable = false)
    : fields(fields), nullable(nullable) {}
  bool operator==(const Struct& other) const {
    return fields == other.fields && nullable == other.nullable;
  }
  bool operator!=(const Struct& other) const { return !(*this == other); }
};

struct Array {
  Field element;
  bool nullable;
  Array(Field element, bool nullable = false)
    : element(element), nullable(nullable) {}
  bool operator==(const Array& other) const {
    return element == other.element && nullable == other.nullable;
  }
  bool operator!=(const Array& other) const { return !(*this == other); }
};

struct TypeDef { // move to wasm-type.cpp ?
  enum Kind { TupleKind, SignatureKind, StructKind, ArrayKind };

  Kind kind;
  union Def {
    Def(Tuple tuple) : tuple(tuple) {}
    Def(Signature signature) : signature(signature) {}
    Def(Struct struct_) : struct_(struct_) {}
    Def(Array array) : array(array) {}
    ~Def() {}
    Tuple tuple;
    Signature signature;
    Struct struct_;
    Array array;
  } def;

  TypeDef(Tuple tuple) : kind(TupleKind), def(tuple) {}
  TypeDef(Signature signature) : kind(SignatureKind), def(signature) {}
  TypeDef(Struct struct_) : kind(StructKind), def(struct_) {}
  TypeDef(Array array) : kind(ArrayKind), def(array) {}

  bool operator==(const TypeDef& other) const {
    if (kind != other.kind)
      return false;
    switch (kind) {
      case TupleKind:
        return def.tuple == other.def.tuple;
      case SignatureKind:
        return def.signature == other.def.signature;
      case StructKind:
        return def.struct_ == other.def.struct_;
      case ArrayKind:
        return def.array == other.def.array;
      default:
        WASM_UNREACHABLE("unexpected kind");
    }
  }
  bool operator!=(const TypeDef& other) const { return !(*this == other); }
};

reusing Signature the way it is and adding Tuple, Struct and Array descriptions. Looking at the hashing stuff etc. I'm worrying, however, that my C++ foo isn't sufficient yet to make the changes to wasm-type.cpp without significant reviewing overhead. I can try, if you like, or I can leave this particular change to someone with more experience and continue from there?

@tlively
Copy link
Member

tlively commented Jul 30, 2020

This looks great so far! I think it is fine to have all this in wasm-types.h so that client code can build up an arbitrarily complex TypeDef then pass it off to a new Type constructor to make it into a Type. It would be great if you could continue on this :)

@dcodeIO
Copy link
Contributor Author

dcodeIO commented Jul 31, 2020

Alright, going to dive into it :)

Have been playing with the hashing a bit now, and I have been wondering if this can be simplified a bit by doing about what boost does?

namespace wasm {

template<typename T> inline void rehash(std::size_t& s, const T& v) {
  std::hash<T> h;
  s ^= h(v) + 0x9e3779b9 + (s << 6) + (s >> 2);
}

}

From what I learned so far this approach would free us from confusion between uint32_t, uint64_t and size_t, and just sit on top of std::hash giving us a size_t for any kind of input.

  size_t operator()(const vector<wasm::Type>& types) const {
    size_t res = std::hash<size_t>{}(types.size());
    for (auto t : types) {
      wasm::rehash<uint64_t>(res, t.getID());
    }
    return res;
  }

Or is there a particular reason for choosing the djb2 approach? Something with uint64_ts on 32-bit maybe? Or an explicit need for non-size_t hash values?

Might look like this then:

size_t hash<wasm::TypeDef>::operator()(const wasm::TypeDef& typeDef) const {
  size_t res = hash<uint32_t>{}(uint32_t(typeDef.kind));
  switch (typeDef.kind) {
    case wasm::TypeDef::Kind::TupleKind: {
      auto& tuple = typeDef.def.tuple;
      wasm::rehash_std(res, tuple.size());
      for (auto t : tuple) {
        wasm::rehash_std(res, t.getID());
      }
      return res;
    }
    case wasm::TypeDef::Kind::SignatureKind: {
      auto& signature = typeDef.def.signature;
      wasm::rehash_std(res, signature.params.getID());
      wasm::rehash_std(res, signature.results.getID());
      return res;
    }
    case wasm::TypeDef::Kind::StructKind: {
      auto& struct_ = typeDef.def.struct_;
      auto& fields = struct_.fields;
      wasm::rehash_std(res, fields.size());
      for (auto f : fields) {
        wasm::rehash_std(res, f.type.getID());
        wasm::rehash_std(res, f.mutable_);
      }
      wasm::rehash_std(res, struct_.nullable);
      return res;
    }
    case wasm::TypeDef::Kind::ArrayKind: {
      auto& array = typeDef.def.array;
      auto& element = array.element;
      wasm::rehash_std(res, element.type.getID());
      wasm::rehash_std(res, element.mutable_);
      wasm::rehash_std(res, array.nullable);
      return res;
    }
    default:
      WASM_UNREACHABLE("unexpected kined");
  }
}

@dcodeIO
Copy link
Contributor Author

dcodeIO commented Jul 31, 2020

I also worry about keeping all the types in a global cache essentially. While the deduplication helps to keep memory footprint low for a single module, a long-running process reading modules, modifying and re-emitting them is going to never reclaim memory of types used in at least one module. The more distinct types there are, the more pressing the issue will become. Currently, the only way to fill up memory would be to create lots of different combinations of tuple types, which is unlikely but possible, but with structs, arrays and ultimately rtts there'll be a lot more ways to create distinct types.

What do you think of only deduplicating basic types globally, and deduplicating instances of complex types per module so these dispose together with the module? Perhaps a concept of TypeRef analogous to ExpressionRef? Unfortunately, this looks like it will require a ton of refactoring all over the place :(

@binji
Copy link
Member

binji commented Jul 31, 2020

@dcodeIO I share your concern about global caches (especially for users that intend to use binaryen as a library), but even in the largest modules I doubt the cost of unique types is going to compare to the cost of expressions. Perhaps the simplest solution is to add a mechanism (if it doesn't already exist) to clear the global cache.

@tlively
Copy link
Member

tlively commented Jul 31, 2020

I would be fine changing how we do hashing, especially if we can make it simpler and more generic, but perhaps @kripken knows some design constraint that would prevent this change.


In order to clean up interned type definitions, we need to ensure that there are no Types pointing to them anywhere in the system. A reference counting scheme for determining that wouldn't work because right now most Types will never have their destructors run. That's because the IR Expressions that contain them are arena-allocated on an arena belonging to the Module. When a Module is destroyed, the arena is cleared but no destructors are run for the objects inside it. So the only way I see to automatically manage the lifetime of type definitions is to arena allocate them as well and tie them to a specific Module. That would make inter-module type comparisons expensive, but perhaps that's ok. Whatever we end up doing, I think we would be able to hide it in the implementation of Type, so I don't think we would need a widespread refactoring. For now it's probably fine to use the existing interning mechanism, and we can treat upgrading it as a separate project.

@tlively
Copy link
Member

tlively commented Jul 31, 2020

Yeah, a manual GC mechanism (e.g. given a set of Modules, clean up any type definitions not used by any of them) would also work, and probably be simpler than splitting up the cache.

@dcodeIO
Copy link
Contributor Author

dcodeIO commented Aug 3, 2020

Alright, going to leave it the way it is for now and look into cleanup later, thanks!

I've also opened a draft PR of a type refactor according to the comments above meanwhile. Unfortunately, it raises additional questions regarding StorageType and a few internals, and I'd love to hear your thoughts on these :)

@tlively
Copy link
Member

tlively commented Sep 9, 2021

Closing this issue because we have good support for the GC proposal now.

@tlively tlively closed this as completed Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants