`hugr-model` #1326

zrho · 2024-07-18T09:09:20Z

zrho
Jul 18, 2024
Collaborator

HUGR as a data format should be independent of its implementation in hugr-core.

I propose to create a crate hugr-model that describes the HUGR data model in plain data (see Passive Data Structure).
This data can be serialized and deserialized into JSON and s-expressions, as well as other formats that we might need in the future (MessagePack or Protobuf for gRPC?).
The hugr-core crate and potential different implementations may use hugr-model to store, communicate and print HUGR graphs, dialect specs, etc.
To ensure that the data model is independent of implementation details, hugr-model would not include any additional logic beyond serialisation and deserialisation, such as efficient data representations, caching, etc.
hugr-core is then free to convert the hugr-model types into whatever data representation is most efficient or convenient to perform optimisations, analyses or rewrites.

I am aware that this brings with it significant breaking changes in the serialisation format
and is a non-insignificant project. However, the longer we wait, the more difficult it would get.

┌───────────────┐                                                    
│ S-expression  │◄───────────────┐                                   
└───────────────┘                │                                   
                                 ▼                                   
┌───────────────┐         ┌─────────────┐              ┌────────────┐
│ JSON          │◄───────►│ hugr-model  │◄────────────►│ hugr-core  │
└───────────────┘         └─────────────┘              └────────────┘
                                 ▲                                   
┌───────────────┐                │                                   
│ MessagePack?  │◄───────────────┘                                   
└───────────────┘

Draft code for this can be found in the feat/model branch.

Advantages

Simplicity

By keeping hugr-model simple and focused on data representation, it allows hugr-core and other implementations to handle complex logic and optimizations.
The hugr-model data model is an overapproximation: for example it can represent graphs that are not connected properly, or types which are ill-kinded.
This is so that hugr-model remains simple and the validation step can be separated out from the parsing/deserialization step.
This separation also allows us to represent partially invalid structures, as would be useful e.g. for a language server.

hugr-core or any downstream implementation can use encoding tricks like #1138 to make invalid types or operation arguments unrepresentable.
The conversion and validation step from hugr-model to hugr-core then ensures that the data is in the correct form.

Versioning

An independent hugr-model crate makes it easier to manage versioning and ensure backward compatibility.
Changes in hugr-core won’t necessitate immediate changes in the data model, reducing the risk of breaking changes once we care about that.
We can also maintain multiple versions of the format at once, if necessary, allowing for tools to migrate between them.
Since the types are just plain data types, the amount of duplication this would bring would be manageable.

By having a uniform representation between builtin operations and extension operations, at least in the model, we also open the door for versioning the core operations independently of the format.

Different Implementations

By separating the data format from the hugr-core implementation, we enable different implementations to form.
This is desirable since it allows us to make different tradeoffs: it is quite hard to build a single implementation that is fit for all purposes, and the result will be quite complex.

A fast but not neccessarily convenient implementation for when speed matters. This might even restrict to a subset with fixed operation types.
A fast-enough but convenient library to drive experimentation or tooling where performance is not a priority. Here we might get away with more stringly-typed but fully extensible data structures.
HUGR as a database, with the graph stored as relations and analysis/rewriting driven by datalog.

An implementation independent specification like this would prevent tying the spec too close to any particular implementation detail, such as Rust, serde or any other incidentals of hugr-core.
The current V2 format is already quite at risk in that regard; the JSON schema is to the largest part derived from what serde generates from the structs in hugr-core.
Even the declarative parts of MLIR are inseperably linked to C++; we should avoid the same fate.

Once declarative specification of operations and dialects is included into the format, a code or docs generator can be built on top of it without having to be strongly tied to hugr-core.

Proposed High-Level Changes to the Format

The current JSON format is quite tied to the data representation in the hugr-core implementation.
While the Hugr graph structure itself is different, pretty much every other type
(including especially the representation of operations and types) is serialized via derived serde::Serialize instances.

To keep conversion reasonable, the JSON and s-expression format should be based on the same data model.
Since the s-expression format should be convenient for human manipulation,
having the same data model for both puts some constraints on the JSON format as well.
As far as I can tell, this is perfectly fine and does not make the JSON format unnatural.
However it brings along some major changes in how the format encodes the graph.

Edge representation

Currently the JSON format represents the hyperedges between ports as a vector of pairs Vec<[(Node, Option<u16>); 2]>.
A representation like this is not viable for a human readable format.
I therefore propose to change to a different representation of edges as follows:
Each port takes a variable name.
When two ports have the same variable name, they should be connected.
Encoded in the s-expression format, this leads to a reasonably intuitive notation and mirrors familiar conventions from LLVM/MLIR.
Consider for example the following code, which expresses the composite of two multiplications:

; (instruction-name [input ports] [output ports])
(@core-f32/mul [%a %b] [%ab])
(@core-f32/mul [%ab %c] [%abc])

This encoding is also reminiscent of the datalog program which implements a pattern match.

Hierarchy

The V2 JSON format assigns numerical ids to nodes.
The node hierarchy is encoded by making child nodes refer to the id of their parent.
I propose to encode the structure as a tree instead.
This matches up with how a hierarchical graph would be written in a usual programming language.
Acyclicity of the tree structure is also immediately apparent without validation.

(hugr my-hugr-module
  (@core/module [] []
    ((@core/func-defn circuit) [] []
      (@core/input [] [%0])
      (@logic/not [%0] [%1])
      (@core/output [%1] []))))

Topologically Sorted

The nodes in a region should be topologically sorted.
This avoids a reader to have to jump back and forth to follow the flow of data.
Requiring the nodes to be in topological order also simplifies deserialisation and cycle checking.

The order of nodes should not matter beyond the dependencies that are recorded in the edges between the nodes' ports.
This is a major advantage of the graph representation over more classical IR representations,
which have to pick a total order on instructions, obscuring that some instructions are independent and can be safely reordered.
We may therefore safely reorder the nodes into topological order on serialisation.
To avoid nodes jumping around non-deterministically, we could perform a stable topological reodering on serialisation.

This requires to rethink the requirement that the input and output node must be the first and second node in a region.
Instead we could arrange for the output node to always be the last node.

Caveat: What about control flow graphs?

Operation Properties

For the purposes of hugr-model there should be no difference between "builtin operations" and "extension operations".
This is in contrast to the V2 JSON schema which has specific affordances for the builtins of hugr-core.
By picking a uniform representation for all operations, a consumer of a hugr graph can decide for itself which operations merit special treatment.
For hugr-core this might be the current set of builtins; for an llvm backend, a datalog-based optimiser or a generic visualiser tool these might be different.
Further, a uniform representation allows to evolve the set of "builtin operations" without having to change the data format.

At the moment, the builtin operations are serialized by serde through their representation as a struct in hugr-core.
In principle the format for a builtin op is arbitrary.
By convention the operations already follow a regular shape, with every builtin op in hugr-core being a struct with fields of one of the following types:

String for function name
String for type alias name
PolyFuncType
Type
TypeRow
Value
FunctionType
Vec<TypeArg>
usize for tag name
ExtensionId
ExtensionSet

Parameters to custom ops are implemented by taking a list of TypeArgs, which (as far as I can tell) can already contain all of the above.
A similar approach could be taken in hugr-model also for the buildins; where any of the types passed to the builtins is not yet representable in that way, it should be made to be.
The conversion step from hugr-model to hugr-core can then "parse" the type args into the data expected by the operation, as we already impose on custom ops.
Note that this can be done without (significantly) touching the current separation of builtins and extension operations in hugr-core.
In particular I think we may keep the way OpType works mostly the same (for this purpose).

This encoding of operation parameters further allows nice integration with a type system and declarative specification of operations. Since work on these aspects is on hold for the time being, this point isn’t elaborated further here but I am happy to discuss the details if there is interest.

Rows vs Lists

The term row has a specific meaning in PL literature: a collection of types equipped with labels (as e.g. in Tierkreis).
The "rows" in HUGR are not rows in this sense, which tripped me up a little in the beginning.
I suggest renaming hugr rows to lists to better conform to conventional terminology.
This also opens up introducing actual rows if there is need without having to make up a new name for them.

The Data Model

A draft of the data model can be expressed as a Rust data structure as follows:

pub struct File {
    pub version: String,
    pub items: Vec<FileItem>,
}

pub enum FileItem {
    Hugr(Hugr),
    // Dialect specs, imports, etc.
}

pub struct Hugr {
    pub name: String,
    pub root: Box<Node>,
}

pub struct Node {
    pub operation: Operation,
    pub inputs: Vec<VarName>,
    pub outputs: Vec<VarName>,
    pub children: Vec<Node>,
}

pub struct Operation {
    pub name: OpName,
    pub args: Vec<Type>,
}

pub enum Type {
    Ctr(TypeCtr),
    Var(TypeVar),
    List(TypeList),
    Row(TypeRow),
    Scheme(TypeScheme),
    LitStr(String),
    LitNat(u64),
    Label(Label)
}

pub struct TypeCtr {
    pub name: Symbol,
    pub args: Vec<Type>
}

pub struct TypeList {
    pub items: Vec<Type>,
    pub tail: Option<Box<Type>>,
}

pub struct TypeRow {
    pub entries: Vec<TypeRowEntry>,
    pub tail: Option<Box<Type>>,
}

pub struct TypeRowEntry {
    pub name: Label,
    pub r#type: Box<Type>,
}

pub struct TypeScheme {
    pub vars: Vec<TypeVar>,
    pub constraints: Vec<TypeConstraint>,
    pub r#type: Box<Type>,
}

pub struct TypeConstraint {
    pub name: Symbol,
    pub args: Vec<Type>,
}

pub struct Symbol(pub String);
pub struct VarName(pub String);
pub struct TypeVar(pub String);
pub struct Label(pub String);

As far as I can see, current HUGR is expressible within this framework.

TypeBounds can be expressed using TypeConstraints on the type variable.
SumTypes can be expressed using a type constructor and some type lists.
Extension types can be simulated with a type constructor and a Label: (@ext :quantum/h2)
ExtensionSets become a type constructor and a row: (@ext-set {:quantum/h2 ()})

Some discussion and concrete experimentation is needed to see where there are mismatches.

The type language is rich enough to express some quite interesting types:

(@tensor @f32 [512 128 128])
(@struct {:x @u64 :y @u64})
(@enum {:ok %a :err @parse-error})
(@fn [@f32 @f32] [@f32])

One potential mismatch is edge kinds. I don't quite know how they would fit in since I don't quite understand why any but the Value edge are necessary in the first place. That is perhaps a different discussion to be had.

The structs above would also be extended with fields that can hold metadata.

Open Questions

Types are used both directly as Types but also to encode general information, like operation arguments and metadata. Looking at Haskel's data kinds and dependently typed languages, this is not entirely weird. However, the name feels a bit off and might be confusing. Is there a better name that feels natural both when used as a type directly but also for operation args and metadata?

ss2165 · 2024-07-22T10:39:06Z

ss2165
Jul 22, 2024
Maintainer

I agree strongly with the motivations and high level proposal. It is a good time to decouple codebase breaking changes from data model breaking changes.
Some comments/questions:

Requiring unique string ids for ports in a machine-friendly model seems quite onerous, can that requirement be moved only to human-readable components like s-expressions?
Top sorted nodes seems too stringent to me, it's useful to output in this format, but given it is just "plain" data I think the model should accept child nodes in any order.
Given the model is just plain data, implementing " nice integration with a type system " would seem to imply a lot of the validation logic that you would need to get it in to hugr-core form (or equivalent). Is there a good story here about avoiding duplication of validation when reading out from model in to business-logic types?
It's not obvious to me what type constructor is doing, could you expand a bit on that?
Type -> StaticParam?

3 replies

zrho Jul 22, 2024
Collaborator Author

We could do this like LLVM/MLIR do: you can use string ids for variables, but that just becomes integers internally (the %0, %1, etc that you see when you export LLVM/MLIR).
I'm also torn on the topsort, especially since control flow regions have backedges. It might be useful for a human-readable format in order to allow reusing variable names with shadowing, but that might not be that important.
For a first step, we would have validation logic in the conversion between hugr-model and hugr-core. We could then turn to move a lot of the validation logic into the type system.
It's a named type. Basically a Rust tuple struct (except that we mix the type and value level). This would include types like @int or (@hash-map @int @str) or (@fn [@f32 @f32] [@f32]) or (@some 4).
Hm, not quite convinced. Term would be in the tradition of the name used when mixing types and values into one data type. Perhaps Attr?

aborgna-q Jul 22, 2024
Maintainer

Do port names have a scope?
- If no, requiring human-written code to have unique names everywhere can be tedious/error prone.
- If yes, how does that play with external edges?
I don't see the benefit of being strict about it in the model, the children order does not change the semantics of the graph (we could even move the in/out restriction to be hugr-core-specific, and put them there on the translation step).
If the s-expression target wants to reorder them for better readability then it's fine to do so, but machine-readable targets won't care about it.

zrho Jul 22, 2024
Collaborator Author

Variables may be scoped, but that needs some thought. We could make a variable in an input port match up with the latest output port that has the same variable, also going up the hierarchy if need be. For this the node order is relevant of course. We can also defer this mechanism for now: require variables to be unique at first but then weaken that requirement once we find a good scoping mechanism. (Not breaking existing files once we introduce scoping was the motivation for the topsort).

doug-q · 2024-07-22T12:27:26Z

doug-q
Jul 22, 2024
Maintainer

| Even the declarative parts of MLIR are inseperably linked to C++; we should avoid the same fate.
I agree, but only because C++ is annoying to call from other languages. Being insepearbly linked to rust is much less objectionable.

I think you are using prefixes to indicate the "kind" of strings, i.e. Symbols are prefixed with "@", VarNames with "%", Labels with ":". What are TypeVars prefixed with?

I agree with you that the current edge representation is not human readable, but I don't think you've justified why the model should be human readable. My reading is that the primary goal is ease and flexibility of serialisation?

I don't understand what the proposed model is "optimised" for. You have made several choices that could go another way, (e.g. edges share variable names, hierarchy), and the only justification I see is human readability. I don't think human readability should be privileged here, that is what the s-expression (isomorphic?) representation is for. I would expect the model to be optimised for serialisation, and for interaction(iteration, query, mutation) via rust.
If one were optimising for serialisation one would put kind in it's own array (i.e. an Array of Types, an Array of nodes, ab Array of Labels etc) and entities would point by index to their component entities.. This is very nice when you have many duplicate entities, as we do.

We would still have to write migrations when the validation rules in hugr-core change. I suppose that we should pattern match and rewrite the hugr-model?

4 replies

ss2165 Jul 22, 2024
Maintainer

+1 to rethinking the model to be serialisation optimised rather than human readable

zrho Jul 22, 2024
Collaborator Author

Fair enough, we can have two. The proposal is currently optimised for having a single source of truth for the format shared between the human readable and serialised form, without making tradeoffs for the serialisation that are worse than what we currently do in the json format. If we intend to have a serialisation format that is more efficient (e.g. tables), we can think about that as well, of course.

aborgna-q Jul 22, 2024
Maintainer

A machine-optimised format would be quite useful for things that exchange many hugrs around.

If I can add to the format wishlist:
tket2's default rewrite set takes >1s to load on CI, mostly due to decoding multiple hugrs.
A table-based definition could improve that if we let it encode multiple hugrs together.

acl-cqc Jul 22, 2024
Collaborator

"We can have two" (models). Interesting numbers: 0,1, infinity....
But human-readable doesn't need a new model, does it? (I'm ok with a second "model" as a private implementation detail of some particular serialization format.) This is just a core data-structure (but a passive one, unlike what we have now). Heck, you could define a serialization format by annotating the model with serde... if we are producing human-readable s-expressions, that's still a function from the one true core model to, erm, s-expression, right....

doug-q · 2024-07-22T12:29:06Z

doug-q
Jul 22, 2024
Maintainer

| One potential mismatch is edge kinds. I don't quite know how they would fit in since I don't quite understand why any but the Value edge are necessary in the first place. That is perhaps a different discussion to be had.

EdgeKinds are not serialised. The semantics of each op (and its properties) determine the EdgeKind of each of its ports.

0 replies

aborgna-q · 2024-07-22T13:20:02Z

aborgna-q
Jul 22, 2024
Maintainer

I'm curious about the perf cost of adding the extra step in machine-readable encodings (esp. since this requires extra steps for reorganising the data, resolving port name identifiers, etc.).

The current model is not really optimised beyond "trust in serde", but as doug says this model isn't either.

Probably a first step to check this should be writing some encoding/decoding benchmarks...

1 reply

zrho Jul 22, 2024
Collaborator Author

For that we would likely want a more column oriented representation. To put that to disk, we could try messagepack to stay with serde. Alternatively protobuf, or capnproto/flatbuffers for more performance still. Or roll our own, like mlir did, but not sure if I’d be a fan of that.

cqc-alec · 2024-07-22T14:33:32Z

cqc-alec
Jul 22, 2024
Maintainer

Having a data model that is independent of the implementation and doesn't break when e.g. a rust field is renamed is definitely a good thing.
Also not convinced we need topological ordering.

0 replies

acl-cqc · 2024-07-22T14:42:35Z

acl-cqc
Jul 22, 2024
Collaborator

Yes, the variable naming thing is very nifty for human-readability, but numbered nodes (and edges being pairs of endpoints) is simple, less prescriptive about how tools might deal with it, and we can have tooling. I have better feelings about the "proper" representation of hierarchy, nobody really wants to deal with explicit hierarchy edges (=> our Rust code e.g. HugrView/Mut wraps hierarchy-edges into an easier-to-use actual "tree" already).

A human-readable version of numbered nodes is stringly-named ones. Ugh, I know.....

And good to maintain a single edge representation for data and control flow, I think, although the variable-name system doesn't forbid cycles (you just have to use a variable before you've defined it, mehh).

0 replies

doug-q · 2024-07-23T08:29:13Z

doug-q
Jul 23, 2024
Maintainer

Perhaps I have missed this, but I do not think you are explicit on how scoping of VarNames works. can edges in sibling DFGs use the same VarNames?

I do think CFGs are important, could you show an example of a cyclic one?

I think this proposal would benefit from explicit requirements on the model, so we can argue about those separately from how the proposed model satisfies them.

Consider:

Allows encoding all of current hugr-core
Is cheap to validate (is no validation possible? currently topological sorting requires validation)
Allows local reasoning: to reason about a subtree of the hierarchy does not require looking outside the subtree.
Is cheap to serialise/deserialise.
Allows de-duplication

5 replies

zrho Jul 23, 2024
Collaborator Author

The original proposal doesn’t have any scoping yet, deferring it to later. We might want to introduce scoping in the beginning as well.

I’d add to the requirements that a lot of validation should be possible through type checking, such that declarative extensions can specify most of their requirements declaratively in the type system.

No validation isn’t really possible: an addition operation is not allowed to have two outputs, an if must have two regions, etc. Unless you mean something different?

doug-q Jul 23, 2024
Maintainer

My understanding is that the model does not have any semantics? i.e. purely syntax.

At that level, type checking does not make sense. Any op (including "addition") doesn't have semantics, it may end up being a divmod with a weird name.

However requiring nodes to be topologically sorted does require validation. Requiring VarNames of inputs to exist as the VarName of some output does require validation.

zrho Jul 23, 2024
Collaborator Author

It's purely syntax yes. Via the structure of the format, we can exclude some things that would otherwise require validation, but likely not all. For example, by nesting nodes instead of having a parent pointer, the hierarchy is automatically a tree, and so no check for acyclicity would be necessary. For port connectivity, we can a priori in the syntax allow cyclic edges, ports that aren't connected to any other port, two output ports that are connected, etc. Having these is also nice for error cases or partially constructed data. Using variable names over binary edges excludes weird situations like (a -> c, a -> d, b -> c) that don't form a hyperedge.

Type checking comes in only later, on top of the syntax. But type checking concerns aren't irrelevant to designing the "syntax": we can make choices for the encoding that do not or do not lend themselves well to expressing an operation's validation criteria via types.

aborgna-q Jul 23, 2024
Maintainer

For port connectivity, we can a priori in the syntax allow cyclic edges, ports that aren't connected to any other port, two output ports that are connected, etc.

That's strongly linked with the operation semantics; Ports may be non-linear value ports (so unlinked / multiports are ok) or even non-value kinds, failing on edge cycles is only valid if you know it should be a dataflow region, etc.

doug-q Jul 23, 2024
Maintainer

This is a good argument for nesting nodes.

| but type checking concerns aren't irrelevant to designing the "syntax"
indeed, I think this is near the idea of "locality" I was talking about above.

zrho · 2024-07-23T09:33:38Z

zrho
Jul 23, 2024
Collaborator Author

Regarding deduplication: we will likely have a lot of types, metadata, etc. that will be duplicated a lot throughout a file. Storing and deserialising this duplicate data is wasted. By having a table format and referring to terms by indices, we can share common subterms by simply reusing their indices.

MLIR deduplicates attributes via hashconsing. In the interest of hugr-model being a passive data structure, we shouldn't do that in hugr-model itself: there may be multiple equal terms with different indices. However, we might produce a deduplicated representation from hugr-core by using hashconsing there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`hugr-model` #1326

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

hugr-model #1326

zrho Jul 18, 2024 Collaborator

Advantages

Simplicity

Versioning

Different Implementations

Proposed High-Level Changes to the Format

Edge representation

Hierarchy

Topologically Sorted

Operation Properties

Rows vs Lists

The Data Model

Open Questions

Replies: 8 comments · 13 replies

ss2165 Jul 22, 2024 Maintainer

zrho Jul 22, 2024 Collaborator Author

aborgna-q Jul 22, 2024 Maintainer

zrho Jul 22, 2024 Collaborator Author

doug-q Jul 22, 2024 Maintainer

ss2165 Jul 22, 2024 Maintainer

zrho Jul 22, 2024 Collaborator Author

aborgna-q Jul 22, 2024 Maintainer

acl-cqc Jul 22, 2024 Collaborator

doug-q Jul 22, 2024 Maintainer

aborgna-q Jul 22, 2024 Maintainer

zrho Jul 22, 2024 Collaborator Author

cqc-alec Jul 22, 2024 Maintainer

acl-cqc Jul 22, 2024 Collaborator

doug-q Jul 23, 2024 Maintainer

zrho Jul 23, 2024 Collaborator Author

doug-q Jul 23, 2024 Maintainer

zrho Jul 23, 2024 Collaborator Author

aborgna-q Jul 23, 2024 Maintainer

doug-q Jul 23, 2024 Maintainer

zrho Jul 23, 2024 Collaborator Author

`hugr-model` #1326

zrho
Jul 18, 2024
Collaborator

Replies: 8 comments 13 replies

ss2165
Jul 22, 2024
Maintainer

zrho Jul 22, 2024
Collaborator Author

aborgna-q Jul 22, 2024
Maintainer

zrho Jul 22, 2024
Collaborator Author

doug-q
Jul 22, 2024
Maintainer

ss2165 Jul 22, 2024
Maintainer

zrho Jul 22, 2024
Collaborator Author

aborgna-q Jul 22, 2024
Maintainer

acl-cqc Jul 22, 2024
Collaborator

doug-q
Jul 22, 2024
Maintainer

aborgna-q
Jul 22, 2024
Maintainer

zrho Jul 22, 2024
Collaborator Author

cqc-alec
Jul 22, 2024
Maintainer

acl-cqc
Jul 22, 2024
Collaborator

doug-q
Jul 23, 2024
Maintainer

zrho Jul 23, 2024
Collaborator Author

doug-q Jul 23, 2024
Maintainer

zrho Jul 23, 2024
Collaborator Author

aborgna-q Jul 23, 2024
Maintainer

doug-q Jul 23, 2024
Maintainer

zrho
Jul 23, 2024
Collaborator Author