Replies: 8 comments 13 replies
-
I agree strongly with the motivations and high level proposal. It is a good time to decouple codebase breaking changes from data model breaking changes.
|
Beta Was this translation helpful? Give feedback.
-
| Even the declarative parts of MLIR are inseperably linked to C++; we should avoid the same fate. I think you are using prefixes to indicate the "kind" of strings, i.e. Symbols are prefixed with "@", VarNames with "%", Labels with ":". What are TypeVars prefixed with? I agree with you that the current edge representation is not human readable, but I don't think you've justified why the model should be human readable. My reading is that the primary goal is ease and flexibility of serialisation? I don't understand what the proposed model is "optimised" for. You have made several choices that could go another way, (e.g. edges share variable names, hierarchy), and the only justification I see is human readability. I don't think human readability should be privileged here, that is what the s-expression (isomorphic?) representation is for. I would expect the model to be optimised for serialisation, and for interaction(iteration, query, mutation) via rust. We would still have to write migrations when the validation rules in hugr-core change. I suppose that we should pattern match and rewrite the hugr-model? |
Beta Was this translation helpful? Give feedback.
-
| One potential mismatch is edge kinds. I don't quite know how they would fit in since I don't quite understand why any but the Value edge are necessary in the first place. That is perhaps a different discussion to be had. EdgeKinds are not serialised. The semantics of each op (and its properties) determine the EdgeKind of each of its ports. |
Beta Was this translation helpful? Give feedback.
-
I'm curious about the perf cost of adding the extra step in machine-readable encodings (esp. since this requires extra steps for reorganising the data, resolving port name identifiers, etc.). The current model is not really optimised beyond "trust in serde", but as doug says this model isn't either. Probably a first step to check this should be writing some encoding/decoding benchmarks... |
Beta Was this translation helpful? Give feedback.
-
Having a data model that is independent of the implementation and doesn't break when e.g. a rust field is renamed is definitely a good thing. |
Beta Was this translation helpful? Give feedback.
-
Yes, the variable naming thing is very nifty for human-readability, but numbered nodes (and edges being pairs of endpoints) is simple, less prescriptive about how tools might deal with it, and we can have tooling. I have better feelings about the "proper" representation of hierarchy, nobody really wants to deal with explicit hierarchy edges (=> our Rust code e.g. HugrView/Mut wraps hierarchy-edges into an easier-to-use actual "tree" already). A human-readable version of numbered nodes is stringly-named ones. Ugh, I know..... And good to maintain a single edge representation for data and control flow, I think, although the variable-name system doesn't forbid cycles (you just have to use a variable before you've defined it, mehh). |
Beta Was this translation helpful? Give feedback.
-
Perhaps I have missed this, but I do not think you are explicit on how scoping of VarNames works. can edges in sibling DFGs use the same VarNames? I do think CFGs are important, could you show an example of a cyclic one? I think this proposal would benefit from explicit requirements on the model, so we can argue about those separately from how the proposed model satisfies them. Consider:
|
Beta Was this translation helpful? Give feedback.
-
Regarding deduplication: we will likely have a lot of types, metadata, etc. that will be duplicated a lot throughout a file. Storing and deserialising this duplicate data is wasted. By having a table format and referring to terms by indices, we can share common subterms by simply reusing their indices. MLIR deduplicates attributes via hashconsing. In the interest of |
Beta Was this translation helpful? Give feedback.
-
HUGR as a data format should be independent of its implementation in
hugr-core
.I propose to create a crate
hugr-model
that describes the HUGR data model in plain data (see Passive Data Structure).This data can be serialized and deserialized into JSON and s-expressions, as well as other formats that we might need in the future (MessagePack or Protobuf for gRPC?).
The
hugr-core
crate and potential different implementations may usehugr-model
to store, communicate and print HUGR graphs, dialect specs, etc.To ensure that the data model is independent of implementation details,
hugr-model
would not include any additional logic beyond serialisation and deserialisation, such as efficient data representations, caching, etc.hugr-core
is then free to convert thehugr-model
types into whatever data representation is most efficient or convenient to perform optimisations, analyses or rewrites.I am aware that this brings with it significant breaking changes in the serialisation format
and is a non-insignificant project. However, the longer we wait, the more difficult it would get.
Advantages
Simplicity
By keeping hugr-model simple and focused on data representation, it allows hugr-core and other implementations to handle complex logic and optimizations.
The
hugr-model
data model is an overapproximation: for example it can represent graphs that are not connected properly, or types which are ill-kinded.This is so that
hugr-model
remains simple and the validation step can be separated out from the parsing/deserialization step.This separation also allows us to represent partially invalid structures, as would be useful e.g. for a language server.
hugr-core
or any downstream implementation can use encoding tricks like #1138 to make invalid types or operation arguments unrepresentable.The conversion and validation step from
hugr-model
tohugr-core
then ensures that the data is in the correct form.Versioning
An independent
hugr-model
crate makes it easier to manage versioning and ensure backward compatibility.Changes in
hugr-core
won’t necessitate immediate changes in the data model, reducing the risk of breaking changes once we care about that.We can also maintain multiple versions of the format at once, if necessary, allowing for tools to migrate between them.
Since the types are just plain data types, the amount of duplication this would bring would be manageable.
By having a uniform representation between builtin operations and extension operations, at least in the model, we also open the door for versioning the core operations independently of the format.
Different Implementations
By separating the data format from the hugr-core implementation, we enable different implementations to form.
This is desirable since it allows us to make different tradeoffs: it is quite hard to build a single implementation that is fit for all purposes, and the result will be quite complex.
An implementation independent specification like this would prevent tying the spec too close to any particular implementation detail, such as Rust, serde or any other incidentals of
hugr-core
.The current V2 format is already quite at risk in that regard; the JSON schema is to the largest part derived from what serde generates from the structs in
hugr-core
.Even the declarative parts of MLIR are inseperably linked to C++; we should avoid the same fate.
Once declarative specification of operations and dialects is included into the format, a code or docs generator can be built on top of it without having to be strongly tied to hugr-core.
Proposed High-Level Changes to the Format
The current JSON format is quite tied to the data representation in the
hugr-core
implementation.While the
Hugr
graph structure itself is different, pretty much every other type(including especially the representation of operations and types) is serialized via derived
serde::Serialize
instances.To keep conversion reasonable, the JSON and s-expression format should be based on the same data model.
Since the s-expression format should be convenient for human manipulation,
having the same data model for both puts some constraints on the JSON format as well.
As far as I can tell, this is perfectly fine and does not make the JSON format unnatural.
However it brings along some major changes in how the format encodes the graph.
Edge representation
Currently the JSON format represents the hyperedges between ports as a vector of pairs
Vec<[(Node, Option<u16>); 2]>
.A representation like this is not viable for a human readable format.
I therefore propose to change to a different representation of edges as follows:
Each port takes a variable name.
When two ports have the same variable name, they should be connected.
Encoded in the s-expression format, this leads to a reasonably intuitive notation and mirrors familiar conventions from LLVM/MLIR.
Consider for example the following code, which expresses the composite of two multiplications:
This encoding is also reminiscent of the datalog program which implements a pattern match.
Hierarchy
The V2 JSON format assigns numerical ids to nodes.
The node hierarchy is encoded by making child nodes refer to the id of their parent.
I propose to encode the structure as a tree instead.
This matches up with how a hierarchical graph would be written in a usual programming language.
Acyclicity of the tree structure is also immediately apparent without validation.
Topologically Sorted
The nodes in a region should be topologically sorted.
This avoids a reader to have to jump back and forth to follow the flow of data.
Requiring the nodes to be in topological order also simplifies deserialisation and cycle checking.
The order of nodes should not matter beyond the dependencies that are recorded in the edges between the nodes' ports.
This is a major advantage of the graph representation over more classical IR representations,
which have to pick a total order on instructions, obscuring that some instructions are independent and can be safely reordered.
We may therefore safely reorder the nodes into topological order on serialisation.
To avoid nodes jumping around non-deterministically, we could perform a stable topological reodering on serialisation.
This requires to rethink the requirement that the input and output node must be the first and second node in a region.
Instead we could arrange for the output node to always be the last node.
Operation Properties
For the purposes of
hugr-model
there should be no difference between "builtin operations" and "extension operations".This is in contrast to the V2 JSON schema which has specific affordances for the builtins of
hugr-core
.By picking a uniform representation for all operations, a consumer of a hugr graph can decide for itself which operations merit special treatment.
For
hugr-core
this might be the current set of builtins; for an llvm backend, a datalog-based optimiser or a generic visualiser tool these might be different.Further, a uniform representation allows to evolve the set of "builtin operations" without having to change the data format.
At the moment, the builtin operations are serialized by serde through their representation as a struct in
hugr-core
.In principle the format for a builtin op is arbitrary.
By convention the operations already follow a regular shape, with every builtin op in
hugr-core
being a struct with fields of one of the following types:String
for function nameString
for type alias namePolyFuncType
Type
TypeRow
Value
FunctionType
Vec<TypeArg>
usize
for tag nameExtensionId
ExtensionSet
Parameters to custom ops are implemented by taking a list of
TypeArg
s, which (as far as I can tell) can already contain all of the above.A similar approach could be taken in
hugr-model
also for the buildins; where any of the types passed to the builtins is not yet representable in that way, it should be made to be.The conversion step from
hugr-model
tohugr-core
can then "parse" the type args into the data expected by the operation, as we already impose on custom ops.Note that this can be done without (significantly) touching the current separation of builtins and extension operations in
hugr-core
.In particular I think we may keep the way
OpType
works mostly the same (for this purpose).This encoding of operation parameters further allows nice integration with a type system and declarative specification of operations. Since work on these aspects is on hold for the time being, this point isn’t elaborated further here but I am happy to discuss the details if there is interest.
Rows vs Lists
The term
row
has a specific meaning in PL literature: a collection of types equipped with labels (as e.g. in Tierkreis).The "rows" in HUGR are not rows in this sense, which tripped me up a little in the beginning.
I suggest renaming hugr rows to lists to better conform to conventional terminology.
This also opens up introducing actual rows if there is need without having to make up a new name for them.
The Data Model
A draft of the data model can be expressed as a Rust data structure as follows:
As far as I can see, current HUGR is expressible within this framework.
TypeBound
s can be expressed usingTypeConstraint
s on the type variable.SumType
s can be expressed using a type constructor and some type lists.Extension
types can be simulated with a type constructor and aLabel
:(@ext :quantum/h2)
ExtensionSet
s become a type constructor and a row:(@ext-set {:quantum/h2 ()})
Some discussion and concrete experimentation is needed to see where there are mismatches.
The type language is rich enough to express some quite interesting types:
(@tensor @f32 [512 128 128])
(@struct {:x @u64 :y @u64})
(@enum {:ok %a :err @parse-error})
(@fn [@f32 @f32] [@f32])
One potential mismatch is edge kinds. I don't quite know how they would fit in since I don't quite understand why any but the
Value
edge are necessary in the first place. That is perhaps a different discussion to be had.The structs above would also be extended with fields that can hold metadata.
Open Questions
Type
s are used both directly asType
s but also to encode general information, like operation arguments and metadata. Looking at Haskel's data kinds and dependently typed languages, this is not entirely weird. However, the name feels a bit off and might be confusing. Is there a better name that feels natural both when used as a type directly but also for operation args and metadata?Beta Was this translation helpful? Give feedback.
All reactions