-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ONNX import/export #10
Comments
Here is my takeaway from the meeting: Splitting up ONNXmutable has lower priority than making sure it can load state of the art models. Is this correct? Quickest way to proceed here is to just test loading models and file issues (and preferably PRs 😊 ) when they fail. If so, I can probably take the functionality from BaseOnnx and add it to ONNXmutable instead along with (protobuf 0.9 compliant) protos in order to get rid of the warning-generating dependency to ONNX.jl and release a 0.1 version very soon. If there is a need to split it up for reuse purposes this can always be done later of course. I will add examples of loading a model and replacing the head as well as a pruning example here. Please let me know if there is anything else you’d like to see. Load model and replace classification head (click me to expand)
Here is how to load a model and a few of the ways NaiveNASflux lets you explore the model (ImportOnnxExample) pkg> add NaiveNASflux, MLDatasets, https://github.com/DrChainsaw/ONNXmutable.jl
julia> using ONNXmutable, NaiveNASflux, MLDatasets
# Lots of warnings from ONNX.jl :(
julia> resnetfile = download("https://github.com/onnx/models/raw/master/vision/classification/resnet/model/resnet18-v1-7.onnx");
julia> model = CompGraph(resnetfile);
julia> name.(vertices(model)) # Query names like this
julia> nout.(vertices(model)) # List output sizes like this Ok, lets scrape off the classification head and replace it with a new dense layer. julia> vs = vertices(model);
julia> remove!(vs[end], RemoveStrategy(NoSizeChange())); # Normally we want NaiveNASflux to align the sizes between the input to the removed layer and the output, but in this case we explicitly want to change the output size without touching anything else
julia> insert!(vs[end-1], v -> mutable("newhead", Dense(nout(v), 10), v));
julia> newmodel = CompGraph(vs[begin], outputs(vs[end-1]));
julia> vertices(newmodel) |> last |> nout
10
julia> vertices(newmodel) |> last |> name
"newhead" Unfortunately removing the very last layer is slightly more inconvenient than any other because 1) the CompGraph has a reference the output layer and 2) because NaiveNASflux will by default think that it needs to change the size of the previous layer to have the same output size (which is a pretty reasonable default assumption to be fair). In normal cases one can just do Anyways, now one can just (re-)train the model as normal. julia> Flux.adapt(T, x::Flux.Zeros) = x # Workaround for Flux issue # 1332
julia> Flux.adapt(T, x::Base.ReinterpretArray) = T(x) # ONNX.jl turns TensorProtos into `ReinterpretedArray`s and Zygote does not like that. BaseOnnx makes them normal Array, so this is a temporary nuisance
julia> newmodel = newmodel |> cpu; # Same as above (this should ofc be gpu if training on gpu)
julia> loss(x,y) = NaiveNASflux.Flux.Losses.logitcrossentropy(newmodel(x), y); # forgot to add Flux, but it is available through NaiveNASflux
julia> x,y = MLDatasets.CIFAR10.traindata();
julia> xpadded = zeros(Float32, 224,224, 3, 8); # Need to pad excessively to match size for this constructed use case
julia> xpadded[97:128,97:128,:,:] .= x[:,:,:,1:8];
julia> yhot = NaiveNASflux.Flux.onehotbatch(y[1:8], 0:9);
julia> pshead = NaiveNASflux.Flux.params(vertices(newmodel) |> last |> layer); # Only train the parameters of the last layer
julia> length(pshead)
2
julia> NaiveNASflux.Flux.train!(loss, pshead, [(xpadded, yhot)], ADAM()) # I don’t have a GPU on this computer, so I’ll only do one batch |
Here is the pruning use case
julia> model = CompGraph(resnetfile);
julia> numneurons(m) = mapreduce(nout, +, vertices(m)); # Not needed, just to show that something happened
julia> numneurons(model)
15531
julia> function pruning_metric(v, offs)
val = neuron_value(v) # neuron_value defaults to magnitude of parameters along activation dimension
ismissing(val) && return fill(offs, nout_org(v)) # Layers with no parameters return missing by default
return val .- min(offs, 0.8*maximum(val)) # min is crude safeguard to prevent that models get size 0
end
julia> allvals = mapreduce(neuron_value, vcat, vertices(model)) |> skipmissing |> collect;
julia> cutoff = partialsort(allvals, round(Int, 0.3*length(allvals))) # Cutoff is bigger than (approx) 30% of all values
julia> for v in vertices(model) # It is currently a limitation that one first must reduce the size of each individual vertex. If you want to use NaiveNASflux for pruning I think I can fix this limiation
metric = pruning_metric(v, cutoff)
nprune = sum(<(0), metric) - length(metric) + nout(v)
nprune <= 0 && continue
Δnout(v, -nprune)
end
julia> Δoutputs(model, v -> pruning_metric(v, cutoff)); # Given that we have new sizes for all layers, which neurons do we decide to keep
julia> apply_mutation(model);
julia> model(ones(Float32, 224,224,3,2)) |> size # Model is still internally consistent
(1000, 2)
julia> numneurons(model) # 30% fewer neurons
10042 Now, this turned out to prune alot more than 30% of all parameters, so chances are the model accuracy suffered alot. It also seems like the layers with more parameters also tend to have lower magnitude of their parameters, so the strategy to have a global cutoff is probably not the best. It should however be straight forward to change the above example to do a comparison per layer instead (i.e prune 30% of each layer instead of 30% in total) so I'll leave that exercise to the reader :). Note that to train this model one needs to do that annoying (and soon to be fixed) dance with overloading a few Flux methods and mapping to Obviously all the above examples could (and perhaps should) be wrapped in much nicer APIs. NaiveNASflux tries to be more of a library than a user-facing package and I have put flexibility before ease of use. I'm thinking this can be used as a start to figure out what a nice looking API for the above use cases could look like. |
Maybe a naive question, but would it be possible to convert the model imported by ONNXmutable back into a regular Flux chain? For example, following: julia> model = CompGraph(resnetfile); If the model could be recompose into a the stack of regular Flux building blocks (Conv, Dense, SkipConnexion), then I think it would make the manipulation of the model, such as for image transfer learning very easy and intuitive, just like what is done in the model-zoo tutorial: https://github.com/FluxML/model-zoo/blob/master/tutorials/transfer_learning/transfer_learning.jl. |
It certainly could, at least for the type of models which can be expressed as a chain. It is a bit of a mental exercise since the graph format in ONNX is very different from how (non-linear) graphs are expressed with chains. The current deserialization is just a very simple recursion through the graph and I don't think one can do the same to create a chain. Splitting ONNXmutable into multiple packages would however allow for creating a simpler importer which makes use of the same primitives, but it seems like the interest in this is somewhat low. Another somewhat annoying but certainly overcomeable issue is that one probably needs to resort to wrapping layers in closures and that requires that some side mechanism to provide the parameters is provided. I also think that the example looks easy because the graph is very linear. The same mechanism would not be so nice in the non-linear case. I guess that the "scrape off the classification head" usecase would be fine for almost all models though. |
Btw, if the graph is linear then one can just do this: julia> chain = Chain(layer.(vertices(compgraph)[2:end])...) |
Thanks a lot for the clarifications! The graph = CompGraph("data/resnet34-v1-7.onnx")
julia> m = Chain(layer.(vertices(graph)[2:end])...)
ERROR: MethodError: no method matching layer(::NaiveNASlib.var"#225#226"{typeof(+)})
Closest candidates are:
layer(::ONNXmutable.Flatten) at C:\Users\jerem\.julia\packages\ONNXmutable\kxK8z\src\deserialize\constraints.jl:179
layer(::CompVertex) at C:\Users\jerem\.julia\packages\NaiveNASflux\0lGnm\src\vertex.jl:114
layer(::NaiveNASflux.InputShapeVertex) at C:\Users\jerem\.julia\packages\NaiveNASflux\0lGnm\src\vertex.jl:18 Also, with the ResNets v2, it fails at the import step: julia> graph = CompGraph("data/resnet34-v2-7.onnx")
ERROR: MethodError: no method matching (::ONNXmutable.var"#144#145")(::Dict{Symbol,Any})
Closest candidates are:
#144(::Any, ::Any) at C:\Users\jerem\.julia\packages\ONNXmutable\kxK8z\src\deserialize\ops.jl:214
Stacktrace:
[1] wrapfrom(::ONNXmutable.OnnxNode, ::ONNXmutable.OnnxNode, ::ONNXmutable.CompGraphBuilder, ::Symbol, ::Dict{Symbol,Any}) at C:\Users\jerem\.julia\packages\ONNXmutable\kxK8z\src\deserialize\combine.jl:47
... Are the residual connexions the cause the difficulties here? (let me know it'd be preferable that I open an issue the ONNXmutable repo or elsewhere) |
Thanks for showing interest @jeremiedb, sorry for wall-of-texting you as a response :)
I think this has been stated a few times, but outside of the group maintaining this tracker it does not seem like there is an overwhelming need, judging by the number of posts on discourse and the influx of issues in ONNX.jl and ONNXmutable.jl. There could certainly be a chicken and egg problem here as people might silently turn back to python when they can't find a canonical ONNX package for any of the Julia ML frameworks. I would love to discuss my proposal above some more. I'm however a bit hesistant to just go ahead and litter the general registry with ONNX packages without any wider support. Chances are that people will just go "hmmm, unkown single author and I don't understand the point of all of this so I'll just roll my own" if I proceed. I'm happy to do some work if we can work out a structure which we believe in. I will also be happy if the conclusion is that we should wait for some ONNX expert to catch interest instead :)
Yes, sorry for using confusing home-made graph terminology here. A ResNet would not fall into the category of "linear DAGs" as the elementwise summations are nodes in the graph which take input from more than one other node. The I can't say I have though alot about it, but creating an algorithm to transform an arbitrary graph using this representation to a chain does not seem like a trivial task. I'm sure the resnet can be done if one assumes that the only thing one needs to handle are things which fit the If you have a suggestion for the above I'd be happy to accept a PR (or even implement it if you tell me how) in ONNXmutable to have something like
This is definitely an issue with ONNXmutable! Please file an issue and I'll look into it. FYI the code which fails are the heuristics to combine multiple ONNX nodes into a single CompGraph node. One example of when one might want this are activation functions as ONNX always has them as separate nodes in the graph while Flux allows them to be inside the layers. This is by no means a necessary step, but I though it was nice to allow for things like CUDA optimizations as well as just generally trying to make sure that import -> export returns something which is as close to the same model as possible. |
There are a couple of related issues here too. First, on the point of basic models/blocks like VGG/ResNet, we have a PR to Metalhead.jl that implements some of these natively in Flux. Ideally, I think we should skip ONNX completely and provide pre-trained models by just training those native Flux models (that was my intent anyways). Second, there would still be a need beyond basic models for the ability to read in arbitrary ONNX models into Flux models. I agree that this would be very useful functionality, but like @DrChainsaw mentioned, it is a tricky problem. Any solution would probably need to be well-documented so someone can reference what to expect from the translation. That being said, I think this PR will be helpful for expressing more complex graphs. I've already indicated in one of the ML committers calls that the PR should get some attention and be merged. |
I'm not sure I would just blanket equate "Flux models" with "
I still don't fully understand the rationale behind that PR. Isn't the nice thing about the Once you deviate from this simple assumption, wouldn't it be better to just use the same format as what is more or less tried and true when it comes to DAGs? I see many arguments that Flux should be more like the other frameworks and I guess that if model building deviates the same complaints will come for that too. |
This is true, and it is certainly what makes Flux so powerful. But if I need to manipulate a model, then I'd much rather operate with the same layer building blocks in Flux. There are two reasons for this:
I agree with this too. I don't think Flux should build graphs of the computation like other frameworks. I do think
Generally speaking, I am not in favor of adding more layers to Flux. I think it is better to just use closures and arbitrary Julia functions whenever possible. But I think |
Hi, I am very sorry to join this conversation so late.
I come to Julia from an industrial side, where we are trying to make
products. From that point of view FastAI is a repository of best training
practices, and ONNX provides access to the best pretrained models. ONNX
also provides the critical link to other non-Julia eco-systems. For
example, one might want to suck in VGG from ONNX, fine tune it in Flux with
best practices from FastAI, save it to ONNX, and deliver it via some large
ONNX Runtime system.
So, for my purposes, ONNX does not have to support all DAGs-- just the ones
used in popular Vision and NLP pretrained models. I think if we get
reading and writing VGG, ResNet, BERT, GPT-2/3 and maybe a few more
working, that will make 90% of the developers happy.
FastAI should contain working examples of Transfer learning using ONNX. I
would try to match some of the popular Python examples, such as those from
Coursea's NLP Transformer class.
Finally, I also strongly agree that we should make it easy to extend ONNX
with new primitives so that researchers can come up with new designs, and
have them used outside of Julia.
…On Sun, Nov 15, 2020 at 11:41 AM Kyle Daruwalla ***@***.***> wrote:
I'm not sure I would just blanket equate "Flux models" with "Chain". In my
view, a CompGraph is also a Flux model, as is any julia function which
makes use of building blocks from Flux (i.e what ONNX.jl tried to do).
This is true, and it is certainly what makes Flux so powerful. But if I
need to manipulate a model, then I'd much rather operate with the same
layer building blocks in Flux. There are two reasons for this:
- Flux layers are pretty well designed for manipulation with standard
Julia syntax
- Most people (myself included for my work) don't want to learn
another library to interpret a model
Isn't the nice thing about the Chain that if your model is super simple so
that it only consists of unary operations stringed (well chained actually)
you can get rid of alot of complexity as you only need to store the layers
as a tuple/array since the structure is implicit from the struct itself?
I agree with this too. I don't think Flux should build graphs of the
computation like other frameworks. I do think Chain should remain a
simple sequentially executed list of functions. That PR doesn't necessarily
invalidate that assumption. It would make representing models like
Inception easier.
- Parallel is just a generic form of SkipConnection with more
branches. The input and output is still a unary piece of data. In the
Metalhead.jl PR that I linked, a closure is used to implement the many
branches of an Inception module, since SkipConnection is limited to a
two branches. If you mostly want to execute the branching module, then
there is little difference here. But if you want to manipulate or access
the module's parts, then having a struct/layer like `SkipConnection is
useful.
- Join/Split are where it might appear like the unary structure is
broken, but I would argue that the input and output are still "unary"
because they are a single tuple. If you use a Split in a Chain, then
the Chain only sees the one tuple as the output and not the individual
pieces of the tuple. It is up to the person who used the Split in the
Chain to make sure that the subsequent layer accepts the tuple as an
input. We already do this when we x -> reshape(x, :) between
convolution and fully-connected layers.
Generally speaking, I am not in favor of adding more layers to Flux. I
think it is better to just use closures and arbitrary Julia functions
whenever possible. But I think Parallel, Join, and Split have proven to
be ubiquitous enough in ML that their addition is warranted. Given the
assumption that the input and output to a DAG are truly unary, I think
Chain + Parallel + Join + Split lets you include that DAG into a "native"
Flux model. Like I mentioned, that translation is tricky and must be
well-documented so people know exactly what gets interpreted as what, but
the ability to make that translation would be useful for me (and I suspect
many others).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAANIBCLBFA5XDTG5EMR4GTSQAAFDANCNFSM4S5XMUPQ>
.
|
Yeah, the essence of this argument is certainly a wrench in Julias narrative about everything being first class so people can just design their own extensions to other packages. I guess it has a pretty strong gravitational effect on adding peripheral stuff to existing packages as else one needs to create an agreement on what is the canonical package for it or forever suffer the lisp curse (disclaimer: I'm not a lisper, I saw someone mention this as being the same issue as julias ecosystem and thought it was spot on). I don't want to argue this too much because I'm really not that dug in as to what I believe is right, but:
Those layers are not part of flux now, so learning to make use of them in the context of a Chain might be comparable to learning another library. Other popular frameworks I know of make use of the "traditional DAG" formulation to support them (or just let just leave it up to the user to define the function). Why make things more difficult than it has to be? Anyways, if you disagree but don't want to argue it further then no need to reply. As I stated below, I think it is a non-issue for this work if we go the eco-system route. I'm not so familiar with transformers. Given that BERT and GPT can be expressed as ONNX graphs the graph can also be expressed with the
Yeah, this is pretty much how I went about it as well (the "for my purposes" part that is) and I think it is the only scalable approach unless some big corp decides to throw alot of people at this. I'm not sure if you mean anything special when you said "all DAGs", I'd like to separate 'operations' from 'DAGs' where the latter just describes the sequence of 'operations' to apply. A missing operation is typically easier to fix the day you realize you need it compared to fixing a missing subset of DAGs. Conversely, supporting all 'DAGs', but not all 'operations' from day one is pretty straight forward if you just copy the DAG format from ONNX. I'm happy to merge PRs for the OPs you need in your business, and probably even implement a few of them if you file issues (unless if there is anything ethically wrong about doing so). As I said above, I'm also happy to dismantle ONNXmutable into several packages so that someone can build the ONNX <-> Chain importer without having to either reimplement all the primitives or depend on the I mean, it's not that hard to write them from scratch, but at least I managed to mess up a double digit number of things like not reversing/interleaving arrays along certain dimensions in ways which did not even show up when running onnx's own test suite. Thats why I ended up using onnxruntime as a test dependency so that the test suite checks that all OPs (and a handful of models with more OPs stringed together) produce the same output for the same input. Again, the "Most people don't want to learn another library" is pretty much why I don't go ahead and make those packages I suggested in the OP. Again, I don't care what my own github looks like ofc, but I prefer to not litter the general registry with packages for a single person ONNX ecosystem and I don't want to deal with dependencies to non-registered packages as anything but an extremely temporary situation. Although I'm happy there is some attention to this I'm not sure if the comments in this issue means that there is a strong disagreement to the original proposal or if we just ended up accidentally bikeshedding the DAG format (which is a non-issue if we go the eco-system route imo). Would it be possible to zoom back in on that and see if it might satisfy all requirement or is there some other proposal on how to proceed? |
Part of the difficulty with this discussion is that neither PyTorch nor TF handle importing from ONNX. That said, https://mxnet.apache.org/versions/1.7.0/api/python/docs/tutorials/packages/onnx/fine_tuning_gluon.html#Fine-Tuning-the-ONNX-model is a decent e2e transfer learning example, so perhaps we shouldn't be looking at those two frameworks at all! The MXNet implementation looks pretty straightforward, but it's operating at more of the TF 1.0/NNlib level. This seems to be a good sweet spot—the JAX repo has a similar (toy) implementation. What if Note that the above was all about importing. WRT exporting, I suppose the only options are some kind of tracing framework (e.g. what ONNXMutable uses now) or some extra machinery on top of Functors.jl. Assuming the former wins out, why not trace all the way down to NNlib and Base functions? Yes ONNX is conceptually a DAG, but it's also a full IR with control flow and everything. https://github.com/FluxML/XLA.jl was a good POC, so (given a less flaky Mjolnir equivalent) I think Flux and ONNX could be decoupled here as well. Edit: I think this could also resolve the DAG format discussion, if only because there wouldn't need to be a shared DAG between Flux and ONNX(Mutable). Creating "a less flaky Mjolnir" may be the sticking point though. |
I think that the actual way one is supposed to use ONNX is an excellent point which I did not address in the OP, partially because I don't have a satisfactory answer. I have also gotten the impression over time that ONNX is in practice less of an exchange between frameworks and that something like onnxruntime is the end goal for ONNX models. I guess the major frameworks rely on being big enough so that they don't need to exchange models, or do they use some other format when exchanging between frameworks? My current take on this is that doing an ONNX import/export is still more general than picking one framework and do their export/import or trying to cover all. I have made a throw-away limited import/export from Keras (basically me being the stubborn guy who refused to use python) and I don't think that format was much easier to import into Flux compared to ONNX. I pretty much made ONNXmutable mostly with long term storage of models in mind, and for this use case it is ofc very valuable that a loading a model should be as close as possible to what was saved. I'm not sure if this requirement is in direct conflict with the transfer learning or more general import case, but lets explore it a bit. Right now, Using NNlib operations for some export cases is certainly worth exploring to see if it simplifies and generalizes some aspect. I don't think it is ideal in all cases as ONNX defines "bigger" ops (e.g. Gemm vs MatMul) to facilitate optimization, but maybe good ONNX importers can do these optimizations themselves. Another issue is that I'm not certain that ONNX supports "free" parameters in the model, so things like the bias in conv layers which are not visible inside NNlib could cause issues. I also like to be able to use e.g. netron and be able to recognize the model I just exported which may or may not be a requirement worth considering. Just to be clear, ONNXmutable does trace down to Base primitives like To start with the most primitive of the primitives certainly make sense if one starts from scratch and aims to cover the whole spec. The approach I have taken so far is to add the stuff I need and just keep doing that for as long as I need something which I think is a pretty decent pragmatic approach all in all. I don't think either approach excludes the other though. The import case is a bit less clear if it is generally valuable. Firstly, you need something to hold the parameters and I'm not sure it is the most user friendly thing in a transer learning context to return opaque closures. I guess one way to go about it is to have a generic I wanted to use Functors for the export rather than messing with tracing, but it just seemed to be designed for a completely different purpose. In particular, it does not tell you anything of the program flow. It also does not penetrate inside closures (e.g. the combine method in
I might not be enough computer scientist to accurately speak here, but the ONNX IR does not define control flow afaik. It defines node operations like If your point was that the current tracing approach which only uses dispatch will choke on Julia control code then you are absolutely correct. Mjolnir as you point out seems like the perfect tracing tool and I have many times thought that I should give it a go, but never found the energy. One thing I don't think Mjolnir/XLA solves is the world age problem which again is why some Julia-defined structure (like Chain or CompGraph) is needed to describe the DAG. As I stated in the OP, if one is content with a slim ONNX package which can only import into global scope it is straight forward to generate a Julia IR from either of the three formats (ONNX graph, CompGraph or Chain) and I think it can also be made functor compatible. I don't think Mjolnir is needed for this even. I fully agree that Flux and ONNX can be (and in fact are) fully decoupled. To build something like onnxruntime in Julia would not require one to depend on Flux. It seems from the discussion above though as if adherence to Flux is very desirable from a transfer learning perspective. Perhaps there does not need to be one package which fulfillls all requirements though and that leads back to the eco-system discussion (ie. what packages should there be and what are their responsibilities) which I think is the right discussion to have. Note to self: Practice being more concise in online conversations. I don't think anyone will read this post up to this line so I guess I can say whatever I want here. (can't think of anything to say since I have basically used all words in post above) |
Yeah I think there is value in an ONNX representation. I am certainly not advocating for a system where all ONNX models are represented by Flux layers. I think the MXNet implementation looks more like what I was thinking. Some kind of ONNX2Flux.jl package that translates
I think multiple representations can exist at the same time. Some people will use CompGraph, and some people will prefer to switch to Flux layers. We don't have to put all functionality inside Flux or force everything outside of it. Being first class means the user gets to choose and not suffer for their choice. |
Too late now 😄. I think this kind of deep design discussion could definitely benefit from some synchronous communication. We should schedule a call some time soon or add an agenda item to one of the bi-weekly calls. In the meantime:
Yes and nope. That AIUI was the genesis of ONNX, NNEF etc. IIRC ONNX Runtime was a later development.
This is exactly the granularity NNlib and
If you're referring to autodiff, that's where Functors.jl (the mechanism behind
👍
Yup, compat with Knet for one. I don't see any problem with bootstrapping with Flux, but mid-long term NNlib+potentially NNcore would be a better common substrate to target. @darsnack I think this applies to your ONNX2Flux.jl point too.
Not really, based on your findings it may be a dead end. I'm personally more in camp tracing, but that's very much a personal bias towards the "ML is a interpreter/compiler problem" school of thought.
So does e.g. WASM, yet that is 100% Turing complete. I'm glossing over call et. al and ONNX IR is not Turing complete AFAIK, but that doesn't feel like a limitation for ML models specifically. Perhaps Mjolnir-style inlining and partial evaluation is mandatory for this though, I've not looked into it either.
I'm pretty hazy on world-age issues, but the global scope thing is not a problem with XLA. AIUI ONNX.jl literally generates and evals source code, whereas Mjolnir (really IRTools) is directly manipulating a partially lowered representation that the compiler middle end works with. There's certainly nothing stopping one from XLA'ing a function in another nested function and passing that to a third function in another module.
Does the following functionality constitute Flux adherence?
Is there anything else you think is necessary for transfer learning? My (limited) experience is that the pre-trained backbone is treated as a black box and not much interrogation of the internals happens beyond splitting and figuring out input/output sizes. |
Yes I agree! Of course, I definitely don't know as much about ONNX as everyone else in this thread.
I think this is the right list to base the design of an ONNX package on. Stuff like what I brought up (converting ONNX models to Flux) is a separate and orthogonal issue.
😅 I did too! |
Yes, perhaps we can take a shot at crystalizing the requirements (maybe they should be called wishes in this context) and see if some components can form from this.
I was mainly thinking about those. I agree that its not beautiful that they are in the same opset. I saw the motivation somewhere being just performance reasons (e.g. CUDA). At least it makes me feel a little bit less stupid for dreading the task of trying to come up with an algorithm which recognizes subsets of the DAG as equivalent to members of some set of higher level operations. If one tries to keep the interoperability ambition alive it also makes sense that importing e.g. a recurrent model actually yields something recognizable as a recurrent model.
I was actually thinking in completely wrong terms here, no idea way. Of course ONNX supports what I called dangling parameters above. That is what the initializers do and that is the only mechanism to have parameters in the model at all. Just FYI,
Thats great! I somehow thought only Flux made use of NNlib which is the reason for my scepticism above. This is a strong argument for defining primitives on that level. I guess one package which should be made is the one with NNlib primitives for ONNX. Nnlib2Onnx?
This is outside of my understanding atm, but isn't there some limitation on how dynamic this can be? I can see how XLA:ing an existing function could end up confining everything so that the compiler can do its thing, but would it be possible to do this when e.g. passing a string (or a GraphProto) to a function which XLA:s whatever function that string (GraphProto) happens to represent and passes it on to a third function? I don't think this is something which needs to be resolved at this point. I guess I'm just greedily trying to obtain a small nugget of knowledge at a low cost here :) Anyways, given that my thought block on parameters has resolved, I certainly see how one could make a potentially very simple (as in simple to build) ONNX importer by just copying pretty much the exact ONNX IR format and function APIs. Fetching params is as you hinted probably 'just' a matter of labelling up all initializers based on whether they are labeled as differentiable in the OP they are input to. In the simplest form there might be a bit too much Dict:ing around with strings for performance to be satisfactory though.
I still think this is an important requirement to capture and I also believe this is in spirit with what ONNX tries to achieve. I do believe there is a set of reusable components (packages) which allow these things to coexist without an onerous amount of duplication. |
As much as I enjoy the simplicity and theoretical purity of a symbolic graph, practical usage seems to indicate it's a terrible representation for debugging, introspection and ergonomics in general. That said, modern frameworks like JAX and torch.script show that the two are not necessary mutually exclusive. Don't have much of an opinion on where parallel/split/join should live, but they could make sense as decorators or combinators for the ONNX library to use. I'm specifically thinking of n>1-ary operators like Warning: everything below is highly speculative! Looking at this model again: Because the structure is a DAG, it can be unrolled in topological order to a linear trace of sorts: ins = [:in]
outs = [:e]
ENV = Dict(...)
ENV[:a] = conv(ENV[:in])
ENV[:b] = conv(ENV[:a])
ENV[:c] = conv(ENV[:a])
ENV[:d] = +(ENV[:b], ENV[:c])
ENV[:e] = conv(ENV[:d]) where Splitting a network for e.g. transfer learning could be done like so: ins = [:in]
outs = [:b, :c] # this could be any variable/symbol that is not used by a later op, or explicitly specified during the splitting operation
ENV = Dict(...)
ENV[:a] = conv(ENV[:in])
ENV[:b] = conv(ENV[:a])
ENV[:c] = conv(ENV[:a])
# -- cut at this layer --
# ENV[:d] = +(ENV[:b], ENV[:c])
# ENV[:e] = conv(ENV[:d]) This presumes a good enough tracing and codegen apparatus, so I'm not sure how tenable it is with our current infrastructure. Julia is definitely a better fit than Python for such an approach though. |
Could this be a combination of a) bad experience with tensorflows opaque graph format and b) essential complexity due to general DAGs vs simple sequential models? I have spent quite a bit of effort to make RePL messing around with I think the differences compared to Chain is due to the higher degree of generality. I do think there is a certain degree of attractiveness in the strategy to make everything in the chain a unary op (ie. split,join parallell), but I think that approach can only take you so far, and when models become sufficiently complex I don't think it makes model manipulation easier.
This unrolling is basically what happens when executing the I have explored this a little bit and I have not found a way to get around the world age barrier. Here is a discourse post on that matter: https://discourse.julialang.org/t/improve-performance-of-computation-graph-evaluation/32873 which covers that. There exist GeneralizedGenerated and RuntimeGeneratedFunctions but they come with caveats. I have tried the former without success but haven't gotten around to the latter. In case you want to play around with it I can dig up my compile-CompGraph-as-a-function script to save you a few minutes. I did try the 'mutable references' tape approach suggested in the discourse thread and it did not perform better than dict memoization, but I didn't go out of my way to optimize it. I have the code for that too somewhere in case you'd like to give it a spin. Btw, the issues I mentioned in the thread seem to have been resolved in Zygote so I don't feel a pressing need to improve the CompGraph execution performance.
Yup, but why would this be considered easier to do with a Julia expression compared to a structure which is already designed to do exactly this? Fwiw, this is a pretty deep rabbit hole where simple cases makes one believe that this is straight forward. The ENAS paper refered to this as "butterfly effects" and pretty much copped out by limiting the search space imo. This is something which is doable when you are in control of the search space (i.e user is not allowed to make arbitrary modifications on the graph, everything is pre-baked). In my mind, offering something which just fails (e.g. results in a corrupt/misaligned model) for things which look perfectly reasonable to do from the API is not desirable, especially when NaiveNASlib with a pretty high confidence already takes care of this. Its much easier to modify or create a new API for NaiveNASlib than it is to build the thing that works for the simple cases and then spend the rest of eternity to patch every single issue which pops up with special handling. I think this is a very good summary of the current state of model representation and I don't think I have any definite answers to the issues you raised. One way to defer this discussion is to build the ONNX ecosystem so that model representation can be build independend of the lions share of the code (which is op conversion), one caveat being covered below. I need to put answering this topic aside as work calls, but I'll try to give a proper reply when I find the time.
I'm a bit torn on this one. I like Flux being bare bones and advocating to just write everything as a function. I can however see the difficulty of letting the eco-system come up with many alternative model representations when that is not enough (programatic model manipulation being one). As stated before, had there existed a canonical compgraph package in julia when I started NaiveNASlib I would have tried to build on top of that. I started using LightGraphs, but it turned out to not be the right tool for the job for various reasons. |
I agree with this, especially from the perspective of NAS. If the goal involves lots of arbitrary model manipulation (e.g. programmatically directed manipulation), then it's better to work with a graph-based representation.
I would say the reasons are similar to working with Flux has one more limitation in this regard that @DrChainsaw touched on. While
Definitely, but functions are limited in that their lifetime is constrained around the function call, and they don't exist beyond that. So unless you are willing to pass in parameters to every function call, you need something more permanent to capture state. Closures can give you this, but they lack specificity (i.e. they are anonymous and their fields are not consistent). So, while a higher order function can capture some state and return a closure, that state is not easily referenced. But (in Julia) closures are just anonymous structs, so we come full circle to why structs are useful. I would push that we want everything as functions, but structs are just stateful, named functions. And in Julia, structs are not just collections of data, but also types. As I talked about in this comment, there is a huge step in ergonomics when something is lifted into the type system. Julia's core strength are built around the type system.
I think this is the right approach too. A design call would be good, but we are getting too stuck in the mud on which (if any) type of graph should go into Flux as a layer. The ONNX side of things can exists completely independent of whether |
One possibility is adding Stack Semantics to Chain. This is the approach used by Google Brain, and was discussed in the topic "FluxTracks?" on Zulip. The idea is that operations pop inputs from the top of a stack, and push their outputs. To split values, one just duplicates them on the stack, and merging just requires popping more than one value. Splits, Residuals and Combinations can all be implemented this way. The ONNX API would then read and write Flux Chains with these new features. Here is an Python example of a multi-headed Transformer Decoder in Trax from Coursera's NLP Course
|
And here is some documentation on Trax https://trax-ml.readthedocs.io/en/latest/notebooks/layers_intro.html#2.-Inputs-and-Outputs |
Keeping this brief to incentivize a design call, but in short I was mistakenly assuming that ONNXMutable wanted to get rid of NaiveNASlib altogether because the graph representation was insufficient for certain use-cases. Since that's not the case and My only remaining question is whether operations like Also, just to wrap up my off-topic digression about symbolic graphs:
It was meant as more of a general statement/observation on gripes I've seen frequently expressed by ML framework users. Personally speaking, I was thinking not of |
Sorry if I seem indecisive about this, but there are certainly disadvantages which don't really have so much to do with the representation itself. For the automatic size alignment stuff do work, one needs to be able to formulate the size constraints as a MILP constraint. The vast majority of ONNX ops fits into one of the pre-baked types in NaiveNASlib and require no further effort, but there is a significant portion which doesn't, for example Reshape and Flatten. Just to be clear, the model works and can be trained as normal even without this, but if you try to modify the graph structure the "promise" of NaiveNASlib to keep the graph shape-aligned might be broken. While one can argue something like "well, what works works and thats better than nothing" it has the risk of providing an unpolished feel and a sense of uncertainty as to whether things will really work out or not. Just adding the constraints is of course an option and it is what I have been doing until now, but given that this might well be outside the comfort zone of potential contributors (it certainly is outside mine :) ) along with the possibility that some OPs might just not be possible to express in a MILP problem it seems like an unnecessary constraint (pun kinda intended) to put on development. Ideally imo the current ONNXmutable would be just one out of a handful ONNX import package for people who are serious about graph modification and are prepared to pay the price of adding constraints for new ops. Question is how to slice things to prevent that basically the same stuff is reimplemented in each package. This is what I tried to break down in the OP. Making a super simple CompGraph similar to the one in NaiveNASlib but without the mutation stuff is not many lines of code so it does not have to be any cost to have the implementation inside a more generic ONNX importer. |
@DrChainsaw - is it possible to determine if the graph is linear (and if this transformation results in the same output)? (Either way, this would be worth documenting in |
Hehe, there really seems to be very little love for the CompGraph format.
I'm not graph expert and I have certainly been burned by how difficult it seems to be to write code that reasons about graphs, but I think that a DAG is "linear" if all nodes except the input and output node has exactly one input and one output edge. |
I think the reason why I wanted to see the 'underlying' FluxML representation was that I was trying to figure out what layers / ops I would have to enable to work with JuMP variables for my optimization problem (e.g. |
There should not be a need for any. Just like the Chain, the CompGraph vertices can have any function inside them and when evaluated the graph will just pass the outputs to the right nodes just like Chain does. Chain is really just a much simpler CompGraph which only supports what is in here called a 'linear' graph. If you want to be able to mutate the graph (change size or layers or remove/add nodes/edges) and have NaiveNASlib ensure sizes of parameters across the whole graph is still consistent then you also need to provide metadata for that or write the constraints yourself, but to just evaluate it nothing extra is needed. |
Why is there no mature package that can simply load onnx, and then enter the parameters required by the model to get the predicted result. |
To just satisfy your wish without trying to interpret it, you have https://github.com/jw3126/ONNXRunTime.jl which wraps microsofts onnxruntime which afaik is the most complete implementation of the ONNX spec. This will allow you to do both inference and training from Julia, but ofc any Julia native AD will not work. Opinion pieceDespite the fact that most popular deeplearning operations look the same on the surface level, they dont have a global standard of how to implement them which everyone just follows. ONNX is an attempt to consolidate this, but it is really just another implementation with the classical standards problem. The ONNX spec is sprawling with alot of operators (many which are a bit fringe and situational), multiple versions of each operator and multiple flags and configuration parameters for most of the operators. Just from the sheer volume of things to implement (and test and maintain) "full support" is a huge task. If you browse the ONNX issues and discussion forums there are a number of requests to keep the standard implementable (which are met with sympathies but also the standard "but the market wants it" justification which is the predominant source of feature creep in all software imo). The ONNX.jl repo now has the goal to be pretty much a faithful implementation of the spec (rather than the doomed to hit a deadend approach I went for in ONNXNaiveNASflux). It is however not backed by any megacrop which means that amount of developer hours is going to be a bottleneck. I guess that noone in their right mind is going to sit and just implement all versions of all operators in the ONNX spec on their free time. Although I have no insight into onnxruntime, I'm 99.999% sure that the people who work on it are salaried by microsoft to do so and that even they have a prioritized backlog of what is most useful to do. If ONNX.jl is going to fly, it will need individual contributors to get anywhere. At least for me, adding support for an operator which would allow me to solve some problem I have right now is alot more rewarding, and over time it ought to build up to some kind of support for the most used parts of the spec. It is a bit of a different mindset than in the python ecosystem where the expecation is that everything you would ever need is written by some megacorp in C/C++. One is ofc justified to ask "why should I implement it in Julia when it is already available in python?" and for this there are no good answers (if one is looking at it from a pure minimum effort to solve a problem). I think this is why e.g. Julia Computing seems to focus on the use cases where python does not have a good story (e.g the SciML stuff) to leverage Julias strength. |
We've started overhauling ONNX.jl since this issue, and we have tracking on that repo. So I am closing this. |
I’m willing to put some effort in the ONNX story if there exist some interest.
TBH I don’t really know what fastAI is about so if there is some special meaning to ONNX + fastAI then please let me know.
Non exhaustive rundown of the current status Flux is that @opus111 has created BaseOnnx and I have made a branch of ONNXmutable which makes use of it (replacing ONNX.jl as the source of protos).
Afaik, ONNXmutable is fully functional and verified and the main gripe I have with it is that it is a bit of a monolith. It depends on NaiveNASflux for the model DAG which in turn has the somewhat big dependencies Flux (arguably not a big deal), LightGraphs and MetaGraphs (both which could be removed though with a little effort), JuMP and Cbc. ONNXmutable also has onnx and onnxruntime as test-only dependencies as well as PyCall and Conda to be able to use and depend on them.
As such, I’m mainly focused on breaking down that monolith into smaller and more reusable components which don’t force all those dependencies on to people. Please let me know if this is not what you think is needed here.
Here are some of my thoughts:
Import
In addition to the primitives discussed below, one needs some kind of runnable representation of the computation graph.
Julias fantastic autodiff capabilities have made it obsolete to have a special DAG format for the models (e.g. Tensorflow) as you can just write the DAG as a normal Julia function. While it is indeed possible to translate an ONNX model into a Julia expression which evaluates to a function representing the model (this is the approach taken by ONNX.jl) it has the limitation that the expression (at least in practice) has to be evaluated at the top level.
Flux today does not have a typical DAG. Flux’s built in Chain can represent many DAGs through the usage of SkipConnection but I don’t think it can represent any DAG (without user written functions).
I think that the method used in ONNXmutable can be generalized without too much effort to work for any typical DAG and perhaps this is something which is useful to put in BasicOnnx. I opened an issue about it here.
If there is interest, I think I can make a functor compatible import-onnxmodel-as-a-function macro in BaseOnnx. Drawback is perhaps that it might give people a janky experience due to it only working from the top level, but perhaps this can be solved by generating a better error message than “incorrect world age”.
To reach the end goal of having a useable package, there still needs to be a DAG format to use. One option is that I extract the non-mutation stuff from NaiveNASlib into some AbstractMlDag package. It is already separated from the mutation stuff so it is trivial to move to a separate package.
Export
The method in ONNXmutable uses dispatch for tracing and I think this is good enough for most typical/traditional DL models. It will however fail if it encounters
function model(x::AbstractArray)
)I think IRTools can be used without too much effort to circumvent 1, but I dread to think about how to make use of it for 2. Mjolnir seems to have the perfect abstraction but I don’t know if it is effectively maintained and ready for use.
For exporting, I’m not as certain that there exist some simple and universal enough solution which is worthwhile to put in BaseOnnx. Should we try to make a generic package or two for this or just mash it together with the framework specific stuff?
Primitives
Primitives are the functions which have a one to one correspondence with an operator defined in ONNX, for example Add, sin, Conv, RNN etc. In other words, this is the part which knows how to transform e.g. a Flux.Conv into an ONNX NodeProto and vice versa.
I don’t think this can be done without manually typing out the mapping for each OP so in any moderately well designed ONNX package this will be the by far biggest effort to create and maintain, especially considering opset versions. To me this makes it quite important that adding OPs can be done easily so that users of the package can contribute with the OPs they need or else it will be a thankless and soul crushing effort to support the whole spec.
To me, this makes it pretty useful to have a package with only the primitives to attract contributions. Obviously one package per ML framework is needed as the primitive package has to depend on the ML framework package (e.g. Flux, KNet, Lilith etc).
Here we can start with the modest set of primitives from ONNXmutable as a start of the Flux package.
Furthermore, import and export have basically nothing to do with each other and I don’t see a way to make use of import primitives for export and vice versa. This opens up for having separate packages for import and export primitives, but it is also kinda nice to be able to test the import functionality with the export functionality and vice versa. Thoughts on this?
Another thought is whether it makes sense to have a package with OPs from Base (e.g. Add, tan, sin, max, reshape etc)?
Testing
Testing numerical computations is always annoying. In ONNXmutable I used 1) test vectors from onnx and 2) comparison with output from onnxruntime. Nothing says onnxruntime is the golden standard ofc, but it seems like a lot of work is being put into it so I think it makes for a pretty good reference.
There are a couple of lines of code to set all of this up and perhaps this can also be made a package which others can make use of (e.g. primitive packages for other ML frameworks).
Package homes
The packages which are specific to a particular ML framework are best served to sit in the same org as their parent frameworks, right? What about the reminders?
Assuming the above, the reminders are basically BaseOnnx, the testing package, the Base OPs and maybe a generic export package or two if it makes sense to create it.
I guess that creating a new org in github is no effort, but perhaps it is good to try to put it into some more well known org like JuliaML.
Another option is to just give up on components which are not tied to any framework and perhaps just create another monolith. I think the advantage of this over what ONNXmutable offers today then is basically that one does not need to depend on JuMP and Cbc. I can't imagine that this is the reason why people state that julia has no functional ONNX import/export though.
The text was updated successfully, but these errors were encountered: