Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add NNEF frontend #108

Merged
merged 3 commits into from
May 31, 2024
Merged

[RFC] Add NNEF frontend #108

merged 3 commits into from
May 31, 2024

Conversation

agoston-mc
Copy link
Contributor

@agoston-mc agoston-mc commented Apr 11, 2024

A RFC to add a Neural Network Exchange Format frontend to TVM relay.
Link to discuss

@agoston-mc agoston-mc changed the title [RFC] Add NNEF frontend #108 [RFC] Add NNEF frontend Apr 11, 2024
@tqchen
Copy link
Member

tqchen commented Apr 13, 2024

Thanks for the proposal, as a community we recently moves towards the relax IR for latest genAI workloads, additionally, it is unclear how much adoption NNEF have as of now versus ONNX and other formats

@gyenesvi
Copy link

Hi,

as a community we recently moves towards the relax IR for latest genAI workloads

Thanks for directing us towards Relax. I guess that means that new frontends should convert their representations into Relax IR instead of Relay? The documentation on tvm.apache.org refers to Relay, but not Relax. Is that documentation obsolete in this area? Is Relay going to be superseded by Relax?

We only see frontend examples in tvm.relax that we can use as reference. Is there further documentation on tvm.relax?

It is interesting to hear that there's more focus on dynamic graphs / shape inference, as one of the key goals of the next version of NNEF, under development, is support for dynamic graphs and shape inference.

it is unclear how much adoption NNEF have as of now versus ONNX and other formats

One of the goals of integration into compiler stacks like TVM would be exactly to drive more adoption, as adoption requires public tooling to be able to demonstrate the capabilities / usage of NNEF in end-to-end workflows. As the next version of NNEF will focus on dynamic graphs, custom operations and lowering to tensor IR level, TVM seems like a good option to demonstrate its potential in compilation based inference engines. But first we would like to start with integrating the currently publicly available version of NNEF.

Also, TVM has backends to multiple Khronos formats, such as SPIR-V (Vulkan) and OpenCL, that is why TVM could provide us with an end-to-end workflow starting from a Khronos defined input format, and resulting in Khronos defined outputs. Furthermore, some Khronos members may be interested in implementing their own (proprietary) hardware backends to TVM, with which an NNEF frontend could also provide an end-to-end workflow.

@tqchen
Copy link
Member

tqchen commented Apr 16, 2024

Thanks for the note. We are in the process of revamping docs. The latest set of emerging model optimizations like LLMs will be based on relax, most of the community developments also now centers around this. Relay is mostly in maintaince mode per dev activities. https://github.com/apache/tvm/tree/main/python/tvm/relax/frontend/onnx likely is a good reference there

@agoston-mc
Copy link
Contributor Author

We have updated the PR with Relax frontend, but we have also kept the Relay, as an option, thinking it could be useful to have both, because we noticed performance differences during testing.

We observed that Relax by the default build pipeline is significantly slower than Relay. (On CPU we observed 2 orders of magnitude slower runs, while on GPU 3-5x slower, the models we tested were mobilenet and resnet variants, all static models) We observed the same with the ONNX Relax frontend, so we suspect the issue is with the compilation, not with the frontends. Is this a normal situation with the current state of development of Relax?
By using Meta Schedule with a custom pipeline (with only a ValidateOps transformation), we managed to match or surpass the speed of Relay, but in many cases using the 'zero' or 'default_build' pipelines did not improve the performance.
What is the recommended workflow to be able to reach the performance of Relay reliably (at least on static models)?

Anyways this does not affect the frontend code in the PR so we could move on with that as the frontend is ready to be submitted to TVM. We are just curious for debug/measurement reasons, as we were surprised by the results.

@tqchen
Copy link
Member

tqchen commented May 9, 2024

I think the main reason here was relay default incorporate autotuning by default, while Relax dos not. The main rationale as of now is we would like to choose to decouple metaschedule tuning from the flow (as tuning is usually slower).

That does not mean metaschedule cannot be applied, we do encourage users to apply metaschedule for traditional applications. In the build flow, the meta-schedule can then get applied by composing together with default flow.

The zero pipeline as of now mainly focus on some of some extra out of box improvement for latest LLM models and can expand to more in future

@gyenesvi
Copy link

gyenesvi commented May 9, 2024

Thanks for the info about the schedules and differences, it makes sense.

As for moving on, what would be the next step now? Do you need any other info from us for reviewing?

@tqchen
Copy link
Member

tqchen commented May 9, 2024

Leaving it open for another week in case others want to chime in, otherwise LGTM

@gyenesvi
Copy link

gyenesvi commented May 9, 2024

Great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants