Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Runtime] Extend Graph Runtime To Support Cuda Graph Launch #7573

Closed
wants to merge 51 commits into from
Closed

[Runtime] Extend Graph Runtime To Support Cuda Graph Launch #7573

wants to merge 51 commits into from

Conversation

zhuochenKIDD
Copy link
Contributor

We are currently using graph runtime to run some CTR models on NV-GPU, for our in-house model (around 100 nodes in tvm json graph ) cuGraphLaunch can reduce 5% to 10% percent latency vs the original for-loop cuda kernel launch.

So I wonder if the extension might benefits other workloads, I haven't test other types of models.

This is a POC, will supplement demos/docs

zhuochenKIDD and others added 4 commits March 2, 2021 20:44
* [RELAY] Modify some passes to not stack overflow on many lets.

Passes modified:
- inline primitives
- dead code
- lambda lift

* one fix

* small fix

* .at -> []

* fix
@@ -481,6 +481,15 @@ TVM_DLL int TVMStreamFree(int device_type, int device_id, TVMStreamHandle stream
*/
TVM_DLL int TVMSetStream(int device_type, int device_id, TVMStreamHandle handle);


TVM_DLL int TVMStreamBeginCapture(int device_type, int device_id, TVMStreamHandle stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given cuGraph is specific to CUDA atm, let us not introduce it to the DeviceAPI for now, instead just use the cuda api in cugraph runtime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, already removed

@tqchen
Copy link
Member

tqchen commented Mar 3, 2021

cc @trevor-m @antinucleon @merrymercy please help to review

leeexyz and others added 14 commits March 3, 2021 09:30
…scope (#7497)

* [Tensorize] Support conds depend on outer loop vars inside tensorize scope

* Reformat
* Update Vitis AI CI PyXIR version to v0.1.6

* Add --depth 1 to PyXIR clone command
* Add SPIR-V lowering for WhileNode

* test vulkan in while loop tests
* [RUNTIME] Move Map into runtime

This allows us to use Map to store parameters needed at runtime.

* node.{Array|Map} -> runtime.{Array|Map}

* missed some renames
* [AutoScheduler] Query in task extraction

* trigger ci
- Updated ethosn relay backend to support 20.11 api changes.
 - Removed legacy support for 20.05.
 - Added a mechanism to specify the ethosn driver stack version.
* Rewrite the Rust Module API and change some imports causing crashes.

This commit also updates the docs to remove outdated information.

* Renable Python test and remove warnings

* Python test still flaky

* Fix broken module test

* Fix broken test

* Reset test file
…add dynamic bug (#7562)

* Add segment sum Op

* Remove unnecessary

* Documentation

* Black

* Add GPU

* Uncomment

* Add documentation

* Add dynamic tests

* Add TF Op

* Add Sparse Segment Sum

* Add test coverage

* PR Comments

* Int64 tests

* Add SparseSegmentSqrtN

* Add SparseSegmentSqrtNOp

* Deduplicate code

* Add SparseSegmentMean

* Parametrize Tests

* Remove

* Modularize

* Black

* Modularize Code

* Pylint

* PR Comments

* Add scatter add tests

* Remove Test

Co-authored-by: Ubuntu <ubuntu@ip-172-31-42-251.us-east-2.compute.internal>
@tkonolige
Copy link
Contributor

@zhuochenKIDD Do you have a guess as to why this is faster than the for-loop launch approach?

Trevor Morris and others added 2 commits March 4, 2021 14:10
…7581)

* Prevent TRT runtime crash for duplicate inputs and outputs

* Add empty subgraph unit test
@zhuochenKIDD
Copy link
Contributor Author

@zhuochenKIDD Do you have a guess as to why this is faster than the for-loop launch approach?

@tkonolige not sure why it's faster,it's based on test,and depends on workloads.
I guest for my model, it has many tiny kernels which is more kernel-launch bound, and cuda-graph might reduce kernel-launch overhead, I will do more profiling & analysis,do you have some suggestion?

hgt312 and others added 7 commits March 5, 2021 13:47
#7539

Co-authored-by: guoweijun <guoweijun@baidu.com>
…7313)

* Add sparse dense tuning tutorial

* Add sparse input fusion

* Update the dag to support output fusion

* Update

* Add task input to search_task

* Update

* Add search_inputs to measure

* Lint fix

* Lint fix

* Update

* Update

* Update

* Update

* Add file save load support

* Update

* Update

* Update

* Remove add_task_inputs API

* Update

* Update

* Update

* Lint fix

* Lint fix

* Lint fix

* Lint fix

* Update

* Add example ci_log

* Update

* retrigger ci

* Update

* Update

* Update

* Lint fix

* Lint fix

* Lint fix
)

* Move SimplifyConvPad to a new pass and don't enable it by default

* rename pass

* move files

* fix lint

* adjust test tolerance
…ecutor (#7604)

* properly return and unflatten outputs from GraphExecutor

* lint

* cleaner approach, not sure what I was thinking before

* remove unused import

* forgot copyto cpu

* make solution even cleaner using iterator
* [Torch] support hardsigmoid

* qhswish first impl

* add qhardsigmoid but the result is not correct

* add qmv3 to test

* comment fix
@FrozenGene
Copy link
Member

In terms of the interface, can we use GraphRuntimeFactory to dispatch the extended runtime so that it would be more straightforward for users?

cc @FrozenGene @zhiics @icemelon9 @vinx13

If i don't understand it wrong, graph runtime we have done it before, but have not done for VM or other runtime.

@FrozenGene
Copy link
Member

@zhuochenKIDD Besides requires CUDA 10, does cuda graph require special version of GPU hardware? like Tesla or Ampere?

@FrozenGene
Copy link
Member

@zhuochenKIDD Do you have a guess as to why this is faster than the for-loop launch approach?

@tkonolige not sure why it's faster,it's based on test,and depends on workloads.
I guest for my model, it has many tiny kernels which is more kernel-launch bound, and cuda-graph might reduce kernel-launch overhead, I will do more profiling & analysis,do you have some suggestion?

I think the answer is here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs

This allows a graph to be defined once and then launched repeatedly. Separating out the definition of a graph from its execution enables a number of optimizations: first, CPU launch costs are reduced compared to streams, because much of the setup is done in advance; second, presenting the whole workflow to CUDA enables optimizations which might not be possible with the piecewise work submission mechanism of streams.

@FrozenGene
Copy link
Member

@zhuochenKIDD Besides requires CUDA 10, does cuda graph require special version of GPU hardware? like Tesla or Ampere?

I find the answer: https://docs.nvidia.com/cuda/cuda-samples/index.html#simple-cuda-graphs no special version of GPU hardware required. It is optimization at software level.

zhuochenKIDD and others added 2 commits March 8, 2021 20:51
…7602)

* [TE] Fix bug in AutoInlineElemWise and implement AutoInlineBroadcast

* [TE] Add AutoInlineBroadcast API to schedule_pass.h
@comaniac
Copy link
Contributor

comaniac commented Mar 8, 2021

If i don't understand it wrong, graph runtime we have done it before, but have not done for VM or other runtime.

This PR is for graph runtime so it's not an issue for other runtimes? My concern is if we can reduce the API complexity from users. Now users need to use tvm.contrib.cu_graph.cugraph_runtime.create(...) to make use of this runtime, but this is not recommeneded. We should make all graph runtimes craeted from graph_runtime.GraphModule.

monklof and others added 4 commits March 8, 2021 10:50
* add ShapeFunc for tanh

* _schedule_dense_small_batch turn autotvm off when dense's inner dim is unknown

* fix CI pylint
* [Relay] Fix relay op strategy for cuda dense int8

* Remove uint8 && Add autotvm task extraction test for relay graph that contains dense op (int8 * int8 -> int32)

* Reformat the code of test case
* [Relay] add ShapeFunc for one_hot op

* fix pylint

* add test for shapefunc of one_hot op
@FrozenGene
Copy link
Member

If i don't understand it wrong, graph runtime we have done it before, but have not done for VM or other runtime.

This PR is for graph runtime so it's not an issue for other runtimes? My concern is if we can reduce the API complexity from users. Now users need to use tvm.contrib.cu_graph.cugraph_runtime.create(...) to make use of this runtime, but this is not recommeneded. We should make all graph runtimes craeted from graph_runtime.GraphModule.

Not it. I just meant our GraphRuntimeFactory has designed for this purpose you mentioned before. However, GraphRuntimeFactory is applied only for GraphRuntime now, not applied for VM or other runtime. But GraphRuntimeFactory could be used for the purpose you mentioned.

@zhuochenKIDD
Copy link
Contributor Author

I opened a more clean PR, close this.
#7616

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.