[Runtime] Extend Graph Runtime To Support Cuda Graph Launch #7573

zhuochenKIDD · 2021-03-03T03:46:05Z

We are currently using graph runtime to run some CTR models on NV-GPU, for our in-house model (around 100 nodes in tvm json graph ) cuGraphLaunch can reduce 5% to 10% percent latency vs the original for-loop cuda kernel launch.

So I wonder if the extension might benefits other workloads, I haven't test other types of models.

This is a POC, will supplement demos/docs

* [RELAY] Modify some passes to not stack overflow on many lets. Passes modified: - inline primitives - dead code - lambda lift * one fix * small fix * .at -> [] * fix

tqchen · 2021-03-03T16:13:10Z

include/tvm/runtime/c_runtime_api.h

@@ -481,6 +481,15 @@ TVM_DLL int TVMStreamFree(int device_type, int device_id, TVMStreamHandle stream
 */
 TVM_DLL int TVMSetStream(int device_type, int device_id, TVMStreamHandle handle);

+
+TVM_DLL int TVMStreamBeginCapture(int device_type, int device_id, TVMStreamHandle stream);


Given cuGraph is specific to CUDA atm, let us not introduce it to the DeviceAPI for now, instead just use the cuda api in cugraph runtime

thx, already removed

tqchen · 2021-03-03T16:16:10Z

cc @trevor-m @antinucleon @merrymercy please help to review

…scope (#7497) * [Tensorize] Support conds depend on outer loop vars inside tensorize scope * Reformat

* Update Vitis AI CI PyXIR version to v0.1.6 * Add --depth 1 to PyXIR clone command

* Add SPIR-V lowering for WhileNode * test vulkan in while loop tests

…tive index being out of bounds (#7554)

* [RUNTIME] Move Map into runtime This allows us to use Map to store parameters needed at runtime. * node.{Array|Map} -> runtime.{Array|Map} * missed some renames

* [AutoScheduler] Query in task extraction * trigger ci

…#7579)

- Updated ethosn relay backend to support 20.11 api changes. - Removed legacy support for 20.05. - Added a mechanism to specify the ethosn driver stack version.

* Rewrite the Rust Module API and change some imports causing crashes. This commit also updates the docs to remove outdated information. * Renable Python test and remove warnings * Python test still flaky * Fix broken module test * Fix broken test * Reset test file

…add dynamic bug (#7562) * Add segment sum Op * Remove unnecessary * Documentation * Black * Add GPU * Uncomment * Add documentation * Add dynamic tests * Add TF Op * Add Sparse Segment Sum * Add test coverage * PR Comments * Int64 tests * Add SparseSegmentSqrtN * Add SparseSegmentSqrtNOp * Deduplicate code * Add SparseSegmentMean * Parametrize Tests * Remove * Modularize * Black * Modularize Code * Pylint * PR Comments * Add scatter add tests * Remove Test Co-authored-by: Ubuntu <ubuntu@ip-172-31-42-251.us-east-2.compute.internal>

tkonolige · 2021-03-04T19:38:13Z

@zhuochenKIDD Do you have a guess as to why this is faster than the for-loop launch approach?

…7581) * Prevent TRT runtime crash for duplicate inputs and outputs * Add empty subgraph unit test

zhuochenKIDD · 2021-03-05T03:28:55Z

@zhuochenKIDD Do you have a guess as to why this is faster than the for-loop launch approach?

@tkonolige not sure why it's faster，it's based on test，and depends on workloads.
I guest for my model, it has many tiny kernels which is more kernel-launch bound, and cuda-graph might reduce kernel-launch overhead, I will do more profiling & analysis，do you have some suggestion?

* Use correct default value of False for is_ascend * Add unit test for default topk is_ascend value

* init * fix * fix

#7539 Co-authored-by: guoweijun <guoweijun@baidu.com>

…7313) * Add sparse dense tuning tutorial * Add sparse input fusion * Update the dag to support output fusion * Update * Add task input to search_task * Update * Add search_inputs to measure * Lint fix * Lint fix * Update * Update * Update * Update * Add file save load support * Update * Update * Update * Remove add_task_inputs API * Update * Update * Update * Lint fix * Lint fix * Lint fix * Lint fix * Update * Add example ci_log * Update * retrigger ci * Update * Update * Update * Lint fix * Lint fix * Lint fix

) * Move SimplifyConvPad to a new pass and don't enable it by default * rename pass * move files * fix lint * adjust test tolerance

…ecutor (#7604) * properly return and unflatten outputs from GraphExecutor * lint * cleaner approach, not sure what I was thinking before * remove unused import * forgot copyto cpu * make solution even cleaner using iterator

* [Torch] support hardsigmoid * qhswish first impl * add qhardsigmoid but the result is not correct * add qmv3 to test * comment fix

FrozenGene · 2021-03-08T07:44:22Z

In terms of the interface, can we use GraphRuntimeFactory to dispatch the extended runtime so that it would be more straightforward for users?

cc @FrozenGene @zhiics @icemelon9 @vinx13

If i don't understand it wrong, graph runtime we have done it before, but have not done for VM or other runtime.

FrozenGene · 2021-03-08T07:45:19Z

@zhuochenKIDD Besides requires CUDA 10, does cuda graph require special version of GPU hardware? like Tesla or Ampere?

FrozenGene · 2021-03-08T10:45:17Z

@zhuochenKIDD Do you have a guess as to why this is faster than the for-loop launch approach?

@tkonolige not sure why it's faster，it's based on test，and depends on workloads.
I guest for my model, it has many tiny kernels which is more kernel-launch bound, and cuda-graph might reduce kernel-launch overhead, I will do more profiling & analysis，do you have some suggestion?

I think the answer is here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs

This allows a graph to be defined once and then launched repeatedly. Separating out the definition of a graph from its execution enables a number of optimizations: first, CPU launch costs are reduced compared to streams, because much of the setup is done in advance; second, presenting the whole workflow to CUDA enables optimizations which might not be possible with the piecewise work submission mechanism of streams.

FrozenGene · 2021-03-08T10:46:42Z

@zhuochenKIDD Besides requires CUDA 10, does cuda graph require special version of GPU hardware? like Tesla or Ampere?

I find the answer: https://docs.nvidia.com/cuda/cuda-samples/index.html#simple-cuda-graphs no special version of GPU hardware required. It is optimization at software level.

…7602) * [TE] Fix bug in AutoInlineElemWise and implement AutoInlineBroadcast * [TE] Add AutoInlineBroadcast API to schedule_pass.h

comaniac · 2021-03-08T17:56:28Z

If i don't understand it wrong, graph runtime we have done it before, but have not done for VM or other runtime.

This PR is for graph runtime so it's not an issue for other runtimes? My concern is if we can reduce the API complexity from users. Now users need to use tvm.contrib.cu_graph.cugraph_runtime.create(...) to make use of this runtime, but this is not recommeneded. We should make all graph runtimes craeted from graph_runtime.GraphModule.

* add ShapeFunc for tanh * _schedule_dense_small_batch turn autotvm off when dense's inner dim is unknown * fix CI pylint

* [Relay] Fix relay op strategy for cuda dense int8 * Remove uint8 && Add autotvm task extraction test for relay graph that contains dense op (int8 * int8 -> int32) * Reformat the code of test case

* [Relay] add ShapeFunc for one_hot op * fix pylint * add test for shapefunc of one_hot op

FrozenGene · 2021-03-09T03:32:40Z

If i don't understand it wrong, graph runtime we have done it before, but have not done for VM or other runtime.

This PR is for graph runtime so it's not an issue for other runtimes? My concern is if we can reduce the API complexity from users. Now users need to use tvm.contrib.cu_graph.cugraph_runtime.create(...) to make use of this runtime, but this is not recommeneded. We should make all graph runtimes craeted from graph_runtime.GraphModule.

Not it. I just meant our GraphRuntimeFactory has designed for this purpose you mentioned before. However, GraphRuntimeFactory is applied only for GraphRuntime now, not applied for VM or other runtime. But GraphRuntimeFactory could be used for the purpose you mentioned.

…cugraph

zhuochenKIDD · 2021-03-09T13:21:13Z

I opened a more clean PR, close this.
#7616

zhuochenKIDD and others added 4 commits March 2, 2021 20:44

add cugraph poc

3f2e6b3

[RELAY] Modify some passes to not stack overflow on many lets. (#7558)

91b6b3f

* [RELAY] Modify some passes to not stack overflow on many lets. Passes modified: - inline primitives - dead code - lambda lift * one fix * small fix * .at -> [] * fix

[torch] Add linear operator support (#7569)

67bba90

fix device api define

cbfdf18

tqchen requested changes Mar 3, 2021

View reviewed changes

tqchen assigned comaniac Mar 3, 2021

tqchen added the status: need review label Mar 3, 2021

leeexyz and others added 14 commits March 3, 2021 09:30

[Tensorize] Support conds depend on outer loop vars inside tensorize …

37053e1

…scope (#7497) * [Tensorize] Support conds depend on outer loop vars inside tensorize scope * Reformat

[CI][VitisAI] Update CI Vitis AI PyXIR version (#7575)

296700e

* Update Vitis AI CI PyXIR version to v0.1.6 * Add --depth 1 to PyXIR clone command

[SPIR-V] Add SPIR-V lowering for While node (#7574)

dccc1ab

* Add SPIR-V lowering for WhileNode * test vulkan in while loop tests

[Relay][Quantization] Fix Bug Which Cause Negative Left Shift Op (#7432)

25bf449

[Relay][bugfix][error reporting] BiasAddRel does not check for a nega…

1d5f334

…tive index being out of bounds (#7554)

compile engine dump tir and shape funcs (#7552)

980cf80

[RUNTIME] Move Map into runtime (#7570)

73a0b96

* [RUNTIME] Move Map into runtime This allows us to use Map to store parameters needed at runtime. * node.{Array|Map} -> runtime.{Array|Map} * missed some renames

[AutoSchedule] Fix a flaky test (#7580)

41c0591

[AutoScheduler] Querying and sampling in task extraction (#7571)

3f5f84d

* [AutoScheduler] Query in task extraction * trigger ci

remove DeviceAPI def

ece752f

[DOCKER] Fix: install script regarding get-pip.py during docker build (…

66f9139

…#7579)

[ETHOSN] Add support for 20.11 Ethos-N driver stack release (#7506)

02a6483

- Updated ethosn relay backend to support 20.11 api changes. - Removed legacy support for 20.05. - Added a mechanism to specify the ethosn driver stack version.

Trevor Morris and others added 2 commits March 4, 2021 14:10

[BYOC][TensorRT] Make TRT runtime robust to empty or weird subgraphs (#…

3fbb0a3

…7581) * Prevent TRT runtime crash for duplicate inputs and outputs * Add empty subgraph unit test

[SPIRV] Support Bool buffer argument (#7591)

d7f5753

comaniac and others added 4 commits March 4, 2021 21:34

[PyTorch] Guarantee data input is the first argument (#7592)

d5cb3cb

[CI] Bump arm version (#7584)

61e799c

Fix for dynamic batch size conv2d nhwc (#7598)

fb06fd8

[Frontend][MXNet] Fix default value for is_ascend in topk (#7568)

b9adce2

* Use correct default value of False for is_ascend * Add unit test for default topk is_ascend value

hgt312 and others added 7 commits March 5, 2021 13:47

[Relay][Pass] Avoid stack overflow when using PostOrderRewrite (#7588)

1ae4697

* init * fix * fix

[TOPI] disable test_shift with i8 datatype (#7597)

783be9d

#7539 Co-authored-by: guoweijun <guoweijun@baidu.com>

Move SimplifyConvPad to a new pass and don't enable it by default (#7603

69c1c6d

) * Move SimplifyConvPad to a new pass and don't enable it by default * rename pass * move files * fix lint * adjust test tolerance

[CUDA] BF16 support (#7014)

8aa2a7c

[Torch, QNN] Support quantized mobilenet v3 from torch 1.8 (#7606)

760e9b2

* [Torch] support hardsigmoid * qhswish first impl * add qhardsigmoid but the result is not correct * add qmv3 to test * comment fix

zhuochenKIDD and others added 2 commits March 8, 2021 20:51

fix lint by clang-format

94b7438

[TE] Fix bug in AutoInlineElemWise and implement AutoInlineBroadcast (#…

cc7f8dc

…7602) * [TE] Fix bug in AutoInlineElemWise and implement AutoInlineBroadcast * [TE] Add AutoInlineBroadcast API to schedule_pass.h

monklof and others added 4 commits March 8, 2021 10:50

[Relay] add ShapeFunc for tanh (#6898)

ca303aa

* add ShapeFunc for tanh * _schedule_dense_small_batch turn autotvm off when dense's inner dim is unknown * fix CI pylint

[Relay] Fix relay op strategy for cuda dense int8 (#7586)

8d1f5b2

* [Relay] Fix relay op strategy for cuda dense int8 * Remove uint8 && Add autotvm task extraction test for relay graph that contains dense op (int8 * int8 -> int32) * Reformat the code of test case

Add logging to diagnose flaky ci-qemu test (#7610)

717c5e0

[Relay] add ShapeFunc for one_hot op (#7490)

b827845

* [Relay] add ShapeFunc for one_hot op * fix pylint * add test for shapefunc of one_hot op

zhuochenKIDD and others added 9 commits March 9, 2021 12:19

fix format lint issue

d485d6d

[RUNTIME] Unify load params interface (#7559)

89bafd5

[FIX] Fix clang12 warnings (#7593)

a8d1055

add cugraph poc

5e7f35d

fix device api define

11f3264

remove DeviceAPI def

6832974

fix lint by clang-format

4fb418c

fix format lint issue

e43df61

Merge branch 'cugraph' of github.com:zhuochenKIDD/incubator-tvm into …

6eb950a

…cugraph

zhuochenKIDD closed this Mar 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Runtime] Extend Graph Runtime To Support Cuda Graph Launch #7573

[Runtime] Extend Graph Runtime To Support Cuda Graph Launch #7573

zhuochenKIDD commented Mar 3, 2021

tqchen Mar 3, 2021

zhuochenKIDD Mar 4, 2021

tqchen commented Mar 3, 2021

tkonolige commented Mar 4, 2021

zhuochenKIDD commented Mar 5, 2021

FrozenGene commented Mar 8, 2021

FrozenGene commented Mar 8, 2021

FrozenGene commented Mar 8, 2021

FrozenGene commented Mar 8, 2021

comaniac commented Mar 8, 2021

FrozenGene commented Mar 9, 2021

zhuochenKIDD commented Mar 9, 2021

[Runtime] Extend Graph Runtime To Support Cuda Graph Launch #7573

[Runtime] Extend Graph Runtime To Support Cuda Graph Launch #7573

Conversation

zhuochenKIDD commented Mar 3, 2021

tqchen Mar 3, 2021

Choose a reason for hiding this comment

zhuochenKIDD Mar 4, 2021

Choose a reason for hiding this comment

tqchen commented Mar 3, 2021

tkonolige commented Mar 4, 2021

zhuochenKIDD commented Mar 5, 2021

FrozenGene commented Mar 8, 2021

FrozenGene commented Mar 8, 2021

FrozenGene commented Mar 8, 2021

FrozenGene commented Mar 8, 2021

comaniac commented Mar 8, 2021

FrozenGene commented Mar 9, 2021

zhuochenKIDD commented Mar 9, 2021