-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Runtime] Extend Graph Runtime To Support Cuda Graph Launch #7573
Conversation
* [RELAY] Modify some passes to not stack overflow on many lets. Passes modified: - inline primitives - dead code - lambda lift * one fix * small fix * .at -> [] * fix
include/tvm/runtime/c_runtime_api.h
Outdated
@@ -481,6 +481,15 @@ TVM_DLL int TVMStreamFree(int device_type, int device_id, TVMStreamHandle stream | |||
*/ | |||
TVM_DLL int TVMSetStream(int device_type, int device_id, TVMStreamHandle handle); | |||
|
|||
|
|||
TVM_DLL int TVMStreamBeginCapture(int device_type, int device_id, TVMStreamHandle stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given cuGraph is specific to CUDA atm, let us not introduce it to the DeviceAPI for now, instead just use the cuda api in cugraph runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, already removed
cc @trevor-m @antinucleon @merrymercy please help to review |
…scope (#7497) * [Tensorize] Support conds depend on outer loop vars inside tensorize scope * Reformat
* Update Vitis AI CI PyXIR version to v0.1.6 * Add --depth 1 to PyXIR clone command
* Add SPIR-V lowering for WhileNode * test vulkan in while loop tests
…tive index being out of bounds (#7554)
* [RUNTIME] Move Map into runtime This allows us to use Map to store parameters needed at runtime. * node.{Array|Map} -> runtime.{Array|Map} * missed some renames
* [AutoScheduler] Query in task extraction * trigger ci
- Updated ethosn relay backend to support 20.11 api changes. - Removed legacy support for 20.05. - Added a mechanism to specify the ethosn driver stack version.
* Rewrite the Rust Module API and change some imports causing crashes. This commit also updates the docs to remove outdated information. * Renable Python test and remove warnings * Python test still flaky * Fix broken module test * Fix broken test * Reset test file
…add dynamic bug (#7562) * Add segment sum Op * Remove unnecessary * Documentation * Black * Add GPU * Uncomment * Add documentation * Add dynamic tests * Add TF Op * Add Sparse Segment Sum * Add test coverage * PR Comments * Int64 tests * Add SparseSegmentSqrtN * Add SparseSegmentSqrtNOp * Deduplicate code * Add SparseSegmentMean * Parametrize Tests * Remove * Modularize * Black * Modularize Code * Pylint * PR Comments * Add scatter add tests * Remove Test Co-authored-by: Ubuntu <ubuntu@ip-172-31-42-251.us-east-2.compute.internal>
@zhuochenKIDD Do you have a guess as to why this is faster than the for-loop launch approach? |
…7581) * Prevent TRT runtime crash for duplicate inputs and outputs * Add empty subgraph unit test
@tkonolige not sure why it's faster,it's based on test,and depends on workloads. |
* Use correct default value of False for is_ascend * Add unit test for default topk is_ascend value
* init * fix * fix
#7539 Co-authored-by: guoweijun <guoweijun@baidu.com>
…7313) * Add sparse dense tuning tutorial * Add sparse input fusion * Update the dag to support output fusion * Update * Add task input to search_task * Update * Add search_inputs to measure * Lint fix * Lint fix * Update * Update * Update * Update * Add file save load support * Update * Update * Update * Remove add_task_inputs API * Update * Update * Update * Lint fix * Lint fix * Lint fix * Lint fix * Update * Add example ci_log * Update * retrigger ci * Update * Update * Update * Lint fix * Lint fix * Lint fix
…ecutor (#7604) * properly return and unflatten outputs from GraphExecutor * lint * cleaner approach, not sure what I was thinking before * remove unused import * forgot copyto cpu * make solution even cleaner using iterator
* [Torch] support hardsigmoid * qhswish first impl * add qhardsigmoid but the result is not correct * add qmv3 to test * comment fix
If i don't understand it wrong, graph runtime we have done it before, but have not done for VM or other runtime. |
@zhuochenKIDD Besides requires CUDA 10, does cuda graph require special version of GPU hardware? like Tesla or Ampere? |
I think the answer is here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
|
I find the answer: https://docs.nvidia.com/cuda/cuda-samples/index.html#simple-cuda-graphs no special version of GPU hardware required. It is optimization at software level. |
…7602) * [TE] Fix bug in AutoInlineElemWise and implement AutoInlineBroadcast * [TE] Add AutoInlineBroadcast API to schedule_pass.h
This PR is for graph runtime so it's not an issue for other runtimes? My concern is if we can reduce the API complexity from users. Now users need to use |
* add ShapeFunc for tanh * _schedule_dense_small_batch turn autotvm off when dense's inner dim is unknown * fix CI pylint
* [Relay] Fix relay op strategy for cuda dense int8 * Remove uint8 && Add autotvm task extraction test for relay graph that contains dense op (int8 * int8 -> int32) * Reformat the code of test case
* [Relay] add ShapeFunc for one_hot op * fix pylint * add test for shapefunc of one_hot op
Not it. I just meant our |
I opened a more clean PR, close this. |
We are currently using graph runtime to run some CTR models on NV-GPU, for our in-house model (around 100 nodes in tvm json graph ) cuGraphLaunch can reduce 5% to 10% percent latency vs the original for-loop cuda kernel launch.
So I wonder if the extension might benefits other workloads, I haven't test other types of models.
This is a POC, will supplement demos/docs