-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Runtime][PipelineExecutor] Tutorial of using pipeline executor. #11557
Conversation
# own splitting function logic. | ||
import os | ||
|
||
os.sys.path.append(os.path.abspath(os.environ["TVM_HOME"] + "/tests/python/relay")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think unfortunately right now this has to be done with relative paths. you can debug this with tests/scripts/ci.py docs
i believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@areusch , thanks for the follow up, the path issue get fixed, but seems like the ci box not enabled dnnl or not installed mkldnn?, then the tutorial still can not execute, to handle such issue, I put the BYOC part into a function and comment the function execution to avoid the DNNL execution error issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should have whatever is enabled in ci_gpu, and that's determined partly by Dockerfile.ci_gpu and by tests/scripts/task_config_build_gpu.sh. you could propose a change there if you need something for your tutorial (just add to this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@areusch, the Jekins file used a fixed docker image, the ci still running the tutorial file without apply the change in Dockerfile.ci_gpu.
I can saw the new gpu docker image get uploaded into aws ecr, and can not found it on tlcstaging of docker hub, could I know what is process to request upload the new docker image to fix my issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @driazati i think we need to set the ecr image repo to be public, or push those images to dockerhub. thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could definitely do that and probably will soon. as a stop gap in the meantime @huajsj you can run the docker build locally and pass it to ci.py
:
bash docker/build.sh ci_gpu --tag my_ci_gpu
python tests/scripts/ci.py docs --docker-image my_ci_gpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f1187c0
to
d10da4e
Compare
blocked on #11774 |
2691730
to
81f2d49
Compare
0ca864b
to
58c7e85
Compare
blocked on #12020 |
Tutorial of using pipeline executor including the byoc use case.
this is a known issue of sphinx-gallery sphinx-gallery/sphinx-gallery#211
63efbad
to
0b30034
Compare
@masahi, please take a look. |
pipe_config[mod0].target = "llvm" | ||
pipe_config[mod0].dev = tvm.cpu(0) | ||
############################################################################### | ||
# Set the cpu afinity for control flow, for example using cpu 0 for control flow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify what is meant by "control flow", and why we need to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we run backend with executor for example cutlass, both cpu and gpu would get involved for the execution, cpu part response for preparing data, pre/post processing, transfer data between layer etc, I call this part as control flow.
under multiple backend situation, for example in this tutorial that is LLVM + CUTLASS, the 2 control flow will compete the cpu resource, and cause a lot of thread context switch, or cpu migration. These type resource competing will slow down the performance. by using the affinity setting, we associate a backend to a particular cpu group to avoid the said overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"control flow" usually means if/else or for loop in TVM or in general. How about "host operations"?
This also doesn't sound like something most users should be concerned about. I suggest removing affinity stuff from the tutorial and set the default affinity inside some runtime function. If you require affinity control by users, please summarize and add what you said above to the tutorial with correct English.
########################################### | ||
# Splitting the network into two subgraphs. | ||
# ----------------------------------------- | ||
# It is an example that the graph splitting function comes from a unit test. User can create a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence is broken and makes no sense..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed into “This function called 'graph_split' from a unit test is just an example. User can create a customized logic to split the graph.”
import inspect | ||
import os | ||
|
||
test_path = os.path.dirname(inspect.getfile(lambda: None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can simply use __file__
here instead of inspect
. And rename test_path
to tutorial_dir
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace "test_path" with "tutorial_dir",
the reason we use inspect instead of file is because that __file__
not work with sphinx-gallery which is used by tvm doc
huajsj@8d2bfc3
pipe_config[mod1].export_cc = "nvcc" | ||
################################################################################# | ||
# Set the cpu afinity for control flow, for example using cpu 1 for control flow. | ||
pipe_config[mod1].cpu_affinity = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipe_config[mod1].cpu_affinity
is written twice, here and at L166.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
pipe_config[mod0].target = "llvm" | ||
pipe_config[mod0].dev = tvm.cpu(0) | ||
############################################################################### | ||
# Set the cpu afinity for control flow, for example using cpu 0 for control flow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"control flow" usually means if/else or for loop in TVM or in general. How about "host operations"?
This also doesn't sound like something most users should be concerned about. I suggest removing affinity stuff from the tutorial and set the default affinity inside some runtime function. If you require affinity control by users, please summarize and add what you said above to the tutorial with correct English.
pipe_config[mod1].build_func = cutlass_build | ||
pipe_config[mod1].export_cc = "nvcc" | ||
################################################################################# | ||
# Set the cpu afinity for control flow, for example using cpu 1 for control flow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: afinity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the affinity and use default tvm threadpoll default affinity logic.
pipe_config[mod1].cpu_affinity = "1" | ||
pipe_config["input"]["data"].connect(pipe_config[mod0]["input"]["data"]) | ||
pipe_config[mod0]["output"][0].connect(pipe_config[mod1]["input"]["data_n_0"]) | ||
pipe_config[mod1]["output"]["0"].connect(pipe_config["output"][0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these three lines related to affinity control? You should have another ######## before them and explain what they do.
I have to say, this is not a good API. For example, where the names "data" and "data_n_0" come from? What is pipe_config[mod0]["output"][0]
? And why you use "0" at L178?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these three line related connect subgraph to build pipeline instead of affinity, added detail explain.
"data" and "data_n_0" coming from subgraphs which is a list of subgraph, by print(subgraph[0]) , print(subgraph[1]) the said "data" and "data_n_0" will shown. if here give a wrong name which not exist , the API will throw a error.
pipe_config[mod0]["output"][0] means "the first output interface" of "mod0", line 178 "0" is typo , fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay I think this is the last typo fix.
…che#11557) * [Runtime][PipelineExecutor] Tutorial of using pipeline executor. Tutorial of using pipeline executor including the byoc use case. * fix ci issue * document change. * triger build * fix doc issue * fix ci issue * doc issue * fix ci issue * fix ci issue. * fix __file__ not found problem. this is a known issue of sphinx-gallery sphinx-gallery/sphinx-gallery#211 * fix byoc with dnnl issue * enable dnnl and pipeline executor * trigger build * trigger build * fix build issue * trigger build * oneflow cause crash, do test with change * add sphinx skip * plint * remove from_oneflow change test. * remove pipeline executor change for test * plint * enable DNNL and pipeline * disable DNNL * enable DNNL without pipeline * remove dnnl and add cutlass * use cutlass with byoc * change into cutlass * fix doc convention issue * remove duplicate variable * fix plint issue. * address review comments. * address review comments * fix bug. * polish the document * fix plint issue * address review comments. * address review comments * address review comments
…che#11557) * [Runtime][PipelineExecutor] Tutorial of using pipeline executor. Tutorial of using pipeline executor including the byoc use case. * fix ci issue * document change. * triger build * fix doc issue * fix ci issue * doc issue * fix ci issue * fix ci issue. * fix __file__ not found problem. this is a known issue of sphinx-gallery sphinx-gallery/sphinx-gallery#211 * fix byoc with dnnl issue * enable dnnl and pipeline executor * trigger build * trigger build * fix build issue * trigger build * oneflow cause crash, do test with change * add sphinx skip * plint * remove from_oneflow change test. * remove pipeline executor change for test * plint * enable DNNL and pipeline * disable DNNL * enable DNNL without pipeline * remove dnnl and add cutlass * use cutlass with byoc * change into cutlass * fix doc convention issue * remove duplicate variable * fix plint issue. * address review comments. * address review comments * fix bug. * polish the document * fix plint issue * address review comments. * address review comments * address review comments
RFC:https://github.com/apache/tvm-rfcs/blob/main/rfcs/0014-pipeline-executor.md
issue: #8596
Tutorial of using pipeline executor including the byoc use case.
This tutorial need to enable "USE_PIPELINE_EXECUTOR","USE_DNNL_CODEGEN" on config.cmake with MKL-DNN installed, not sure if the "How To Guides" is a better fit.
cc @areusch, @masahi