Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Runtime][PipelineExecutor] Tutorial of using pipeline executor. #11557

Merged
merged 39 commits into from
Jul 22, 2022

Conversation

huajsj
Copy link
Contributor

@huajsj huajsj commented Jun 3, 2022

RFC:https://github.com/apache/tvm-rfcs/blob/main/rfcs/0014-pipeline-executor.md
issue: #8596
Tutorial of using pipeline executor including the byoc use case.

This tutorial need to enable "USE_PIPELINE_EXECUTOR","USE_DNNL_CODEGEN" on config.cmake with MKL-DNN installed, not sure if the "How To Guides" is a better fit.

cc @areusch, @masahi

# own splitting function logic.
import os

os.sys.path.append(os.path.abspath(os.environ["TVM_HOME"] + "/tests/python/relay"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think unfortunately right now this has to be done with relative paths. you can debug this with tests/scripts/ci.py docs i believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@areusch , thanks for the follow up, the path issue get fixed, but seems like the ci box not enabled dnnl or not installed mkldnn?, then the tutorial still can not execute, to handle such issue, I put the BYOC part into a function and comment the function execution to avoid the DNNL execution error issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should have whatever is enabled in ci_gpu, and that's determined partly by Dockerfile.ci_gpu and by tests/scripts/task_config_build_gpu.sh. you could propose a change there if you need something for your tutorial (just add to this PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@areusch, the Jekins file used a fixed docker image, the ci still running the tutorial file without apply the change in Dockerfile.ci_gpu.
I can saw the new gpu docker image get uploaded into aws ecr, and can not found it on tlcstaging of docker hub, could I know what is process to request upload the new docker image to fix my issue?

Copy link
Contributor

@areusch areusch Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @driazati i think we need to set the ecr image repo to be public, or push those images to dockerhub. thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could definitely do that and probably will soon. as a stop gap in the meantime @huajsj you can run the docker build locally and pass it to ci.py:

bash docker/build.sh ci_gpu --tag my_ci_gpu
python tests/scripts/ci.py docs --docker-image my_ci_gpu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@driazati @areusch , thanks for the follow up, I tried the local test, and it work well, about this PR, to make it become green, do you think that is possible to merge PR 11744 first ? which did the dnnl installation change and is the dependency this PR.

@huajsj huajsj requested a review from areusch June 9, 2022 02:56
@huajsj huajsj force-pushed the pipeline-tutorial branch 4 times, most recently from f1187c0 to d10da4e Compare June 14, 2022 07:40
@huajsj huajsj mentioned this pull request Jun 17, 2022
7 tasks
@areusch
Copy link
Contributor

areusch commented Jun 21, 2022

blocked on #11774

@huajsj huajsj force-pushed the pipeline-tutorial branch 2 times, most recently from 2691730 to 81f2d49 Compare June 23, 2022 18:21
@huajsj huajsj force-pushed the pipeline-tutorial branch 2 times, most recently from 0ca864b to 58c7e85 Compare July 5, 2022 05:53
@huajsj
Copy link
Contributor Author

huajsj commented Jul 6, 2022

blocked on #12020

@huajsj
Copy link
Contributor Author

huajsj commented Jul 18, 2022

@areusch @masahi , @driazati , the CI is green now.
The latest changes is that we are using the CUTLASS to replace DNNL in the BYOC use case, and such change fixed the sphix crash issue. now all CI test passed. Please take a look.

@huajsj
Copy link
Contributor Author

huajsj commented Jul 19, 2022

@masahi, please take a look.

@masahi masahi self-assigned this Jul 20, 2022
pipe_config[mod0].target = "llvm"
pipe_config[mod0].dev = tvm.cpu(0)
###############################################################################
# Set the cpu afinity for control flow, for example using cpu 0 for control flow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify what is meant by "control flow", and why we need to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we run backend with executor for example cutlass, both cpu and gpu would get involved for the execution, cpu part response for preparing data, pre/post processing, transfer data between layer etc, I call this part as control flow.

under multiple backend situation, for example in this tutorial that is LLVM + CUTLASS, the 2 control flow will compete the cpu resource, and cause a lot of thread context switch, or cpu migration. These type resource competing will slow down the performance. by using the affinity setting, we associate a backend to a particular cpu group to avoid the said overhead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"control flow" usually means if/else or for loop in TVM or in general. How about "host operations"?

This also doesn't sound like something most users should be concerned about. I suggest removing affinity stuff from the tutorial and set the default affinity inside some runtime function. If you require affinity control by users, please summarize and add what you said above to the tutorial with correct English.

gallery/how_to/work_with_relay/using_pipeline_executor.py Outdated Show resolved Hide resolved
###########################################
# Splitting the network into two subgraphs.
# -----------------------------------------
# It is an example that the graph splitting function comes from a unit test. User can create a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence is broken and makes no sense..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed into “This function called 'graph_split' from a unit test is just an example. User can create a customized logic to split the graph.”

@huajsj huajsj requested a review from masahi July 21, 2022 14:59
import inspect
import os

test_path = os.path.dirname(inspect.getfile(lambda: None))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can simply use __file__ here instead of inspect. And rename test_path to tutorial_dir.

Copy link
Contributor Author

@huajsj huajsj Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace "test_path" with "tutorial_dir",
the reason we use inspect instead of file is because that __file__ not work with sphinx-gallery which is used by tvm doc
huajsj@8d2bfc3

pipe_config[mod1].export_cc = "nvcc"
#################################################################################
# Set the cpu afinity for control flow, for example using cpu 1 for control flow.
pipe_config[mod1].cpu_affinity = "1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pipe_config[mod1].cpu_affinity is written twice, here and at L166.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

pipe_config[mod0].target = "llvm"
pipe_config[mod0].dev = tvm.cpu(0)
###############################################################################
# Set the cpu afinity for control flow, for example using cpu 0 for control flow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"control flow" usually means if/else or for loop in TVM or in general. How about "host operations"?

This also doesn't sound like something most users should be concerned about. I suggest removing affinity stuff from the tutorial and set the default affinity inside some runtime function. If you require affinity control by users, please summarize and add what you said above to the tutorial with correct English.

pipe_config[mod1].build_func = cutlass_build
pipe_config[mod1].export_cc = "nvcc"
#################################################################################
# Set the cpu afinity for control flow, for example using cpu 1 for control flow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: afinity

Copy link
Contributor Author

@huajsj huajsj Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the affinity and use default tvm threadpoll default affinity logic.

pipe_config[mod1].cpu_affinity = "1"
pipe_config["input"]["data"].connect(pipe_config[mod0]["input"]["data"])
pipe_config[mod0]["output"][0].connect(pipe_config[mod1]["input"]["data_n_0"])
pipe_config[mod1]["output"]["0"].connect(pipe_config["output"][0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these three lines related to affinity control? You should have another ######## before them and explain what they do.

I have to say, this is not a good API. For example, where the names "data" and "data_n_0" come from? What is pipe_config[mod0]["output"][0]? And why you use "0" at L178?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these three line related connect subgraph to build pipeline instead of affinity, added detail explain.

"data" and "data_n_0" coming from subgraphs which is a list of subgraph, by print(subgraph[0]) , print(subgraph[1]) the said "data" and "data_n_0" will shown. if here give a wrong name which not exist , the API will throw a error.

pipe_config[mod0]["output"][0] means "the first output interface" of "mod0", line 178 "0" is typo , fixed.

Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I think this is the last typo fix.

gallery/how_to/work_with_relay/using_pipeline_executor.py Outdated Show resolved Hide resolved
gallery/how_to/work_with_relay/using_pipeline_executor.py Outdated Show resolved Hide resolved
gallery/how_to/work_with_relay/using_pipeline_executor.py Outdated Show resolved Hide resolved
@masahi masahi merged commit ecd3c88 into apache:main Jul 22, 2022
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
…che#11557)

* [Runtime][PipelineExecutor]  Tutorial of using pipeline executor.

Tutorial of using pipeline executor including the byoc use case.

* fix ci issue

* document change.

* triger build

* fix doc issue

* fix ci issue

* doc issue

* fix ci issue

* fix ci issue.

* fix __file__ not found problem.

this is a known issue of sphinx-gallery
sphinx-gallery/sphinx-gallery#211

* fix byoc with dnnl issue

* enable dnnl and pipeline executor

* trigger build

* trigger build

* fix build issue

* trigger build

* oneflow cause crash, do test with change

* add sphinx skip

* plint

* remove from_oneflow change test.

* remove pipeline executor change for test

* plint

* enable DNNL and pipeline

* disable DNNL

* enable DNNL without pipeline

* remove dnnl and add cutlass

* use cutlass with byoc

* change into cutlass

* fix doc convention issue

* remove duplicate variable

* fix plint issue.

* address review comments.

* address review comments

* fix bug.

* polish the document

* fix plint issue

* address review comments.

* address review comments

* address review comments
mikeseven pushed a commit to mikeseven/tvm that referenced this pull request Sep 27, 2023
…che#11557)

* [Runtime][PipelineExecutor]  Tutorial of using pipeline executor.

Tutorial of using pipeline executor including the byoc use case.

* fix ci issue

* document change.

* triger build

* fix doc issue

* fix ci issue

* doc issue

* fix ci issue

* fix ci issue.

* fix __file__ not found problem.

this is a known issue of sphinx-gallery
sphinx-gallery/sphinx-gallery#211

* fix byoc with dnnl issue

* enable dnnl and pipeline executor

* trigger build

* trigger build

* fix build issue

* trigger build

* oneflow cause crash, do test with change

* add sphinx skip

* plint

* remove from_oneflow change test.

* remove pipeline executor change for test

* plint

* enable DNNL and pipeline

* disable DNNL

* enable DNNL without pipeline

* remove dnnl and add cutlass

* use cutlass with byoc

* change into cutlass

* fix doc convention issue

* remove duplicate variable

* fix plint issue.

* address review comments.

* address review comments

* fix bug.

* polish the document

* fix plint issue

* address review comments.

* address review comments

* address review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants