-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VM] Per-input, data dependence specification for shape func #7210
[VM] Per-input, data dependence specification for shape func #7210
Conversation
I like this, I think it makes a lot of sense, but I'll defer mainly to @zhiics and @icemelon9 since they mainly implemented the infrastructure for heterogeneous shape functions. |
cc @icemelon9 @zhiics Any thought on this issue? I think this is important to discuss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with the change but I am not sure if there is a better solution. @icemelon9 can you take a look?
1ffa75a
to
f658556
Compare
f658556
to
11bd3f5
Compare
11bd3f5
to
6c1b318
Compare
@icemelon9 it should be ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @masahi @zhiics @mbrookhart |
…7210) * made TShapeDataDependant array * add stub * dyn strided slice working * reshape also working * remove log * works on maskrcnn * lint fix * fix cpp test * remove stale pop back * add more doc * dependant -> dependent * remove redundant check * remove data_dependent_
…7210) * made TShapeDataDependant array * add stub * dyn strided slice working * reshape also working * remove log * works on maskrcnn * lint fix * fix cpp test * remove stale pop back * add more doc * dependant -> dependent * remove redundant check * remove data_dependent_
…7210) * made TShapeDataDependant array * add stub * dyn strided slice working * reshape also working * remove log * works on maskrcnn * lint fix * fix cpp test * remove stale pop back * add more doc * dependant -> dependent * remove redundant check * remove data_dependent_
…7210) * made TShapeDataDependant array * add stub * dyn strided slice working * reshape also working * remove log * works on maskrcnn * lint fix * fix cpp test * remove stale pop back * add more doc * dependant -> dependent * remove redundant check * remove data_dependent_
Motivation
Currently, shape functions are executed on CPU, even if the model is running on GPU. Each shape function is declared with
data_dependent
flag, specifying whether or not the shape function needs the input tensors themselves or only the shapes of input tensors, to compute output shape:tvm/python/tvm/relay/op/op.py
Lines 359 to 368 in 9e766d9
When an op's
data_dependent
is true, VM will do device to host memcpy of entire tensors before running that op's shape func on CPU. In particular, sincedyn.strided_slice
hasdata_dependent
True, VM would do device to host memcpy of a to-be-sliced tensor for everydyn.strided_sllice
invocation, which can be highly expensive if the tensor is big.tvm/python/tvm/relay/op/dyn/_transform.py
Line 195 in 9e766d9
In fact, one of the bottlenecks of running PyTorch MaskRCNN on GPU is this repeated device to host memcpy, as shown in the profiler output below. Most of them is for
dyn.strided_slice
shape func.CUDA memcpy HtoD
is also very slow, but this is necessary to send large parameters once to GPU.But if we think about it, these expensive copyings are completely useless: shape func of
dyn.strided_slice
only needs data shape. But since other argumentsbegin, end, strides
require their tensor values,dyn.strided_slice
needs to declare itsdata_dependant
flag to be True. This issue can be resolved if we let each op declare its data dependence per input.Proposed solution
Luckily, decisions to insert device-to-host memcpy before shape func is already done per each input separately, as shown below:
tvm/python/tvm/relay/transform/memory_alloc.py
Lines 207 to 221 in 4c4888b
So all we need to do is to have a way to specify per-input data dependence for each op, and send these information to
ManifestAllocPass
above. This PR enables such specification like this:I've also updated
compile_engine.cc
accordingly to send per-input data dep information toManifestAllocPass
. The change I made is minimum necessary to achieve my goal, so it can be improved. With this PR, I was able to remove all expensive device-to-host memcpy, and it cuts GPU MaskRCNN runtime by 14 millisecond. More importantly, the purpose of this PR is to let people aware of this problem, and decide the best solution.please review @icemelon9 @zhiics @kevinthesun @mbrookhart