[VM] Per-input, data dependence specification for shape func #7210

masahi · 2021-01-05T04:48:04Z

Motivation

Currently, shape functions are executed on CPU, even if the model is running on GPU. Each shape function is declared with data_dependent flag, specifying whether or not the shape function needs the input tensors themselves or only the shapes of input tensors, to compute output shape:

tvm/python/tvm/relay/op/op.py

Lines 359 to 368 in 9e766d9

    
           def register_shape_func(op_name, data_dependant, shape_func=None, level=10): 
        
               """Register operator shape function for an op. 
        
               Parameters 
        
               ---------- 
        
               op_name : str 
        
                   The name of the op. 
        
               data_dependant : bool 
        
                   Whether the shape function depends on input data.

When an op's data_dependent is true, VM will do device to host memcpy of entire tensors before running that op's shape func on CPU. In particular, since dyn.strided_slice has data_dependent True, VM would do device to host memcpy of a to-be-sliced tensor for every dyn.strided_sllice invocation, which can be highly expensive if the tensor is big.

tvm/python/tvm/relay/op/dyn/_transform.py

Line 195 in 9e766d9

@_reg.register_shape_func("dyn.strided_slice", True)

In fact, one of the bottlenecks of running PyTorch MaskRCNN on GPU is this repeated device to host memcpy, as shown in the profiler output below. Most of them is for dyn.strided_slice shape func. CUDA memcpy HtoD is also very slow, but this is necessary to send large parameters once to GPU.

==2531== Profiling application: python maskrcnn_test.py                                                                                                                      
==2531== Profiling result:                                                                                                                                                   
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name                                                                                            
 GPU activities:   16.59%  32.814ms        58  565.76us  30.112us  1.0854ms  fused_expand_dims_concatenate_1_kernel0                                                         
                    9.37%  18.524ms       475  38.998us     544ns  5.3433ms  [CUDA memcpy HtoD]                                                                              
                    9.09%  17.983ms         4  4.4957ms  4.4597ms  4.5401ms  fused_nn_conv2d_add_nn_relu_20_kernel0                                                          
                    8.77%  17.350ms         1  17.350ms  17.350ms  17.350ms  fused_nn_conv2d_transpose_add_nn_relu_kernel0                                                   
                    7.54%  14.912ms      2283  6.5310us     672ns  7.8311ms  [CUDA memcpy DtoH]                                                                              
                    4.33%  8.5701ms         1  8.5701ms  8.5701ms  8.5701ms  fused_nn_conv2d_add_5_kernel0                                                                   
                    4.27%  8.4386ms         1  8.4386ms  8.4386ms  8.4386ms  fused_nn_conv2d_add_nn_relu_16_kernel0 
                    2.68%  5.3057ms         2  2.6528ms  399.20us  4.9065ms  sgemm_128x128x8_NT_vec

But if we think about it, these expensive copyings are completely useless: shape func of dyn.strided_slice only needs data shape. But since other arguments begin, end, strides require their tensor values, dyn.strided_slice needs to declare its data_dependant flag to be True. This issue can be resolved if we let each op declare its data dependence per input.

Proposed solution

Luckily, decisions to insert device-to-host memcpy before shape func is already done per each input separately, as shown below:

tvm/python/tvm/relay/transform/memory_alloc.py

Lines 207 to 221 in 4c4888b

    
           for i, (arg, state) in enumerate(zip(new_args, input_states)): 
        
               state = int(state) 
        
               # Pass Shapes 
        
               if state == 2: 
        
                   for j, subexp in enumerate(from_tuple_type(arg.type_annotation, arg)): 
        
                       sh_of = self.visit(self.shape_of(subexp)) 
        
                       shape_func_ins.append(scope.let("in_shape_{0}".format(input_pos + j), sh_of)) 
        
                       input_pos += 1 
        
                   is_inputs.append(0) 
        
               # Pass Inputs 
        
               elif state == 1: 
        
                   new_arg = self.visit(arg) 
        
                   ctx = self.get_context(arg) 
        
                   if ctx.device_type != cpu_ctx.device_type: 
        
                       new_arg = self.device_copy(new_arg, ctx, cpu_ctx)

So all we need to do is to have a way to specify per-input data dependence for each op, and send these information to ManifestAllocPass above. This PR enables such specification like this:

@_reg.register_shape_func("dyn.strided_slice", [False, True, True, True])

I've also updated compile_engine.cc accordingly to send per-input data dep information to ManifestAllocPass. The change I made is minimum necessary to achieve my goal, so it can be improved. With this PR, I was able to remove all expensive device-to-host memcpy, and it cuts GPU MaskRCNN runtime by 14 millisecond. More importantly, the purpose of this PR is to let people aware of this problem, and decide the best solution.

please review @icemelon9 @zhiics @kevinthesun @mbrookhart

mbrookhart · 2021-01-05T18:04:24Z

I like this, I think it makes a lot of sense, but I'll defer mainly to @zhiics and @icemelon9 since they mainly implemented the infrastructure for heterogeneous shape functions.

masahi · 2021-01-14T00:32:32Z

cc @icemelon9 @zhiics Any thought on this issue? I think this is important to discuss.

zhiics

I am okay with the change but I am not sure if there is a better solution. @icemelon9 can you take a look?

python/tvm/relay/op/op.py

include/tvm/relay/op_attr_types.h

src/relay/backend/compile_engine.cc

masahi · 2021-01-15T20:12:22Z

@icemelon9 it should be ready

icemelon

LGTM

icemelon · 2021-01-15T21:22:46Z

Thanks @masahi @zhiics @mbrookhart

…7210) * made TShapeDataDependant array * add stub * dyn strided slice working * reshape also working * remove log * works on maskrcnn * lint fix * fix cpp test * remove stale pop back * add more doc * dependant -> dependent * remove redundant check * remove data_dependent_

masahi changed the title ~~[VM] Per-input, data dependence speficiation for shape func~~ [VM] Per-input, data dependence specification for shape func Jan 5, 2021

ZihengJiang added the status: need review label Jan 6, 2021

zhiics reviewed Jan 14, 2021

View reviewed changes

python/tvm/relay/op/op.py Outdated Show resolved Hide resolved

masahi force-pushed the shape-func-per-input-data-dependent branch 2 times, most recently from 1ffa75a to f658556 Compare January 14, 2021 19:52

icemelon reviewed Jan 14, 2021

View reviewed changes

include/tvm/relay/op_attr_types.h Outdated Show resolved Hide resolved

src/relay/backend/compile_engine.cc Outdated Show resolved Hide resolved

src/relay/backend/compile_engine.cc Outdated Show resolved Hide resolved

masahi added 12 commits January 15, 2021 10:43

made TShapeDataDependant array

41f212c

add stub

b720908

dyn strided slice working

8364e6f

reshape also working

897d837

remove log

72cd60c

works on maskrcnn

7ca8002

lint fix

48a7c7b

fix cpp test

6d9011d

remove stale pop back

1a7e721

add more doc

3a15c13

dependant -> dependent

eafddf5

remove redundant check

94ced0d

masahi force-pushed the shape-func-per-input-data-dependent branch from f658556 to 11bd3f5 Compare January 15, 2021 01:43

remove data_dependent_

6c1b318

masahi force-pushed the shape-func-per-input-data-dependent branch from 11bd3f5 to 6c1b318 Compare January 15, 2021 09:14

icemelon approved these changes Jan 15, 2021

View reviewed changes

icemelon merged commit 2992e9b into apache:main Jan 15, 2021

icemelon added status: accepted and removed status: need review labels Jan 15, 2021

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VM] Per-input, data dependence specification for shape func #7210

[VM] Per-input, data dependence specification for shape func #7210

masahi commented Jan 5, 2021 •

edited

Loading

mbrookhart commented Jan 5, 2021

masahi commented Jan 14, 2021

zhiics left a comment

masahi commented Jan 15, 2021

icemelon left a comment

icemelon commented Jan 15, 2021

	def register_shape_func(op_name, data_dependant, shape_func=None, level=10):
	"""Register operator shape function for an op.

	Parameters
	----------
	op_name : str
	The name of the op.

	data_dependant : bool
	Whether the shape function depends on input data.

	for i, (arg, state) in enumerate(zip(new_args, input_states)):
	state = int(state)
	# Pass Shapes
	if state == 2:
	for j, subexp in enumerate(from_tuple_type(arg.type_annotation, arg)):
	sh_of = self.visit(self.shape_of(subexp))
	shape_func_ins.append(scope.let("in_shape_{0}".format(input_pos + j), sh_of))
	input_pos += 1
	is_inputs.append(0)
	# Pass Inputs
	elif state == 1:
	new_arg = self.visit(arg)
	ctx = self.get_context(arg)
	if ctx.device_type != cpu_ctx.device_type:
	new_arg = self.device_copy(new_arg, ctx, cpu_ctx)

[VM] Per-input, data dependence specification for shape func #7210

[VM] Per-input, data dependence specification for shape func #7210

Conversation

masahi commented Jan 5, 2021 • edited Loading

Motivation

Proposed solution

mbrookhart commented Jan 5, 2021

masahi commented Jan 14, 2021

zhiics left a comment

Choose a reason for hiding this comment

masahi commented Jan 15, 2021

icemelon left a comment

Choose a reason for hiding this comment

icemelon commented Jan 15, 2021

masahi commented Jan 5, 2021 •

edited

Loading