Large GPU memory usage with EXHAUSTIVE cuDNN search #7612

cschreib-ibex · 2021-05-07T09:39:54Z

Describe the bug

Summary: On master with EXHAUSTIVE cuDNN search, our model uses 5GB of GPU memory, vs only 1.3GB memory with other setups (including in TensorFlow). This memory usage cannot be reduced using gpu_mem_limit, even though the model can actually run if there is only 0.5GB of GPU memory available.

Details: We have a model converted from TensorFlow, the same model as in #7578 (in all that follows, the fix for upsampling performance mentioned in that other issue was applied to onnxruntime). Here are some statistics I collected when running this model on various versions of onnxruntime (1.7.2, or master on commit 2f04797), with different cuDNN search settings, and compared against TensorFlow:

lib	branch	cudnn	used mem	time per run
onnx	master	exhaustive	5.0GB	15ms
onnx	master	default	0.6GB	31ms
onnx	1.7.2	exhaustive	1.3GB	27ms
onnx	1.7.2	default	0.7GB	41ms
TF	2.4.0	n/a	1.2GB	23ms

The master branch with EXHAUSTIVE cuDNN search has clearly the best runtime performance of all combinations, and it is the only onnxruntime setup that runs faster than Tensorflow (great job on that!). Unfortunately, it also has a dramatically larger memory usage. This is wrecking havoc with our product, which needs that memory for other purposes (and we need to keep the onnxruntime session alive to avoid the cost of re-creating it next time it is needed). In all cases I set the arena extend strategy to kSameAsRequested.

I thought I could control this with the CUDA session settings. If I set gpu_mem_limit to any value between 4GB and 8GB, the session runs but keeps using 5GB of memory. If I set gpu_mem_limit to anything less than 4GB, the session refuses to run with the following message:

2021-05-07 08:39:51.9782447 [E:onnxruntime:, sequential_executor.cc:338 onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running FusedConv node. Name:'StatefulPartitionedCall/model/conv2d_1/BiasAdd_StatefulPartitionedCall/model/activation_1/Relu' Status Message: C:\onnxruntime\onnxruntime\core\framework\bfc_arena.cc:309 onnxruntime::BFCArena::AllocateRawInternal Available memory of 2010120192 is smaller than requested bytes of 4362633216

I then tried to artificially reduce the amount of available GPU memory by pre-allocating a big array with CUDA, and leave the memory limit to the maximum. Surprisingly, the session was then able to run, with unchanged performance:

available mem	used mem	time per run
7.4GB	5.0GB	15ms
5.3GB	3.4GB	16ms
3.3GB	2.0GB	16ms
1.7GB	0.9GB	18ms
1.0GB	0.6GB	16ms
0.8GB	0.5GB	17ms
0.6GB	failed	failed

This table suggests that the model should be able to run with gpu_mem_limit set as low as ~1GB, yet it is not the case.

nvprof reports:

onnx master exhaustive

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   71.03%  10.188ms        13  783.66us  249.11us  1.7104ms  maxwell_scudnn_winograd_128x128_ldg1_ldg4_mobile_relu_tile148t_nt_v0
                    9.36%  1.3429ms         5  268.57us  76.350us  593.24us  void onnxruntime::cuda::_UpampleNearestKernel<float, int=4>(onnxruntime::cuda::TArray<__int64, int=8>, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>, onnxruntime::cuda::fast_divmod, float const *, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>*, __int64)

onnx master default

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   35.21%  10.852ms         3  3.6174ms  2.1593ms  4.4071ms  maxwell_scudnn_128x128_relu_small_nn_v1
                   17.24%  5.3139ms         2  2.6569ms  2.6557ms  2.6582ms  maxwell_scudnn_128x128_relu_large_nn_v1
                   14.97%  4.6135ms         1  4.6135ms  4.6135ms  4.6135ms  maxwell_scudnn_128x32_relu_medium_nn_v1
                   14.27%  4.3997ms         4  1.0999ms  723.06us  1.5317ms  maxwell_scudnn_128x32_relu_small_nn_v1
                    5.67%  1.7466ms         3  582.22us  567.96us  593.43us  maxwell_scudnn_128x64_relu_large_nn_v1
                    5.31%  1.6360ms         5  327.21us  93.374us  723.38us  void onnxruntime::cuda::_UpampleNearestKernel<float, int=4>(onnxruntime::cuda::TArray<__int64, int=8>, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>, onnxruntime::cuda::fast_divmod, float const *, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>*, __int64)

onnx 1.7.2 exhaustive

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   64.29%  11.518ms        13  885.99us  282.59us  1.9184ms  maxwell_scudnn_winograd_128x128_ldg1_ldg4_mobile_relu_tile148t_nt_v0
                    8.38%  1.5005ms         5  300.10us  85.758us  662.83us  void onnxruntime::cuda::_UpampleNearestKernel<float, int=4>(onnxruntime::cuda::TArray<__int64, int=8>, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>, onnxruntime::cuda::fast_divmod, float const *, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>*, __int64)

onnx 1.7.2 default

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   33.47%  10.730ms         3  3.5767ms  2.1571ms  4.2870ms  maxwell_scudnn_128x128_relu_small_nn_v1
                   14.49%  4.6463ms         2  2.3231ms  2.3222ms  2.3241ms  maxwell_scudnn_128x128_relu_large_nn_v1
                   14.42%  4.6212ms         1  4.6212ms  4.6212ms  4.6212ms  maxwell_scudnn_128x32_relu_medium_nn_v1
                   13.41%  4.2974ms         4  1.0744ms  723.57us  1.4280ms  maxwell_scudnn_128x32_relu_small_nn_v1
                    5.45%  1.7480ms         3  582.67us  567.99us  592.47us  maxwell_scudnn_128x64_relu_large_nn_v1
                    5.11%  1.6378ms         5  327.55us  93.405us  724.14us  void onnxruntime::cuda::_UpampleNearestKernel<float, int=4>(onnxruntime::cuda::TArray<__int64, int=8>, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>, onnxruntime::cuda::fast_divmod, float const *, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>*, __int64)

Tensorflow

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   52.99%  10.185ms        13  783.49us  256.67us  1.8006ms  maxwell_scudnn_winograd_128x128_ldg1_ldg4_mobile_relu_tile148t_nt_v0
                    9.92%  1.9073ms        10  190.73us  24.480us  667.86us  void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>*)
                    9.24%  1.7757ms         5  355.15us  99.998us  786.58us  void tensorflow::_GLOBAL__N__52_resize_nearest_neighbor_op_gpu_cu_compute_86_cpp1_ii_ed679893_20864::ResizeNearestNeighborNHWC<float>(int, float const *, int, int, int, int, int, float, float, tensorflow::_GLOBAL__N__52_resize_nearest_neighbor_op_gpu_cu_compute_86_cpp1_ii_ed679893_20864::ResizeNearestNeighborNHWC<float>*)

Urgency
None as we can run with 1.7.2. Unfortunately that means we cannot benefit from the improved performance from master, which may land in the next version with this memory problem.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Pro
ONNX Runtime installed from (source or binary): source
ONNX Runtime version: master and 1.7.2 (see above)
Python version: N/A
Visual Studio version (if applicable): 2019
GCC/Compiler version (if compiling from source): VS 2019
CUDA/cuDNN version: CUDA 11.2.1 + cuDNN 8.1.0.77
GPU model and memory: GeForce GTX 1070 8GB

To Reproduce

Unfortunately I cannot share our model, as it is proprietary.

Expected behavior
Memory used should be limited to what is truly needed. gpu_mem_limit should behave the same as truly reducing the available GPU memory.

Screenshots
N/A

Additional context
None

The text was updated successfully, but these errors were encountered:

ytaous · 2021-05-10T17:10:55Z

@duli2012 - any thought on the memory issue?

cschreib-ibex · 2021-06-07T12:06:12Z

I have just tested the (just released) 1.8.0 version, and the problem persists.

Here's an updated table from the original issue:

lib	branch	cudnn	used mem	time per run
onnx	1.8.0	exhaustive	4.7GB	19ms
onnx	1.8.0	default	0.2GB	41ms
onnx	1.7.2	exhaustive	1.3GB	27ms
onnx	1.7.2	default	0.7GB	41ms
TF	2.4.0	n/a	1.2GB	23ms

cschreib-ibex · 2021-06-07T12:50:57Z

Thankfully, I saw #7284 added the ability to let the memory arena shrink after each call to Run. I changed the run options like so:

Ort::RunOptions options;

int deviceID = 0;
cudaGetDevice(&deviceID);

std::ostringstream stream;
stream << "gpu:" << deviceID;
options.AddConfigEntry("memory.enable_memory_arena_shrinkage", stream.str().c_str());

session.Run(options, ...);

... and that worked. All the memory used by Run was cleaned up automatically when the function returned. However, the performance loss was substantial:

lib	branch	cudnn	used mem	time per run
onnx	1.8.0	exhaustive	0.0GB	40ms
onnx	1.8.0	default	0.0GB	48ms
onnx	1.7.2	exhaustive	1.3GB	27ms
onnx	1.7.2	default	0.7GB	41ms
TF	2.4.0	n/a	1.2GB	23ms

With these numbers, we are still better off using 1.7.2. Ideally the root cause of the excessive memory usage would be solved, so we can benefit from the best performance with a reasonable memory usage.

jywu-msft added the ep:CUDA issues related to the CUDA execution provider label May 7, 2021

RangiLyu mentioned this issue May 14, 2021

[Feature] enable exporting to onnx for PointRend open-mmlab/mmdetection#4977

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large GPU memory usage with EXHAUSTIVE cuDNN search #7612

Large GPU memory usage with EXHAUSTIVE cuDNN search #7612

cschreib-ibex commented May 7, 2021

ytaous commented May 10, 2021

cschreib-ibex commented Jun 7, 2021 •

edited

Loading

cschreib-ibex commented Jun 7, 2021

Large GPU memory usage with EXHAUSTIVE cuDNN search #7612

Large GPU memory usage with EXHAUSTIVE cuDNN search #7612

Comments

cschreib-ibex commented May 7, 2021

ytaous commented May 10, 2021

cschreib-ibex commented Jun 7, 2021 • edited Loading

cschreib-ibex commented Jun 7, 2021

cschreib-ibex commented Jun 7, 2021 •

edited

Loading