Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large GPU memory usage with EXHAUSTIVE cuDNN search #7612

Open
cschreib-ibex opened this issue May 7, 2021 · 3 comments
Open

Large GPU memory usage with EXHAUSTIVE cuDNN search #7612

cschreib-ibex opened this issue May 7, 2021 · 3 comments
Labels
ep:CUDA issues related to the CUDA execution provider

Comments

@cschreib-ibex
Copy link
Contributor

Describe the bug

Summary: On master with EXHAUSTIVE cuDNN search, our model uses 5GB of GPU memory, vs only 1.3GB memory with other setups (including in TensorFlow). This memory usage cannot be reduced using gpu_mem_limit, even though the model can actually run if there is only 0.5GB of GPU memory available.

Details: We have a model converted from TensorFlow, the same model as in #7578 (in all that follows, the fix for upsampling performance mentioned in that other issue was applied to onnxruntime). Here are some statistics I collected when running this model on various versions of onnxruntime (1.7.2, or master on commit 2f04797), with different cuDNN search settings, and compared against TensorFlow:

lib branch cudnn used mem time per run
onnx master exhaustive 5.0GB 15ms
onnx master default 0.6GB 31ms
onnx 1.7.2 exhaustive 1.3GB 27ms
onnx 1.7.2 default 0.7GB 41ms
TF 2.4.0 n/a 1.2GB 23ms

The master branch with EXHAUSTIVE cuDNN search has clearly the best runtime performance of all combinations, and it is the only onnxruntime setup that runs faster than Tensorflow (great job on that!). Unfortunately, it also has a dramatically larger memory usage. This is wrecking havoc with our product, which needs that memory for other purposes (and we need to keep the onnxruntime session alive to avoid the cost of re-creating it next time it is needed). In all cases I set the arena extend strategy to kSameAsRequested.

I thought I could control this with the CUDA session settings. If I set gpu_mem_limit to any value between 4GB and 8GB, the session runs but keeps using 5GB of memory. If I set gpu_mem_limit to anything less than 4GB, the session refuses to run with the following message:

2021-05-07 08:39:51.9782447 [E:onnxruntime:, sequential_executor.cc:338 onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running FusedConv node. Name:'StatefulPartitionedCall/model/conv2d_1/BiasAdd_StatefulPartitionedCall/model/activation_1/Relu' Status Message: C:\onnxruntime\onnxruntime\core\framework\bfc_arena.cc:309 onnxruntime::BFCArena::AllocateRawInternal Available memory of 2010120192 is smaller than requested bytes of 4362633216

I then tried to artificially reduce the amount of available GPU memory by pre-allocating a big array with CUDA, and leave the memory limit to the maximum. Surprisingly, the session was then able to run, with unchanged performance:

available mem used mem time per run
7.4GB 5.0GB 15ms
5.3GB 3.4GB 16ms
3.3GB 2.0GB 16ms
1.7GB 0.9GB 18ms
1.0GB 0.6GB 16ms
0.8GB 0.5GB 17ms
0.6GB failed failed

This table suggests that the model should be able to run with gpu_mem_limit set as low as ~1GB, yet it is not the case.

nvprof reports:

onnx master exhaustive

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   71.03%  10.188ms        13  783.66us  249.11us  1.7104ms  maxwell_scudnn_winograd_128x128_ldg1_ldg4_mobile_relu_tile148t_nt_v0
                    9.36%  1.3429ms         5  268.57us  76.350us  593.24us  void onnxruntime::cuda::_UpampleNearestKernel<float, int=4>(onnxruntime::cuda::TArray<__int64, int=8>, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>, onnxruntime::cuda::fast_divmod, float const *, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>*, __int64)

onnx master default

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   35.21%  10.852ms         3  3.6174ms  2.1593ms  4.4071ms  maxwell_scudnn_128x128_relu_small_nn_v1
                   17.24%  5.3139ms         2  2.6569ms  2.6557ms  2.6582ms  maxwell_scudnn_128x128_relu_large_nn_v1
                   14.97%  4.6135ms         1  4.6135ms  4.6135ms  4.6135ms  maxwell_scudnn_128x32_relu_medium_nn_v1
                   14.27%  4.3997ms         4  1.0999ms  723.06us  1.5317ms  maxwell_scudnn_128x32_relu_small_nn_v1
                    5.67%  1.7466ms         3  582.22us  567.96us  593.43us  maxwell_scudnn_128x64_relu_large_nn_v1
                    5.31%  1.6360ms         5  327.21us  93.374us  723.38us  void onnxruntime::cuda::_UpampleNearestKernel<float, int=4>(onnxruntime::cuda::TArray<__int64, int=8>, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>, onnxruntime::cuda::fast_divmod, float const *, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>*, __int64)

onnx 1.7.2 exhaustive

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   64.29%  11.518ms        13  885.99us  282.59us  1.9184ms  maxwell_scudnn_winograd_128x128_ldg1_ldg4_mobile_relu_tile148t_nt_v0
                    8.38%  1.5005ms         5  300.10us  85.758us  662.83us  void onnxruntime::cuda::_UpampleNearestKernel<float, int=4>(onnxruntime::cuda::TArray<__int64, int=8>, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>, onnxruntime::cuda::fast_divmod, float const *, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>*, __int64)

onnx 1.7.2 default

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   33.47%  10.730ms         3  3.5767ms  2.1571ms  4.2870ms  maxwell_scudnn_128x128_relu_small_nn_v1
                   14.49%  4.6463ms         2  2.3231ms  2.3222ms  2.3241ms  maxwell_scudnn_128x128_relu_large_nn_v1
                   14.42%  4.6212ms         1  4.6212ms  4.6212ms  4.6212ms  maxwell_scudnn_128x32_relu_medium_nn_v1
                   13.41%  4.2974ms         4  1.0744ms  723.57us  1.4280ms  maxwell_scudnn_128x32_relu_small_nn_v1
                    5.45%  1.7480ms         3  582.67us  567.99us  592.47us  maxwell_scudnn_128x64_relu_large_nn_v1
                    5.11%  1.6378ms         5  327.55us  93.405us  724.14us  void onnxruntime::cuda::_UpampleNearestKernel<float, int=4>(onnxruntime::cuda::TArray<__int64, int=8>, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>, onnxruntime::cuda::fast_divmod, float const *, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>*, __int64)

Tensorflow

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   52.99%  10.185ms        13  783.49us  256.67us  1.8006ms  maxwell_scudnn_winograd_128x128_ldg1_ldg4_mobile_relu_tile148t_nt_v0
                    9.92%  1.9073ms        10  190.73us  24.480us  667.86us  void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>*)
                    9.24%  1.7757ms         5  355.15us  99.998us  786.58us  void tensorflow::_GLOBAL__N__52_resize_nearest_neighbor_op_gpu_cu_compute_86_cpp1_ii_ed679893_20864::ResizeNearestNeighborNHWC<float>(int, float const *, int, int, int, int, int, float, float, tensorflow::_GLOBAL__N__52_resize_nearest_neighbor_op_gpu_cu_compute_86_cpp1_ii_ed679893_20864::ResizeNearestNeighborNHWC<float>*)

Urgency
None as we can run with 1.7.2. Unfortunately that means we cannot benefit from the improved performance from master, which may land in the next version with this memory problem.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Pro
  • ONNX Runtime installed from (source or binary): source
  • ONNX Runtime version: master and 1.7.2 (see above)
  • Python version: N/A
  • Visual Studio version (if applicable): 2019
  • GCC/Compiler version (if compiling from source): VS 2019
  • CUDA/cuDNN version: CUDA 11.2.1 + cuDNN 8.1.0.77
  • GPU model and memory: GeForce GTX 1070 8GB

To Reproduce

  • Unfortunately I cannot share our model, as it is proprietary.

Expected behavior
Memory used should be limited to what is truly needed. gpu_mem_limit should behave the same as truly reducing the available GPU memory.

Screenshots
N/A

Additional context
None

@jywu-msft jywu-msft added the ep:CUDA issues related to the CUDA execution provider label May 7, 2021
@ytaous
Copy link
Contributor

ytaous commented May 10, 2021

@duli2012 - any thought on the memory issue?

@cschreib-ibex
Copy link
Contributor Author

cschreib-ibex commented Jun 7, 2021

I have just tested the (just released) 1.8.0 version, and the problem persists.

Here's an updated table from the original issue:

lib branch cudnn used mem time per run
onnx 1.8.0 exhaustive 4.7GB 19ms
onnx 1.8.0 default 0.2GB 41ms
onnx 1.7.2 exhaustive 1.3GB 27ms
onnx 1.7.2 default 0.7GB 41ms
TF 2.4.0 n/a 1.2GB 23ms

@cschreib-ibex
Copy link
Contributor Author

Thankfully, I saw #7284 added the ability to let the memory arena shrink after each call to Run. I changed the run options like so:

Ort::RunOptions options;

int deviceID = 0;
cudaGetDevice(&deviceID);

std::ostringstream stream;
stream << "gpu:" << deviceID;
options.AddConfigEntry("memory.enable_memory_arena_shrinkage", stream.str().c_str());

session.Run(options, ...);

... and that worked. All the memory used by Run was cleaned up automatically when the function returned. However, the performance loss was substantial:

lib branch cudnn used mem time per run
onnx 1.8.0 exhaustive 0.0GB 40ms
onnx 1.8.0 default 0.0GB 48ms
onnx 1.7.2 exhaustive 1.3GB 27ms
onnx 1.7.2 default 0.7GB 41ms
TF 2.4.0 n/a 1.2GB 23ms

With these numbers, we are still better off using 1.7.2. Ideally the root cause of the excessive memory usage would be solved, so we can benefit from the best performance with a reasonable memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider
Projects
None yet
Development

No branches or pull requests

3 participants