You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary: On master with EXHAUSTIVE cuDNN search, our model uses 5GB of GPU memory, vs only 1.3GB memory with other setups (including in TensorFlow). This memory usage cannot be reduced using gpu_mem_limit, even though the model can actually run if there is only 0.5GB of GPU memory available.
Details: We have a model converted from TensorFlow, the same model as in #7578 (in all that follows, the fix for upsampling performance mentioned in that other issue was applied to onnxruntime). Here are some statistics I collected when running this model on various versions of onnxruntime (1.7.2, or master on commit 2f04797), with different cuDNN search settings, and compared against TensorFlow:
lib
branch
cudnn
used mem
time per run
onnx
master
exhaustive
5.0GB
15ms
onnx
master
default
0.6GB
31ms
onnx
1.7.2
exhaustive
1.3GB
27ms
onnx
1.7.2
default
0.7GB
41ms
TF
2.4.0
n/a
1.2GB
23ms
The master branch with EXHAUSTIVE cuDNN search has clearly the best runtime performance of all combinations, and it is the only onnxruntime setup that runs faster than Tensorflow (great job on that!). Unfortunately, it also has a dramatically larger memory usage. This is wrecking havoc with our product, which needs that memory for other purposes (and we need to keep the onnxruntime session alive to avoid the cost of re-creating it next time it is needed). In all cases I set the arena extend strategy to kSameAsRequested.
I thought I could control this with the CUDA session settings. If I set gpu_mem_limit to any value between 4GB and 8GB, the session runs but keeps using 5GB of memory. If I set gpu_mem_limit to anything less than 4GB, the session refuses to run with the following message:
2021-05-07 08:39:51.9782447 [E:onnxruntime:, sequential_executor.cc:338 onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running FusedConv node. Name:'StatefulPartitionedCall/model/conv2d_1/BiasAdd_StatefulPartitionedCall/model/activation_1/Relu' Status Message: C:\onnxruntime\onnxruntime\core\framework\bfc_arena.cc:309 onnxruntime::BFCArena::AllocateRawInternal Available memory of 2010120192 is smaller than requested bytes of 4362633216
I then tried to artificially reduce the amount of available GPU memory by pre-allocating a big array with CUDA, and leave the memory limit to the maximum. Surprisingly, the session was then able to run, with unchanged performance:
available mem
used mem
time per run
7.4GB
5.0GB
15ms
5.3GB
3.4GB
16ms
3.3GB
2.0GB
16ms
1.7GB
0.9GB
18ms
1.0GB
0.6GB
16ms
0.8GB
0.5GB
17ms
0.6GB
failed
failed
This table suggests that the model should be able to run with gpu_mem_limit set as low as ~1GB, yet it is not the case.
nvprof reports:
onnx master exhaustive
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 71.03% 10.188ms 13 783.66us 249.11us 1.7104ms maxwell_scudnn_winograd_128x128_ldg1_ldg4_mobile_relu_tile148t_nt_v0
9.36% 1.3429ms 5 268.57us 76.350us 593.24us void onnxruntime::cuda::_UpampleNearestKernel<float, int=4>(onnxruntime::cuda::TArray<__int64, int=8>, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>, onnxruntime::cuda::fast_divmod, float const *, onnxruntime::cuda::_UpampleNearestKernel<float, int=4, onnxruntime::cuda::fast_divmod, int=8>*, __int64)
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 52.99% 10.185ms 13 783.49us 256.67us 1.8006ms maxwell_scudnn_winograd_128x128_ldg1_ldg4_mobile_relu_tile148t_nt_v0
9.92% 1.9073ms 10 190.73us 24.480us 667.86us void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>*)
9.24% 1.7757ms 5 355.15us 99.998us 786.58us void tensorflow::_GLOBAL__N__52_resize_nearest_neighbor_op_gpu_cu_compute_86_cpp1_ii_ed679893_20864::ResizeNearestNeighborNHWC<float>(int, float const *, int, int, int, int, int, float, float, tensorflow::_GLOBAL__N__52_resize_nearest_neighbor_op_gpu_cu_compute_86_cpp1_ii_ed679893_20864::ResizeNearestNeighborNHWC<float>*)
Urgency
None as we can run with 1.7.2. Unfortunately that means we cannot benefit from the improved performance from master, which may land in the next version with this memory problem.
System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Pro
ONNX Runtime installed from (source or binary): source
ONNX Runtime version: master and 1.7.2 (see above)
Python version: N/A
Visual Studio version (if applicable): 2019
GCC/Compiler version (if compiling from source): VS 2019
CUDA/cuDNN version: CUDA 11.2.1 + cuDNN 8.1.0.77
GPU model and memory: GeForce GTX 1070 8GB
To Reproduce
Unfortunately I cannot share our model, as it is proprietary.
Expected behavior
Memory used should be limited to what is truly needed. gpu_mem_limit should behave the same as truly reducing the available GPU memory.
Screenshots
N/A
Additional context
None
The text was updated successfully, but these errors were encountered:
... and that worked. All the memory used by Run was cleaned up automatically when the function returned. However, the performance loss was substantial:
lib
branch
cudnn
used mem
time per run
onnx
1.8.0
exhaustive
0.0GB
40ms
onnx
1.8.0
default
0.0GB
48ms
onnx
1.7.2
exhaustive
1.3GB
27ms
onnx
1.7.2
default
0.7GB
41ms
TF
2.4.0
n/a
1.2GB
23ms
With these numbers, we are still better off using 1.7.2. Ideally the root cause of the excessive memory usage would be solved, so we can benefit from the best performance with a reasonable memory usage.
Describe the bug
Summary: On master with
EXHAUSTIVE
cuDNN search, our model uses 5GB of GPU memory, vs only 1.3GB memory with other setups (including in TensorFlow). This memory usage cannot be reduced usinggpu_mem_limit
, even though the model can actually run if there is only 0.5GB of GPU memory available.Details: We have a model converted from TensorFlow, the same model as in #7578 (in all that follows, the fix for upsampling performance mentioned in that other issue was applied to onnxruntime). Here are some statistics I collected when running this model on various versions of onnxruntime (1.7.2, or master on commit 2f04797), with different cuDNN search settings, and compared against TensorFlow:
The master branch with
EXHAUSTIVE
cuDNN search has clearly the best runtime performance of all combinations, and it is the only onnxruntime setup that runs faster than Tensorflow (great job on that!). Unfortunately, it also has a dramatically larger memory usage. This is wrecking havoc with our product, which needs that memory for other purposes (and we need to keep the onnxruntime session alive to avoid the cost of re-creating it next time it is needed). In all cases I set the arena extend strategy tokSameAsRequested
.I thought I could control this with the CUDA session settings. If I set
gpu_mem_limit
to any value between 4GB and 8GB, the session runs but keeps using 5GB of memory. If I setgpu_mem_limit
to anything less than 4GB, the session refuses to run with the following message:I then tried to artificially reduce the amount of available GPU memory by pre-allocating a big array with CUDA, and leave the memory limit to the maximum. Surprisingly, the session was then able to run, with unchanged performance:
This table suggests that the model should be able to run with
gpu_mem_limit
set as low as ~1GB, yet it is not the case.nvprof
reports:onnx master exhaustive
onnx master default
onnx 1.7.2 exhaustive
onnx 1.7.2 default
Tensorflow
Urgency
None as we can run with 1.7.2. Unfortunately that means we cannot benefit from the improved performance from master, which may land in the next version with this memory problem.
System information
To Reproduce
Expected behavior
Memory used should be limited to what is truly needed.
gpu_mem_limit
should behave the same as truly reducing the available GPU memory.Screenshots
N/A
Additional context
None
The text was updated successfully, but these errors were encountered: