Fix GPU CUDA out of memory error when workers_per_replica > 1 #853

RobertLucian · 2020-03-07T03:12:13Z

Fixes the original problem of #845.

When the following conditions are met:

Python Predictor used for API.
GPUs are used.
Tensorflow-based framework is used (either 1.x or 2.x).
workers_per_replica is set to a value > 1.

CUDA_ERROR_OUT_OF_MEMORY error is thrown for all workers_per_replica - 1 that didn't have a chance of "reserving" the GPU's memory. By default, when loading up a model, all of the GPU's memory is pre-allocated. To avoid that, the GPU's memory usage has to be limited - either by:

Allowing the model to allocate memory just as much as it needs.
Or by preallocating a subset of the memory the GPU has.

checklist:

run make test and make lint
test manually (i.e. build/push all images, restart operator, and re-deploy APIs)
update examples
update docs and add any new files to summary.md (view in gitbook after merging)
cherry-pick into release branches if applicable
alert the dev team if the dev environment changed

deliahu · 2020-03-13T01:43:48Z

@RobertLucian thank you for looking into this!

I updated the code a little, and moved the documentation you added to gpus.md (which we just added today). Please let me know what you think.

Also, I am still seeing GPU out of memory issues when I tried running it. Now the API did become ready without crashing, but when I tried to hit the API with concurrent requests, it seemed to crash due to GPU OOM.

I changed from tf.config.experimental.list_physical_devices to tf.config.list_physical_devices, although I don't think that is the issue. I ran concurrent requests by running sample_inference.py in three terminals as close together as I could (I also commented out the yolov3 API request and instead loaded boxes_raw from a pickled file that I saved).

Perhaps the model, once loaded into the GPU, is too big? Or perhaps limiting the GPU growth isn't working for some reason? Or perhaps there is a GPU memory leak somehow?

Here is the error I saw:

2020-03-13 01:30:24.565193: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-13 01:30:26.381819: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.382882: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 921.60M (966367744 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.383895: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 829.44M (869731072 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.384796: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 746.50M (782758144 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.385776: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 671.85M (704482304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.385815: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 662.60MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-03-13 01:30:26.386715: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.386738: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 662.60MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-03-13 01:30:26.386967: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-13 01:30:26.759981: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.760021: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.40MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-03-13 01:30:26.760857: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.760887: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.40MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-03-13 01:30:26.761677: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.761697: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 168.26MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-03-13 01:30:26.762491: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.762513: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 168.26MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-03-13 01:30:26.763358: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.763385: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.39MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-03-13 01:30:26.764299: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.764321: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.39MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-03-13 01:30:26.765089: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.765111: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 34.11MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-03-13 01:30:26.765944: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 1.00G (1073741824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.765967: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 34.11MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-03-13 01:30:26.766829: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.767941: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.803854: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.804850: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.819843: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.820687: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.826803: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.827651: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.828450: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.829320: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.849865: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.850845: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.851695: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.852582: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.880788: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.881711: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.882618: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.883480: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.906586: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.907449: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.908214: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.909076: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.940342: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.941290: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.942502: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.943345: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.955865: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.956705: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.959952: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.960889: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.961669: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.962642: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.977870: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.978717: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.979915: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.980725: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.981720: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.982729: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.983500: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.984214: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.992417: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:26.993443: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.001266: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.002539: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.011531: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.012398: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.015864: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.016839: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.017998: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.019095: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.035904: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.036739: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.039700: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.040621: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.041605: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.042594: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.058893: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.059838: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.068041: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.068988: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.076160: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.076993: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.085423: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.086386: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.094381: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.095314: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.102756: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.103944: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.110515: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.111342: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.118190: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.119024: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.125368: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.126410: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.132345: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:27.133607: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.269306: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.270174: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.275340: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.276114: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.294826: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.295732: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.301980: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.303037: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.304003: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.304907: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.306188: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.307439: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.308545: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.309411: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.310587: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.311738: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.329960: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.331054: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.331924: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.332697: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.344471: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.345405: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.350423: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.351409: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.352685: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.353698: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.372790: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.374071: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.374988: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.375985: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.390418: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-13 01:30:28.391255: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

RobertLucian · 2020-03-13T11:22:40Z

@deliahu
So, I had to revert to only 1 worker/instance for the CRNN API (#845).

Apparently, the CRNN API models need cumulatively about 8119 MiB. The T4 GPU only has 15079 MiB. This means it's not possible to fit 2 workers on a single T4 GPU. And there's no GPU memory leak nor there is an issue with the GPU growth setting - we're okay on that regard.

I looked into ways of reducing the memory need of the models within Keras to be able to fit in 2 workers, and without a significant change to the used models (for instance, inside faustomorales/keras-ocr's source code), there isn't an easy way out of this.

Also, I am still seeing GPU out of memory issues when I tried running it. Now the API did become ready without crashing, but when I tried to hit the API with concurrent requests, it seemed to crash due to GPU OOM.

I looked into this and I found out the memory requirements of loading a model are lower than those of loading a model and running predictions. This explains the above situation.

I updated the code a little, and moved the documentation you added to gpus.md (which we just added today). Please let me know what you think.

Yes, I like this. Specifically the headline. It's succinct.

deliahu · 2020-03-13T15:37:29Z

All sounds good, thank you for looking into this!

(cherry picked from commit c0f3d4b)

RobertLucian and others added 7 commits March 7, 2020 00:56

Fix "CUDA_ERROR_OUT_OF_MEMORY: out of memory" error

a094ec5

add docs note about cuda out of memory error

86e424d

Remove GPU docs

8f7aa35

Merge branch 'master' into fix/gpu-out-of-memory

ad59e4b

Add back GPU docs

9881538

Merge branch 'master' into fix/gpu-out-of-memory

6520e07

Use tf.config.list_physical_devices

1f6bd37

Revert to only having a worker/instance for CRNN

8cf0f14

deliahu approved these changes Mar 13, 2020

View reviewed changes

deliahu merged commit c0f3d4b into cortexlabs:master Mar 13, 2020

deliahu pushed a commit that referenced this pull request Mar 13, 2020

Fix GPU CUDA out of memory error when workers_per_replica > 1 (#853)

7d448c7

(cherry picked from commit c0f3d4b)

RobertLucian deleted the fix/gpu-out-of-memory branch March 15, 2020 05:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix GPU CUDA out of memory error when workers_per_replica > 1 #853

Fix GPU CUDA out of memory error when workers_per_replica > 1 #853

Uh oh!

RobertLucian commented Mar 7, 2020 •

edited

Loading

Uh oh!

deliahu commented Mar 13, 2020

Uh oh!

RobertLucian commented Mar 13, 2020 •

edited

Loading

Uh oh!

deliahu commented Mar 13, 2020

Uh oh!

Uh oh!

Fix GPU CUDA out of memory error when workers_per_replica > 1 #853

Fix GPU CUDA out of memory error when workers_per_replica > 1 #853

Uh oh!

Conversation

RobertLucian commented Mar 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deliahu commented Mar 13, 2020

Uh oh!

RobertLucian commented Mar 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deliahu commented Mar 13, 2020

Uh oh!

Uh oh!

RobertLucian commented Mar 7, 2020 •

edited

Loading

RobertLucian commented Mar 13, 2020 •

edited

Loading