catch tf.errors.CancelledError for OOM #2504

njzjz · 2023-05-02T23:55:20Z

Sometimes TF will raise CancelledError instead of ResourceExhaustedError:

2023-05-02 14:39:26.504134: W tensorflow/tsl/framework/bfc_allocator.cc:497] *____*****____****************__________*****************************************___________________
2023-05-02 14:39:26.504163: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at matmul_op_impl.h:730 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[165000,100,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2023-05-02 14:39:26.504208: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[165000,100,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node load/gradients/MatMul_grad/MatMul_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

2023-05-02 14:39:26.504253: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): CANCELLED: RecvAsync is cancelled.
         [[{{node load/ProdVirialSeA/_33}}]] [type.googleapis.com/tensorflow.DerivedStatus='']
2023-05-02 14:39:26.504299: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): CANCELLED: RecvAsync is cancelled.
         [[{{node load/ProdVirialSeA/_33}}]]
         [[cluster_6_1/merge_oidx_1/_35]] [type.googleapis.com/tensorflow.DerivedStatus='']
Traceback (most recent call last):
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1378, in _do_call
    return fn(*args)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1361, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1454, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.CancelledError: RecvAsync is cancelled.
         [[{{node load/ProdVirialSeA/_33}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
    sys.exit(main())
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 649, in main
    make_model_devi(**dict_args)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/infer/model_devi.py", line 259, in make_model_devi
    devi = calc_model_devi(coord, box, atype, dp_models)
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/infer/model_devi.py", line 186, in calc_model_devi
    ret = dp.eval(
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 322, in eval
    output = self._eval_func(self._eval_inner, numb_test, natoms)(
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 237, in eval_func
    return self.auto_batch_size.execute_all(
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 191, in execute_all
    n_batch, result = self.execute(execute_with_batch_size, index, natoms)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 103, in execute
    n_batch, result = callable(
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 169, in execute_with_batch_size
    return (end_index - start_index), callable(
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 474, in _eval_inner
    v_out = run_sess(self.sess, t_out, feed_dict=feed_dict_test)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/utils/sess.py", line 30, in run_sess
    return sess.run(*args, **kwargs)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 968, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1371, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1397, in _do_call
    raise type(e)(node_def, op, message)  # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.CancelledError: Graph execution error:

RecvAsync is cancelled.
         [[{{node load/ProdVirialSeA/_33}}]]

codecov · 2023-05-03T00:06:55Z

Codecov Report

Patch and project coverage have no change.

Comparison is base (16e9133) 74.39% compared to head (c172117) 74.40%.

Additional details and impacted files

@@           Coverage Diff           @@
##            devel    #2504   +/-   ##
=======================================
  Coverage   74.39%   74.40%           
=======================================
  Files         227      227           
  Lines       23401    23428   +27     
  Branches     1673     1680    +7     
=======================================
+ Hits        17410    17432   +22     
+ Misses       4895     4889    -6     
- Partials     1096     1107   +11

Impacted Files	Coverage Δ
deepmd/utils/sess.py	`54.54% <0.00%> (ø)`

... and 11 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

catch tf.errors.CancelledError for OOM

c172117

github-actions bot added the Python label May 2, 2023

wanghan-iapcm approved these changes May 4, 2023

View reviewed changes

wanghan-iapcm merged commit 45a8d1e into deepmodeling:devel May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

catch tf.errors.CancelledError for OOM #2504

catch tf.errors.CancelledError for OOM #2504

njzjz commented May 2, 2023

codecov bot commented May 3, 2023 •

edited

Loading

catch tf.errors.CancelledError for OOM #2504

catch tf.errors.CancelledError for OOM #2504

Conversation

njzjz commented May 2, 2023

codecov bot commented May 3, 2023 • edited Loading

Codecov Report

codecov bot commented May 3, 2023 •

edited

Loading