Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catch tf.errors.CancelledError for OOM #2504

Merged
merged 1 commit into from
May 4, 2023

Conversation

njzjz
Copy link
Member

@njzjz njzjz commented May 2, 2023

Sometimes TF will raise CancelledError instead of ResourceExhaustedError:

2023-05-02 14:39:26.504134: W tensorflow/tsl/framework/bfc_allocator.cc:497] *____*****____****************__________*****************************************___________________
2023-05-02 14:39:26.504163: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at matmul_op_impl.h:730 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[165000,100,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2023-05-02 14:39:26.504208: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[165000,100,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node load/gradients/MatMul_grad/MatMul_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

2023-05-02 14:39:26.504253: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): CANCELLED: RecvAsync is cancelled.
         [[{{node load/ProdVirialSeA/_33}}]] [type.googleapis.com/tensorflow.DerivedStatus='']
2023-05-02 14:39:26.504299: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): CANCELLED: RecvAsync is cancelled.
         [[{{node load/ProdVirialSeA/_33}}]]
         [[cluster_6_1/merge_oidx_1/_35]] [type.googleapis.com/tensorflow.DerivedStatus='']
Traceback (most recent call last):
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1378, in _do_call
    return fn(*args)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1361, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1454, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.CancelledError: RecvAsync is cancelled.
         [[{{node load/ProdVirialSeA/_33}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
    sys.exit(main())
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 649, in main
    make_model_devi(**dict_args)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/infer/model_devi.py", line 259, in make_model_devi
    devi = calc_model_devi(coord, box, atype, dp_models)
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/infer/model_devi.py", line 186, in calc_model_devi
    ret = dp.eval(
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 322, in eval
    output = self._eval_func(self._eval_inner, numb_test, natoms)(
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 237, in eval_func
    return self.auto_batch_size.execute_all(
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 191, in execute_all
    n_batch, result = self.execute(execute_with_batch_size, index, natoms)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 103, in execute
    n_batch, result = callable(
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/utils/batch_size.py", line 169, in execute_with_batch_size
    return (end_index - start_index), callable(
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/infer/deep_pot.py", line 474, in _eval_inner
    v_out = run_sess(self.sess, t_out, feed_dict=feed_dict_test)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/utils/sess.py", line 30, in run_sess
    return sess.run(*args, **kwargs)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 968, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1371, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1397, in _do_call
    raise type(e)(node_def, op, message)  # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.CancelledError: Graph execution error:

RecvAsync is cancelled.
         [[{{node load/ProdVirialSeA/_33}}]]

@github-actions github-actions bot added the Python label May 2, 2023
@codecov
Copy link

codecov bot commented May 3, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (16e9133) 74.39% compared to head (c172117) 74.40%.

Additional details and impacted files
@@           Coverage Diff           @@
##            devel    #2504   +/-   ##
=======================================
  Coverage   74.39%   74.40%           
=======================================
  Files         227      227           
  Lines       23401    23428   +27     
  Branches     1673     1680    +7     
=======================================
+ Hits        17410    17432   +22     
+ Misses       4895     4889    -6     
- Partials     1096     1107   +11     
Impacted Files Coverage Δ
deepmd/utils/sess.py 54.54% <0.00%> (ø)

... and 11 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@wanghan-iapcm wanghan-iapcm merged commit 45a8d1e into deepmodeling:devel May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants