GRPC error #15

errorer-max · 2023-06-01T07:19:21Z

Hi @mrahtz , thanks for doing this repo! I think this algorithm is a milestone in the process of deep reinforcement learning.
We installed all components according to the pipfile and pipfile.lock files, and a GRPC error occurred while training the predictor network after completing the collection of preferences. There was no problem with the first round of training, but an error was reported the second time.

Hardware resources: multi-core CPU, two GPUs 1080 TI

Running environment: Python 3.7 TensorFlow1.15

Pipenv operation：
python3 run.py train_policy_with_preferences EnduroNoFrameskip-v4 --n_envs 16 --render_episodes --n_initial_prefers 10

Process Process-21:
Traceback (most recent call last):
File "/home/mxm/.local/share/virtualenvs/learning-from-human-preferences-master-b0FE2Hdz/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/mxm/.local/share/virtualenvs/learning-from-human-preferences-master-b0FE2Hdz/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/mxm/.local/share/virtualenvs/learning-from-human-preferences-master-b0FE2Hdz/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.AbortedError: From /job:train/replica:0/task:0:
The same RecvTensor (GrpcWorker) request was received twice. step_id: 57349849272042118 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;c324b44e509a7e70;/job:train/replica:0/task:0/device:GPU:0;edge_106_pred_0/c2/bias/read;0:0" request_id: -5783289113748899051
Additional GRPC error information:
{"created":"@1685585397.670607084","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"The same RecvTensor (GrpcWorker) request was received twice. step_id: 57349849272042118 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;c324b44e509a7e70;/job:train/replica:0/task:0/device:GPU:0;edge_106_pred_0/c2/bias/read;0:0" request_id: -5783289113748899051","grpc_status":10}
[[{{node pred_0/c2/bias/read}}]]

mrahtz · 2023-06-01T10:38:05Z

Hmm, I'm sorry, I've never seen an error like that before, and I'm not sure what it means. It looks like it's coming from within TensorFlow, so my best guess is that's it's something to do with your TensorFlow, CUDA and cuDNN installations, or your GPU drivers. The only suggestion that comes to mind is to try installing NVIDIA's version of TensorFlow https://github.com/NVIDIA/tensorflow which seems to have better compatibility with newer GPU drivers.

errorer-max · 2023-06-02T09:26:52Z

Thank you very much for your suggestion. I have tried installing it https://github.com/NVIDIA/tensorflow/tree/r1.15.2 +Nv20.06, however, the result is still regrettable as the error still occurred. Before installing this version of TensorFlow, all CUDA versions on the server have been removed to avoid conflicts between CUDA versions.

2023-06-02 17:01:36.278538: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 73633454253398810
Process Process-22:
Traceback (most recent call last):
File "/home/mxm/.local/share/virtualenvs/rlhp-py38-ie9AYam1/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/mxm/.local/share/virtualenvs/rlhp-py38-ie9AYam1/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/home/mxm/.local/share/virtualenvs/rlhp-py38-ie9AYam1/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.AbortedError: From /job:train/replica:0/task:0:
The same RecvTensor (GrpcWorker) request was received twice. step_id: 73633454253398810 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;972bc9779fbc3f86;/job:train/replica:0/task:0/device:GPU:0;edge_216_pred_0/d2/bias/read;0:0" request_id: -8191416709793270049
Additional GRPC error information:
{"created":"@1685696496.277802028","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"The same RecvTensor (GrpcWorker) request was received twice. step_id: 73633454253398810 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;972bc9779fbc3f86;/job:train/replica:0/task:0/device:GPU:0;edge_216_pred_0/d2/bias/read;0:0" request_id: -8191416709793270049","grpc_status":10}
[[{{node pred_0/d2/bias/read}}]]

Before running the algorithm, in order to complete all processes as soon as possible, I used 'python3 run. py train_policy_with references EnduroNoFrameskip-v4-- n_envs 16-- render_episodes -- n_initial_prefs 15'. Before running the algorithm, in order to complete all processes as soon as possible, I used 'python3 run. py train_policy_with references EnduroNoFrameskip-v4-- n_envs 16-- render_episodes -- n_initial_prefs 15'. That is to say, I only input preferences 15 times before starting to train the reward prediction network. I don't know if this setting will cause this problem to occur?

mrahtz · 2023-06-10T11:20:48Z

Hey, sorry for the slow reply - busy week.

If that didn't work, sorry, I'm out of ideas. I don't think it should make any difference which order you run the commands in - this really sounds like some weird error in TensorFlow itself. My impression is that TensorFlow 1.x is really pretty unsupported these days, so it might just be that it's too old.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPC error #15

GRPC error #15

errorer-max commented Jun 1, 2023 •

edited

Loading

mrahtz commented Jun 1, 2023

errorer-max commented Jun 2, 2023

mrahtz commented Jun 10, 2023

GRPC error #15

GRPC error #15

Comments

errorer-max commented Jun 1, 2023 • edited Loading

mrahtz commented Jun 1, 2023

errorer-max commented Jun 2, 2023

mrahtz commented Jun 10, 2023

errorer-max commented Jun 1, 2023 •

edited

Loading