Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPC error #15

Open
errorer-max opened this issue Jun 1, 2023 · 3 comments
Open

GRPC error #15

errorer-max opened this issue Jun 1, 2023 · 3 comments

Comments

@errorer-max
Copy link

errorer-max commented Jun 1, 2023

Hi @mrahtz , thanks for doing this repo! I think this algorithm is a milestone in the process of deep reinforcement learning.
We installed all components according to the pipfile and pipfile.lock files, and a GRPC error occurred while training the predictor network after completing the collection of preferences. There was no problem with the first round of training, but an error was reported the second time.

Hardware resources: multi-core CPU, two GPUs 1080 TI

Running environment: Python 3.7 TensorFlow1.15

Pipenv operation:
python3 run.py train_policy_with_preferences EnduroNoFrameskip-v4 --n_envs 16 --render_episodes --n_initial_prefers 10

Process Process-21:
Traceback (most recent call last):
File "/home/mxm/.local/share/virtualenvs/learning-from-human-preferences-master-b0FE2Hdz/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/mxm/.local/share/virtualenvs/learning-from-human-preferences-master-b0FE2Hdz/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/mxm/.local/share/virtualenvs/learning-from-human-preferences-master-b0FE2Hdz/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.AbortedError: From /job:train/replica:0/task:0:
The same RecvTensor (GrpcWorker) request was received twice. step_id: 57349849272042118 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;c324b44e509a7e70;/job:train/replica:0/task:0/device:GPU:0;edge_106_pred_0/c2/bias/read;0:0" request_id: -5783289113748899051
Additional GRPC error information:
{"created":"@1685585397.670607084","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"The same RecvTensor (GrpcWorker) request was received twice. step_id: 57349849272042118 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;c324b44e509a7e70;/job:train/replica:0/task:0/device:GPU:0;edge_106_pred_0/c2/bias/read;0:0" request_id: -5783289113748899051","grpc_status":10}
[[{{node pred_0/c2/bias/read}}]]

@mrahtz
Copy link
Owner

mrahtz commented Jun 1, 2023

Hmm, I'm sorry, I've never seen an error like that before, and I'm not sure what it means. It looks like it's coming from within TensorFlow, so my best guess is that's it's something to do with your TensorFlow, CUDA and cuDNN installations, or your GPU drivers. The only suggestion that comes to mind is to try installing NVIDIA's version of TensorFlow https://github.com/NVIDIA/tensorflow which seems to have better compatibility with newer GPU drivers.

@errorer-max
Copy link
Author

Thank you very much for your suggestion. I have tried installing it https://github.com/NVIDIA/tensorflow/tree/r1.15.2 +Nv20.06, however, the result is still regrettable as the error still occurred. Before installing this version of TensorFlow, all CUDA versions on the server have been removed to avoid conflicts between CUDA versions.

2023-06-02 17:01:36.278538: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 73633454253398810
Process Process-22:
Traceback (most recent call last):
File "/home/mxm/.local/share/virtualenvs/rlhp-py38-ie9AYam1/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/mxm/.local/share/virtualenvs/rlhp-py38-ie9AYam1/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/home/mxm/.local/share/virtualenvs/rlhp-py38-ie9AYam1/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.AbortedError: From /job:train/replica:0/task:0:
The same RecvTensor (GrpcWorker) request was received twice. step_id: 73633454253398810 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;972bc9779fbc3f86;/job:train/replica:0/task:0/device:GPU:0;edge_216_pred_0/d2/bias/read;0:0" request_id: -8191416709793270049
Additional GRPC error information:
{"created":"@1685696496.277802028","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"The same RecvTensor (GrpcWorker) request was received twice. step_id: 73633454253398810 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;972bc9779fbc3f86;/job:train/replica:0/task:0/device:GPU:0;edge_216_pred_0/d2/bias/read;0:0" request_id: -8191416709793270049","grpc_status":10}
[[{{node pred_0/d2/bias/read}}]]

Before running the algorithm, in order to complete all processes as soon as possible, I used 'python3 run. py train_policy_with references EnduroNoFrameskip-v4-- n_envs 16-- render_episodes -- n_initial_prefs 15'. Before running the algorithm, in order to complete all processes as soon as possible, I used 'python3 run. py train_policy_with references EnduroNoFrameskip-v4-- n_envs 16-- render_episodes -- n_initial_prefs 15'. That is to say, I only input preferences 15 times before starting to train the reward prediction network. I don't know if this setting will cause this problem to occur?

@mrahtz
Copy link
Owner

mrahtz commented Jun 10, 2023

Hey, sorry for the slow reply - busy week.

If that didn't work, sorry, I'm out of ideas. I don't think it should make any difference which order you run the commands in - this really sounds like some weird error in TensorFlow itself. My impression is that TensorFlow 1.x is really pretty unsupported these days, so it might just be that it's too old.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants