You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 3, 2023. It is now read-only.
tl;dr: the code works fine without ngraph; with ngraph enabled, it dies with the errors show below.
Details:
Been trying to get ngraph working with Google's deeplab v3+, without any luck. The code is being run inside a docker container (the nvcr.io/nvidia/tensorflow:18.12-py3 image) on an nvidia dgx2 (16 GPUs).
Versions:
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
TensorFlow version installed: 1.12.0 (unknown)
nGraph bridge built with: 1.12.0 (v1.12.0-0-ga6d8ffa)
The docker container was started with the following command line:
Here are the errors. I have no idea how to diagnose this. :)
[snip]
INFO:tensorflow:Restoring parameters from /raid/wingated/cancer/deeplab_data/init_models/xception/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /raid/wingated/cancer/deeplab_data/logs/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Error reported to Coordinator: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 495, in run
self.run_loop()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/supervisor.py", line 1034, in run_loop
self._sv.global_step])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster
The text was updated successfully, but these errors were encountered:
tl;dr: the code works fine without ngraph; with ngraph enabled, it dies with the errors show below.
Details:
Been trying to get ngraph working with Google's deeplab v3+, without any luck. The code is being run inside a docker container (the nvcr.io/nvidia/tensorflow:18.12-py3 image) on an nvidia dgx2 (16 GPUs).
Versions:
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
TensorFlow version installed: 1.12.0 (unknown)
nGraph bridge built with: 1.12.0 (v1.12.0-0-ga6d8ffa)
The docker container was started with the following command line:
nvidia-docker run -it
--rm
--shm-size=1g
--ulimit memlock=-1
--ulimit stack=67108864
--privileged=true
-v /raid/wingated:/raid/wingated
-v /home/wingated:/home/wingated
-v /mnt/pccfs:/mnt/pccfs
nvcr.io/nvidia/tensorflow:18.12-py3
Here are the errors. I have no idea how to diagnose this. :)
[snip]
INFO:tensorflow:Restoring parameters from /raid/wingated/cancer/deeplab_data/init_models/xception/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /raid/wingated/cancer/deeplab_data/logs/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Error reported to Coordinator: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 495, in run
self.run_loop()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/supervisor.py", line 1034, in run_loop
self._sv.global_step])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster
The text was updated successfully, but these errors were encountered: