Skip to content
This repository has been archived by the owner on Jan 3, 2023. It is now read-only.

Broken with deeplabv3+ #427

Open
wingated opened this issue Feb 10, 2019 · 1 comment
Open

Broken with deeplabv3+ #427

wingated opened this issue Feb 10, 2019 · 1 comment

Comments

@wingated
Copy link

tl;dr: the code works fine without ngraph; with ngraph enabled, it dies with the errors show below.

Details:

Been trying to get ngraph working with Google's deeplab v3+, without any luck. The code is being run inside a docker container (the nvcr.io/nvidia/tensorflow:18.12-py3 image) on an nvidia dgx2 (16 GPUs).

Versions:

Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
TensorFlow version installed: 1.12.0 (unknown)
nGraph bridge built with: 1.12.0 (v1.12.0-0-ga6d8ffa)

The docker container was started with the following command line:

nvidia-docker run -it
--rm
--shm-size=1g
--ulimit memlock=-1
--ulimit stack=67108864
--privileged=true
-v /raid/wingated:/raid/wingated
-v /home/wingated:/home/wingated
-v /mnt/pccfs:/mnt/pccfs
nvcr.io/nvidia/tensorflow:18.12-py3

Here are the errors. I have no idea how to diagnose this. :)

[snip]
INFO:tensorflow:Restoring parameters from /raid/wingated/cancer/deeplab_data/init_models/xception/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /raid/wingated/cancer/deeplab_data/logs/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Error reported to Coordinator: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 495, in run
self.run_loop()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/supervisor.py", line 1034, in run_loop
self._sv.global_step])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster

@avijit-nervana
Copy link
Contributor

nGraph won't work with TensorFlow built for GPU. It will only work with TensorFlow built for CPU.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants