Broken with deeplabv3+ #427

wingated · 2019-02-10T00:07:42Z

tl;dr: the code works fine without ngraph; with ngraph enabled, it dies with the errors show below.

Details:

Been trying to get ngraph working with Google's deeplab v3+, without any luck. The code is being run inside a docker container (the nvcr.io/nvidia/tensorflow:18.12-py3 image) on an nvidia dgx2 (16 GPUs).

Versions:

Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
TensorFlow version installed: 1.12.0 (unknown)
nGraph bridge built with: 1.12.0 (v1.12.0-0-ga6d8ffa)

The docker container was started with the following command line:

nvidia-docker run -it
--rm
--shm-size=1g
--ulimit memlock=-1
--ulimit stack=67108864
--privileged=true
-v /raid/wingated:/raid/wingated
-v /home/wingated:/home/wingated
-v /mnt/pccfs:/mnt/pccfs
nvcr.io/nvidia/tensorflow:18.12-py3

Here are the errors. I have no idea how to diagnose this. :)

[snip]
INFO:tensorflow:Restoring parameters from /raid/wingated/cancer/deeplab_data/init_models/xception/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /raid/wingated/cancer/deeplab_data/logs/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Error reported to Coordinator: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 495, in run
self.run_loop()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/supervisor.py", line 1034, in run_loop
self._sv.global_step])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster

avijit-nervana · 2019-02-13T13:56:43Z

nGraph won't work with TensorFlow built for GPU. It will only work with TensorFlow built for CPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken with deeplabv3+ #427

Broken with deeplabv3+ #427

wingated commented Feb 10, 2019

avijit-nervana commented Feb 13, 2019

Broken with deeplabv3+ #427

Broken with deeplabv3+ #427

Comments

wingated commented Feb 10, 2019

avijit-nervana commented Feb 13, 2019