Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation #5

Closed
kaenkogashi opened this issue Sep 7, 2020 · 20 comments

Comments

@kaenkogashi
Copy link

kaenkogashi commented Sep 7, 2020

Dear, sir

Thank you for your works!

I try to train VCL on V-COCO as following instructions.

Train an VCL on V-COCO
python tools/Train_VCL_ResNet_VCOCO.py --model VCL_union_multi_ml1_l05_t3_rew_aug5_3_new_VCOCO_test --num_iteration 400000

I only assigned 1 GPU for training and I got error messages as below, would you help me to solve with this?
I don't know why I am try to training on V-COCO, but the error is about HICO.

Traceback (most recent call last):
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1339, in _run_fn
self._extend_graph()
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1374, in _extend_graph
tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation HICO_0/MatMul: {{node HICO_0/MatMul}}was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:CPU:1, /job:localhost/replica:0/task:0/device:CPU:10, /job:localhost/replica:0/task:0/device:CPU:11, /job:localhost/replica:0/task:0/device:CPU:12, /job:localhost/replica:0/task:0/device:CPU:13, /job:localhost/replica:0/task:0/device:CPU:14, /job:localhost/replica:0/task:0/device:CPU:15, /job:localhost/replica:0/task:0/device:CPU:2, /job:localhost/replica:0/task:0/device:CPU:3, /job:localhost/replica:0/task:0/device:CPU:4, /job:localhost/replica:0/task:0/device:CPU:5, /job:localhost/replica:0/task:0/device:CPU:6, /job:localhost/replica:0/task:0/device:CPU:7, /job:localhost/replica:0/task:0/device:CPU:8, /job:localhost/replica:0/task:0/device:CPU:9, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]. Make sure the device specification refers to a valid device.
[[HICO_0/MatMul]]

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tools/Train_VCL_ResNet_VCOCO.py", line 109, in
sw.train_model(sess, args.max_iters)
File "/home/kogashi/VCL/tools/../lib/models/train_Solver_VCOCO_MultiGPU.py", line 153, in train_model
self.from_snapshot(sess)
File "/home/kogashi/VCL/tools/../lib/models/train_Solver_VCOCO.py", line 134, in from_snapshot
sess.run(tf.global_variables_initializer())
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation HICO_0/MatMul: node HICO_0/MatMul (defined at /home/kogashi/VCL/tools/../lib/networks/ResNet50_VCOCO.py:150) was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:CPU:1, /job:localhost/replica:0/task:0/device:CPU:10, /job:localhost/replica:0/task:0/device:CPU:11, /job:localhost/replica:0/task:0/device:CPU:12, /job:localhost/replica:0/task:0/device:CPU:13, /job:localhost/replica:0/task:0/device:CPU:14, /job:localhost/replica:0/task:0/device:CPU:15, /job:localhost/replica:0/task:0/device:CPU:2, /job:localhost/replica:0/task:0/device:CPU:3, /job:localhost/replica:0/task:0/device:CPU:4, /job:localhost/replica:0/task:0/device:CPU:5, /job:localhost/replica:0/task:0/device:CPU:6, /job:localhost/replica:0/task:0/device:CPU:7, /job:localhost/replica:0/task:0/device:CPU:8, /job:localhost/replica:0/task:0/device:CPU:9, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]. Make sure the device specification refers to a valid device.
[[HICO_0/MatMul]]

Errors may have originated from an input operation.
Input Source operations connected to node HICO_0/MatMul:
IteratorGetNext (defined at /home/kogashi/VCL/tools/../lib/ult/ult.py:884)
HICO_0/Const (defined at /home/kogashi/VCL/tools/../lib/networks/ResNet50_VCOCO.py:148)

@zhihou7
Copy link
Owner

zhihou7 commented Sep 7, 2020

Thanks for your interest.

The information in "HICO" is my mistake. It is because I first evaluate VCL on HICO-DET dataset and I did not change the variable name "HICO" to "HOI". This is just the scope name/variable name.

According your log information. I guess it's because your GPU device name is "XLA_GPU". Thus

 [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:CPU:1, /job:localhost/replica:0/task:0/device:CPU:10, /job:localhost/replica:0/task:0/device:CPU:11, /job:localhost/replica:0/task:0/device:CPU:12, /job:localhost/replica:0/task:0/device:CPU:13, /job:localhost/replica:0/task:0/device:CPU:14, /job:localhost/replica:0/task:0/device:CPU:15, /job:localhost/replica:0/task:0/device:CPU:2, /job:localhost/replica:0/task:0/device:CPU:3, /job:localhost/replica:0/task:0/device:CPU:4, /job:localhost/replica:0/task:0/device:CPU:5, /job:localhost/replica:0/task:0/device:CPU:6, /job:localhost/replica:0/task:0/device:CPU:7, /job:localhost/replica:0/task:0/device:CPU:8, /job:localhost/replica:0/task:0/device:CPU:9, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]

Thus, in your device list, there is no device "/device:GPU:0". I guess renaming the device name to "/device:XLA_GPU:0" might solve your problem.

change line 66 in lib/models/train_Solver_VCOCO_MultiGPU.py

with tf.device('/gpu:%d' % gpu_idx):

to

with tf.device('/XLA_GPU:%d' % gpu_idx):

or
with tf.device('/device:XLA_GPU:%d' % gpu_idx):

You can find this issue in tensorflow.

emmm, It is also ok you remove all "tf.device()" in lib/models/train_Solver_VCOCO_MultiGPU.py if you just use one GPU. It will use the default device.

If you have further problems, feel free to discuss it.

@kaenkogashi
Copy link
Author

kaenkogashi commented Sep 8, 2020

Thank you for your replay!

I removed all "tf.device()" in lib/models/train_Solver_VCOCO_MultiGPU.py because I use one GPU. (Actually, I tried other solutions like change line 66 in lib/models/train_Solver_VCOCO_MultiGPU.py
with tf.device('/gpu:%d' % gpu_idx):
to
with tf.device('/XLA_GPU:%d' % gpu_idx):
or
with tf.device('/device:XLA_GPU:%d' % gpu_idx):) ,but there are still GPU allocate errors.)

But in this time, gpu didn't work. Instead of using gpu, cpu worked.
I installed tensorflow-gpu version with pip.
I don't know what to do. Sorry for the basic questions, I am not familiar with tensorflow. (I always use pytorch)
If you come up with any other solutions, please teach me. thank you very much!

@zhihou7
Copy link
Owner

zhihou7 commented Sep 8, 2020

I also meet the similar problem. But I have forgotten the solution. I find someone solve it like this in tensorflow/tensorflow#30748 (comment)

I met the same problem on ubuntu 18.04, cuda 10.1 and Tensorflow 1.14.0. However, I uninstalled the pip version tensorflow using pip uninstall tensorflow-gpu and then use conda install -c anaconda tensorflow-gpu to install conda version, and it works for me. You can have a try.

Hope help you.

@zhihou7
Copy link
Owner

zhihou7 commented Sep 8, 2020

btw, do you also remove "tf.device('/cpu:0')" in line 44? If so, your tensorflow possibly has some problems. try to install tensorflow-gpu==1.14.0 by conda

@kaenkogashi
Copy link
Author

kaenkogashi commented Sep 9, 2020

thank you for your help!

I installed tensorflow-gpu==1.14.0 with conda. (I uninstalled pip version), and I used the original code. I didn't change code at all.
Then, CUDA not found error comes out.

tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at xla_ops.cc:463 : Not found: ./libdevice.compute_20.10.bc not found
Traceback (most recent call last):
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: ./libdevice.compute_20.10.bc not found
[[{{node cluster_4_1/xla_compile}}]]
[[cluster_1_1/merge_oidx_4/_873]]
(1) Not found: ./libdevice.compute_20.10.bc not found
[[{{node cluster_4_1/xla_compile}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tools/Train_VCL_ResNet_VCOCO.py", line 109, in
sw.train_model(sess, args.max_iters)
File "/home/kogashi/VCL/tools/../lib/models/train_Solver_VCOCO_MultiGPU.py", line 171, in train_model
train_op])
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/home/kogashi/miniconda3/envs/VCL/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: ./libdevice.compute_20.10.bc not found
[[{{node cluster_4_1/xla_compile}}]]
[[cluster_1_1/merge_oidx_4/_873]]
(1) Not found: ./libdevice.compute_20.10.bc not found
[[{{node cluster_4_1/xla_compile}}]]
0 successful operations.
0 derived errors ignored.

I renamed /home/kogashi/miniconda3/cuda-10.1/nvvm/libdevice/libdevice.10.bc
to
/home/kogashi/miniconda3/cuda-10.1/nvvm/libdevice/libdevice.compute_20.10.bc
But there is still NotFoundError. I am wondering where tensorflow looking at???
I googled several web-site, but still can't find the answer. (I am using cuda-10.1, but currently the error is NotFoundError. If cuda-10.1 is the wrong version, please let me know)
Thank you very much!

@zhihou7
Copy link
Owner

zhihou7 commented Sep 9, 2020

Do you have multiple cuda? Have you defined CUDA_DIR env var? From the message in jax-ml/jax#989, this problem seems like tensorflow can not find the cuda dir. Someone trys to set CUDA_DIR or add the symlink did (eg.: $ sudo ln -s /opt/cuda /usr/local/cuda-10.2), or set "XLA_FLAGS=--xla_gpu_cuda_data_dir=conda-env-path/lib/"

@kaenkogashi
Copy link
Author

kaenkogashi commented Sep 9, 2020

Thank you very much! After I made cuda's symbolic link, finally it worked!
Would you tell me how many hours this model will take for training on V-COCO and HICO datasets?
And on my GPU, your model looks like don't use much GPU-power, but need a lot of memory and CPU power.

@zhihou7
Copy link
Owner

zhihou7 commented Sep 9, 2020

V-COCO converges at around iteration 300000. HICO converges at around iteration 500000. The time this model will take depend on your GPU.

V-COCO needs less 24 hours on 2080Ti and HICO requires around 48 hours. If your decrease the learning rate on V-COCO more quickly, I guess it will converge earlier.

On 2080Ti, each iteration will consume 0.2s on HICO. On Titan XP, each iteration consumes around 0.25-0.3 on HICO. If you training speed is still too slow after 1000 iteration, I guess it might have some problems.

Yeah, it needs GPU memory because we input two images.

All numbers above are based on res50 backbone.

@kaenkogashi
Copy link
Author

kaenkogashi commented Sep 9, 2020

Thank you for your reply!

Current I am training on V-COCO dataset. It is slow because it still takes 2.107 each iteration after 1000 iteration.
My GPU have 16G memory(Tesla V100). So, shall I use multi-GPU rather than single -GPU?

iter: 4910 / 400000, im_id: 347655, total loss: nan, lr: 0.010000, speed: 2.107 s/iter/iter

Is this based on single-GPU? (I thought probably you used multi-GPU )

V-COCO needs less 24 hours on 2080Ti and HICO requires around 48 hours.

@zhihou7
Copy link
Owner

zhihou7 commented Sep 9, 2020

Yes, All the experiments are based on single-GPU because I find two gpus have bugs and are slower. Well, I also tested the code for issue #4 with V100 last week. Here (https://github.com/zhihou7/VCL/files/5175383/test.txt) is the log. It is much faster than the experiment with 2080Ti.

Do you install scikit-image?

scikit-image 0.14.2

I remember the version of scikit-image will affect the speed seriously. I use 0.14.2.

It also might be because the first running is slow. It is wired.

@kaenkogashi
Copy link
Author

Thank you for your comment!

I installed scikit-image, but the version was different, so I uninstalled old one and installed scikit-image 0.14.2.
I am restart training model and I will let you know the result later. Thank you very much!

@zhihou7
Copy link
Owner

zhihou7 commented Sep 9, 2020

In fact, I do not use the scikit-image in my code. I just forget to remove "import skimage". I'm not sure why the code runs slow in some environments.

@kaenkogashi
Copy link
Author

kaenkogashi commented Sep 10, 2020

Thank you for your comment!

I found the reason for the slow training. Because other people use CPU-power heavily in our server.
And VCL's problem also use CPU-power heavily. That was the reason.

iter: 4910 / 400000, im_id: 347655, total loss: nan, lr: 0.010000, speed: 2.107 s/iter/iter

@zhihou7
Copy link
Owner

zhihou7 commented Sep 10, 2020

Thanks for your comment! I also face this problem that VCL will use CPU-power largely in some machines. But in other machines, it then looks normal. In our GPU cluster, I usually allocate 1GPU and 4-CPU and VCL begins normally. It might also depend on IO load.

@kaenkogashi
Copy link
Author

Unfortunately, our server's cpu is always busy. I think probably I can load memory on the GPU, not on CPU??
I will write pytorch implementation of VCL! (but I need to read the whole code first, haha)
Anyway, I think we can close this topic, thank you for your help!

@zhihou7
Copy link
Owner

zhihou7 commented Sep 10, 2020

OK,I also want to implement this in pytorch. But I donot find suitable open source pytorch code or I can not reproduce the reported performance. Our core code is in VCL.py. Current implementation is worse.

@kaenkogashi
Copy link
Author

I see, after I finished pytorch implementation( probably not based on iCAN). I will contact you. But from your opinion, maybe a hard work. haha!

@kaenkogashi
Copy link
Author

kaenkogashi commented Sep 10, 2020

@zhihou7

Current implementation is worse.

I watch at the code, some library is originate in tensorflow(like Res5 blocks), pytorch don't have those libraries. Maybe this is the reason why we can't reproduce the performance.

Would you provide your core code VCL.py or other modules in pytorch version?
I am going to implement VCL, maybe your pytorch code can't reproduce performance, but it is still faster for me to write from scratch.
hope hearing from you soon!

@zhihou7
Copy link
Owner

zhihou7 commented Sep 10, 2020

Well, I have not begin to implement VCL in pytorch. I plan to implement it based on https://github.com/ASMIftekhar/VSGNet or https://github.com/vt-vl-lab/DRG. DRG is based on iCAN that is the base code of my released code. If we want to simply reimplement VCL in pytorch, DRG (appearance only branch) possibly is a good choice. But DRG is a ensemble of three model (very weird). I can only obtain around 12% mAP with appearance only model in DRG, far worse than reported.

@kaenkogashi
Copy link
Author

Thank you ! I have some task in hurry today, I will look at the code you provided tomorrow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants