-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation #5
Comments
Thanks for your interest. The information in "HICO" is my mistake. It is because I first evaluate VCL on HICO-DET dataset and I did not change the variable name "HICO" to "HOI". This is just the scope name/variable name. According your log information. I guess it's because your GPU device name is "XLA_GPU". Thus
Thus, in your device list, there is no device "/device:GPU:0". I guess renaming the device name to "/device:XLA_GPU:0" might solve your problem. change line 66 in lib/models/train_Solver_VCOCO_MultiGPU.py
to
or You can find this issue in tensorflow. emmm, It is also ok you remove all "tf.device()" in lib/models/train_Solver_VCOCO_MultiGPU.py if you just use one GPU. It will use the default device. If you have further problems, feel free to discuss it. |
Thank you for your replay! I removed all "tf.device()" in lib/models/train_Solver_VCOCO_MultiGPU.py because I use one GPU. (Actually, I tried other solutions like change line 66 in lib/models/train_Solver_VCOCO_MultiGPU.py But in this time, gpu didn't work. Instead of using gpu, cpu worked. |
I also meet the similar problem. But I have forgotten the solution. I find someone solve it like this in tensorflow/tensorflow#30748 (comment)
Hope help you. |
btw, do you also remove "tf.device('/cpu:0')" in line 44? If so, your tensorflow possibly has some problems. try to install tensorflow-gpu==1.14.0 by conda |
thank you for your help! I installed tensorflow-gpu==1.14.0 with conda. (I uninstalled pip version), and I used the original code. I didn't change code at all.
I renamed /home/kogashi/miniconda3/cuda-10.1/nvvm/libdevice/libdevice.10.bc |
Do you have multiple cuda? Have you defined CUDA_DIR env var? From the message in jax-ml/jax#989, this problem seems like tensorflow can not find the cuda dir. Someone trys to set CUDA_DIR or add the symlink did (eg.: $ sudo ln -s /opt/cuda /usr/local/cuda-10.2), or set "XLA_FLAGS=--xla_gpu_cuda_data_dir=conda-env-path/lib/" |
Thank you very much! After I made cuda's symbolic link, finally it worked! |
V-COCO converges at around iteration 300000. HICO converges at around iteration 500000. The time this model will take depend on your GPU. V-COCO needs less 24 hours on 2080Ti and HICO requires around 48 hours. If your decrease the learning rate on V-COCO more quickly, I guess it will converge earlier. On 2080Ti, each iteration will consume 0.2s on HICO. On Titan XP, each iteration consumes around 0.25-0.3 on HICO. If you training speed is still too slow after 1000 iteration, I guess it might have some problems. Yeah, it needs GPU memory because we input two images. All numbers above are based on res50 backbone. |
Thank you for your reply! Current I am training on V-COCO dataset. It is slow because it still takes 2.107 each iteration after 1000 iteration.
Is this based on single-GPU? (I thought probably you used multi-GPU )
|
Yes, All the experiments are based on single-GPU because I find two gpus have bugs and are slower. Well, I also tested the code for issue #4 with V100 last week. Here (https://github.com/zhihou7/VCL/files/5175383/test.txt) is the log. It is much faster than the experiment with 2080Ti. Do you install scikit-image? scikit-image 0.14.2 I remember the version of scikit-image will affect the speed seriously. I use 0.14.2. It also might be because the first running is slow. It is wired. |
Thank you for your comment! I installed scikit-image, but the version was different, so I uninstalled old one and installed scikit-image 0.14.2. |
In fact, I do not use the scikit-image in my code. I just forget to remove "import skimage". I'm not sure why the code runs slow in some environments. |
Thank you for your comment! I found the reason for the slow training. Because other people use CPU-power heavily in our server.
|
Thanks for your comment! I also face this problem that VCL will use CPU-power largely in some machines. But in other machines, it then looks normal. In our GPU cluster, I usually allocate 1GPU and 4-CPU and VCL begins normally. It might also depend on IO load. |
Unfortunately, our server's cpu is always busy. I think probably I can load memory on the GPU, not on CPU?? |
OK,I also want to implement this in pytorch. But I donot find suitable open source pytorch code or I can not reproduce the reported performance. Our core code is in VCL.py. Current implementation is worse. |
I see, after I finished pytorch implementation( probably not based on iCAN). I will contact you. But from your opinion, maybe a hard work. haha! |
I watch at the code, some library is originate in tensorflow(like Res5 blocks), pytorch don't have those libraries. Maybe this is the reason why we can't reproduce the performance. Would you provide your core code VCL.py or other modules in pytorch version? |
Well, I have not begin to implement VCL in pytorch. I plan to implement it based on https://github.com/ASMIftekhar/VSGNet or https://github.com/vt-vl-lab/DRG. DRG is based on iCAN that is the base code of my released code. If we want to simply reimplement VCL in pytorch, DRG (appearance only branch) possibly is a good choice. But DRG is a ensemble of three model (very weird). I can only obtain around 12% mAP with appearance only model in DRG, far worse than reported. |
Thank you ! I have some task in hurry today, I will look at the code you provided tomorrow! |
Dear, sir
Thank you for your works!
I try to train VCL on V-COCO as following instructions.
I only assigned 1 GPU for training and I got error messages as below, would you help me to solve with this?
I don't know why I am try to training on V-COCO, but the error is about HICO.
The text was updated successfully, but these errors were encountered: