Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using nvdia-docker #11

Closed
danperazzo opened this issue Jul 26, 2019 · 11 comments
Closed

Error using nvdia-docker #11

danperazzo opened this issue Jul 26, 2019 · 11 comments

Comments

@danperazzo
Copy link

Hello. I am trying to get the results as displayed but when I execute sudo nvidia-docker run --rm --volume /:/host --workdir /host$PWD tf_colmap bash demo.sh the docker returns a error. Below is the error that I have encountered:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=7506 /var/lib/docker/overlay2/74b368071c67140593255d9461eb525598dbbca0ab382047da530356351746c6/merged]\\\\nnvidia-container-cli: requirement error: unsatisfied condition: brand = tesla\\\\n\\\"\"": unknown.

@bmild
Copy link
Collaborator

bmild commented Jul 26, 2019

Based on some quick googling it sounds like it might be a driver error. What GPU are you using and what version of the nvidia drivers (should be in top left if you run nvidia-smi command)?

The docker container is trying to run a CUDA 10.0 image. Based on the driver requirements for CUDA 10.0 from this table, looks like you need nvidia drivers >= 410.48.
https://github.com/NVIDIA/nvidia-docker/wiki/CUDA#requirements

Your problem looks similar to these ones, if you need more information to debug.
NVIDIA/nvidia-docker#861
NVIDIA/nvidia-docker#931

Hope that helps!

@danperazzo
Copy link
Author

danperazzo commented Jul 26, 2019

Thanks!! I have just updated the drivers and got rid of this error. However, I have got a segmentation fault, below is the error message. Again, thank you very much :))
demo.sh: line 24: 106 Segmentation fault (core dumped) cuda_renderer/cuda_renderer data/testscene/mpis_360 data/testscene/outputs/test_path.txt data/testscene/outputs/test_vid.mp4 360 .8 18

@bmild
Copy link
Collaborator

bmild commented Jul 28, 2019

Hmm interesting. What GPU are you using?

Were the MPIs correctly generated and saved in the folder data/testscene/mpis_360? Check for the file data/testscene/mpis_360/mpi19/mpi.b. If not, the renderer will definitely segfault and the issue was with generating the MPIs.

If the MPIs do exist, maybe there was not enough GPU memory for the renderer. It requires about 800MB. You could try ensuring this memory is available, then running the renderer command by itself in the docker env, like this (copy+paste as a single command in terminal)
sudo nvidia-docker run --rm --volume /:/host --workdir /host$PWD tf_colmap cuda_renderer/cuda_renderer data/testscene/mpis_360 data/testscene/outputs/test_path.txt data/testscene/outputs/test_vid.mp4 360 .8 18

@danperazzo
Copy link
Author

danperazzo commented Jul 30, 2019

Hello, I have just checked and It did not render any MPIs and I am using a NVIDIA GeForce GTX 1050

@bmild
Copy link
Collaborator

bmild commented Jul 31, 2019

I saw in the error you posted:
IOError: File ./checkpoints/papermodel/checkpoint.meta does not exist.
Did you download the trained checkpoint (using bash download_data.sh)?

@danperazzo
Copy link
Author

I have just checked and I have this file on my project(checkpoint.meta). I have just checked and, apparently, there was error with the tensorflow:

2019-07-31 14:24:09.655268: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at conv_ops_3d.cc:332 : Resource exhausted: OOM when allocating tensor with shape[1,32,360,480,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "imgs2mpis.py", line 82, in
args.numplanes, args.no_mpis, True, args.psvs)
File "imgs2mpis.py", line 53, in gen_mpis
mpis = run_inference(imgs, poses, mpi_bds, ibr_runner, num_planes, patched, disps=disps, psvs=psvs)
File "/host/home/daniel/LLFF/llff/inference/mpi_utils.py", line 156, in run_inference
mpi.generate(generator, num_planes)
File "/host/home/daniel/LLFF/llff/inference/mpi_utils.py", line 55, in generate

@danperazzo
Copy link
Author

Well, I have just discovered that there was insuficient GPU memory. My GPU has 4GB. How much memory you had?

@bmild
Copy link
Collaborator

bmild commented Jul 31, 2019

Ah yeah that's it - I always use GPUs with at least 8GB.

You should be able to make it run with 4GB by changing this line to patched = True (this will make the network compute the output in smaller patches).
And then by changing the argument valid=270 in this line to valid=120 (this controls the width/height of the patch computed by the network).

These two changes make the demo.sh run using only about 2.4GB on my GPU.

@danperazzo
Copy link
Author

Alright, thanks a lot!!!! It worked :)) Those changes affect only the processing time or it impacts the final result?

@bmild
Copy link
Collaborator

bmild commented Jul 31, 2019

Great! It will only slow down the time, results should be the same.

@danperazzo
Copy link
Author

Alright, thanks!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants