-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bazel GPU build error with fatal error: external/nccl_archive/src/nccl.h: No such file or directory #327
Comments
Same error here. cuda 8.0 ERROR: /root/.cache/bazel/_bazel_root/f8d1071c69ea316497c31e40fe01608c/external/org_tensorflow/tensorflow/contrib/nccl/BUILD:23:1: C++ compilation of rule '@org_tensorflow//tensorflow/contrib/nccl:python/ops/_nccl_ops.so' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter ... (remaining 77 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1. Any solutions? |
@ kirilg,can you help take a quick look at this issue? Thank you. |
To get around it you can comment out the DEP for nccl in: tensorflow/tensorflow/contrib/BUILD Line 42 iirc |
Thanks, @jlertle |
Thanks @jlertle. |
which line in: tensorflow/tensorflow/contrib/BUILD is the DEP for nccl? i can't find it, thanks. |
65: "//tensorflow/contrib/nccl:nccl_py", I believe... |
It was moved into a Windows check but the referenced path is still having issues resolving during Serving build process on Ubuntu. Bazel stuff. |
I tried a script provided by #318, it works fine |
If you comment it out examples fail, I managed to built it as well but... I get Here is the task that fails:
I verified that nccl_Archive is fetched and unzipped correctly under .cache dir and from what I see |
I solved it by removing the prefix /external/nccl_archive. |
@skonto removing prefix /external/nccl_archive in files nccl_ops.cc and |
git clone https://github.com/NVIDIA/nccl.git sudo make install |
I used @perdasilva fix and was able to get it to compile but it fails with the last 10 or so tests. When trying to run the syntaxnet/demo.sh script it looks like it recognizes the GPU (K80) but then dies with a segmentation fault. I did not comment out the nccl_py but instead downloaded the nccl.git above and executed the lines as they were listed - it compiles (fails tests) but compiles. Any idea why I'm getting this segmentation fault? UPDATE: YOU CAN IGNORE THESE ERRORS, do bazel test ... and then the normal installation as per the guide.
|
These are the tests that fail:
|
UDPATE: Looks like you can ignore the above updates - the segmentation fault is being caused by the out-of-memory GPU crashing. You can easily fix this by replacing the lines in the models/syntaxnet/syntaxnet/parser_eval.py in the Main() function call to this:
Thanks to @utkrist tensorflow/models#173 I will post a little step-by-step for those who are looking at this - spent probably 2 days recompiling this (takes about 2-3 hrs each time you compile it... really ridiculous Google has not provided instructions on GPU integration. Now that being said - one more question hopefully someone more advanced can help with. Even though it runs with the GPU it looks like it may error out or crash after completion?
|
One final question - for anyone who might be an expert user out there. It looks like the annotator is working correctly through the script but I noticed that it takes a long time to actually load the models - not that long to actually evaluate the sentence itself. Is there a way to keep the model loaded and then pass new lines of text to it. I know that you can pass it a file with multiple lines but I want to keep the thread hanging in the background and be able to pass strings into it. Not sure if this is possible but would really appreciate anyone's guidance on the matter. Thanks! |
I'm still getting crashes because of cuda out of memory errors. |
@perdasilva |
Hi, @perdasilva I have compiled successful tensorflow 1.8 with NCCL2, the problem is that if you have used the deb package to install it on your system, then the package will be splited into different locations:
However Tensorflow configuration needs only one path for the root of this content, that's why the compilation is not happy. To solve this you can:
|
Hi i m using new tf_serving 1.7 ERROR: /home/ubuntu/.cache/bazel/_bazel_ubuntu/8bd6e58495e54c8cdf1fb8b1ed15e742/external/org_tensorflow/tensorflow/contrib/nccl/BUILD:23:1: error while parsing .d file: /home/ubuntu/.cache/bazel/_bazel_ubuntu/8bd6e58495e54c8cdf1fb8b1ed15e742/execroot/tf_serving/bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/_objs/python/ops/_nccl_ops_gpu/external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.pic.d (No such file or directory) |
Hi .. I don't know If I am wrong in your case but the fact is. During the configure step of tensorflow prior to build, it asks about the folder of your nccl right ? in that time as I explained above you must have that structure .
If you installed nccl from deb, then the sources will be scatered around your system and will not follow the structure tensorflow need. |
thanks cyber, |
I built yesterday without any problem. I will check again |
is it latest 1.7 serving |
Hi again, I don't had time to build it yet to check your needs but I can tell that in my machine I have this file.
The BUILD file is inside this folder as well a symlink for my prefered nccl instalation folder ( was generated by Tensorflow ./configure ) To get that you have to build tensorflow... (warning it can take several hours to build) |
thanks @cyberwillis for giving yourtime for this.. |
Hi... I am checking it right now, and yes I understand what you said, but the thing is, the configure file from tensorflow is used just to register the environment variables that bazel build needs. as you can see in the readme from the version 1.3 at Install Prerequisites
|
yeah.. |
Tell me what version of Cuda do you have and what version of Nccl do you have and also any thing else I can replicate your environtment. |
cudda 9, and latest nccl |
How did you set up your nccl2 ? |
@cyberwillis
|
@cjhkeep However if you install "NCCL2" using the ".deb" version your nccl.h will be far away from where tensorflow expect to find it! That's why a suggested installing from the tar file here . I was trying to diagnose the problem of our colleague above, asking him what the process he had installed nccl on his machine. [UPDATED] |
Wouldn't it be the right way to tensorflow to just look at the right directories? |
NVIDIA in times to times change the locations of its packages (because they think its funny) :) |
Closing - please see the latest Docker examples for bringing up a build environment. The GPU build addresses the NCCL dependency. |
@gautamvasudevan This is still a bug when trying to do macOS GPU builds. Since your Docker example doesn't work on Mac, I think this issue should still be open. |
Hi @praeclarum I am sorry to see your question only now, sadlly I am not using Tensorflow anymore, but I believe that since the Docker Tensorflow rely on the abstract install from Ubuntu you can change it for the exact problem you having. Can you post exact what problem are you getting on your MAC ? Another question... If you are using Docker on your Mac to build TFServing... how do you make the GPU Passthrow to Docker Engine ? Since the X11 forwarding does not exist on MacOS (instead apple uses Quartz). So your GPU will never be recognized inside docker because Apple does not allow it formally. I believe your only strategy is to translate the commands from Docker file into commands in your Homebrew. |
In case you really want to try make the GPU available inside docker (macOS only) you can try use XQuartz take a look on this Gist https://gist.github.com/cschiewek/246a244ba23da8b9f0e7b11a68bf3285 |
We don't have any official support for macOS and nccl builds currently, though feel free to file a new issue specifically for macOS, we welcome any community support here! |
seems that now there's a |
@gatoatigrado thanks for the tip, that's exactly what I was looking for |
We are trying to build Tensorflow Serving 0.5.1 with TensorFlow 1.0.0@07bb8ea
Basing on CUDA 7.5, cuDNN 5.
Bazel 0.4.4
I'm able to find nccl.h, but it can't be found during bazel build. Any suggestions? Thanks in advanced.
The text was updated successfully, but these errors were encountered: