Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazel GPU build error with fatal error: external/nccl_archive/src/nccl.h: No such file or directory #327

Closed
cheyang opened this issue Feb 19, 2017 · 44 comments

Comments

@cheyang
Copy link
Contributor

cheyang commented Feb 19, 2017

We are trying to build Tensorflow Serving 0.5.1 with TensorFlow 1.0.0@07bb8ea

Basing on CUDA 7.5, cuDNN 5.
Bazel 0.4.4

cd serving && bazel build -c opt --config=cuda tensorflow_serving/...
ERROR: /root/.cache/bazel/_bazel_root/f8d1071c69ea316497c31e40fe0160
8c/external/org_tensorflow/tensorflow/contrib/nccl/BUILD:23:1: C++ c
ompilation of rule '@org_tensorflow//tensorflow/contrib/nccl:python/
ops/_nccl_ops.so' failed: crosstool_wrapper_driver_is_not_gcc failed
: error executing command external/local_config_cuda/crosstool/clang
/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTI
FY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-paramete
r ... (remaining 76 argument(s) skipped): com.google.devtools.build.
lib.shell.BadExitStatusException: Process exited with status 1.
In file included from external/org_tensorflow/tensorflow/contrib/ncc
l/kernels/nccl_manager.cc:15:0:
external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager
.h:23:44: fatal error: external/nccl_archive/src/nccl.h: No such fil
e or directory
 #include "external/nccl_archive/src/nccl.h"
                                            ^
compilation terminated.
INFO: Elapsed time: 147.378s, Critical Path: 107.11s

I'm able to find nccl.h, but it can't be found during bazel build. Any suggestions? Thanks in advanced.

find / -name nccl.h
/root/.cache/bazel/_bazel_root/5071e8dca1385fb776f72b33971bf157/exte
rnal/nccl_archive/src/nccl.h
/root/.cache/bazel/_bazel_root/f8d1071c69ea316497c31e40fe01608c/exte
rnal/nccl_archive/src/nccl.h
@tvkpz
Copy link

tvkpz commented Feb 19, 2017

Same error here.

cuda 8.0
cudnn 5.1
bazel 4.2

ERROR: /root/.cache/bazel/_bazel_root/f8d1071c69ea316497c31e40fe01608c/external/org_tensorflow/tensorflow/contrib/nccl/BUILD:23:1: C++ compilation of rule '@org_tensorflow//tensorflow/contrib/nccl:python/ops/_nccl_ops.so' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter ... (remaining 77 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
In file included from external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.cc:15:0:
external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.h:23:44: fatal error: external/nccl_archive/src/nccl.h: No such file or directory
compilation terminated.

Any solutions?

@cheyang
Copy link
Contributor Author

cheyang commented Feb 23, 2017

@ kirilg,can you help take a quick look at this issue? Thank you.

@kinhunt
Copy link

kinhunt commented Feb 24, 2017

same here
2017-02-24 1 05 02

@jlertle
Copy link

jlertle commented Feb 24, 2017

To get around it you can comment out the DEP for nccl in: tensorflow/tensorflow/contrib/BUILD

Line 42 iirc

@cheyang
Copy link
Contributor Author

cheyang commented Feb 25, 2017

Thanks, @jlertle

@sskgit
Copy link

sskgit commented Feb 25, 2017

Thanks @jlertle.

@cosastro
Copy link

which line in: tensorflow/tensorflow/contrib/BUILD is the DEP for nccl? i can't find it, thanks.

@perdasilva
Copy link
Contributor

perdasilva commented Mar 24, 2017

65: "//tensorflow/contrib/nccl:nccl_py",

I believe...

@jlertle
Copy link

jlertle commented Mar 24, 2017

It was moved into a Windows check but the referenced path is still having issues resolving during Serving build process on Ubuntu. Bazel stuff.

@cosastro
Copy link

cosastro commented Mar 27, 2017

I tried a script provided by #318, it works fine

@skonto
Copy link

skonto commented Apr 10, 2017

If you comment it out examples fail, I managed to built it as well but... I get
ImportError: cannot import name nccl with a minst example.

Here is the task that fails:

>>>>> # @org_tensorflow//tensorflow/contrib/nccl:python/ops/_nccl_ops.so [action 'Compiling external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.cc']
(cd /root/.cache/bazel/_bazel_root/f8d1071c69ea316497c31e40fe01608c/execroot/serving && \
  exec env - \
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
  external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE 
  '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object
   -fno-omit-frame-pointer -g0 -O2 -DNDEBUG -ffunction-sections -fdata-sections '-std=c++11' -MD 
   -MF bazel-out/local_linux-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/_objs/python/ops/_nccl_ops.so/external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.pic.d
    '-frandom-seed=bazel-out/local_linux-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/_objs/python/ops/_nccl_ops.so/external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.pic.o' -fPIC -DEIGEN_MPL2_ONLY 
  -iquote external/org_tensorflow -iquote bazel-out/local_linux-opt/genfiles/external/org_tensorflow -iquote external/bazel_tools 
  -iquote bazel-out/local_linux-opt/genfiles/external/bazel_tools -iquote external/nccl_archive 
  -iquote bazel-out/local_linux-opt/genfiles/external/nccl_archive -iquote external/local_config_cuda 
  -iquote bazel-out/local_linux-opt/genfiles/external/local_config_cuda -iquote external/protobuf 
  -iquote bazel-out/local_linux-opt/genfiles/external/protobuf -iquote external/eigen_archive 
  -iquote bazel-out/local_linux-opt/genfiles/external/eigen_archive -iquote external/local_config_sycl
   -iquote bazel-out/local_linux-opt/genfiles/external/local_config_sycl -isystem external/bazel_tools/tools/cpp/gcc3 
   -isystem external/local_config_cuda/cuda -isystem bazel-out/local_linux-opt/genfiles/external/local_config_cuda/cuda
    -isystem external/local_config_cuda/cuda/include -isystem bazel-out/local_linux-opt/genfiles/external/local_config_cuda/cuda/include 
    -isystem external/protobuf/src -isystem bazel-out/local_linux-opt/genfiles/external/protobuf/src -isystem external/eigen_archive 
    -isystem bazel-out/local_linux-opt/genfiles/external/eigen_archive -DEIGEN_AVOID_STL_ARRAY -Iexternal/gemmlowp -Wno-sign-compare -fno-exceptions '-DGOOGLE_CUDA=1' -msse3 -pthread -no-canonical-prefixes -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -fno-canonical-system-headers -c external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.cc -o bazel-out/local_linux-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/_objs/python/ops/_nccl_ops.so/external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.pic.o)
ERROR: /root/.cache/bazel/_bazel_root/f8d1071c69ea316497c31e40fe01608c/external/org_tensorflow/tensorflow/contrib/nccl/BUILD:23:1: C++ compilation of rule '@org_tensorflow//tensorflow/contrib/nccl:python/ops/_nccl_ops.so' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter ... (remaining 77 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
In file included from external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.cc:15:0:
external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.h:23:44: fatal error: external/nccl_archive/src/nccl.h:
 No such file or directory
 #include "external/nccl_archive/src/nccl.h"
                                            ^

I verified that nccl_Archive is fetched and unzipped correctly under .cache dir and from what I see
-iquote external/nccl_archive should do the work to include all stuff needed.

@skonto
Copy link

skonto commented Apr 11, 2017

I solved it by removing the prefix /external/nccl_archive.

@Lukeisme
Copy link

@skonto removing prefix /external/nccl_archive in files nccl_ops.cc and
nccl_manager.h which in folder tensorflow/tensorflow/contrib/nccl/kernels, fix the issue

@perdasilva
Copy link
Contributor

perdasilva commented Jun 2, 2017

git clone https://github.com/NVIDIA/nccl.git
cd nccl/
make CUDA_HOME=/usr/local/cuda

sudo make install
sudo mkdir -p /usr/local/include/external/nccl_archive/src
sudo ln -s /usr/local/include/nccl.h /usr/local/include/external/nccl_archive/src/nccl.h

@tomodachi21
Copy link

tomodachi21 commented Jun 29, 2017

I used @perdasilva fix and was able to get it to compile but it fails with the last 10 or so tests. When trying to run the syntaxnet/demo.sh script it looks like it recognizes the GPU (K80) but then dies with a segmentation fault. I did not comment out the nccl_py but instead downloaded the nccl.git above and executed the lines as they were listed - it compiles (fails tests) but compiles. Any idea why I'm getting this segmentation fault?

UPDATE: YOU CAN IGNORE THESE ERRORS, do bazel test ... and then the normal installation as per the guide.

2017-06-29 13:40:28.569661: W external/org_tensorflow/tensorflow/stream_executor/stream.cc:1550] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 161, in <module>
    tf.app.run()
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 157, in main
    Eval(sess)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 130, in Eval
    parser.evaluation['documents'],
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 778, in run
    run_metadata_ptr)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 982, in _run
    feed_dict_string, options, run_metadata)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1032, in _do_run
    target_list, options, run_metadata)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1052, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : a.shape=(1, 1040), b.shape=(1040, 64), m=1, n=64, k=1040
	 [[Node: evaluation/while/layer_0/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](evaluation/while/concat, evaluation/while/layer_0/MatMul/Enter)]]
	 [[Node: evaluation/while/Switch_8/_95 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_156_evaluation/while/Switch_8", tensor_type=DT_STRING, _device="/job:localhost/replica:0/task:0/cpu:0"](^_cloopevaluation/while/feature_4/shape/_61)]]

Caused by op u'evaluation/while/layer_0/MatMul', defined at:
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/par
ser_eval.py", line 161, in <module>
    tf.app.run()
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ubuntu/models/syntaxnet/bazel-
bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 157, in main
    Eval(sess)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 111, in Eval
    evaluation_max_steps=FLAGS.max_steps)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/structured_graph_builder.py", line 240, in AddEvaluation
    'features'], n['state'], use_average=self._use_averaging))
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/structured_graph_builder.py", line 128, in _BuildSequence
    parallel_iterations=100)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/ops/control_flow_ops.py", line 2623, in while_loop
    result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/ops/control_flow_ops.py", line 2456, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/ops/control_flow_ops.py", line 2406, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/structured_graph_builder.py", line 106, in Advance
    return_average=use_average)['logits']
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/graph_builder.py", line 353, in _BuildNetwork
    name='layer_%d' % i)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/ops/nn_impl.py", line 263, in relu_layer
    xw_plus_b = nn_ops.bias_add(math_ops.matmul(x, weights), biases)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/ops/math_ops.py", line 1801, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/ops/gen_math_ops.py", line 1263, in _mat_mul
    transpose_b=transpose_b, name=name)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/framework/ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/ubuntu/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/org_tensorflow/tensorflow/python/framework/ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

InternalError (see above for traceback): Blas SGEMM launch failed : a.shape=(1, 1040), b.shape=(1040, 64), m=1, n=64, k=1040
	 [[Node: evaluation/while/layer_0/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](evaluation/while/concat, evaluation/while/layer_0/MatMul/Enter)]]
	 [[Node: evaluation/while/Switch_8/_95 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_156_evaluation/while/Switch_8", tensor_type=DT_STRING, _device="/job:localhost/replica:0/task:0/cpu:0"](^_cloopevaluation/while/feature_4/shape/_61)]]

INFO:tensorflow:Total processed documents: 0
INFO:tensorflow:Read 0 documents
syntaxnet/demo.sh: line 56:  5721 Segmentation fault      (core dumped) $PARSER_EVAL --input=$INPUT_FORMAT --output=stdout-conll --hidden_layer_sizes=64 --arg_prefix=brain_tagger --graph_builder=structured --task_context=$MODEL_DIR/context.pbtxt --model_path=$MODEL_DIR/tagger-params --slim_model --batch_size=1024 --alsologtostderr
      5722                       (core dumped) | $PARSER_EVAL --input=stdin-conll --output=stdout-conll --hidden_layer_sizes=512,512 --arg_prefix=brain_parser --graph_builder=structured --task_context=$MODEL_DIR/context.pbtxt --model_path=$MODEL_DIR/parser-params --slim_model --batch_size=1024 --alsologtostderr
      5723                       (core dumped) | bazel-bin/syntaxnet/conll2tree --task_context=$MODEL_DIR/context.pbtxt --alsologtostderr

@tomodachi21
Copy link

These are the tests that fail:

INFO: Elapsed time: 3887.699s, Critical Path: 336.62s
//syntaxnet:arc_standard_transitions_test                                PASSED in 0.1s
//syntaxnet:binary_segment_state_test                                    PASSED in 0.1s
//syntaxnet:binary_segment_transitions_test                              PASSED in 0.1s
//syntaxnet:char_ngram_string_extractor_test                             PASSED in 0.1s
//syntaxnet:char_properties_test                                         PASSED in 0.1s
//syntaxnet:char_shift_transitions_test                                  PASSED in 0.1s
//syntaxnet:head_transitions_test                                        PASSED in 0.1s
//syntaxnet:label_transitions_test                                       PASSED in 0.1s
//syntaxnet:morphology_label_set_test                                    PASSED in 0.1s
//syntaxnet:once_transitions_test                                        PASSED in 0.1s
//syntaxnet:parser_features_test                                         PASSED in 0.1s
//syntaxnet:segmenter_utils_test                                         PASSED in 0.0s
//syntaxnet:sentence_features_test                                       PASSED in 0.1s
//syntaxnet:shared_store_test                                            PASSED in 0.3s
//syntaxnet:tagger_transitions_test                                      PASSED in 0.1s
//syntaxnet/util:check_test                                              PASSED in 1.7s
//syntaxnet/util:registry_test                                           PASSED in 1.7s
//syntaxnet:whole_sentence_features_test                                 PASSED in 0.1s
//util/utf8:unicodetext_unittest                                         PASSED in 0.1s
//syntaxnet:beam_reader_ops_test                                         FAILED in 19.3s
  /tmp/bazeltemp/_bazel_root/496ef57d77987a9f471821c181b6cf0f/execroot/__main__/bazel-out/local_linux-opt/testlogs/syntaxnet/beam_reader_ops_test/test.log
//syntaxnet:graph_builder_test                                           FAILED in 14.9s
  /tmp/bazeltemp/_bazel_root/496ef57d77987a9f471821c181b6cf0f/execroot/__main__/bazel-out/local_linux-opt/testlogs/syntaxnet/graph_builder_test/test.log
//syntaxnet:lexicon_builder_test                                         FAILED in 3.3s
  /tmp/bazeltemp/_bazel_root/496ef57d77987a9f471821c181b6cf0f/execroot/__main__/bazel-out/local_linux-opt/testlogs/syntaxnet/lexicon_builder_test/test.log
//syntaxnet:parser_trainer_test                                          FAILED in 14.2s
  /tmp/bazeltemp/_bazel_root/496ef57d77987a9f471821c181b6cf0f/execroot/__main__/bazel-out/local_linux-opt/testlogs/syntaxnet/parser_trainer_test/test.log
//syntaxnet:reader_ops_test                                              FAILED in 8.5s
  /tmp/bazeltemp/_bazel_root/496ef57d77987a9f471821c181b6cf0f/execroot/__main__/bazel-out/local_linux-opt/testlogs/syntaxnet/reader_ops_test/test.log
//syntaxnet:text_formats_test                                            FAILED in 5.6s
  /tmp/bazeltemp/_bazel_root/496ef57d77987a9f471821c181b6cf0f/execroot/__main__/bazel-out/local_linux-opt/testlogs/syntaxnet/text_formats_test/test.log

Executed 25 out of 25 tests: 19 tests pass and 6 fail locally.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.

@tomodachi21
Copy link

UDPATE: Looks like you can ignore the above updates - the segmentation fault is being caused by the out-of-memory GPU crashing. You can easily fix this by replacing the lines in the models/syntaxnet/syntaxnet/parser_eval.py in the Main() function call to this:

gpu_opt = tf.GPUOptions(allow_growth=True)
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_opt)) as sess:
    Eval(sess)

Thanks to @utkrist tensorflow/models#173

I will post a little step-by-step for those who are looking at this - spent probably 2 days recompiling this (takes about 2-3 hrs each time you compile it... really ridiculous Google has not provided instructions on GPU integration. Now that being said - one more question hopefully someone more advanced can help with.
tensorflow/models#173

Even though it runs with the GPU it looks like it may error out or crash after completion?

INFO:tensorflow:Processed 1 documents
INFO:tensorflow:Total processed documents: 1
INFO:tensorflow:num correct tokens: 0
INFO:tensorflow:total tokens: 6
INFO:tensorflow:Seconds elapsed in evaluation: 1.30, eval metric: 0.00%
INFO:tensorflow:Processed 1 documents
INFO:tensorflow:Total processed documents: 1
INFO:tensorflow:num correct tokens: 1
INFO:tensorflow:total tokens: 6
INFO:tensorflow:Seconds elapsed in evaluation: 1.59, eval metric: 16.67%
INFO:tensorflow:Read 1 documents
Input: why are you looking at me
Parse:
looking VBG ROOT
 +-- why WRB advmod
 +-- are VBP aux
 +-- you PRP nsubj
 +-- at IN prep
     +-- me PRP pobj
syntaxnet/demo.sh: line 56: 32245 Segmentation fault      (core dumped) $PARSER_EVAL --input=$INPUT_FORMAT --output=stdout-conll --hidden_layer_sizes=64 --arg_prefix=brain_tagger --graph_builder=structured --task_context=$MODEL_DIR/context.pbtxt --model_path=$MODEL_DIR/tagger-params --slim_model --batch_size=1024 --alsologtostderr
     32246                       (core dumped) | $PARSER_EVAL --input=stdin-conll --output=stdout-conll --hidden_layer_sizes=512,512 --arg_prefix=brain_parser --graph_builder=structured --task_context=$MODEL_DIR/context.pbtxt --model_path=$MODEL_DIR/parser-params --slim_model --batch_size=1024 --alsologtostderr
     32247                       (core dumped) | bazel-bin/syntaxnet/conll2tree --task_context=$MODEL_DIR/context.pbtxt --alsologtostderr

@tomodachi21
Copy link

One final question - for anyone who might be an expert user out there. It looks like the annotator is working correctly through the script but I noticed that it takes a long time to actually load the models - not that long to actually evaluate the sentence itself.

Is there a way to keep the model loaded and then pass new lines of text to it. I know that you can pass it a file with multiple lines but I want to keep the thread hanging in the background and be able to pass strings into it. Not sure if this is possible but would really appreciate anyone's guidance on the matter.

Thanks!

@zerodarkzone
Copy link

I'm still getting crashes because of cuda out of memory errors.
Even after using: gpu_opt = tf.GPUOptions(allow_growth=True)
I'm compiling with bazel 0.5.4 and my GPU is a NVIDIA1080TI

@xiaoleihuang
Copy link

@perdasilva
I tried to install NCCL 2*, the latest version, but the tensorflow can not locate the NCCL. Have you tried the version 2? Thanks!

@cyberwillis
Copy link

cyberwillis commented May 1, 2018

Hi, @perdasilva

I have compiled successful tensorflow 1.8 with NCCL2, the problem is that if you have used the deb package to install it on your system, then the package will be splited into different locations:

  • lib content to /usr/lib/x86_64-linux-gnu/ folder
  • include content to /usr/include/ folder
  • NCCL-SLA (software license agreement) to python somewhere in site-packages folder

However Tensorflow configuration needs only one path for the root of this content, that's why the compilation is not happy.

To solve this you can:

  1. create a folder and put symlinks pointing to this exact structure like this (for example the lastest version 2.1.15):
  nccl2 (or the name you like it)
   ├── include
   │     └── nccl.h
   ├── lib
   │     ├── libnccl.so -> libnccl.so.2*
   │     ├── libnccl.so.2 -> libnccl.so.2.1.15*
   │     ├── libnccl.so.2.1.15*
   │     └── libnccl_static.a
   ├── NCCL-SLA.txt
   └── COPYRIGHT.txt
  1. Or download and extract in some where, one of this packages accordingly to your cuda version. You can easily find on web:
  • nccl_2.1.15-1%2Bcuda8.0_x86_64.txz
  • nccl_2.1.15-1%2Bcuda9.0_x86_64.txz
  • nccl_2.1.15-1%2Bcuda9.1_x86_64.txz
  1. Pointing the path on the ./configure process when asked or setting the environment variable for it and it will not be asked.
    export TF_NCCL_VERSION='2.1.15'
    export NCCL_INSTALL_PATH=/usr/local/nccl2 (my prefered path)

@rickragv
Copy link

rickragv commented May 9, 2018

Hi i m using new tf_serving 1.7
third_party/nccl/nccl.h: No such file or directory
and i am not finding BUILD file in new directory , can anybody knows
As currently tensorflow is build by bazel.

ERROR: /home/ubuntu/.cache/bazel/_bazel_ubuntu/8bd6e58495e54c8cdf1fb8b1ed15e742/external/org_tensorflow/tensorflow/contrib/nccl/BUILD:23:1: error while parsing .d file: /home/ubuntu/.cache/bazel/_bazel_ubuntu/8bd6e58495e54c8cdf1fb8b1ed15e742/execroot/tf_serving/bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/_objs/python/ops/_nccl_ops_gpu/external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.pic.d (No such file or directory)
In file included from external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.cc:15:0:
external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.h:23:35: fatal error: third_party/nccl/nccl.h: No such file or directory
compilation terminated.
INFO: Elapsed time: 545.332s, Critical Path: 222.39s
FAILED: Build did NOT complete successfully

@cyberwillis
Copy link

cyberwillis commented May 11, 2018

Hi ..

I don't know If I am wrong in your case but the fact is.

During the configure step of tensorflow prior to build, it asks about the folder of your nccl right ? in that time as I explained above you must have that structure .

  nccl2 (or the name you like it)
   ├── include
   │     └── nccl.h
   ├── lib
   │     ├── libnccl.so -> libnccl.so.2*
   │     ├── libnccl.so.2 -> libnccl.so.2.1.15*
   │     ├── libnccl.so.2.1.15*
   │     └── libnccl_static.a
   ├── NCCL-SLA.txt
   └── COPYRIGHT.txt

If you installed nccl from deb, then the sources will be scatered around your system and will not follow the structure tensorflow need.

@rickragv
Copy link

thanks cyber,
but in current release they don't have configure file.. its all bazel based.

@cyberwillis
Copy link

I built yesterday without any problem. I will check again

@rickragv
Copy link

is it latest 1.7 serving

@cyberwillis
Copy link

cyberwillis commented May 11, 2018

Hi again,

I don't had time to build it yet to check your needs but I can tell that in my machine I have this file.

.cache/bazel/_bazel_ubuntu/ad1e09741bb4109fbc70ef8216b59ee2/external/local_config_nccl

The BUILD file is inside this folder as well a symlink for my prefered nccl instalation folder ( was generated by Tensorflow ./configure )

To get that you have to build tensorflow... (warning it can take several hours to build)

@rickragv
Copy link

rickragv commented May 12, 2018

thanks @cyberwillis for giving yourtime for this..
prior to 1.4 tensorflow serving it has configure file for tensorflow.. but now they removed in latest one.
its in built now using bazel build
https://github.com/tensorflow/serving

@cyberwillis
Copy link

cyberwillis commented May 12, 2018

Hi...

I am checking it right now, and yes I understand what you said, but the thing is, the configure file from tensorflow is used just to register the environment variables that bazel build needs.

as you can see in the readme from the version 1.3 at Install Prerequisites

#Pre-requisite
cd tensorflow
./configure
cd ..

#Building
cd serving
bazel build -c opt tensorflow_serving/...

@rickragv
Copy link

rickragv commented May 12, 2018

yeah..
i tried using 1.4 and it suceded
but to be future ready.. i need 1.7 tf serving..
and all the environment variables perfectly set.. even though i built nccl locally and pointed to it.
i deleted cache and re-ran still its not finding nccl.h
if in future you try building 1.7 or latest just let us know....
thanks @cyberwillis

@cyberwillis
Copy link

Tell me what version of Cuda do you have and what version of Nccl do you have and also any thing else I can replicate your environtment.

@rickragv
Copy link

cudda 9, and latest nccl

@cyberwillis
Copy link

How did you set up your nccl2 ?

@cjhkeep
Copy link

cjhkeep commented May 14, 2018

@cyberwillis
Copy link

cyberwillis commented May 14, 2018

@cjhkeep
I dont have any problem with nccl.h 1.3 or 2.xx at all. I don't need a guide

However if you install "NCCL2" using the ".deb" version your nccl.h will be far away from where tensorflow expect to find it! That's why a suggested installing from the tar file here .

I was trying to diagnose the problem of our colleague above, asking him what the process he had installed nccl on his machine.

[UPDATED]

@discordianfish
Copy link

Wouldn't it be the right way to tensorflow to just look at the right directories? /usr/include/ is the place for header files in linux, I don't get why it looks somewhere else..?

@cyberwillis
Copy link

NVIDIA in times to times change the locations of its packages (because they think its funny) :)
If you investigate a little, depending on your cuda version some files go to some places others go to another places... I believe NVidia doesn't have a stable ideia where to put this things exacly and tensorflow cannot enter on their hell.

@gautamvasudevan
Copy link
Collaborator

Closing - please see the latest Docker examples for bringing up a build environment. The GPU build addresses the NCCL dependency.

@praeclarum
Copy link

@gautamvasudevan This is still a bug when trying to do macOS GPU builds. Since your Docker example doesn't work on Mac, I think this issue should still be open.

@cyberwillis
Copy link

cyberwillis commented Sep 10, 2018

Hi @praeclarum I am sorry to see your question only now, sadlly I am not using Tensorflow anymore, but I believe that since the Docker Tensorflow rely on the abstract install from Ubuntu you can change it for the exact problem you having. Can you post exact what problem are you getting on your MAC ?

Another question... If you are using Docker on your Mac to build TFServing... how do you make the GPU Passthrow to Docker Engine ? Since the X11 forwarding does not exist on MacOS (instead apple uses Quartz). So your GPU will never be recognized inside docker because Apple does not allow it formally.

I believe your only strategy is to translate the commands from Docker file into commands in your Homebrew.

@cyberwillis
Copy link

In case you really want to try make the GPU available inside docker (macOS only) you can try use XQuartz take a look on this Gist https://gist.github.com/cschiewek/246a244ba23da8b9f0e7b11a68bf3285

@gautamvasudevan
Copy link
Collaborator

We don't have any official support for macOS and nccl builds currently, though feel free to file a new issue specifically for macOS, we welcome any community support here!

@gatoatigrado
Copy link

seems that now there's a --config=nonccl option you can add to a bazel command, e.g. bazel build --config=opt --config=cuda --config=nonccl //tensorflow/tools/pip_package:build_pip_package (dunno if this will work entirely, but it seems to get me past this error ...)

@praeclarum
Copy link

@gatoatigrado thanks for the tip, that's exactly what I was looking for

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests