Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

undefined symbol in roi_pooling_layer #90

Open
chaochaow opened this issue Feb 24, 2017 · 25 comments
Open

undefined symbol in roi_pooling_layer #90

chaochaow opened this issue Feb 24, 2017 · 25 comments

Comments

@chaochaow
Copy link

chaochaow commented Feb 24, 2017

I followed the instruction and try to run the training script on CPU but got the following error msg. It looks like it has some issues to load the roi_pooling.so library. Does anybody encounter this problem? I am using Tensorflow 0.11.0.

./experiments/scripts/faster_rcnn_end2end.sh cpu 0 VGG16 pascal_voc

Traceback (most recent call last):
File "./tools/train_net.py", line 16, in
from networks.factory import get_network
File "/home/Projects/Faster-RCNN_TF/tools/../lib/networks/init.py", line 8, in
from .VGGnet_train import VGGnet_train
File "/home/Projects/Faster-RCNN_TF/tools/../lib/networks/VGGnet_train.py", line 2, in
from networks.network import Network
File "/home/Projects/Faster-RCNN_TF/tools/../lib/networks/network.py", line 3, in
import roi_pooling_layer.roi_pooling_op as roi_pool_op
File "/home/Projects/Faster-RCNN_TF/tools/../lib/roi_pooling_layer/roi_pooling_op.py", line 5, in
_roi_pooling_module = tf.load_op_library(filename)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/load_library.py", line 63, in load_op_library
raise errors._make_specific_exception(None, None, error_msg, error_code)
tensorflow.python.framework.errors.NotFoundError: /home/Projects/Faster-RCNN_TF/tools/../lib/roi_pooling_layer/roi_pooling.so: undefined symbol: _ZN10tensorflow7strings6StrCatB5cxx11ERKNS0_8AlphaNumE

@showkeyjar
Copy link

the same error.

@Peratham
Copy link

same here.

@Peratham
Copy link

I set the cuda path in lib/make.sh and it works for me.

@chaochaow
Copy link
Author

thanks for the information, @Peratham. Do you run the training script with the GPU version?

@chaochaow
Copy link
Author

@Peratham : I tried GPU version today and it gave an error stating roi_pooling.o can not be found when trying to build cython module in lib.

@Peratham
Copy link

Peratham commented Mar 1, 2017

Yes, I ran it with the GPU version.

I also set nvcc argument based on cuda compatibility on the card that I am using. But I think the default value works on most GPUs anyway. So this may not be the case.

@Peratham
Copy link

Peratham commented Mar 1, 2017

Also, do you have eigen in your system and the system path?

@XinliangZhu
Copy link

@Peratham Hi Peratham, when I run the demo.py, I got the problem of "ImportError: No module named gpu_nms". I checked the FRCN_ROOT/lib/nms, and there's no gpu_nms.so. So I guess it maybe the problem of building Cython. I used CUDA 8.0, Anaconda with python 2.7 and tensorflow 1.0. Do you have any idea? Thank you!

@Peratham
Copy link

Peratham commented Mar 1, 2017

@XinliangZhu Have you run make.sh in the /lib directory? What is the output when you run it? Are there any errors?

@XinliangZhu
Copy link

@Peratham Yes, I have run it. No error occurred.

@XinliangZhu
Copy link

@Peratham I have solved the problem of generating gpu_nms.so by revising cudaconfig = {'home':home, 'nvcc':nvcc, 'include': pjoin(home, 'include'), 'lib64': pjoin(home, 'lib64')} as cudaconfig = {'home':home, 'nvcc':nvcc, 'include': pjoin(home, 'include'), 'lib64': pjoin(home, 'lib')}
However, it occurs new problem just like chaochaow's. By the way, I am using gpu.

@Peratham
Copy link

Peratham commented Mar 1, 2017

@XinliangZhu Is your machine 32 or 64 bits?

@chaochaow
Copy link
Author

Found a similar discussion here. #50

I still have problem with CPU version after adding -D_GLIBCXX_USE_CXX11_ABI=0. Will try GPU version tomorrow.

@XinliangZhu
Copy link

@Peratham Thanks! Mine is 64 bits. I have completely solved the problem. To summarize, I first change the cudaconfig in /lib/setup.py as above. This will guarantee the generation of gpu_nms.so. Then I revise the CUDA_PATH and add -D_GLIBCXX_USE_CXX11_ABI=0 in /lib/make.sh. Just followed your instruction and CharlesShang/TFFRCNN#2. Thanks!

@chaochaow
Copy link
Author

chaochaow commented Mar 1, 2017

@Peratham @XinliangZhu now I have the same problem as CharlesShang/TFFRCNN#2 as nsivaramakrishnan posted in cpu version.

Looks like there is no solution for cpu version yet according to #36

@XinliangZhu
Copy link

XinliangZhu commented Mar 1, 2017

@chaochaow Based on my experience, this may be caused by the mismatched g++ version. So is your g++ 5.* and where did you add -D_GLIBCXX_USE_CXX11_ABI=0 ? Since you used CPU only, you should add it in the else part in /lib/make.sh.

@chaochaow
Copy link
Author

@XinliangZhu @Peratham : when I run the GPU version today, I got the following error after adding
g++ -std=c++11 -shared -o roi_pooling.so roi_pooling_op.cc
roi_pooling_op.cu.o -I $TF_INC -D GOOGLE_CUDA=1 -fPIC $CXXFLAGS -D_GLIBCXX_USE_CXX11_ABI=0
-lcudart -L $CUDA_PATH/lib64

Is there anything else I need to do? It looks like my setup is able to locate nvcc

skipping 'utils/nms.c' Cython extension (up-to-date)
skipping 'nms/cpu_nms.c' Cython extension (up-to-date)
rm -rf build
bash make.sh
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
/usr/lib/gcc/x86_64-linux-gnu/5/include/mwaitxintrin.h(36): error: identifier "__builtin_ia32_monitorx" is undefined

/usr/lib/gcc/x86_64-linux-gnu/5/include/mwaitxintrin.h(42): error: identifier "__builtin_ia32_mwaitx" is undefined

2 errors detected in the compilation of "/tmp/tmpxft_000016e3_00000000-7_roi_pooling_op_gpu.cu.cpp1.ii".
g++: error: roi_pooling_op.cu.o: No such file or directory

@XinliangZhu
Copy link

@chaochaow Since you are trying CPU only mode, I guess you should put -D_GLIBCXX_USE_CXX11_ABI=0 in the else block. Like else g++ -std=c++11 -shared -o roi_pooling.so roi_pooling_op.cc -D_GLIBCXX_USE_CXX11_ABI=0 \ -I $TF_INC -fPIC $CXXFLAGS.

For solving your current error, please refer to: tensorflow/tensorflow#1066

@chaochaow
Copy link
Author

chaochaow commented Mar 2, 2017

@XinliangZhu Thanks a lot for the help! I added -D_MWAITXINTRIN_H_INCLUDED -D_FORCE_INLINES into nvcc line and it fixed the compilation issue. Then ran into the gpu_nms issue like yours and got it fixed by changing cudaconfig in setup.py according to your previous post. Now the error I have is sparse_softmax_cross_entropy_with_logits call. Any suggestion?

Googled it and saw this exact same issue as posted here with TF1.0
tensorflow/models#857

Normalizing targets
done
Solving...
Traceback (most recent call last):
File "./tools/train_net.py", line 96, in
max_iters=args.max_iters)
File "/media/garmin/Data/Faster-RCNN_TF/tools/../lib/fast_rcnn/train.py", line 264, in train_net
sw.train_model(sess, max_iters)
File "/media/garmin/Data/Faster-RCNN_TF/tools/../lib/fast_rcnn/train.py", line 114, in train_model
rpn_cross_entropy = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(rpn_cls_score, rpn_label))
File "/home/garmin/.local/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1684, in sparse_softmax_cross_entropy_with_logits
labels, logits)
File "/home/garmin/.local/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1533, in _ensure_xent_args
"named arguments (labels=..., logits=..., ...)" % name)
ValueError: Only call sparse_softmax_cross_entropy_with_logits with named arguments (labels=..., logits=..., ...)

@XinliangZhu
Copy link

XinliangZhu commented Mar 2, 2017 via email

@chaochaow
Copy link
Author

yeah, I updated all those math operators and it ran properly now. Looks like this code took a lot of resource and my 1050 runs out of memory pretty quickly.

@XinliangZhu
Copy link

XinliangZhu commented Mar 2, 2017 via email

@nanyigongzi
Copy link

@Peratham hi,could you tell me where to add cuda path in in a specific location of make.sh?

@TomHeaven
Copy link

I'm using Mac OS X. The correct method fixing the problem is to use clang++ instead of g++ in make.sh.

The updated mash.sh looks like this:

TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')


# config
CXX=g++
CUDA_PATH=/usr/local/cuda/
NVCCFLAGS='--expt-relaxed-constexpr'
CXXFLAGS=''

if [[ "$OSTYPE" =~ ^darwin ]]; then
    CXX=clang++
    CXXFLAGS+=' -undefined dynamic_lookup '
    TF_INC=/Library/Python/2.7/site-packages/tensorflow/include
fi

cd roi_pooling_layer

if [ -d "$CUDA_PATH" ]; then
	nvcc -std=c++11 -c -o roi_pooling_op.cu.o roi_pooling_op_gpu.cu.cc  \
		-I $TF_INC -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC $NVCCFLAGS \
		-arch=sm_37

	$CXX -std=c++11 -shared -o roi_pooling.so roi_pooling_op.cc -D_GLIBCXX_USE_CXX11_ABI=0 \
		roi_pooling_op.cu.o -I $TF_INC  -D GOOGLE_CUDA=1 -fPIC $CXXFLAGS \
		-lcudart -L $CUDA_PATH/lib64
else
	$CXX -std=c++11 -shared -o roi_pooling.so roi_pooling_op.cc -D_GLIBCXX_USE_CXX11_ABI=0 \
		-I $TF_INC -fPIC $CXXFLAGS
fi

cd ..

@danaodai
Copy link

danaodai commented Feb 1, 2018

my problem is cuda path is wrong。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants