Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linking to both tensorflow and protobuf causes segmentation fault during static initializers #24976

Closed
matt-har-vey opened this issue Jan 16, 2019 · 14 comments
Assignees
Labels
comp:eager Eager related issues type:bug Bug

Comments

@matt-har-vey
Copy link

matt-har-vey commented Jan 16, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 4.18.10-1rodete2-amd64 (Debian-derived)
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): nightly Jan 15, 2018 (protobuf built from HEAD Jan 15)
  • Python version: N/A
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): gcc 7.3.0
  • CUDA/cuDNN version: N/A
  • GPU model and memory: N/A

Describe the current behavior
Aborts on SIGSEGV

Describe the expected behavior
Exits cleanly

Details
I want to create an application that calls the C API but also can parse protocol buffers on its own behalf. For that want to link dynamically to tensorflow and statically to protobuf. When I do this, it seems like protobuf may be tricking libtensorflow.so into thinking that it has run some static initializers that it in fact has not run (on the static variables needed by its own internal copy of protobuf).

The segfault is only on Linux. Linking the same way on Windows works fine.

I have varied libtensorflow and protobuf versions, and it seems to happen with all of them. It also happens whether I choose static or dynamic linking for my binary's copy of protobuf.

I also tried building my own liba.so that itself statically links protobuf and then a binary that linked dynamically to "a" and statically to protobuf. This worked, which is pointing away from this being a purely protobuf issue.

Code to reproduce the issue

  • bash
c++ -o main \
  -L$TF_DIR/lib -I$TF_DIR/include \
  -L$PROTO_DIR/lib -I$PROTO_DIR/include \
  main.cc -l tensorflow -l protobuf

LD_LIBRARY_PATH=$TF_DIR/lib:$PROTO_DIR/lib ./main

Removing -lprotobuf from the above command will get rid of the segfault.

  • main.cc
int main(int argc, char** argv) {}

Other info / logs

Program received signal SIGSEGV, Segmentation fault.
0x00007fffed8f20b8 in tensorflow::kernel_factory::OpKernelRegistrar::InitInternal(tensorflow::KernelDef const*, absl::string_view, std::un
ique_ptr<tensorflow::kernel_factory::OpKernelFactory, std::default_deletetensorflow::kernel_factory::OpKernelFactory >) ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
(gdb) bt
#0 0x00007fffed8f20b8 in tensorflow::kernel_factory::OpKernelRegistrar::InitInternal(tensorflow::KernelDef const*, absl::string_view, std
::unique_ptr<tensorflow::kernel_factory::OpKernelFactory, std::default_deletetensorflow::kernel_factory::OpKernelFactory >) ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
#1 0x00007fffed88336a in tensorflow::kernel_factory::OpKernelRegistrar::OpKernelRegistrar(tensorflow::KernelDef const*, absl::string_view
, tensorflow::OpKernel* ()(tensorflow::OpKernelConstruction)) ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
#2 0x00007fffed85f806 in _GLOBAL__sub_I_dataset.cc ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
#3 0x00007ffff7de88aa in call_init (l=, argc=argc@entry=1, argv=argv@entry=0x7fffffffdc68, env=env@entry=0x7fffffffdc78)
at dl-init.c:72
#4 0x00007ffff7de89bb in call_init (env=0x7fffffffdc78, argv=0x7fffffffdc68, argc=1, l=) at dl-init.c:30
#5 _dl_init (main_map=0x7ffff7ffe170, argc=1, argv=0x7fffffffdc68, env=0x7fffffffdc78) at dl-init.c:120
#6 0x00007ffff7dd9c5a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#7 0x0000000000000001 in ?? ()
#8 0x00007fffffffdf2e in ?? ()
#9 0x0000000000000000 in ?? ()

0x00007fffed8f20a0 <+80>: mov 0x50(%r15),%rax
0x00007fffed8f20a4 <+84>: lea -0xa0(%rbp),%rbx
0x00007fffed8f20ab <+91>: mov %rbx,%rdi
0x00007fffed8f20ae <+94>: mov (%rax),%r8
0x00007fffed8f20b1 <+97>: mov 0x48(%r15),%rax
0x00007fffed8f20b5 <+101>: mov (%rax),%rsi
=> 0x00007fffed8f20b8 <+104>: mov -0x18(%r8),%r9

How did -0x18(%r8) get illegal?

(gdb) info register r8
r8 0x0 0

-0x18 is certainly illegal. Where did it come from? 0x50(%r15) if we trace through the above.

(gdb) info register r15
r15 0x555555768d10 93824994413840

(gdb) x/2 0x555555768d60
0x555555768d60: 0xee2c0bc0 0x00007fff

(gdb) x/2 0x00007fffee2c0bc0
0x7fffee2c0bc0 google::protobuf::internal::fixed_address_empty_string: 0x00000000 0x00000000

... the 0x0 that ended up in r8.

Zoom out to find lots of stuff uninitialized:

(gdb) x/64x 0x7fffee4ddb00
0x7fffee4ddb00 google::protobuf::_DoubleValue_default_instance_: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb10 google::protobuf::_DoubleValue_default_instance_+16: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb20 <_ZStL8__ioinit>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb30 <_ZStL8__ioinit>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb40 google::protobuf::internal::RepeatedPrimitiveDefaults::default_instance()::instance: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb50 <guard variable for google::protobuf::internal::RepeatedStringTypeTraits::GetDefaultRepeatedField()::instance>: 0x000000000x00000000 0x00000000 0x00000000
0x7fffee4ddb60 <guard variable for google::protobuf::internal::(anonymous namespace)::Register(google::protobuf::MessageLite const*, int, google::protobuf::internal::ExtensionInfo)::local_static_registry>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb70 <_ZStL8__ioinit>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb80 google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::mu: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb90 google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::mu+16: 0x00000000 0x000000000x00000000 0x00000000
0x7fffee4ddba0 google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::mu+32: 0x00000000 0x000000000x00000000 0x00000000
0x7fffee4ddbb0 <guard variable for google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::runner>: 0x000000000x00000000 0x00000000 0x00000000
0x7fffee4ddbc0 google::protobuf::internal::fixed_address_empty_string: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddbd0 google::protobuf::internal::implicit_weak_message_default_instance: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddbe0 google::protobuf::internal::implicit_weak_message_default_instance+16: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddbf0 google::protobuf::ShutdownProtobufLibrary()::is_shutdown: 0x00000000 0x00000000 0x00000000 0x00000000

@matt-har-vey
Copy link
Author

I found a temporary workaround for myself, but it should still be possible to do this from released binaries without the need to rebuild.

Local opt build works from r1.12 at a6d8ffa

bazel build -c opt --copt=-mavx --define=grpc_no_ares=true //tensorflow/tools/lib_package:libtensorflow

tar zxvf ../tensorflow/bazel-bin/tensorflow/tools/lib_package/libtensorflow.tar.gz

However I get the segfault from

https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-cpu-linux-x86_64-1.12.0.tar.gz

with protobuf built locally from

https://github.com/protocolbuffers/protobuf/releases/download/v3.6.0/protobuf-all-3.6.0.tar.gz

and also from

https://storage.googleapis.com/tensorflow-nightly/github/tensorflow/lib_package/libtensorflow-cpu-linux-x86_64.tar.gz # Wed Jan 16 22:33:29 PST 2019

with protobuf built locally from head (3.6.1) around the same time.

@TravisWhitaker
Copy link

I just hit this with the libtegra_tensorflow.so that Nvidia provides on their Xavier board's rootfs. Is building from scratch really the only workaround?

@klapstoelpiloot
Copy link

klapstoelpiloot commented Feb 7, 2019

I also stumbled upon this problem.

  • On Ubuntu 18.04.1, using GCC 7.3.0
  • Using libtensorflow-cpu-linux-x86_64-1.12.0.tar.gz from the same link @matth79 mentioned above.
  • Using protobuf 3.6.1 built locally.

The problem is easy to replicate:

#include "iostream"
#include "tensorflow/c/c_api.h"

// Enable this line to include protobuf
//#include "google/protobuf/message.h"

// Main program entry
int main(int argc, char* argv[])
{
std::cout << "Tensorflow version: " << TF_Version();
return 0;
}

Link with -ltensorflow and it works fine. Uncomment the line to include protobuf and link with both -ltensorflow and -lprotobuf and observe the segmentation fault on initialization.

@ymodak ymodak added comp:eager Eager related issues type:bug Bug labels Feb 8, 2019
@skye
Copy link
Member

skye commented Feb 8, 2019

@gunan @allenlavoie can either of you comment?

@klapstoelpiloot
Copy link

This has been over a month ago and we're still having issues with this. An update or fix would be very much appreciated!

@artificialbrains
Copy link

We are also having issues with this problem on NVidia's Xavier and would appreciate and update/fix. If there are no plans to fix the bug, we will try to build Tensorflow with the hints from matth79.

@allenlavoie
Copy link
Member

Sounds like it must be a symbol conflict. And since it's the same library, it's not a case where we can just rename one of the symbols to avoid the conflict. The workarounds sound like (1) only load the second copy of protobuf in a .so that does not use TensorFlow, and you can use both that .so and TensorFlow's .so from your main program, (2) instead of linking normally, dlopen() TensorFlow with RTLD_DEEPBIND set so TensorFlow prefers its own symbols.

I'm not sure what TensorFlow can do. Putting something in the global symbol table which conflicts with TensorFlow's protobuf usage isn't something we can easily work around. Unless someone has a suggestion?

@HuaDongShiFanLX
Copy link

Hello. I get the same problem , the info like this:
0x00007fffddef3058 in tensorflow::kernel_factory::OpKernelRegistrar::InitInternal(tensorflow::KernelDef const*, absl::string_view, std::unique_ptr<tensorflow::kernel_factory::OpKernelFactory, std::default_deletetensorflow::kernel_factory::OpKernelFactory >) ()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so

I am using C++ to call python's tensorflow. The protobuf library is called in our own environment, when we call python's “import tensorflow as tf” in C++ in our own environment. The above problem will occur. When the“ import tensorflow as tf “is deleted, the problem will disappear. Do you know the reason?

I think that the protobuf of my environment conflicts with the protobuf of tensorflow.

can you help me . thanks

@TravisWhitaker
Copy link

This is indeed a problem with protobuf; there's not much TF itself can do as @allenlavoie mentioned. We dealt with this by running TF operations in a separate process that talks over a UNIX socket, but @allenlavoie's solutions should work too.

@TravisWhitaker
Copy link

I hope the readers have learned a valuable lesson about using static initializers in this way from this thread.

@abcdabcd987
Copy link
Contributor

I also have this issue. Reproduced with libtensorflow-gpu-linux-x86_64-1.15.0.tar.gz.

@gunan
Copy link
Contributor

gunan commented Jan 21, 2020

While I do not want to close this issue, as @allenlavoie wrote in #24976 (comment) , I am not sure what we can do.
TF is working on the slow path to hide all protobuf symbols from its API surface. Even then static initializers will be executed twice. I am not sure what will happen, as I am not sure how protobuf uses them.

So, unfortunately I can only offer #24976 (comment) , and we should close this as "Infeasible".

@gunan gunan closed this as completed Jan 21, 2020
@tensorflow-bot
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@gittripley
Copy link

Hello. I get the same problem , the info like this:
0x00007fffddef3058 in tensorflow::kernel_factory::OpKernelRegistrar::InitInternal(tensorflow::KernelDef const*, absl::string_view, std::unique_ptr<tensorflow::kernel_factory::OpKernelFactory, std::default_deletetensorflow::kernel_factory::OpKernelFactory >) ()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so

I am using C++ to call python's tensorflow. The protobuf library is called in our own environment, when we call python's “import tensorflow as tf” in C++ in our own environment. The above problem will occur. When the“ import tensorflow as tf “is deleted, the problem will disappear. Do you know the reason?

I think that the protobuf of my environment conflicts with the protobuf of tensorflow.

can you help me . thanks

I ran into core dump issue when call import tensorflow using C++ Python API.

Thread 1 "tf" received signal SIGSEGV, Segmentation fault. google::protobuf::internal::AddDescriptors(google::protobuf::internal::DescriptorTable const*) () from /usr/local/lib/python3.6/dist-packages/google/protobuf/pyext/_message.cpython-36m-x86_64-linux-gnu.so

Finally, I installed python protobuf that matches with Tensorflow's protobuf version, 3.7.1. It magically works. I don't know how to check the protobuf version inside tensorflow library libtensorflow_framework.so or _pywrap_tensorflow_internal.so.

Since Tensorflow 1.14 requires protobuf >= 3.6.1, so I installed 3.6.1 first and then my program throws an error said
[libprotobuf FATAL external/protobuf_archive/src/google/protobuf/stubs/common.cc:86] This program was compiled against version 3.6.1 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.7.1). Contact the program author for an update. If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library. (Version verification failed in "google/protobuf/descriptor.pb.cc".)

However, if I install python protobuf to 3.11.3, I got segfault.

So once I upgrade protobuf into 3.7.1, it works.

rowillia pushed a commit to rowillia/protobuf that referenced this issue Feb 25, 2021
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem.

If a python process using both protobuf _and_ another native library linking in libprotobuf
frequently can cause crashes.  This seems to frequently affect tensorflow as well:

tensorflow/tensorflow#8394,
tensorflow/tensorflow#9525 (comment)
tensorflow/tensorflow#24976,
tensorflow/tensorflow#35573,
https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh,
tensorflow/tensorflow#16104

Testing locally this fixes both crashes when linking in multiple versions of protobuf
and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
TeBoring pushed a commit to protocolbuffers/protobuf that referenced this issue Apr 24, 2021
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem.

If a python process using both protobuf _and_ another native library linking in libprotobuf
frequently can cause crashes.  This seems to frequently affect tensorflow as well:

tensorflow/tensorflow#8394,
tensorflow/tensorflow#9525 (comment)
tensorflow/tensorflow#24976,
tensorflow/tensorflow#35573,
https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh,
tensorflow/tensorflow#16104

Testing locally this fixes both crashes when linking in multiple versions of protobuf
and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
acozzette pushed a commit to acozzette/protobuf that referenced this issue Jan 22, 2022
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem.

If a python process using both protobuf _and_ another native library linking in libprotobuf
frequently can cause crashes.  This seems to frequently affect tensorflow as well:

tensorflow/tensorflow#8394,
tensorflow/tensorflow#9525 (comment)
tensorflow/tensorflow#24976,
tensorflow/tensorflow#35573,
https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh,
tensorflow/tensorflow#16104

Testing locally this fixes both crashes when linking in multiple versions of protobuf
and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
acozzette added a commit to protocolbuffers/protobuf that referenced this issue Jan 25, 2022
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem.

If a python process using both protobuf _and_ another native library linking in libprotobuf
frequently can cause crashes.  This seems to frequently affect tensorflow as well:

tensorflow/tensorflow#8394,
tensorflow/tensorflow#9525 (comment)
tensorflow/tensorflow#24976,
tensorflow/tensorflow#35573,
https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh,
tensorflow/tensorflow#16104

Testing locally this fixes both crashes when linking in multiple versions of protobuf
and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).

Co-authored-by: Roy Williams <roy.williams.iii@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:eager Eager related issues type:bug Bug
Projects
None yet
Development

No branches or pull requests