Linking to both tensorflow and protobuf causes segmentation fault during static initializers #24976

matt-har-vey · 2019-01-16T23:05:31Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 4.18.10-1rodete2-amd64 (Debian-derived)
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): nightly Jan 15, 2018 (protobuf built from HEAD Jan 15)
Python version: N/A
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): gcc 7.3.0
CUDA/cuDNN version: N/A
GPU model and memory: N/A

Describe the current behavior
Aborts on SIGSEGV

Describe the expected behavior
Exits cleanly

Details
I want to create an application that calls the C API but also can parse protocol buffers on its own behalf. For that want to link dynamically to tensorflow and statically to protobuf. When I do this, it seems like protobuf may be tricking libtensorflow.so into thinking that it has run some static initializers that it in fact has not run (on the static variables needed by its own internal copy of protobuf).

The segfault is only on Linux. Linking the same way on Windows works fine.

I have varied libtensorflow and protobuf versions, and it seems to happen with all of them. It also happens whether I choose static or dynamic linking for my binary's copy of protobuf.

I also tried building my own liba.so that itself statically links protobuf and then a binary that linked dynamically to "a" and statically to protobuf. This worked, which is pointing away from this being a purely protobuf issue.

Code to reproduce the issue

bash

c++ -o main \
  -L$TF_DIR/lib -I$TF_DIR/include \
  -L$PROTO_DIR/lib -I$PROTO_DIR/include \
  main.cc -l tensorflow -l protobuf

LD_LIBRARY_PATH=$TF_DIR/lib:$PROTO_DIR/lib ./main

Removing -lprotobuf from the above command will get rid of the segfault.

main.cc

int main(int argc, char** argv) {}

Other info / logs

Program received signal SIGSEGV, Segmentation fault.
0x00007fffed8f20b8 in tensorflow::kernel_factory::OpKernelRegistrar::InitInternal(tensorflow::KernelDef const*, absl::string_view, std::un
ique_ptr<tensorflow::kernel_factory::OpKernelFactory, std::default_deletetensorflow::kernel_factory::OpKernelFactory >) ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
(gdb) bt
#0 0x00007fffed8f20b8 in tensorflow::kernel_factory::OpKernelRegistrar::InitInternal(tensorflow::KernelDef const*, absl::string_view, std
::unique_ptr<tensorflow::kernel_factory::OpKernelFactory, std::default_deletetensorflow::kernel_factory::OpKernelFactory >) ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
#1 0x00007fffed88336a in tensorflow::kernel_factory::OpKernelRegistrar::OpKernelRegistrar(tensorflow::KernelDef const*, absl::string_view
, tensorflow::OpKernel* ()(tensorflow::OpKernelConstruction)) ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
#2 0x00007fffed85f806 in _GLOBAL__sub_I_dataset.cc ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
#3 0x00007ffff7de88aa in call_init (l=, argc=argc@entry=1, argv=argv@entry=0x7fffffffdc68, env=env@entry=0x7fffffffdc78)
at dl-init.c:72
#4 0x00007ffff7de89bb in call_init (env=0x7fffffffdc78, argv=0x7fffffffdc68, argc=1, l=) at dl-init.c:30
#5 _dl_init (main_map=0x7ffff7ffe170, argc=1, argv=0x7fffffffdc68, env=0x7fffffffdc78) at dl-init.c:120
#6 0x00007ffff7dd9c5a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#7 0x0000000000000001 in ?? ()
#8 0x00007fffffffdf2e in ?? ()
#9 0x0000000000000000 in ?? ()

0x00007fffed8f20a0 <+80>: mov 0x50(%r15),%rax
0x00007fffed8f20a4 <+84>: lea -0xa0(%rbp),%rbx
0x00007fffed8f20ab <+91>: mov %rbx,%rdi
0x00007fffed8f20ae <+94>: mov (%rax),%r8
0x00007fffed8f20b1 <+97>: mov 0x48(%r15),%rax
0x00007fffed8f20b5 <+101>: mov (%rax),%rsi
=> 0x00007fffed8f20b8 <+104>: mov -0x18(%r8),%r9

How did -0x18(%r8) get illegal?

(gdb) info register r8
r8 0x0 0

-0x18 is certainly illegal. Where did it come from? 0x50(%r15) if we trace through the above.

(gdb) info register r15
r15 0x555555768d10 93824994413840

(gdb) x/2 0x555555768d60
0x555555768d60: 0xee2c0bc0 0x00007fff

(gdb) x/2 0x00007fffee2c0bc0
0x7fffee2c0bc0 google::protobuf::internal::fixed_address_empty_string: 0x00000000 0x00000000

... the 0x0 that ended up in r8.

Zoom out to find lots of stuff uninitialized:

(gdb) x/64x 0x7fffee4ddb00
0x7fffee4ddb00 google::protobuf::_DoubleValue_default_instance_: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb10 google::protobuf::_DoubleValue_default_instance_+16: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb20 <_ZStL8__ioinit>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb30 <_ZStL8__ioinit>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb40 google::protobuf::internal::RepeatedPrimitiveDefaults::default_instance()::instance: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb50 <guard variable for google::protobuf::internal::RepeatedStringTypeTraits::GetDefaultRepeatedField()::instance>: 0x000000000x00000000 0x00000000 0x00000000
0x7fffee4ddb60 <guard variable for google::protobuf::internal::(anonymous namespace)::Register(google::protobuf::MessageLite const*, int, google::protobuf::internal::ExtensionInfo)::local_static_registry>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb70 <_ZStL8__ioinit>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb80 google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::mu: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb90 google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::mu+16: 0x00000000 0x000000000x00000000 0x00000000
0x7fffee4ddba0 google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::mu+32: 0x00000000 0x000000000x00000000 0x00000000
0x7fffee4ddbb0 <guard variable for google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::runner>: 0x000000000x00000000 0x00000000 0x00000000
0x7fffee4ddbc0 google::protobuf::internal::fixed_address_empty_string: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddbd0 google::protobuf::internal::implicit_weak_message_default_instance: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddbe0 google::protobuf::internal::implicit_weak_message_default_instance+16: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddbf0 google::protobuf::ShutdownProtobufLibrary()::is_shutdown: 0x00000000 0x00000000 0x00000000 0x00000000

The text was updated successfully, but these errors were encountered:

matt-har-vey · 2019-01-17T06:36:27Z

I found a temporary workaround for myself, but it should still be possible to do this from released binaries without the need to rebuild.

Local opt build works from r1.12 at a6d8ffa

bazel build -c opt --copt=-mavx --define=grpc_no_ares=true //tensorflow/tools/lib_package:libtensorflow

tar zxvf ../tensorflow/bazel-bin/tensorflow/tools/lib_package/libtensorflow.tar.gz

However I get the segfault from

https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-cpu-linux-x86_64-1.12.0.tar.gz

with protobuf built locally from

https://github.com/protocolbuffers/protobuf/releases/download/v3.6.0/protobuf-all-3.6.0.tar.gz

and also from

https://storage.googleapis.com/tensorflow-nightly/github/tensorflow/lib_package/libtensorflow-cpu-linux-x86_64.tar.gz # Wed Jan 16 22:33:29 PST 2019

with protobuf built locally from head (3.6.1) around the same time.

TravisWhitaker · 2019-02-06T00:40:44Z

I just hit this with the libtegra_tensorflow.so that Nvidia provides on their Xavier board's rootfs. Is building from scratch really the only workaround?

klapstoelpiloot · 2019-02-07T13:56:43Z

I also stumbled upon this problem.

On Ubuntu 18.04.1, using GCC 7.3.0
Using libtensorflow-cpu-linux-x86_64-1.12.0.tar.gz from the same link @matth79 mentioned above.
Using protobuf 3.6.1 built locally.

The problem is easy to replicate:

#include "iostream"
#include "tensorflow/c/c_api.h"

// Enable this line to include protobuf
//#include "google/protobuf/message.h"

// Main program entry
int main(int argc, char* argv[])
{
std::cout << "Tensorflow version: " << TF_Version();
return 0;
}

Link with -ltensorflow and it works fine. Uncomment the line to include protobuf and link with both -ltensorflow and -lprotobuf and observe the segmentation fault on initialization.

skye · 2019-02-08T19:30:28Z

@gunan @allenlavoie can either of you comment?

klapstoelpiloot · 2019-03-11T19:24:56Z

This has been over a month ago and we're still having issues with this. An update or fix would be very much appreciated!

artificialbrains · 2019-03-11T21:51:40Z

We are also having issues with this problem on NVidia's Xavier and would appreciate and update/fix. If there are no plans to fix the bug, we will try to build Tensorflow with the hints from matth79.

allenlavoie · 2019-03-11T22:14:57Z

Sounds like it must be a symbol conflict. And since it's the same library, it's not a case where we can just rename one of the symbols to avoid the conflict. The workarounds sound like (1) only load the second copy of protobuf in a .so that does not use TensorFlow, and you can use both that .so and TensorFlow's .so from your main program, (2) instead of linking normally, dlopen() TensorFlow with RTLD_DEEPBIND set so TensorFlow prefers its own symbols.

I'm not sure what TensorFlow can do. Putting something in the global symbol table which conflicts with TensorFlow's protobuf usage isn't something we can easily work around. Unless someone has a suggestion?

HuaDongShiFanLX · 2019-03-14T06:26:11Z

Hello. I get the same problem , the info like this:
0x00007fffddef3058 in tensorflow::kernel_factory::OpKernelRegistrar::InitInternal(tensorflow::KernelDef const*, absl::string_view, std::unique_ptr<tensorflow::kernel_factory::OpKernelFactory, std::default_deletetensorflow::kernel_factory::OpKernelFactory >) ()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so

I am using C++ to call python's tensorflow. The protobuf library is called in our own environment, when we call python's “import tensorflow as tf” in C++ in our own environment. The above problem will occur. When the“ import tensorflow as tf “is deleted, the problem will disappear. Do you know the reason?

I think that the protobuf of my environment conflicts with the protobuf of tensorflow.

can you help me . thanks

TravisWhitaker · 2019-03-18T21:25:52Z

This is indeed a problem with protobuf; there's not much TF itself can do as @allenlavoie mentioned. We dealt with this by running TF operations in a separate process that talks over a UNIX socket, but @allenlavoie's solutions should work too.

TravisWhitaker · 2019-03-18T21:27:14Z

I hope the readers have learned a valuable lesson about using static initializers in this way from this thread.

abcdabcd987 · 2020-01-13T03:59:48Z

I also have this issue. Reproduced with libtensorflow-gpu-linux-x86_64-1.15.0.tar.gz.

gunan · 2020-01-21T04:41:11Z

While I do not want to close this issue, as @allenlavoie wrote in #24976 (comment) , I am not sure what we can do.
TF is working on the slow path to hide all protobuf symbols from its API surface. Even then static initializers will be executed twice. I am not sure what will happen, as I am not sure how protobuf uses them.

So, unfortunately I can only offer #24976 (comment) , and we should close this as "Infeasible".

tensorflow-bot · 2020-01-21T04:41:14Z

Are you satisfied with the resolution of your issue?
Yes
No

gittripley · 2020-03-24T01:13:52Z

Hello. I get the same problem , the info like this:
0x00007fffddef3058 in tensorflow::kernel_factory::OpKernelRegistrar::InitInternal(tensorflow::KernelDef const*, absl::string_view, std::unique_ptr<tensorflow::kernel_factory::OpKernelFactory, std::default_deletetensorflow::kernel_factory::OpKernelFactory >) ()
from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so

I am using C++ to call python's tensorflow. The protobuf library is called in our own environment, when we call python's “import tensorflow as tf” in C++ in our own environment. The above problem will occur. When the“ import tensorflow as tf “is deleted, the problem will disappear. Do you know the reason?

I think that the protobuf of my environment conflicts with the protobuf of tensorflow.

can you help me . thanks

I ran into core dump issue when call import tensorflow using C++ Python API.

Thread 1 "tf" received signal SIGSEGV, Segmentation fault. google::protobuf::internal::AddDescriptors(google::protobuf::internal::DescriptorTable const*) () from /usr/local/lib/python3.6/dist-packages/google/protobuf/pyext/_message.cpython-36m-x86_64-linux-gnu.so

Finally, I installed python protobuf that matches with Tensorflow's protobuf version, 3.7.1. It magically works. I don't know how to check the protobuf version inside tensorflow library libtensorflow_framework.so or _pywrap_tensorflow_internal.so.

Since Tensorflow 1.14 requires protobuf >= 3.6.1, so I installed 3.6.1 first and then my program throws an error said
[libprotobuf FATAL external/protobuf_archive/src/google/protobuf/stubs/common.cc:86] This program was compiled against version 3.6.1 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.7.1). Contact the program author for an update. If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library. (Version verification failed in "google/protobuf/descriptor.pb.cc".)

However, if I install python protobuf to 3.11.3, I got segfault.

So once I upgrade protobuf into 3.7.1, it works.

@gnossen

@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).

@gnossen

@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).

@gnossen

@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).

@gnossen

@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message). Co-authored-by: Roy Williams <roy.williams.iii@gmail.com>

ymodak added comp:eager Eager related issues type:bug Bug labels Feb 8, 2019

ymodak assigned skye Feb 8, 2019

klapstoelpiloot mentioned this issue Mar 19, 2019

Static initializers prevent using libraries which also use Protobuf protocolbuffers/protobuf#5914

Closed

gunan closed this as completed Jan 21, 2020

suth1807 mentioned this issue Jul 6, 2020

Relink /usr/local/lib/libtensorflow_framework.so.2' with /lib/x86_64-linux-gnu/libz.so.1' for IFUNC symbol `crc32_z' #41080

Closed

This was referenced Aug 12, 2020

Tensorflow lib core dumped JohnSnowLabs/spark-nlp#996

Closed

Problem with spark-nlp JohnSnowLabs/spark-nlp#995

Closed

deadeyegoodwin mentioned this issue Nov 11, 2020

Protobuf version conflicts r19.10 triton-inference-server/server#2243

Closed

rowillia mentioned this issue Feb 12, 2021

Python Wheel using dynamic linking (e.g. built with --cpp_implementation not --cpp_implementation --compile_static_extension) protocolbuffers/protobuf#8291

Closed

rowillia mentioned this issue Feb 25, 2021

Make libprotobuf symbols local on OSX protocolbuffers/protobuf#8346

Merged

sighingnow mentioned this issue May 26, 2021

[Feature Request] Allow custom namespace to avoid protobuf version/linking conflicts protocolbuffers/protobuf#4004

Closed

sammymax mentioned this issue Aug 4, 2022

[ScaNN] SIGILL when installing PIP module google-research/google-research#1224

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linking to both tensorflow and protobuf causes segmentation fault during static initializers #24976

Linking to both tensorflow and protobuf causes segmentation fault during static initializers #24976

matt-har-vey commented Jan 16, 2019 •

edited

Loading

matt-har-vey commented Jan 17, 2019

TravisWhitaker commented Feb 6, 2019

klapstoelpiloot commented Feb 7, 2019 •

edited

Loading

skye commented Feb 8, 2019

klapstoelpiloot commented Mar 11, 2019

artificialbrains commented Mar 11, 2019

allenlavoie commented Mar 11, 2019

HuaDongShiFanLX commented Mar 14, 2019

TravisWhitaker commented Mar 18, 2019

TravisWhitaker commented Mar 18, 2019

abcdabcd987 commented Jan 13, 2020

gunan commented Jan 21, 2020

tensorflow-bot bot commented Jan 21, 2020

gittripley commented Mar 24, 2020

Linking to both tensorflow and protobuf causes segmentation fault during static initializers #24976

Linking to both tensorflow and protobuf causes segmentation fault during static initializers #24976

Comments

matt-har-vey commented Jan 16, 2019 • edited Loading

matt-har-vey commented Jan 17, 2019

TravisWhitaker commented Feb 6, 2019

klapstoelpiloot commented Feb 7, 2019 • edited Loading

skye commented Feb 8, 2019

klapstoelpiloot commented Mar 11, 2019

artificialbrains commented Mar 11, 2019

allenlavoie commented Mar 11, 2019

HuaDongShiFanLX commented Mar 14, 2019

TravisWhitaker commented Mar 18, 2019

TravisWhitaker commented Mar 18, 2019

abcdabcd987 commented Jan 13, 2020

gunan commented Jan 21, 2020

tensorflow-bot bot commented Jan 21, 2020

gittripley commented Mar 24, 2020

matt-har-vey commented Jan 16, 2019 •

edited

Loading

klapstoelpiloot commented Feb 7, 2019 •

edited

Loading