Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow now builds with Bazel 1.0 #33415

Merged
merged 4 commits into from
Oct 24, 2019

Conversation

laszlocsomor
Copy link
Contributor

@laszlocsomor laszlocsomor commented Oct 16, 2019

Fix cc_library targets by adding 'alwayslink = 1'
so targets depending on them build with
--incompatible_remove_legacy_whole_archive, which
is enabled by default in Bazel 1.0.

Inspired by c3619ce

Fixes #32835

Fix cc_library targets by adding 'alwayslink = 1'
so targets depending on them build with
--incompatible_remove_legacy_whole_archive, which
is enabled by default in Bazel 1.0.

Fixes tensorflow#32835
@tensorflow-bot tensorflow-bot bot added the size:XS CL Change Size: Extra Small label Oct 16, 2019
@gbaned gbaned self-assigned this Oct 16, 2019
@gbaned gbaned added the prtype:bugfix PR to fix a bug label Oct 16, 2019
@gbaned gbaned requested a review from av8ramit October 16, 2019 09:29
@mihaimaruseac mihaimaruseac requested a review from gunan October 16, 2019 17:08
@laszlocsomor
Copy link
Contributor Author

ping

@mihaimaruseac mihaimaruseac requested review from meteorcloudy and removed request for av8ramit and gunan October 18, 2019 15:42
@mihaimaruseac
Copy link
Collaborator

I don't think alwayslink = 1 is the right solution here. @meteorcloudy is working on enabling TF to build with Bazel 1.0.

@laszlocsomor
Copy link
Contributor Author

I don't think alwayslink = 1 is the right solution here.

Can you explain why?

@meteorcloudy
Copy link
Member

@mihaimaruseac I think specifying alwayslink = 1 for every cc_library that need to be force linked is the correct way to migrate for --incompatible_remove_legacy_whole_archive, @laszlocsomor is helping us to make TF build with 1.0. I'll help upgrading Bazel on CI later.

@tensorflow-bot tensorflow-bot bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Oct 21, 2019
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Oct 21, 2019
@mihaimaruseac
Copy link
Collaborator

In this case approving. I was thinking that alwayslink = 1 will result in more bloat in the binaries, but happy to be proven wrong and learn that it is useful. Thanks.

tensorflow-copybara pushed a commit that referenced this pull request Oct 24, 2019
PiperOrigin-RevId: 276476881
Change-Id: Ife776ebdc88038a99f064ac895a86889a2c01473
@tensorflow-copybara tensorflow-copybara merged commit 1119076 into tensorflow:master Oct 24, 2019
@meteorcloudy
Copy link
Member

@laszlocsomor , I did a rerun with TF@HEAD, but the build failure doesn't seem to be fixed.
https://buildkite.com/bazel/tensorflow/builds/3755

Are we missing anything?

@laszlocsomor laszlocsomor deleted the bazel-1.0 branch October 25, 2019 07:08
@laszlocsomor
Copy link
Contributor Author

Well this is embarrassing, but it seems I was actually using 0.26.1 because of the .bazelversion and because I used Bazelisk.

Looking into failures with the real Bazel 1.1.0 now, then again with Bazel 1.0.0. I'll keep this thread updated.

@laszlocsomor
Copy link
Contributor Author

Fortunately, @scentini has also been working on this same issue and made far more progress than I did.

@timokau
Copy link

timokau commented Oct 25, 2019

Fortunately, @scentini has also been working on this same issue and made far more progress than I did.

Is that progress publicly available?

@flokli
Copy link

flokli commented Oct 25, 2019

Also would be very interested in trying this out :-)

tensorflow-copybara pushed a commit that referenced this pull request Oct 25, 2019
bazelbuild/bazel#7362

This PR updates `cc_library`s to use `alwayslink=1` when needed. Currently, library archives and library object groups are wrapped in `--whole_archive`, `--no_whole_archive`, thus the whole library is always linked. This changes with Bazel 1.0, and libraries that export otherwise unused symbols must be tagged as `alwayslink=1`.

This PR fixes #32835 (once again, as it was fixed by #33415 but the codebase regressed in the meantime.)
It also fixes most of the TF tests.

PiperOrigin-RevId: 276772147
Change-Id: I05db02e84200a484df52f4f02b601dc486636912
sunway513 pushed a commit to ROCm/tensorflow-upstream that referenced this pull request Oct 26, 2019
bazelbuild/bazel#7362

This PR updates `cc_library`s to use `alwayslink=1` when needed. Currently, library archives and library object groups are wrapped in `--whole_archive`, `--no_whole_archive`, thus the whole library is always linked. This changes with Bazel 1.0, and libraries that export otherwise unused symbols must be tagged as `alwayslink=1`.

This PR fixes tensorflow#32835 (once again, as it was fixed by tensorflow#33415 but the codebase regressed in the meantime.)
It also fixes most of the TF tests.

PiperOrigin-RevId: 276772147
Change-Id: I05db02e84200a484df52f4f02b601dc486636912
@scentini
Copy link
Contributor

The commit in question is 52167ad. TF at HEAD currently builds with Bazel 1.0: https://buildkite.com/bazel/tensorflow/

There are still a couple of tests that are failing with Bazel 1.0, and I'm working on fixing them. Then, I will set --incompatible_remove_legacy_whole_archive to True in .bazelrc, to prevent regression.

@timokau
Copy link

timokau commented Oct 28, 2019

Is there some kind of hotfix one can apply to compile the current releases (1.15.0 and 2.0.0) with bazel 1.0?

@laszlocsomor
Copy link
Contributor Author

@timokau : Can you patch b4e6901 and 52167ad onto your source tree? I think that should do it.

@timokau
Copy link

timokau commented Oct 31, 2019

@laszlocsomor unfortunately neither of those apply on the 1.15.0 or 2.0.0 trees. Is there any other option?

@laszlocsomor
Copy link
Contributor Author

@timokau : Sorry, I'm not aware of any. Can you try resolving the merge conflicts? I just tried cherry-picking the commits onto r1.15: some conflicting targets don't yet exist (I deleted those), some files don't exist (deleted those too). There are some genuine conflicts too that I now cannot inspect closer, but the goal of these two cherry-picked commits was to add alwayslink = 1 to cc_library rules that contain source files that define methods that the linker would strip without the alwayslink. So if you try building TensorFlow and the linker complains about missing functions, find the .cc files where they are defined, find the cc_library where those .cc files are in the srcs, and add alwayslink = 1. Hope that helps -- if you're still stuck, I can help again on Monday next week.

timokau pushed a commit to timokau/tensorflow that referenced this pull request Nov 2, 2019
bazelbuild/bazel#7362

This PR updates `cc_library`s to use `alwayslink=1` when needed. Currently, library archives and library object groups are wrapped in `--whole_archive`, `--no_whole_archive`, thus the whole library is always linked. This changes with Bazel 1.0, and libraries that export otherwise unused symbols must be tagged as `alwayslink=1`.

This PR fixes tensorflow#32835 (once again, as it was fixed by tensorflow#33415 but the codebase regressed in the meantime.)
It also fixes most of the TF tests.

PiperOrigin-RevId: 276772147
Change-Id: I05db02e84200a484df52f4f02b601dc486636912
(cherry picked from commit 52167ad)
@timokau
Copy link

timokau commented Nov 2, 2019

Yeah, unfortunately I'm still stuck. I've gotten this far with cherry-picks: timokau@cec52a8

But the build still fails with

bazel-out/host/bin/tensorflow/contrib/ignite/gen_gen_igfs_ops_py_wrappers_cc: symbol lookup error: /build/output/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/contrib/ignite/../../../_solib_k8/_U_S_Stensorflow_Scontrib_Signite_Cgen_Ugen_Uigfs_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined symbol: scc_info_DoubleValue_google_2fprotobuf_2fwrappers_2eproto

and I don't know where this symbol is coming from. Any help would be much appreciated from me and maybe also other distro maintainers.

@laszlocsomor
Copy link
Contributor Author

@scentini: does this error look familiar?

@kgizdov
Copy link

kgizdov commented Nov 7, 2019

@scentini @laszlocsomor any idea which package causes this:

ERROR: /build/tensorflow/src/tensorflow-2.0.0/tensorflow/python/BUILD:2158:1: Executing genrule //tensorflow/python:parsing_ops_pygenrule failed (Exit 127)
bazel-out/host/bin/tensorflow/python/gen_parsing_ops_py_wrappers_cc: symbol lookup error: /build/.cache/bazel/_bazel_builduser/539918b59a0d8a3726d6fc114e247083/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/../../_solib_k8/_U_S_Stensorflow_Spython_Cgen_Uparsing_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.2: undefined symbol: scc_info_DoubleValue_google_2fprotobuf_2fwrappers_2eproto

? from the log it points to something in tensorflow/python but not sure what.

@scentini
Copy link
Contributor

scentini commented Nov 7, 2019

Can you try patching 2d2d6cf ?
The missing symbol is a protobuf one.
I'll share a list of all the changes we made regarding --incompatible_remove_legacy_whole_archive later today. Note that we are still encountering issues though.

@scentini
Copy link
Contributor

scentini commented Nov 7, 2019

Here it is:
c3619ce
afb7639
9760bc2
b4e6901
52167ad
2d2d6cf

Also, see #34044, not yet merged.

@timokau
Copy link

timokau commented Nov 7, 2019

Thank you! Unfortunately with all those patches cherry-picked (to the best of my conflict-resolution ability), the build still fails:

ERROR: /build/source/tensorflow/compiler/tf2xla/ops/BUILD:25:1: Executing genrule //tensorflow/compiler/tf2xla/ops:gen_xla_ops_pygenrule failed (Exit 127)
[2,139 / 2,289] 3 actions running
    Compiling external/pcre/pcre_exec.c [for host]; 2s local
    Compiling external/pcre/pcre_compile.c [for host]; 0s local
    Compiling external/swig/Source/Modules/python.cxx [for host]; 0s local
bazel-out/host/bin/tensorflow/compiler/tf2xla/ops/gen_gen_xla_ops_py_wrappers_cc: symbol lookup error: /build/output/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/compiler/tf2xla/ops/../../../../_solib_k8/_U_S_Stensorflow_Scompiler_Stf2xla_Sops_Cgen_Ugen_Uxla_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined symbol: _ZN10tensorflow16ProtoDebugStringB5cxx11ERKNS_16DeviceAttributesE
[2,139 / 2,289] 3 actions running

Should I give up on trying to get 1.15 to build?

@kgizdov
Copy link

kgizdov commented Nov 8, 2019

@scentini thanks. I managed to port the patches you listed. On v2.0.0 it still fails with the following error:

ERROR: /build/tensorflow/src/tensorflow-2.0.0/tensorflow/python/BUILD:2212:1: Executing genrule //tensorflow/python:state_ops_pygenrule failed (Exit 127)
bazel-out/host/bin/tensorflow/python/gen_state_ops_py_wrappers_cc: symbol lookup error: /build/.cache/bazel/_bazel_builduser/539918b59a0d8a3726d6fc114e247083/execroot/org_tens
orflow/bazel-out/host/bin/tensorflow/python/../../_solib_k8/_U_S_Stensorflow_Spython_Cgen_Ustate_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.2: undefined s
ymbol: _ZN10tensorflow16ProtoDebugStringB5cxx11ERKNS_16DeviceAttributesE

Is there an easy way of identifying which cc_library is the culprit, so that I can put alwayslink = 1 accordingly? Cheers

@scentini
Copy link
Contributor

Sometimes you can just search where the method is defined, and then find the cc_library that contains it. This case was tricky though, we managed to track it to this library. I sent out a PR internally to add alwayslink=1.

tensorflow-copybara pushed a commit that referenced this pull request Nov 11, 2019
…default of --incompatible_remove_legacy_whole_archive in cl/277339372

This should fix the latest issue reported in #33415.
Also fixes an internally reported missing symbol.

Related to bazelbuild/bazel#7362

PiperOrigin-RevId: 279792794
Change-Id: I6f5d26ee37b9c886662df5e2daf9273c15cae865
timokau added a commit to timokau/tensorflow that referenced this pull request Nov 12, 2019
karimnosseir pushed a commit to karimnosseir/tensorflow that referenced this pull request Nov 13, 2019
…default of --incompatible_remove_legacy_whole_archive in cl/277339372

This should fix the latest issue reported in tensorflow#33415.
Also fixes an internally reported missing symbol.

Related to bazelbuild/bazel#7362

PiperOrigin-RevId: 279792794
Change-Id: I6f5d26ee37b9c886662df5e2daf9273c15cae865
@timokau
Copy link

timokau commented Nov 16, 2019

In case anybody else is interested in this, with this patch I got the 1.15.0 build to work with bazel 1.1. Thanks @scentini!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes prtype:bugfix PR to fix a bug ready to pull PR ready for merge process size:XS CL Change Size: Extra Small
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TF doesn't build anymore with Bazel's --incompatible_remove_legacy_whole_archive