Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf-head: Fix undefined symbols by linking libtensorflow_cc.so.2 #3755

Merged
merged 2 commits into from
Dec 9, 2022

Conversation

maxhgerlach
Copy link
Collaborator

@maxhgerlach maxhgerlach commented Oct 25, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Fixes #3754

For context see tensorflow/tensorflow#58281

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@maxhgerlach maxhgerlach changed the title Draft: tf-head: Fix undefined symbols by linking libtensorflow_cc.so.2 tf-head: Fix undefined symbols by linking libtensorflow_cc.so.2 Oct 25, 2022
@maxhgerlach maxhgerlach marked this pull request as ready for review October 25, 2022 11:31
@maxhgerlach
Copy link
Collaborator Author

This fixes the problem for me. Verified locally:

$ pip install -U tf-nightly-cpu==2.12.0.dev20221024
$ pip install -U git+https://github.com/horovod/horovod.git@refs/pull/3755/head
$ python -c 'import horovod.tensorflow'

Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for keeping HEADs green, LGTM!

@maxhgerlach
Copy link
Collaborator Author

New packages tf_nightly_{cpu,gpu}-2.12.0.dev20221025 have appeared by now and it looks like the libtensorflow_cc.so.2 has been removed again. I think the change will be intermittent, so I'll add some conditional to only link the file if it exists.

c++: error: /usr/local/lib/python3.8/dist-packages/tensorflow/libtensorflow_cc.so.2: No such file or directory

https://buildkite.com/horovod/horovod/builds/8583#01840f11-bd37-4c4e-a85f-4ef0e12838d9

@maxhgerlach
Copy link
Collaborator Author

maxhgerlach commented Oct 25, 2022

(I verified locally that this still works with tf_nightly_cpu-2.12.0.dev20221024, so chances are it will continue to work transparently once libtensorflow_cc.so.2 is re-introduced.)

@github-actions
Copy link

github-actions bot commented Oct 25, 2022

Unit Test Results

  1 191 files    1 191 suites   12h 48m 55s ⏱️
     842 tests      787 ✔️      53 💤 0    2 🔥
24 444 runs  17 416 ✔️ 7 012 💤 0  16 🔥

For more details on these errors, see this check.

Results for commit 241ff9a.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Oct 25, 2022

Unit Test Results (with flaky tests)

  1 603 files    1 603 suites   15h 11m 13s ⏱️
     842 tests      787 ✔️      53 💤 0    2 🔥
33 444 runs  23 446 ✔️ 9 950 💤 0  48 🔥

For more details on these errors, see this check.

Results for commit 241ff9a.

♻️ This comment has been updated with latest results.

@EnricoMi
Copy link
Collaborator

EnricoMi commented Oct 25, 2022

I am not sure if we should keep this while head version does not require this. We do not intent to support head versions older than today (head of head only if you will).

@@ -25,6 +25,11 @@ if (LEN EQUAL "4")
if (Tensorflow_VERSION VERSION_GREATER_EQUAL "2.6")
# XLA implementations and helpers for resource variables are in _pywrap_tensorflow_internal.so
set(Tensorflow_LIBRARIES "${Tensorflow_LIBRARIES} ${Tensorflow_LIB_PATH}/python/_pywrap_tensorflow_internal.so")
set(lib_tensorflow_cc_so_path "${Tensorflow_LIB_PATH}/libtensorflow_cc.so.2")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the .2 stand for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope that stands for TensorFlow 2.x, but I have not dived into the build system to confirm. The main library in that path is libtensorflow_framework.so.2. We get the latter via tf.sysconfig.get_link_flags(), but I couldn't find an API to resolve the new libtensorflow_cc.so.2.

@maxhgerlach
Copy link
Collaborator Author

I am not sure if we should keep this while head version does not require this. We do not intent to support head versions older than today (head of head only if you will).

Agreed. We can always merge this after the rollback of that TensorFlow PR is reverted (comments in tensorflow/tensorflow#55941).

@maxhgerlach
Copy link
Collaborator Author

maxhgerlach commented Dec 8, 2022

The TF refactoring that motivated the fix in this PR appears to be active again: tensorflow/tensorflow#58734

Presumably Horovod CI will soon fail for tf-head again. Then we should rebase this PR to master, and confirm that it fixes the problem.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@maxhgerlach maxhgerlach force-pushed the fix-tf-head-undefined-symbols branch from 2968c27 to 241ff9a Compare December 9, 2022 09:40
@maxhgerlach
Copy link
Collaborator Author

Undefined symbol errors returned to master for tf-head:

E   tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZTIN10tensorflow11XlaOpKernelE

https://github.com/horovod/horovod/actions/runs/3650378067/jobs/6171827788

@maxhgerlach maxhgerlach marked this pull request as ready for review December 9, 2022 09:44
@maxhgerlach
Copy link
Collaborator Author

maxhgerlach commented Dec 9, 2022

OK, this PR still fixes the unknown symbols errors.

There's another problem with tf-head / Keras though (edit: tracked at #3795). I'd suggest to merge this PR and address the other problem separately.

___________ ERROR collecting test/parallel/test_tensorflow_keras.py ____________
ImportError while importing test module '/horovod/test/parallel/test_tensorflow_keras.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
test_tensorflow_keras.py:36: in <module>
    from keras.optimizers.optimizer_v2 import optimizer_v2
E   ModuleNotFoundError: No module named 'keras.optimizers.optimizer_v2'
___________ ERROR collecting test/parallel/test_tensorflow2_keras.py ___________
ImportError while importing test module '/horovod/test/parallel/test_tensorflow2_keras.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
test_tensorflow2_keras.py:35: in <module>
    from keras.optimizers.optimizer_v2 import optimizer_v2
E   ModuleNotFoundError: No module named 'keras.optimizers.optimizer_v2'

@maxhgerlach maxhgerlach merged commit 35b27e9 into master Dec 9, 2022
@maxhgerlach maxhgerlach deleted the fix-tf-head-undefined-symbols branch September 11, 2023 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants