Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{lib}[foss/2022a] TensorFlow v2.9.1 w/ Python 3.10.4 #17092

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Jan 12, 2023

(created using eb --new-pr)

Based on #16008 by @alinelena and #16620 by @VRehnberg

Note that there will be

WARNING: 1 TensorFlow dependencies have not been resolved by EasyBuild. Check the log for details.

This is due to Abseil being a possible $TF_SYSTEM_LIBS since 2.9 but it looks like it got broken between when that PR was opened and 2.9 released: tensorflow/tensorflow#53765 (comment)

Might be worth resolving in the EasyBlock

@jfgrimm
Copy link
Member

jfgrimm commented Jan 12, 2023

Test report by @jfgrimm
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
FAILED
Build succeeded for 8 out of 10 (2 easyconfigs in total)
himem04.pri.viking.alces.network - Linux CentOS Linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz (cascadelake), Python 3.6.8
See https://gist.github.com/c33aaf092e9972d5516ecaa5d2536ffe for a full test report.

@boegel boegel added the update label Jan 12, 2023
@boegel boegel added this to the next release (4.7.1?) milestone Jan 12, 2023
@Flamefire
Copy link
Contributor Author

@jfgrimm I'm a bit surprised of this failure:

NodeFileWriterTest.test_simple
...
/python/framework/node_file_writer_test.py", line 142, in test_simple
    self.assertEqual(node_def1.op, 'MatMul')
AssertionError: 
- _MklMatMul
? ----
+ MatMul

Also in the source of the failing testGetMemoryInfoCPU I see a skip if MKL is enabled (but isn't triggered).

Are you using any modifications to the toolchain or so which pulls in MKL? Or have mkl-dnn somewhere loaded? Maybe attach the full log.

@jfgrimm
Copy link
Member

jfgrimm commented Jan 13, 2023

@Flamefire my site doesn't modify toolchains, and I really shouldn't have any MKL stuff loaded
TensorFlow-2.9.1-20230112-log.tar.gz

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusi8019 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/a80859d5c82a34691118b7cda69da3d3 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusi8021 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/5454c70c2d5d7d4422a73c8e2fe1dac4 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusml30 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/e13895b3515f85c08ef1bdfc9dbe277e for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusa12 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/4010143d07d606a185997ae46c937929 for a full test report.

@VRehnberg
Copy link
Contributor

I'm a bit surprised of this failure:

NodeFileWriterTest.test_simple
...
/python/framework/node_file_writer_test.py", line 142, in test_simple
    self.assertEqual(node_def1.op, 'MatMul')
AssertionError: 
- _MklMatMul
? ----
+ MatMul

Also in the source of the failing testGetMemoryInfoCPU I see a skip if MKL is enabled (but isn't triggered).

Are you using any modifications to the toolchain or so which pulls in MKL? Or have mkl-dnn somewhere loaded? Maybe attach the full log.

I'd just like to second that I'm seeing this same Failure on our system as well.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusa12 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/d5c1a8c8c0eee0548ab1c8eb626d771e for a full test report.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
skl-rockylinux-87 - Linux Rocky Linux 8.7, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 525.60.13, Python 3.6.8
See https://gist.github.com/6044f743a8e88aea0e2a9698d7707385 for a full test report.

@Flamefire
Copy link
Contributor Author

I'd just like to second that I'm seeing this same Failure on our system as well.

I worked with @jfgrimm a bit trying to catch a log of the failing tests but it suddenly succeeded (i.e. no error anymore, all tests green).

So if anyone (@VRehnberg ?) is seeing those MKL-related failures consistently, maybe he can run a modified version with
TF2.9_logging.patch.txt and send me the logs to have a look at.

I also opened a PR for TF 2.10.1 and TF 2.11.0

The upstream issue I reported is tensorflow/tensorflow#59252

@jfgrimm
Copy link
Member

jfgrimm commented Feb 2, 2023

yeah no idea what changed, but my builds (both CPU and CUDA) are happy now 🤷

@VRehnberg
Copy link
Contributor

So if anyone (@VRehnberg ?) is seeing those MKL-related failures consistently, maybe he can run a modified version with
TF2.9_logging.patch.txt and send me the logs to have a look at.

I'm rerunning the builds now. Should be finished tomorrow and will try to remember to upload them then.

@VRehnberg
Copy link
Contributor

Still getting MKL test failure:
easybuild-TensorFlow-2.9.1-20230206.161046.aaEEN.log.gz

Tried to include the logging patch with --amend, not sure if it worked.

@Flamefire
Copy link
Contributor Author

@VRehnberg

Tried to include the logging patch with --amend, not sure if it worked.

Nope that didn't work. I guess yet another instance of easybuilders/easybuild-framework#3358 / easybuilders/easybuild-framework#2222

Best is to download the files and modify it directly. E.g. via eb --from-pr 17092 --copy-ec /tmp/foo --stop (places all files from the PR into /tmp/foo)
Not much hope I can figure out why it doesn't work for you though. But worth a shot.

@VRehnberg
Copy link
Contributor

Second attempt at including patch went better, thanks for the tip for easily fetching the relevant files.
easybuild-TensorFlow-2.9.1-20230207.161810.rDwiP.log.gz

…foss-2022a-CUDA-11.7.0.eb and patches: TensorFlow-2.9.1_fix-PPC-Eigen-build.patch, TensorFlow-2.9.1_remove-duplicate-gpu-tests.patch, TensorFlow-2.9.1_remove-libclang-and-io-gcs-deps.patch, TensorFlow-2.9.1_support_flatbuffers_2.0.patch, TensorFlow-2.8.4_exclude-xnnpack-on-ppc.patch, TensorFlow-2.8.4_fix-PPC-JIT.patch, TensorFlow-2.8.4_resolve-gcc-symlinks.patch
@Flamefire Flamefire force-pushed the 20230112164524_new_pr_TensorFlow291 branch from b0583d6 to 3317a0d Compare May 11, 2023 12:46
@Flamefire
Copy link
Contributor Author

@VRehnberg I think I found the issue: They were inconsistent checking for MKL and some place was using a runtime check for CPU features while another place was only checking for defines. This was fixed in TF 2.11 by tensorflow/tensorflow@5ec3d2e

I included the patch in the new commit, so that version should now build for you. Can you verify?

@VRehnberg
Copy link
Contributor

A test report will hopefully come in a few hours with the results.

@boegelbot
Copy link
Collaborator

@Flamefire: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/4948100242
Output from first failing test suite run:

FAIL: test_dep_versions_per_toolchain_generation (test.easyconfigs.easyconfigs.EasyConfigTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 884, in test_dep_versions_per_toolchain_generation
    self.assertFalse(multi_dep_vars, error_msg)
AssertionError: No multi-variant deps found for '^.*-(?P<tc_gen>20(1[89]|[2-9][0-9])[ab]).*\.eb$' easyconfigs:

found 2 variants of 'flatbuffers' dependency in easyconfigs using '2022a' toolchain generation
* version: 2.0.0; versionsuffix:  as dep for set(['TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb', 'TensorFlow-2.9.1-foss-2022a.eb'])
* version: 2.0.7; versionsuffix:  as dep for set(['AlphaFold-2.3.1-foss-2022a.eb', 'AlphaFold-2.3.4-foss-2022a-ColabFold.eb', 'ColabFold-1.5.2-foss-2022a-CUDA-11.7.0.eb', 'TensorFlow-2.11.0-foss-2022a-CUDA-11.7.0.eb', 'ColabFold-1.5.2-foss-2022a.eb', 'pod5-file-format-0.1.8-foss-2022a.eb', 'AlphaFold-2.3.4-foss-2022a-CUDA-11.7.0-ColabFold.eb', 'TensorFlow-2.11.0-foss-2022a.eb', 'M3GNet-0.2.4-foss-2022a.eb', 'AlphaFold-2.3.1-foss-2022a-CUDA-11.7.0.eb'])


----------------------------------------------------------------------
Ran 17002 tests in 748.353s

FAILED (failures=1)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
alvis-c1 - Linux Rocky Linux 8.6, x86_64, Intel Xeon Processor (Skylake), Python 3.6.8
See https://gist.github.com/VRehnberg/274b5ff0ebe9db379fb9dc8ac8465cb8 for a full test report.

@Flamefire
Copy link
Contributor Author

Flamefire commented May 12, 2023

@VRehnberg

fatal error: google/protobuf/port_def.inc: No such file or directory

That again? Are you using the latest easyblock? I.e. Easybuild 4.7.1 or easybuilders/easybuild-easyblocks#2854

@Flamefire
Copy link
Contributor Author

It looks like some other ECs were merged already using flatbuffers 2.0.6 while TF 2.9 barely worked with 2.0.0. However I was able to backport the changes so it should work now.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusi7005 - Linux CentOS Linux 7.7.1908, x86_64, AMD EPYC 7702 64-Core Processor (zen2), Python 2.7.5
See https://gist.github.com/Flamefire/45148c23fc982e579e8d154a16f55d52 for a full test report.

@VRehnberg
Copy link
Contributor

VRehnberg commented May 25, 2023

I'm still seeing the MKL failure on this one. Though, TensorFlow-2.11.0 I don't think had this problem. At least I've installed the CUDA version of 2.11.0 on our systems since some time ago.

Log for latest failure:
easybuild-TensorFlow-2.9.1-20230522.170143.PAJqr.log.gz

@Flamefire
Copy link
Contributor Author

@VRehnberg Looks like this PR wasn't merged yet and your log file looks like you didn't build from this PR so the required patch for the MKL failure isn't included.

TF 2.11 already includes the patch (upstream) but here my backport-patch is required

@boegel Can this be merged?

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
skl-rockylinux-88 - Linux Rocky Linux 8.8, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 530.30.02, Python 3.6.8
See https://gist.github.com/SebastianAchilles/11e00f03f70b02af5a0707d2b2a499ee for a full test report.

Copy link
Member

@SebastianAchilles SebastianAchilles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@SebastianAchilles
Copy link
Member

SebastianAchilles commented Jun 22, 2023

Going in, thanks @Flamefire!

Failed with HTTP Error

@SebastianAchilles
Copy link
Member

Going in, thanks @Flamefire!

@SebastianAchilles SebastianAchilles merged commit 7d99e5f into easybuilders:develop Jun 22, 2023
5 checks passed
@Flamefire Flamefire deleted the 20230112164524_new_pr_TensorFlow291 branch June 22, 2023 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants