Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{lib}[foss/2022a] TensorFlow v2.11.0 w/ Python 3.10.4 (+ CUDA 11.7.0) #17241

Merged

Conversation

…orFlow-2.11.0-foss-2022a.eb and patches: TensorFlow-2.11.0_disable-avx512-extensions.patch, TensorFlow-2.11.0_fix-eigen-atan-on-PPC.patch, TensorFlow-2.11.0_fix-link-error.patch, TensorFlow-2.11.0_remove-libclang-and-io-gcs-deps.patch, TensorFlow-2.5.0_fix-arm-vector-intrinsics.patch
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusi8009 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/4649cbf110f9f4aca503ad2185f11064 for a full test report.

@Flamefire Flamefire force-pushed the 20230201141739_new_pr_TensorFlow2110 branch from ec8faa9 to ca2e0f3 Compare February 7, 2023 14:17
@surak
Copy link
Contributor

surak commented Feb 7, 2023

I hope I am wrong, but downloading this MR and checking it against develop, I don't see the following patches anywhere:

            '%(name)s-2.8.4_exclude-xnnpack-on-ppc.patch',
            '%(name)s-2.8.4_resolve-gcc-symlinks.patch',
            '%(name)s-2.9.1_remove-duplicate-gpu-tests.patch',

@Flamefire
Copy link
Contributor Author

Flamefire commented Feb 7, 2023

I hope I am wrong, but downloading this MR and checking it against develop, I don't see the following patches anywhere:

You are right, they are from the other PRs and I forgot about them and the unmerged #17101 likely hides those missing files from the CI check/output by the bot.

I added the required PRs to the description.

See all my TF PRs here ;-)

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 4 out of 4 (2 easyconfigs in total)
taurusa12 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/f06ae9888ad12b6fd73b8b874e495e08 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusml3 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/a8fd86baf9a185e299d949837c4b8370 for a full test report.

@Flamefire
Copy link
Contributor Author

@surak I added those patches to this PR. I hope that doesn't lead to conflicts when the others are merged

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusml3 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/258344e401ce05d7b0fa984c862fecc0 for a full test report.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
skl-rockylinux-87 - Linux Rocky Linux 8.7, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 525.85.12, Python 3.6.8
See https://gist.github.com/5f6f22a50b309756da09053949f5959c for a full test report.

@Flamefire Flamefire marked this pull request as draft February 10, 2023 07:53
@Flamefire
Copy link
Contributor Author

After discussion with the developers of Eigen I'm testing a variation of the patch for PPC. --> Converted to draft in the meantime.

This only affects PPC, so any other architecture can test & use this already.

@Flamefire Flamefire marked this pull request as ready for review February 10, 2023 10:14
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusml24 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/4f0a8ce1acab5b728900d40b7aefef69 for a full test report.

@surak
Copy link
Contributor

surak commented Feb 13, 2023

eb --from-pr=17241 

...
== 2023-02-13 13:08:43,750 robot.py:316 WARNING Missing dependencies (details): [{'full_mod_name': 'dill/0.3.6-foss-2022a', 'short_mod_name': 'dill/0.3.6-foss-2022a', 'name': 'dill', 'version': '0.3.6', 'versionsuffix': '', 'toolchain': {'name': 'foss', 'version': '2022a'}, 'toolchain_inherited': True, 'system': False, 'hidden': False, 'build_only': False, 'external_module': False, 'external_module_metadata': {}}, {'full_mod_name': 'flatbuffers/2.0.7-foss-2022a', 'short_mod_name': 'flatbuffers/2.0.7-foss-2022a', 'name': 'flatbuffers', 'version': '2.0.7', 'versionsuffix': '', 'toolchain': {'name': 'foss', 'version': '2022a'}, 'toolchain_inherited': True, 'system': False, 'hidden': False, 'build_only': False, 'external_module': False, 'external_module_metadata': {}}]
== 2023-02-13 13:08:43,750 robot.py:319 WARNING Missing dependencies (EasyBuild module names): dill/0.3.6-foss-2022a, flatbuffers/2.0.7-foss-2022a
== 2023-02-13 13:08:43,751 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/2020/software/EasyBuild/4.7.0/lib/python3.8/site-packages/easybuild/base/exceptions.py:126 in __init__): Missing dependencies: dill/0.3.6-foss-2022a, flatbuffers/2.0.7-foss-2022a (no easyconfig file or existing module found) (at easybuild/2020/software/EasyBuild/4.7.0/lib/python3.8/site-packages/easybuild/tools/robot.py:326 in raise_error_missing_deps)
ERROR: Missing dependencies: dill/0.3.6-foss-2022a, flatbuffers/2.0.7-foss-2022a (no easyconfig file or existing module found)

@surak
Copy link
Contributor

surak commented Feb 13, 2023

bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/tsl/protobuf/error_codes.pb.h:10:10: fatal error: google/protobuf/port_def.inc: No such file or directory
   10 | #include <google/protobuf/port_def.inc>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I have this patch TensorFlow-2.9.1_fix-protobuf-include-def.patch

Fix an issue where google/protobuf/port_def.inc is not found.

diff -ruN tensorflow-2.9.1_old/third_party/systemlibs/protobuf.BUILD tensorflow-2.9.1/third_party/systemlibs/protobuf.BUILD
--- tensorflow-2.9.1_old/third_party/systemlibs/protobuf.BUILD	2022-11-10 16:57:13.649126750 +0100
+++ tensorflow-2.9.1/third_party/systemlibs/protobuf.BUILD	2022-11-10 17:00:42.548576599 +0100
@@ -43,4 +43,6 @@
         ],
     ),
     "wrappers": ("google/protobuf/wrappers.proto", []),
+    "port_def": ("google/protobuf/port_def.inc", []),
+    "coded_stream": ("google/protobuf/io/coded_stream.h", []),
 }

 RELATIVE_WELL_KNOWN_PROTOS = [proto[1][0] for proto in WELL_KNOWN_PROTO_MAP.items()]

@surak
Copy link
Contributor

surak commented Feb 13, 2023

If I apply the patch above, I end up here:

ERROR: /dev/shm/strube1/jusuf/TensorFlow/2.11.0/foss-2022a-CUDA-11.7/TensorFlow/tensorflow-2.11.0/tensorflow/cc/BUILD:824:11: Compiling tensorflow/cc/framework/cc_op_gen_main.cc failed: undeclared inclusion(s) in rule '//tensorflow/cc:cc_op_gen_main':
this rule is missing dependency declarations for the following files included by 'tensorflow/cc/framework/cc_op_gen_main.cc':
  'bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/io/coded_stream.h'
Target //tensorflow/tools/pip_package:build_pip_package failed to build

@Flamefire
Copy link
Contributor Author

bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/tsl/protobuf/error_codes.pb.h:10:10: fatal error: google/protobuf/port_def.inc: No such file or directory
10 | #include <google/protobuf/port_def.inc>
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Are you using the updated easyblock?

@surak
Copy link
Contributor

surak commented Feb 15, 2023

Are you using the updated easyblock?

I am using the one on the latest easybuild 4.7.0 - is there a newer one around?

@Flamefire
Copy link
Contributor Author

I am using the one on the latest easybuild 4.7.0 - is there a newer one around?

Yes, see the PR description: easybuilders/easybuild-easyblocks#2854

@surak
Copy link
Contributor

surak commented Feb 16, 2023

I see that flatbuffers is there, but it tells me this:

	/p/software/jusuf/stages/2023/software/Python/3.10.4-GCCcore-11.3.0/bin/python -m pip check
== 2023-02-16 19:44:27,913 run.py:236 INFO running cmd: /p/software/jusuf/stages/2023/software/Python/3.10.4-GCCcore-11.3.0/bin/python -m pip check 
tensorflow 2.11.0 requires flatbuffers, which is not installed.
  >> command completed: exit 1, ran in 00h00m02s

Can it be that we are missing the flatbuffers-python one?

@Flamefire
Copy link
Contributor Author

Can it be that we are missing the flatbuffers-python one?

I put that into the main flatbuffers which is referenced in the description: #17114

So it doesn't require 2 dependencies (anymore)

@surak
Copy link
Contributor

surak commented Feb 18, 2023

It works for me, installed on Jülich's Juwels Booster, Jureca DC, Juwels Cluster, Jusuf and the HDFML machine.

@easybuilders easybuilders deleted a comment from boegelbot Mar 2, 2023
@easybuilders easybuilders deleted a comment from boegelbot Mar 2, 2023
@boegel boegel changed the title {lib}[foss/2022a] TensorFlow v2.11.0 w/ Python 3.10.4 {lib}[foss/2022a] TensorFlow v2.11.0 w/ Python 3.10.4 (+ CUDA 11.7.0) Mar 6, 2023
@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis10-01 - Linux Rocky Linux 8.6, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A40, 520.61.05, Python 3.6.8
See https://gist.github.com/5801037401d4e7c6341c26023dae0fc7 for a full test report.

@smoors
Copy link
Contributor

smoors commented Mar 15, 2023

Test report by @smoors
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node407.hydra.os - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7282 16-Core Processor, 1 x NVIDIA NVIDIA A100-PCIE-40GB, 515.48.07, Python 3.6.8
See https://gist.github.com/571d6d8dbf550d7d9261c99b8b4b1cd3 for a full test report.

branfosj
branfosj previously approved these changes Mar 17, 2023
@casparvl
Copy link
Contributor

casparvl commented Mar 17, 2023

I'm running into

ERROR: /gpfs/nvme1/1/casparl/ebbuildpath/TensorFlow/2.11.0/foss-2022a-CUDA-11.7.0/TensorFlow/bazel-root/6b2d07a5bcce9224adb889dafb5acf77/external/llvm-project/llvm/BUILD.bazel:203:11: Compiling llvm/lib/Support/Watchdog.cpp failed: undeclared inclusion(s) in rule '@llvm-project//llvm:Support':
this rule is missing dependency declarations for the following files included by 'llvm/lib/Support/Watchdog.cpp':
  '/sw/arch/RHEL8/EB_production/2022/software/GCCcore/11.3.0/lib/gcc/x86_64-pc-linux-gnu/11.3.0/include/stddef.h'
Target //tensorflow/tools/pip_package:build_pip_package failed to build

which looks similar to bazelbuild/bazel#15359 . Anyone seen that before? Did I miss to rebuild something / include some patch / ...? (NB: I am running with -rpath, maybe it's something with the RPATH wrappers?)

Update: I'm running with --disable-rpath now. It hasn't completed, but has been building for way longer than before. Does anyone know why the RPATH wrappers are a problem here, and how we could potentially fix that? (@Flamefire maybe?)

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0105u03a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/45f97e8b0f9fe03e3c643a8087633d7b for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0103u11a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (icelake), 1 x NVIDIA NVIDIA A100-PCIE-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/3a48cec8074ec30cfcdbd12fa5c6b8e5 for a full test report.

@casparvl
Copy link
Contributor

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 1 out of 1 (2 easyconfigs in total)
gcn28.local.snellius.surf.nl - Linux Rocky Linux 8.7, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 1 x NVIDIA NVIDIA A100-SXM4-40GB, 515.86.01, Python 3.6.8
See https://gist.github.com/8d9a9d9f090e390e35ac5905407a143f for a full test report.

Copy link
Contributor

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the LLVM, which I think you still intended to remove, all looks good to me now. We also have plenty of succesful tests now.

The only thing not working is --rpath support, but I don't want to block this PR over that.

@casparvl
Copy link
Contributor

@boegelbot please test @ jsc-zen2

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=17241 EB_ARGS= /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_17241 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 2380

Test results coming soon (I hope)...

- notification for comment with ID 1474245051 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

Copy link
Contributor

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'll give it one final test run just to make sure that the removal of LLVM as build dep doesn't affect anything. Once that completes, I'd consider this ready to be merged.

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=17241 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_17241 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 10500

Test results coming soon (I hope)...

- notification for comment with ID 1474552675 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/4695ea333c700d1eb8c8be5de97a54d2 for a full test report.

@casparvl
Copy link
Contributor

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
gcn36.local.snellius.surf.nl - Linux Rocky Linux 8.7, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 1 x NVIDIA NVIDIA A100-SXM4-40GB, 515.86.01, Python 3.6.8
See https://gist.github.com/58a87ab8b0764b4a0e5436fcb124dc8e for a full test report.

@casparvl
Copy link
Contributor

Going in, thanks @Flamefire!

@casparvl casparvl merged commit 2b2579b into easybuilders:develop Mar 18, 2023
@casparvl casparvl modified the milestones: 4.x, release after 4.7.1 Mar 18, 2023
@Flamefire Flamefire deleted the 20230201141739_new_pr_TensorFlow2110 branch March 18, 2023 17:04
@boegel boegel modified the milestones: release after 4.7.1, 4.7.1 Mar 18, 2023
@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/cf4c8d952d7f969ecd382acc73afb5b9 for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants