fix build of TensorFlow 2.5+ on aarch64 #17101

Flamefire · 2023-01-13T15:25:40Z

The patch is actually wrong and may yield wrong results. Use the upstream patch from TF 2.10

I misunderstood the comment when creating the original patch which I noticed when checking the patches for TF 2.11 where the issue was fixed in the TensorFlow repo.

@boegel Can you test this on your ARM machine and include this for the next release please?

The patch is actually wrong and may yield wrong results. Use the upstream patch from TF 2.10

Use updated patch from easybuilders#17101

Flamefire · 2023-01-13T17:13:32Z

Woops, this is blocking CI for the other TF PRs as the ECs updated here are not updated in the other PRs but all ECs get parsed and checksum-checked for some reason?

boegel · 2023-01-14T20:32:44Z

@boegel Can you test this on your ARM machine and include this for the next release please?

I'll look into that, but it'll take a while (lots of missing dependencies there currently)...

boegel · 2023-01-14T20:33:18Z

Woops, this is blocking CI for the other TF PRs as the ECs updated here are not updated in the other PRs but all ECs get parsed and checksum-checked for some reason?

What do you mean by "blocking CI"? Which other PRs are you referring to?

Flamefire · 2023-01-15T09:42:52Z

What do you mean by "blocking CI"? Which other PRs are you referring to?

I thought the backlinks work better. Anyway: #17092 #17058 #16795

boegel · 2023-01-16T17:21:28Z

Test report by @boegel
FAILED
Build succeeded for 0 out of 7 (7 easyconfigs in total)
fair-mastodon-c6g-8xlarge-0001 - Linux Rocky Linux 8.7, AArch64, ARM UNKNOWN (graviton2), Python 3.6.8
See https://gist.github.com/37a6f91ae810f6275b854c8dc8af5891 for a full test report.

boegel · 2023-01-16T17:33:31Z

@Flamefire Not sure how useful that test report is, since it's all failures (but at least all dependencies are in place now...)

Flamefire · 2023-01-17T16:27:44Z

One hit an old bug in TensorFlow which is fixed in newer versions for the others the failure is not contained in the log. I can see if I can backport the patch for that old failure.

boegel · 2023-01-18T10:49:38Z

One hit an old bug in TensorFlow which is fixed in newer versions for the others the failure is not contained in the log. I can see if I can backport the patch for that old failure.

Does this help (full log for TensorFlow-2.8.4-foss-2021b.eb)?
easybuild-TensorFlow-2.8.4-20230118.090356.WfoKS.log.gz

Flamefire · 2023-01-19T12:24:56Z

Some have build failures:

TensorFlow-2.5.3-foss-2021a-CUDA-11.3.1.eb
TensorFlow-2.7.1-foss-2021b-CUDA-11.4.1.eb
TensorFlow-2.8.4-foss-2021b.eb

The first is similar to one I see in 2.11 on PPC but hard to tell what's wrong. The other 2 build errors are not in the log. And in the one you attached it seemed to succeed and then fail in the test validation -.-

The others are failing tests which look like real bugs. Maybe try to build the ones that build without this PR and if they also fail I'd say this PR is good. From all I can tell the change is correct:

https://developer.arm.com/architectures/instruction-sets/intrinsics/#q=vdotq_lane_s32

vdotq_lane_s32 | (int32x4_t r, int8x16_t a, int8x8_t b, const int lane)

--> 3rd argument must be an s8

Edit: I found a patch for the AARCH64 build: tensorflow/tensorflow@4933ada
It's from TF 2.6 and applies here to 2.5.x

boegel · 2023-01-21T13:43:56Z

Test report by @boegel
FAILED
Build succeeded for 0 out of 7 (7 easyconfigs in total)
fair-mastodon-c6g-8xlarge-0001 - Linux Rocky Linux 8.7, AArch64, ARM UNKNOWN (graviton2), Python 3.6.8
See https://gist.github.com/68d29f73abc0469f4b6fb142f07a0eef for a full test report.

Kappa078

easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.8.4-foss-2021b.eb

boegel · 2023-01-26T16:55:35Z

Test report by @boegel
SUCCESS
Build succeeded for 10 out of 10 (10 easyconfigs in total)
node3300.joltik.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 525.60.13, Python 3.6.8
See https://gist.github.com/d80a1ff22634db3ed5e99176294bd6e6 for a full test report.

Flamefire · 2023-01-30T11:22:28Z

@boegel I'd like to get this in to fix the CI failures on other TF PRs using the new checksum and/or patch version.

It would be good to know if the current ECs/patch also fail on aarch64. If the failures look similar then the issue is not introduced here.

boegel

lgtm

boegel · 2023-02-10T09:30:29Z

Although problems remains on aarch64, the test report confirm that no trouble was introduced on x86_64, and since this is a step in the right direction it can be included.

Thanks a lot @Flamefire!

boegel · 2023-02-10T09:30:38Z

Going in, thanks @Flamefire!

Use updated patch from easybuilders#17101

fix patch for TensorFlow 2.5+ on ARM

db2c3a3

The patch is actually wrong and may yield wrong results. Use the upstream patch from TF 2.10

jfgrimm added this to the next release (4.7.1?) milestone Jan 13, 2023

jfgrimm added the bug fix label Jan 13, 2023

Flamefire added a commit to Flamefire/easybuild-easyconfigs that referenced this pull request Jan 13, 2023

Fix ARM patch

dacfb36

Use updated patch from easybuilders#17101

Flamefire added a commit to Flamefire/easybuild-easyconfigs that referenced this pull request Jan 13, 2023

Fix ARM patch

eadb4b6

Use updated patch from easybuilders#17101

Flamefire added a commit to Flamefire/easybuild-easyconfigs that referenced this pull request Jan 13, 2023

Fix ARM patch

d589592

Use updated patch from easybuilders#17101

Fix AARCH64 build

4b1f3a7

Flamefire force-pushed the fix-tf-arm-patch branch from 4a3b335 to 4b1f3a7 Compare January 19, 2023 16:01

boegel changed the title ~~fix patch for TensorFlow 2.5+ on ARM~~ fix build of TensorFlow 2.5+ on aarch64 Jan 21, 2023

boegel added the aarch64 Related to Arm 64-bit (aarch64) label Jan 21, 2023

Kappa078 reviewed Jan 26, 2023

View reviewed changes

Flamefire mentioned this pull request Feb 7, 2023

{lib}[foss/2022a] TensorFlow v2.11.0 w/ Python 3.10.4 (+ CUDA 11.7.0) #17241

Merged

2 tasks

boegel approved these changes Feb 10, 2023

View reviewed changes

boegel merged commit dd437dd into easybuilders:develop Feb 10, 2023

Flamefire added a commit to Flamefire/easybuild-easyconfigs that referenced this pull request Feb 10, 2023

Fix ARM patch

627cab0

Use updated patch from easybuilders#17101

Flamefire deleted the fix-tf-arm-patch branch February 10, 2023 11:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix build of TensorFlow 2.5+ on aarch64 #17101

fix build of TensorFlow 2.5+ on aarch64 #17101

Flamefire commented Jan 13, 2023

Flamefire commented Jan 13, 2023

boegel commented Jan 14, 2023

boegel commented Jan 14, 2023

Flamefire commented Jan 15, 2023

boegel commented Jan 16, 2023

boegel commented Jan 16, 2023

Flamefire commented Jan 17, 2023

boegel commented Jan 18, 2023

Flamefire commented Jan 19, 2023 •

edited

Loading

boegel commented Jan 21, 2023

Kappa078 left a comment

boegel commented Jan 26, 2023

Flamefire commented Jan 30, 2023

boegel left a comment

boegel commented Feb 10, 2023

boegel commented Feb 10, 2023

fix build of TensorFlow 2.5+ on aarch64 #17101

fix build of TensorFlow 2.5+ on aarch64 #17101

Conversation

Flamefire commented Jan 13, 2023

Flamefire commented Jan 13, 2023

boegel commented Jan 14, 2023

boegel commented Jan 14, 2023

Flamefire commented Jan 15, 2023

boegel commented Jan 16, 2023

boegel commented Jan 16, 2023

Flamefire commented Jan 17, 2023

boegel commented Jan 18, 2023

Flamefire commented Jan 19, 2023 • edited Loading

boegel commented Jan 21, 2023

Kappa078 left a comment

Choose a reason for hiding this comment

boegel commented Jan 26, 2023

Flamefire commented Jan 30, 2023

boegel left a comment

Choose a reason for hiding this comment

boegel commented Feb 10, 2023

boegel commented Feb 10, 2023

Flamefire commented Jan 19, 2023 •

edited

Loading