Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix build of TensorFlow 2.5+ on aarch64 #17101

Merged
merged 2 commits into from
Feb 10, 2023

Conversation

Flamefire
Copy link
Contributor

The patch is actually wrong and may yield wrong results. Use the upstream patch from TF 2.10

I misunderstood the comment when creating the original patch which I noticed when checking the patches for TF 2.11 where the issue was fixed in the TensorFlow repo.

@boegel Can you test this on your ARM machine and include this for the next release please?

The patch is actually wrong and may yield wrong results.
Use the upstream patch from TF 2.10
@jfgrimm jfgrimm added this to the next release (4.7.1?) milestone Jan 13, 2023
Flamefire added a commit to Flamefire/easybuild-easyconfigs that referenced this pull request Jan 13, 2023
Use updated patch from easybuilders#17101
Flamefire added a commit to Flamefire/easybuild-easyconfigs that referenced this pull request Jan 13, 2023
Use updated patch from easybuilders#17101
Flamefire added a commit to Flamefire/easybuild-easyconfigs that referenced this pull request Jan 13, 2023
Use updated patch from easybuilders#17101
@Flamefire
Copy link
Contributor Author

Woops, this is blocking CI for the other TF PRs as the ECs updated here are not updated in the other PRs but all ECs get parsed and checksum-checked for some reason?

@boegel
Copy link
Member

boegel commented Jan 14, 2023

@boegel Can you test this on your ARM machine and include this for the next release please?

I'll look into that, but it'll take a while (lots of missing dependencies there currently)...

@boegel
Copy link
Member

boegel commented Jan 14, 2023

Woops, this is blocking CI for the other TF PRs as the ECs updated here are not updated in the other PRs but all ECs get parsed and checksum-checked for some reason?

What do you mean by "blocking CI"? Which other PRs are you referring to?

@Flamefire
Copy link
Contributor Author

What do you mean by "blocking CI"? Which other PRs are you referring to?

I thought the backlinks work better. Anyway: #17092 #17058 #16795

@boegel
Copy link
Member

boegel commented Jan 16, 2023

Test report by @boegel
FAILED
Build succeeded for 0 out of 7 (7 easyconfigs in total)
fair-mastodon-c6g-8xlarge-0001 - Linux Rocky Linux 8.7, AArch64, ARM UNKNOWN (graviton2), Python 3.6.8
See https://gist.github.com/37a6f91ae810f6275b854c8dc8af5891 for a full test report.

@boegel
Copy link
Member

boegel commented Jan 16, 2023

@Flamefire Not sure how useful that test report is, since it's all failures (but at least all dependencies are in place now...)

@Flamefire
Copy link
Contributor Author

One hit an old bug in TensorFlow which is fixed in newer versions for the others the failure is not contained in the log. I can see if I can backport the patch for that old failure.

@boegel
Copy link
Member

boegel commented Jan 18, 2023

One hit an old bug in TensorFlow which is fixed in newer versions for the others the failure is not contained in the log. I can see if I can backport the patch for that old failure.

Does this help (full log for TensorFlow-2.8.4-foss-2021b.eb)?
easybuild-TensorFlow-2.8.4-20230118.090356.WfoKS.log.gz

@Flamefire
Copy link
Contributor Author

Flamefire commented Jan 19, 2023

Some have build failures:

  • TensorFlow-2.5.3-foss-2021a-CUDA-11.3.1.eb
  • TensorFlow-2.7.1-foss-2021b-CUDA-11.4.1.eb
  • TensorFlow-2.8.4-foss-2021b.eb

The first is similar to one I see in 2.11 on PPC but hard to tell what's wrong. The other 2 build errors are not in the log. And in the one you attached it seemed to succeed and then fail in the test validation -.-

The others are failing tests which look like real bugs. Maybe try to build the ones that build without this PR and if they also fail I'd say this PR is good. From all I can tell the change is correct:

https://developer.arm.com/architectures/instruction-sets/intrinsics/#q=vdotq_lane_s32

vdotq_lane_s32 | (int32x4_t r, int8x16_t a, int8x8_t b, const int lane)

--> 3rd argument must be an s8

Edit: I found a patch for the AARCH64 build: tensorflow/tensorflow@4933ada
It's from TF 2.6 and applies here to 2.5.x

@boegel boegel changed the title fix patch for TensorFlow 2.5+ on ARM fix build of TensorFlow 2.5+ on aarch64 Jan 21, 2023
@boegel boegel added the aarch64 Related to Arm 64-bit (aarch64) label Jan 21, 2023
@boegel
Copy link
Member

boegel commented Jan 21, 2023

Test report by @boegel
FAILED
Build succeeded for 0 out of 7 (7 easyconfigs in total)
fair-mastodon-c6g-8xlarge-0001 - Linux Rocky Linux 8.7, AArch64, ARM UNKNOWN (graviton2), Python 3.6.8
See https://gist.github.com/68d29f73abc0469f4b6fb142f07a0eef for a full test report.

Copy link

@Kappa078 Kappa078 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boegel
Copy link
Member

boegel commented Jan 26, 2023

Test report by @boegel
SUCCESS
Build succeeded for 10 out of 10 (10 easyconfigs in total)
node3300.joltik.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 525.60.13, Python 3.6.8
See https://gist.github.com/d80a1ff22634db3ed5e99176294bd6e6 for a full test report.

@Flamefire
Copy link
Contributor Author

@boegel I'd like to get this in to fix the CI failures on other TF PRs using the new checksum and/or patch version.

It would be good to know if the current ECs/patch also fail on aarch64. If the failures look similar then the issue is not introduced here.

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Feb 10, 2023

Although problems remains on aarch64, the test report confirm that no trouble was introduced on x86_64, and since this is a step in the right direction it can be included.

Thanks a lot @Flamefire!

@boegel
Copy link
Member

boegel commented Feb 10, 2023

Going in, thanks @Flamefire!

@boegel boegel merged commit dd437dd into easybuilders:develop Feb 10, 2023
Flamefire added a commit to Flamefire/easybuild-easyconfigs that referenced this pull request Feb 10, 2023
Use updated patch from easybuilders#17101
@Flamefire Flamefire deleted the fix-tf-arm-patch branch February 10, 2023 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aarch64 Related to Arm 64-bit (aarch64) bug fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants