Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functionality Clarification for BWOB #20748

Open
nickden3 opened this issue Jan 4, 2024 · 10 comments
Open

Functionality Clarification for BWOB #20748

nickden3 opened this issue Jan 4, 2024 · 10 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@nickden3
Copy link

nickden3 commented Jan 4, 2024

We have a large build (200k+ actions) and have noticed that clean BWOB builds with low cache hit rates (<70%) actually perform worse than no-cache builds, by a significant magnitude (2x). When we change the remote_download_outputs to all, then the no-cache build is the same performance wise as the cache-enabled one.

We are populating our remote cache, which is a simple nginx server backed by AWS S3, from a CI build on every commit to our default branch. These populate builds are incremental and are using remote_download_minimal as well. Should also note, it seems we see a higher cache hit % for read-only builds when the populate builds are fully clean.

Is this expected behavior and what are the recommendations around using remote_download_minimal vs remote_download_all? I've looked through the source code and forums and have not found any answers. Thanks.

Bazel Version: 7.0.0

@coeuvre
Copy link
Member

coeuvre commented Jan 5, 2024

This is not expected. It would be helpful if you can share the profiles for BwoB builds and no-cache builds.

@sgowroji sgowroji added type: support / not a bug (process) untriaged team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Jan 5, 2024
@nickden3

This comment was marked as outdated.

@coeuvre
Copy link
Member

coeuvre commented Jan 8, 2024

Thanks for the profiles!

Looking at the bwob_cache profile, for many CppCompile actions, there is a gap between "check cache hit" and "CppCompile" spans:

Screenshot 2024-01-08 at 11 31 39

For the other two, there aren't any gaps. That's probably why BwoB is slower in this case.

Now the question is what exactly did Bazel do within the gap? I looked at the code and it seems like the culprit is runfilesTreeUpdater.updateRunfiles. cc @tjgq

@nickden3: Can you try the BwoB build with flag --nobuild_runfile_links to check whether the build is faster?

@nickden3
Copy link
Author

nickden3 commented Jan 8, 2024

Hi @coeuvre , seems --nobuild_runefile_links was already being used. Should note, we're seeing this issue primarily on Windows builds.

Are there any other flags we should look into? Thanks.

copybara-service bot pushed a commit that referenced this issue Jan 9, 2024
To make it easier to debug #20748.

Closes #20783.

PiperOrigin-RevId: 596824058
Change-Id: I1dd6b940c30bd6c4b99904d19667f48a389b3a41
bazel-io pushed a commit to bazel-io/bazel that referenced this issue Jan 9, 2024
To make it easier to debug bazelbuild#20748.

Closes bazelbuild#20783.

PiperOrigin-RevId: 596824058
Change-Id: I1dd6b940c30bd6c4b99904d19667f48a389b3a41
github-merge-queue bot pushed a commit that referenced this issue Jan 9, 2024
To make it easier to debug
#20748.

Closes #20783.

Commit
18c8839

PiperOrigin-RevId: 596824058
Change-Id: I1dd6b940c30bd6c4b99904d19667f48a389b3a41

Co-authored-by: Chi Wang <chiwang@google.com>
@oquenchil oquenchil added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Jan 9, 2024
@coeuvre
Copy link
Member

coeuvre commented Jan 10, 2024

Can you run a build with Bazel binary that includes commit 18c8839? I want to make sure it's indeed caused by runfilesTreeUpdater.updateRunfiles.

(The commit is included in the last_green, if using bazelisk, you can do USE_BAZEL_VERSION=last_green bazelisk ...)

@alinkonSF
Copy link

Hello, I'm picking this up for nickden3 while he's on vacation.

@coeuvre, I assume you want me to run with that commit and give you the trace?

@coeuvre
Copy link
Member

coeuvre commented Jan 11, 2024

@alinkonSF That's true. Thanks!

@alinkonSF
Copy link

Sorry for the long delay on the response; we had other work come up. We will get you the requested data at the start of February

@nickden3
Copy link
Author

Hi @coeuvre , I reran our build using the bazel binary built from commit 18c8839. I sent the build profiles to you via email. Please let me know if you need any other info.

@coeuvre
Copy link
Member

coeuvre commented Jan 31, 2024

Thanks for sharing the profiles. It seems that the issue we saw from #20748 (comment) is gone, probably fixed by #20557.

With the new profiles:

  1. Wall time for download all is 1738s.
  2. Wall time for download minimal is 1826s, +5%.
  3. Wall time for no cache is 2176s, +25%.

I also noticed that 1) and 2) have very low cache hit rate (probably < 10%), so the wall time variance between them might just because of local action executions.

#20557 will be shipped with Bazel 7.1.0. I suggest you try Bazel 7.1.0 once it is out (or a custom built Bazel with the fix) and report back if the issue still exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

5 participants