Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep runfiles tree IDs in memory for tools and multiple test attempts #23703

Closed
wants to merge 2 commits into from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Sep 21, 2024

While writing the compact execution log, make use of the information in RunfilesTree on whether the tree is likely to be reused by multiple test spawns and always keep it in memory for non-test spawns.

@fmeum fmeum requested review from a team as code owners September 21, 2024 17:38
@fmeum fmeum requested review from katre and tjgq and removed request for a team and katre September 21, 2024 17:38
@github-actions github-actions bot added team-Performance Issues for Performance teams team-Configurability platforms, toolchains, cquery, select(), config transitions team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Sep 21, 2024
@tjgq tjgq added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Sep 22, 2024
@fmeum
Copy link
Collaborator Author

fmeum commented Sep 23, 2024

@bazel-io fork 7.4.0

@fmeum
Copy link
Collaborator Author

fmeum commented Sep 23, 2024

@tjgq Taking another look at this, I'm not entirely sure I understand the logic around runfiles mapping caching: Isn't the main case of reuse that of a tool used in multiple actions? It looks like that usage is never cached in RunfilesTreeImpl, only multiple test attempts are. Without this commit, the execution log would reuse the runfiles tree entry, so I'm no longer sure this is an improvement as is.

@fmeum fmeum marked this pull request as draft September 23, 2024 10:29
@fmeum
Copy link
Collaborator Author

fmeum commented Sep 23, 2024

I have converted this back to draft. I will add some tests and then take a deeper look at what's really happening in the case of a shared tool.

@fmeum
Copy link
Collaborator Author

fmeum commented Sep 23, 2024

@justinhorvitz Could you perhaps share the bits of internal bug referenced by

As explained in b/322474776#comment23, using a weak reference still accomplishes the goal of unknown commit.

in 48c9255? Is there anything special about multiple test attempts that makes weak references work in this situation?

@justinhorvitz
Copy link
Contributor

The comment references some work I did to share a single runfiles mapping among concurrent executions of the same test, as in builds with high shard_count or --runs_per_test. The relevant commit is https://cs.opensource.google/bazel/bazel/+/1b2481d855f6a4ca5a938356024ddede36399b2d.

The point of the comment is to explain that using a weak reference is sufficient because it guarantees that at most 1 such runfiles mapping is retained (the active test execution will reference it, making it not GC-eligible). It goes on to say:

A soft reference would be helpful if we were trying to save cpu by reusing a mapping after a GC occurs while it's not strongly reachable. But that's not the goal.

The background is that with the soft reference, we observed elevated OOM rates. In general, we avoid soft references because they are not guaranteed to be collected prior to exceeding --gc_thrashing_limits.

@fmeum
Copy link
Collaborator Author

fmeum commented Sep 23, 2024

@justinhorvitz Thanks for the explanation. What do you think of extending the caching to any non-test targets? Since they are used as tools, they may end up being used concurrently and thus affect peak memory usage too.

@justinhorvitz
Copy link
Contributor

Do you have evidence of a specific problem that needs optimizing? I think that's a prerequisite for considering such an optimization.

Do non-test actions ever share the same RunfilesTreeImpl instance or would that require work?

@tjgq
Copy link
Contributor

tjgq commented Sep 23, 2024

@tjgq Taking another look at this, I'm not entirely sure I understand the logic around runfiles mapping caching: Isn't the main case of reuse that of a tool used in multiple actions? It looks like that usage is never cached in RunfilesTreeImpl, only multiple test attempts are. Without this commit, the execution log would reuse the runfiles tree entry, so I'm no longer sure this is an improvement as is.

I think RunfilesTreeImpl and CompactSpawnLogContext have different objectives: RunfilesTreeImpl is trying to oportunistically deduplicate objects for actions that are "in flight" at the same time, while CompactSpawnLogContext is trying to keep the log small by referencing earlier entries, even when the referencing actions have disjoint lifetimes. So even if we were to extend the logic in RunfilesTreeImpl to cover tools, I think we'd still want CompactSpawnLogContext to track every tool for the entire invocation (since we lack a better signal for "are there any remaining actions that consume this tool").

I agree that the current state of this PR is not what we want; sorry for reviewing it hastily.

@fmeum
Copy link
Collaborator Author

fmeum commented Sep 23, 2024

Do you have evidence of a specific problem that needs optimizing? I think that's a prerequisite for considering such an optimization.

I'm personally not in a good position to provide benchmarks, but certain actions in OSS Bazel can have very large runfiles mappings (e.g. js_binary tools). I could very well see them affecting peak memory in the same way as large tests.

@jbedard @alexeagle Do you happen to have numbers on the size of rules_js tools' (not tests') runfile trees?

@justinhorvitz Could you perhaps run a benchmark with return true in https://cs.opensource.google/bazel/bazel/+/7aa9e7726f4ebf934c9c317c7850e64d901a6141:src/main/java/com/google/devtools/build/lib/analysis/RunfilesSupport.java;l=243 to gather some data?

Do non-test actions ever share the same RunfilesTreeImpl instance or would that require work?

As far as I understand the situation, they would naturally as RunfilesTreeImpl lives in RunfilesSupport and is thus attached to the target providing a tool for an action.

While writing the compact execution log, make us of the information in `RunfilesTree` on whether the tree is likely to be reused by multiple spawns.
@fmeum
Copy link
Collaborator Author

fmeum commented Sep 24, 2024

@tjgq I updated the PR to get the desired behavior. Since tools are also logged as part of inputs, this requires checking the spawns mnemonic.

@fmeum fmeum changed the title Keep runfiles tree IDs in memory for multiple test attempts Keep runfiles tree IDs in memory for tools and multiple test attempts Sep 24, 2024
@fmeum fmeum marked this pull request as ready for review September 24, 2024 10:04
@fmeum fmeum requested a review from tjgq September 24, 2024 10:04
@github-actions github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label Sep 24, 2024
@tjgq tjgq removed awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally team-Configurability platforms, toolchains, cquery, select(), config transitions team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Sep 24, 2024
@fmeum fmeum requested a review from tjgq September 24, 2024 14:33
@tjgq tjgq added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Sep 24, 2024
@jbedard
Copy link

jbedard commented Sep 25, 2024

@jbedard @alexeagle Do you happen to have numbers on the size of rules_js tools' (not tests') runfile trees?

@fmeum no I don't think we have numbers for that. Do you have a specific command or anything to measure what you're looking for?

@fmeum
Copy link
Collaborator Author

fmeum commented Sep 25, 2024

@fmeum no I don't think we have numbers for that. Do you have a specific command or anything to measure what you're looking for?

A simple metric that would already be very helpful is the largest number of files in a .runfile tree you have encountered in a non-test action. I would expect JS/TS and Python tools to have a large footprint.

@justinhorvitz
Copy link
Contributor

@justinhorvitz Could you perhaps run a benchmark with return true in https://cs.opensource.google/bazel/bazel/+/7aa9e7726f4ebf934c9c317c7850e64d901a6141:src/main/java/com/google/devtools/build/lib/analysis/RunfilesSupport.java;l=243 to gather some data?

I added a counter for the number of times we call runfiles.getRunfilesInputs() from RunfilesTreeImpl and did see a significant reduction with this idea on a couple of our typical benchmark builds.

Build A

  • C++ heavy
  • 614k total actions
  • Baseline: 226,998 getRunfilesInputs() calls
  • With caching: 2,260 getRunfilesInputs() calls

Build B

  • Java & JavaScript heavy
  • 1.4M total actions
  • Baseline: 637,419 getRunfilesInputs() calls
  • With caching: 8,722 getRunfilesInputs() calls

Since it's a simple change, it seems worth pursuing. I will run formal benchmarks to see if the savings show up in e2e metrics.

@fmeum
Copy link
Collaborator Author

fmeum commented Sep 25, 2024

@justinhorvitz I guess it makes sense that the improvements are somewhat proportional to --jobs due to actions scheduled at similar times being likely to reference the same tools.

It might be worth exploring using SoftReference for tools (potentially even for CPU savings) and keeping WeakReference for multiple test attempts (for peak memory savings).

@justinhorvitz
Copy link
Contributor

@justinhorvitz I guess it makes sense that the improvements are somewhat proportional to --jobs due to actions scheduled at similar times being likely to reference the same tools.

Internally we use --jobs=600. With a low value of --jobs, there might not be as much peak heap reduction potential, but there could still be cpu savings from calling getRunfilesInputs() fewer times.

It might be worth exploring using SoftReference for tools (potentially even for CPU savings) and keeping WeakReference for multiple test attempts (for peak memory savings).

Using SoftReference would probably re-introduce the OOMs we saw when we tried that for sharded tests. Note that a WeakReference is not collected unless there is a GC, so even if becomes weakly reachable, it may still be reused.

@justinhorvitz
Copy link
Contributor

Benchmarks show a definite reduction of 1-4% in eden space garbage. There was a small cpu reduction but not statistically significant over 5 runs. Peak post-GC heap was not affected, probably because it was dominated by something else for these builds. I think we should proceed with the idea.

bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Sep 30, 2024
Cherry-picks the following changes:

* Optimize representation of runfiles in compact execution log (bazelbuild#23321)
* Keep runfiles tree IDs in memory for multiple test attempts (bazelbuild#23703)
* Fix naming inconsistency in `spawn.proto` (bazelbuild#23706)
* Mark tool runfiles as such in expanded execution log (bazelbuild#23702)

The cherry-picks required introducing a `Map<Artifact, RunfilesTree>`
shim to `RunfilesSupplier` that matches the Bazel 8 way of obtaining a
`RunfilesTree` from a runfiles middleman via `InputMetadataProvider`.

Closes bazelbuild#23683 
Closes bazelbuild#23710
Closes bazelbuild#23711
Closes bazelbuild#23734
@fmeum
Copy link
Collaborator Author

fmeum commented Oct 1, 2024

@meisterT Could you take over the merge now that tjgq is OOO?

@tjgq
Copy link
Contributor

tjgq commented Oct 1, 2024

@meisterT this is cl/680944521 internally

Copy link
Member

@meisterT meisterT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting merged right now.

@copybara-service copybara-service bot closed this in 81b9287 Oct 1, 2024
@github-actions github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Oct 1, 2024
@fmeum fmeum deleted the compact-execlog-caching branch October 2, 2024 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Performance Issues for Performance teams
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants