Keep runfiles tree IDs in memory for tools and multiple test attempts #23703

fmeum · 2024-09-21T17:38:29Z

While writing the compact execution log, make use of the information in RunfilesTree on whether the tree is likely to be reused by multiple test spawns and always keep it in memory for non-test spawns.

fmeum · 2024-09-23T06:28:00Z

@bazel-io fork 7.4.0

fmeum · 2024-09-23T09:37:54Z

@tjgq Taking another look at this, I'm not entirely sure I understand the logic around runfiles mapping caching: Isn't the main case of reuse that of a tool used in multiple actions? It looks like that usage is never cached in RunfilesTreeImpl, only multiple test attempts are. Without this commit, the execution log would reuse the runfiles tree entry, so I'm no longer sure this is an improvement as is.

fmeum · 2024-09-23T10:29:50Z

I have converted this back to draft. I will add some tests and then take a deeper look at what's really happening in the case of a shared tool.

fmeum · 2024-09-23T12:05:55Z

@justinhorvitz Could you perhaps share the bits of internal bug referenced by

As explained in b/322474776#comment23, using a weak reference still accomplishes the goal of unknown commit.

in 48c9255? Is there anything special about multiple test attempts that makes weak references work in this situation?

justinhorvitz · 2024-09-23T17:27:16Z

The comment references some work I did to share a single runfiles mapping among concurrent executions of the same test, as in builds with high shard_count or --runs_per_test. The relevant commit is https://cs.opensource.google/bazel/bazel/+/1b2481d855f6a4ca5a938356024ddede36399b2d.

The point of the comment is to explain that using a weak reference is sufficient because it guarantees that at most 1 such runfiles mapping is retained (the active test execution will reference it, making it not GC-eligible). It goes on to say:

A soft reference would be helpful if we were trying to save cpu by reusing a mapping after a GC occurs while it's not strongly reachable. But that's not the goal.

The background is that with the soft reference, we observed elevated OOM rates. In general, we avoid soft references because they are not guaranteed to be collected prior to exceeding --gc_thrashing_limits.

fmeum · 2024-09-23T17:30:43Z

@justinhorvitz Thanks for the explanation. What do you think of extending the caching to any non-test targets? Since they are used as tools, they may end up being used concurrently and thus affect peak memory usage too.

justinhorvitz · 2024-09-23T17:38:34Z

Do you have evidence of a specific problem that needs optimizing? I think that's a prerequisite for considering such an optimization.

Do non-test actions ever share the same RunfilesTreeImpl instance or would that require work?

tjgq · 2024-09-23T17:40:31Z

@tjgq Taking another look at this, I'm not entirely sure I understand the logic around runfiles mapping caching: Isn't the main case of reuse that of a tool used in multiple actions? It looks like that usage is never cached in RunfilesTreeImpl, only multiple test attempts are. Without this commit, the execution log would reuse the runfiles tree entry, so I'm no longer sure this is an improvement as is.

I think RunfilesTreeImpl and CompactSpawnLogContext have different objectives: RunfilesTreeImpl is trying to oportunistically deduplicate objects for actions that are "in flight" at the same time, while CompactSpawnLogContext is trying to keep the log small by referencing earlier entries, even when the referencing actions have disjoint lifetimes. So even if we were to extend the logic in RunfilesTreeImpl to cover tools, I think we'd still want CompactSpawnLogContext to track every tool for the entire invocation (since we lack a better signal for "are there any remaining actions that consume this tool").

I agree that the current state of this PR is not what we want; sorry for reviewing it hastily.

fmeum · 2024-09-23T21:11:13Z

Do you have evidence of a specific problem that needs optimizing? I think that's a prerequisite for considering such an optimization.

I'm personally not in a good position to provide benchmarks, but certain actions in OSS Bazel can have very large runfiles mappings (e.g. js_binary tools). I could very well see them affecting peak memory in the same way as large tests.

@jbedard @alexeagle Do you happen to have numbers on the size of rules_js tools' (not tests') runfile trees?

@justinhorvitz Could you perhaps run a benchmark with return true in https://cs.opensource.google/bazel/bazel/+/7aa9e7726f4ebf934c9c317c7850e64d901a6141:src/main/java/com/google/devtools/build/lib/analysis/RunfilesSupport.java;l=243 to gather some data?

Do non-test actions ever share the same RunfilesTreeImpl instance or would that require work?

As far as I understand the situation, they would naturally as RunfilesTreeImpl lives in RunfilesSupport and is thus attached to the target providing a tool for an action.

While writing the compact execution log, make us of the information in `RunfilesTree` on whether the tree is likely to be reused by multiple spawns.

fmeum · 2024-09-24T06:24:51Z

@tjgq I updated the PR to get the desired behavior. Since tools are also logged as part of inputs, this requires checking the spawns mnemonic.

src/main/java/com/google/devtools/build/lib/exec/CompactSpawnLogContext.java

src/test/java/com/google/devtools/build/lib/exec/CompactSpawnLogContextTest.java

jbedard · 2024-09-25T04:59:45Z

@jbedard @alexeagle Do you happen to have numbers on the size of rules_js tools' (not tests') runfile trees?

@fmeum no I don't think we have numbers for that. Do you have a specific command or anything to measure what you're looking for?

fmeum · 2024-09-25T06:33:57Z

@fmeum no I don't think we have numbers for that. Do you have a specific command or anything to measure what you're looking for?

A simple metric that would already be very helpful is the largest number of files in a .runfile tree you have encountered in a non-test action. I would expect JS/TS and Python tools to have a large footprint.

justinhorvitz · 2024-09-25T13:15:10Z

@justinhorvitz Could you perhaps run a benchmark with return true in https://cs.opensource.google/bazel/bazel/+/7aa9e7726f4ebf934c9c317c7850e64d901a6141:src/main/java/com/google/devtools/build/lib/analysis/RunfilesSupport.java;l=243 to gather some data?

I added a counter for the number of times we call runfiles.getRunfilesInputs() from RunfilesTreeImpl and did see a significant reduction with this idea on a couple of our typical benchmark builds.

Build A

C++ heavy
614k total actions
Baseline: 226,998 getRunfilesInputs() calls
With caching: 2,260 getRunfilesInputs() calls

Build B

Java & JavaScript heavy
1.4M total actions
Baseline: 637,419 getRunfilesInputs() calls
With caching: 8,722 getRunfilesInputs() calls

Since it's a simple change, it seems worth pursuing. I will run formal benchmarks to see if the savings show up in e2e metrics.

fmeum · 2024-09-25T14:47:31Z

@justinhorvitz I guess it makes sense that the improvements are somewhat proportional to --jobs due to actions scheduled at similar times being likely to reference the same tools.

It might be worth exploring using SoftReference for tools (potentially even for CPU savings) and keeping WeakReference for multiple test attempts (for peak memory savings).

justinhorvitz · 2024-09-25T14:53:57Z

@justinhorvitz I guess it makes sense that the improvements are somewhat proportional to --jobs due to actions scheduled at similar times being likely to reference the same tools.

Internally we use --jobs=600. With a low value of --jobs, there might not be as much peak heap reduction potential, but there could still be cpu savings from calling getRunfilesInputs() fewer times.

It might be worth exploring using SoftReference for tools (potentially even for CPU savings) and keeping WeakReference for multiple test attempts (for peak memory savings).

Using SoftReference would probably re-introduce the OOMs we saw when we tried that for sharded tests. Note that a WeakReference is not collected unless there is a GC, so even if becomes weakly reachable, it may still be reused.

justinhorvitz · 2024-09-26T12:49:03Z

Benchmarks show a definite reduction of 1-4% in eden space garbage. There was a small cpu reduction but not statistically significant over 5 runs. Peak post-GC heap was not affected, probably because it was dominated by something else for these builds. I think we should proceed with the idea.

Cherry-picks the following changes: * Optimize representation of runfiles in compact execution log (bazelbuild#23321) * Keep runfiles tree IDs in memory for multiple test attempts (bazelbuild#23703) * Fix naming inconsistency in `spawn.proto` (bazelbuild#23706) * Mark tool runfiles as such in expanded execution log (bazelbuild#23702) The cherry-picks required introducing a `Map<Artifact, RunfilesTree>` shim to `RunfilesSupplier` that matches the Bazel 8 way of obtaining a `RunfilesTree` from a runfiles middleman via `InputMetadataProvider`. Closes bazelbuild#23683 Closes bazelbuild#23710 Closes bazelbuild#23711 Closes bazelbuild#23734

fmeum · 2024-10-01T08:54:02Z

@meisterT Could you take over the merge now that tjgq is OOO?

tjgq · 2024-10-01T10:49:24Z

@meisterT this is cl/680944521 internally

meisterT

This is getting merged right now.

fmeum requested review from a team as code owners September 21, 2024 17:38

fmeum requested review from katre and tjgq and removed request for a team and katre September 21, 2024 17:38

github-actions bot added team-Performance Issues for Performance teams team-Configurability platforms, toolchains, cquery, select(), config transitions team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Sep 21, 2024

tjgq approved these changes Sep 22, 2024

View reviewed changes

tjgq added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Sep 22, 2024

bazel-io mentioned this pull request Sep 23, 2024

[7.4.0] Keep runfiles tree IDs in memory for multiple test attempts #23711

Closed

fmeum marked this pull request as draft September 23, 2024 10:29

Keep runfiles tree IDs in memory for multiple test attempts

6c813a2

While writing the compact execution log, make us of the information in `RunfilesTree` on whether the tree is likely to be reused by multiple spawns.

fmeum force-pushed the compact-execlog-caching branch from fbdfdff to 6c813a2 Compare September 24, 2024 06:23

fmeum changed the title ~~Keep runfiles tree IDs in memory for multiple test attempts~~ Keep runfiles tree IDs in memory for tools and multiple test attempts Sep 24, 2024

fmeum marked this pull request as ready for review September 24, 2024 10:04

fmeum requested a review from tjgq September 24, 2024 10:04

github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label Sep 24, 2024

tjgq requested changes Sep 24, 2024

View reviewed changes

tjgq removed awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally team-Configurability platforms, toolchains, cquery, select(), config transitions team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Sep 24, 2024

Address comments

1edd352

fmeum requested a review from tjgq September 24, 2024 14:33

tjgq approved these changes Sep 24, 2024

View reviewed changes

tjgq added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Sep 24, 2024

fmeum mentioned this pull request Sep 27, 2024

[7.4.0] Compact execution log improvements #23713

Merged

meisterT approved these changes Oct 1, 2024

View reviewed changes

copybara-service bot closed this in 81b9287 Oct 1, 2024

github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Oct 1, 2024

fmeum deleted the compact-execlog-caching branch October 2, 2024 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep runfiles tree IDs in memory for tools and multiple test attempts #23703

Keep runfiles tree IDs in memory for tools and multiple test attempts #23703

fmeum commented Sep 21, 2024 •

edited

Loading

fmeum commented Sep 23, 2024

fmeum commented Sep 23, 2024

fmeum commented Sep 23, 2024

fmeum commented Sep 23, 2024 •

edited

Loading

justinhorvitz commented Sep 23, 2024

fmeum commented Sep 23, 2024

justinhorvitz commented Sep 23, 2024

tjgq commented Sep 23, 2024 •

edited

Loading

fmeum commented Sep 23, 2024

fmeum commented Sep 24, 2024

jbedard commented Sep 25, 2024

fmeum commented Sep 25, 2024 •

edited

Loading

justinhorvitz commented Sep 25, 2024

fmeum commented Sep 25, 2024 •

edited

Loading

justinhorvitz commented Sep 25, 2024

justinhorvitz commented Sep 26, 2024

fmeum commented Oct 1, 2024

tjgq commented Oct 1, 2024

meisterT left a comment

Keep runfiles tree IDs in memory for tools and multiple test attempts #23703

Keep runfiles tree IDs in memory for tools and multiple test attempts #23703

Conversation

fmeum commented Sep 21, 2024 • edited Loading

fmeum commented Sep 23, 2024

fmeum commented Sep 23, 2024

fmeum commented Sep 23, 2024

fmeum commented Sep 23, 2024 • edited Loading

justinhorvitz commented Sep 23, 2024

fmeum commented Sep 23, 2024

justinhorvitz commented Sep 23, 2024

tjgq commented Sep 23, 2024 • edited Loading

fmeum commented Sep 23, 2024

fmeum commented Sep 24, 2024

jbedard commented Sep 25, 2024

fmeum commented Sep 25, 2024 • edited Loading

justinhorvitz commented Sep 25, 2024

Build A

Build B

fmeum commented Sep 25, 2024 • edited Loading

justinhorvitz commented Sep 25, 2024

justinhorvitz commented Sep 26, 2024

fmeum commented Oct 1, 2024

tjgq commented Oct 1, 2024

meisterT left a comment

Choose a reason for hiding this comment

fmeum commented Sep 21, 2024 •

edited

Loading

fmeum commented Sep 23, 2024 •

edited

Loading

tjgq commented Sep 23, 2024 •

edited

Loading

fmeum commented Sep 25, 2024 •

edited

Loading

fmeum commented Sep 25, 2024 •

edited

Loading