Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include execution cache hit ratios #16043

Open
adam-singer opened this issue Aug 3, 2022 · 6 comments
Open

Include execution cache hit ratios #16043

adam-singer opened this issue Aug 3, 2022 · 6 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request

Comments

@adam-singer
Copy link
Contributor

adam-singer commented Aug 3, 2022

Description of the feature request:

It would be ideal if we could surface cache hit ratios as first class information. As it stands now I'm only aware of two ways to try and calculate cache hit ratios from a users point of view. One is to create the execution log and parse it for remoteCacheable and remoteCacheHit then calculate the standard hit ratio. The other is to use build event logs and pluck out the runner information, for example INFO: 1436 processes: 977 disk cache hit, 456 internal, 1 darwin-sandbox, 2 worker. As discussed in https://bazelbuild.slack.com/archives/C01E7TH8XK9/p1659330964512159?thread_ts=1659329231.807649&cid=C01E7TH8XK9 there are some heuristics we could use to assume some knowledge of cache hit ratio, but it is not concrete.

The request here is to surface cache hit ratio for different dimensions

  • Cache hit for overall execution and some indication that no execution was done or 100% cache hit because no execution was done, in the case where rebuild yields nothing new.
  • Cache hit for remote cached requests for different remote caches (remote/disk)
  • Ratio of where cache was found, remote/disk.

What underlying problem are you trying to solve with this feature?

We are trying to understand when we hit cache regressions from a client side point of view. Recently bazel was released with broken caching in 5.2.0 #15682 (comment), we (twitter) realized it when our remote build cache systems experienced a lower cache hit ratio on cache requests over the network. It would have been ideal for us to also log that kind of metrics from the clients point of view as the source of truth that the runner indeed did miss and what state it might have been in at that time. It would also allow us to put monitoring metrics around the health of the runners when executing on some branches, such as master.

Easy access to this information would greatly help us identify regressions sooner and give us more confidence in bazel and rules upgrades.

Which operating system are you running Bazel on?

osx,linux

What is the output of bazel info release?

No response

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@sgowroji sgowroji added type: feature request untriaged team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Aug 4, 2022
@wilwell wilwell removed the untriaged label Aug 9, 2022
@meisterT
Copy link
Member

Consider trying --experimental_stats_summary which will show a line like this:

INFO: 104 processes: 96 remote cache hit, 1 internal, 7 remote.

If you want to have the info programmatically, you can use the BEP, e.g. by adding the flag --build_event_json_file=/tmp/bep.json and then doing this:

$ jq '.buildMetrics.actionSummary.runnerCount | select( . != null )' < /tmp/bep.json
[
  {
    "name": "total",
    "count": 104
  },
  {
    "name": "remote cache hit",
    "count": 96
  },
  {
    "name": "internal",
    "count": 1
  },
  {
    "name": "remote",
    "count": 7
  }
]

@chancila
Copy link
Contributor

I think one main issue we have is that it's impossible to get the ratio of cacheable actions / actual cache hits from the BEP...the only way as mentioned is by using the exec log.

it would be good to have the aggregate count of potentially cacheable actions in the BEP as well.

@meisterT
Copy link
Member

In what cases is the distinction remote/remote cache hit as seen above not good enough?

@adam-singer
Copy link
Contributor Author

In the use case of analyzing client performance. In our situations cache missing can take a heavy toll on infrastructure and user experience. Knowing how healthy or unhealthy our cache performance is from a clients point of view helps us understand if we have problems. We collect and monitor metrics from both bazel clients via BEP and remote cache servers. Also, if we had these metrics it would have been very clear that bazel 5.2.0 was released with broken client side caching, two executions would show what was expected to be cached and was was or was not found (in terms of counts) in cache from build events. I don't expect most folks look at execution logs between cold local cache builds. I also find it hard to reason about making calculations from runner counts because they don't always show up, and the naming might be a bit dynamic (such as the sandbox names), if the runners always initiated to 0, it would be better for using those counts from a machine parsed point of view, I do see from user experience point of view why printing a bunch of 0 counts for runners wouldn't be ideal.

@adam-singer
Copy link
Contributor Author

@chancila @meisterT we use BEP to extract runner counts, but it's not a very concrete set of information and you need to make assumptions on what the runner means in terms of its relation to caching. Execution logs provide concrete information about caching, the problem is on large targets the size of an execution log is huge, I've seen some of our executions logs in the size of GB (even when serialized into protobuf).

Our attempt at calculation of cache hit ratio is the following (its crude and I don't think its accurate)

total = "total" (default to -1 if does not exist in runner)
cache_hits = "disk cache hit" (default to 0 if does not exist in runner) + "remote cache hit" (default to 0 if does not exist in runner)
internal = "internal" (default to 0 if does not exist in runner)
total_cachable_actions = total - internal

cache_hit_ratio = 
if total_cachable_actions < 0:
  # error we don't have runner data in BEP
 None
else if total_cachable_actions == 0:
  # assume we don't have runner metrics for cached items because they are not returned, so must be empty and internal = 1 and total = 1
  100.00
else:
  (cache_hits / total_cachable_actions) * 100

return cache_hit_ratio

Consider trying --experimental_stats_summary which will show a line like this:

INFO: 104 processes: 96 remote cache hit, 1 internal, 7 remote.

If you want to have the info programmatically, you can use the BEP, e.g. by adding the flag --build_event_json_file=/tmp/bep.json and then doing this:

$ jq '.buildMetrics.actionSummary.runnerCount | select( . != null )' < /tmp/bep.json
[
  {
    "name": "total",
    "count": 104
  },
  {
    "name": "remote cache hit",
    "count": 96
  },
  {
    "name": "internal",
    "count": 1
  },
  {
    "name": "remote",
    "count": 7
  }
]

@chancila @meisterT we use BEP to extract runner counts, but it's not a very concrete set of information and you need to make assumptions on what the runner means in terms of its relation to caching. Execution logs provide concrete information about caching, the problem is on large targets the size of an execution log is huge, I've seen some of our executions logs in the size of GB (even when serialized into protobuf).

Our attempt at calculation of cache hit ratio is the following (its crude and I don't think its accurate)

total = "total" (default to -1 if does not exist in runner)
cache_hits = "disk cache hit" (default to 0 if does not exist in runner) + "remote cache hit" (default to 0 if does not exist in runner)
internal = "internal" (default to 0 if does not exist in runner)
total_cachable_actions = total - internal

cache_hit_ratio = 
if total_cachable_actions < 0:
  # error we don't have runner data in BEP
 None
else if total_cachable_actions == 0:
  # assume we don't have runner metrics for cached items because they are not returned, so must be empty and internal = 1 and total = 1
  100.00
else:
  (cache_hits / total_cachable_actions) * 100

return cache_hit_ratio

While this "works" its probably lossy at best if not completely wrong.

@chancila
Copy link
Contributor

In what cases is the distinction remote/remote cache hit as seen above not good enough?

isn't remote the remote exec executor? isn't that orthogonal?

the example could be the build has 1000 total actions, 500 of which are not remote cacheable, and 500 which are...technically in this build if i get 500 remote cache hits I have 100% cache hit...this number is currently impossible to derive without the execution log. Currently we use the total number of actions / cache hits which gives us a signal we can track for regression as long as the ratio of cacheable/not cacheable doesn't change drastically.

a significant portion of our build graph is not cacheable, this is on purpose because in some cases certain actions are faster to run locally

the size of GB

yea we have exec logs the size of 40gb+

but they compress down to a few 100 MBs with zstd and a couple of tweaks, mainly long range compression...it would be cool if bazel compressed the log as it writes it to avoid murdering IO...on macos our builds are significantly slower due to the IO hit of the exec log.

@coeuvre coeuvre added the P2 We'll consider working on this in future. (Assignee optional) label Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request
Projects
None yet
Development

No branches or pull requests

6 participants