-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bazel remote-caching does not properly treat volatile-status as if it never changes #10075
Comments
The behavior you are seeing is the expected behavior. Bazel's local cache will not rerun an action if only volatile-status.txt changed but will rerun it otherwise. The remote cache does not know about volatile-status.txt and it shouldn't. volatile-status.txt is mostly used to for timestamping (=this binary was built at XX.XX.XXXX XX:XX). What exactly is the use case you are trying to make work? |
For our CI builds that use Bazel, we have an issue that docker containers spun up for tests sometimes linger after the CI job is finished. We want to tag these containers with the CI build information (using volatile-status.txt) to determine which jobs are causing containers to linger and hopefully discover the root cause, or to allow each job to clean up it's own containers without interfering with other jobs. We don't want the tests using these containers to rerun if only the build information has changed and we want to cache across multiple workers that could potentially run a job, hence the use of remote-caching. |
That sounds like your machines aren't recycled after each run? Would it be an option to keep the Bazel server running between builds? This would give you the volatile-status.txt behavior you are looking for. |
To clarify: I haven't thought deeply about this problem nor do I have any experience with such a use case. So I am neither for nor against having this feature. However, implementing this in Bazel would require protocol support and so would need discussions with the API community. @ulfjack @lberki @ericfelly might have more insights on how we handle this internally. |
I vote for "Will not fix; too much effort" here. What you are saying is certainly true, but This, coupled with the fact that fixing this would require changes to the cache protocol makes the cost outweigh the benefits IMHO. One thing you could do is to read the identifier (non-hermetically) from a local file as opposed to through |
We're very reluctant to change the behavior of volatile-status wrt. remote caching or execution, because doing so can potentially cause incorrect cross-user caching. Better to have remote caching / execution be strict and loose a bit of performance there. Now, if you need to tag actions in order to figure out where they're coming from, I think your best bet is to use a side channel, instead of trying to do it with action inputs / env variables. That is, whatever spins up those docker containers should add metadata to the container in a way that the action can't inspect, so it doesn't affect caching, but that you can inspect if you find a hanging container after the fact. I don't have much experience with containers, but I'm sure there's a way to do that. If you're using Bazel to spin up the container, we could add some code to Bazel to do that by default. |
Machines are recycled, but there's no guarantee about which of several worker machines a CI job will run on, so we want the cache to be shared between all of them.
This is how official bazel documentation recommends debugging remote-cache hits If this would require large changes to the structure of remote caching though, I think reading from a local file or other side channel as @ulfjack and @lberki suggested would work for our use case. I'll close this, but would appreciate any suggestions on how best to use a side channel in this way; current ideas are reading from an expected absolute path on CI workers, or maybe using the |
@Panfilwk |
+1 for what @ulfjack said; suboptimal performance is not great, but cache poisoning is even worse. As for the side channel to use, as long as you can tag a container with some sort of unique ID you can use to find the container again, the easiest thing that comes to mind is exactly what you suggested: reading the ID from a file at some fixed path, then printing it on the stdout of the test. |
Not sure I understand something. Wouldn't the stamped output files be stored in the remote cache, and couldn't bazel just ignore |
Or at the very least update the docs to state that bazel only pretends that the volatile file doesn't change in the context of a local incremental build. |
Yea I agree, it seems like unless the volatile status can actually be omitted from the actions inputs during action digest for the remote cache, then volatile keys is very misleading; it appears to work during local development of stamping rules, but then fails in a primary way people expect to leverage it. I also ended up side loading a status file for my stamping needs, but it seems like extra work that is prone to error that really should be handled by bazel. |
@Panfilwk Can we re-open this please? Once you add remote-execution, there is literally no way to "sneak" the version information to the remote builders in a way that avoids the action hash. This was the one and only way to have a remotely built target that embeds version info that doesn't cause it to rebuild every time. |
I submitted some PR / issues to various rules (k8s, docker, etc.) that are affected by this to disable volatile status stamping but feels wrong to me. This really should be addressed here. I understand the concerns about correctness higher in the thread, but Bazel claims both Fast and Correct. This issue seriously impairs "Fast." |
@ulfjack @lberki Can you please reconsider fixing this. The listed work-around for this issue is to use |
I don't work at Google anymore. That said, there is no technical way to fix this with a dumb remote cache, or with the current remote execution protocol. Avoid stamping unless you absolutely positively must have stamped binaries - ideally, never enable stamping for developer builds, and require that everything that is deployed to production (or shipped to an app store or whatever) is built by an automated system that explicitly enables stamping. |
Are you suggesting that stamping should occur in an artifact pipeline after bazel? How does google perform their stamping? |
(Oh hello, @ulfjack !) At Google we:
|
bazel remote cache does not respect the intent of volatile status file, so let's ensure that the status variables are constant in our tests. bazelbuild/bazel#10075
For all who are still struggling with this issue, consider just making your volatile status files constant through your workspace status command. You can use an environment variable to control this behavior, which can be set or unset depending on whether you are testing or deploying. If the status files themselves are constant, you don't have to worry about how they are handled by bazel and any language rules. You can override bazel's default values for status vars as in this example here. |
Hi, we use bazel and use the volatile-status to generate the Kubernetes YAML, and add the commit info to the Kubernetes object metadata. If the generated container image doesn't change on a newer commit, we don't expect the YAML info to change. It's important to have volatile-status work as documented when using remote-cache for us. Even if we stamp it very late, it'll trigger unnecessary rollout if a change in the hash of volatile-status.txt will cause the target not to read from the remote cache. How do we work around this? |
Why do people want different caching behavior local vs remote? What the case that is trying to solve? Seems like they should work similarly, no? |
When used with a remote build cache, these stamps (despite being marked as volatile), induce an unnecessary golink step when switching back and forth between branches. See bazelbuild/bazel#10075. Opting out here shaves a few precious seconds if the previously built binary is already available. Release note: None
I think this issue definitely needs to be reopened, unless there's a dupe already. All the workarounds to pass information to build actions that would not affect the action key are very annoying, and error-prone. |
- Sandboxing on MacOS have measurable overhead. See bazelbuild/bazel#8230; it comes from symlinks being much slower to create. This commit disables sandboxing on local builds (still present in CI) to hopefully speed up builds. - If this proves to be buggy for whatever reason, we could opt out of sandboxing for go package compiles specifically using --strategy=GoCompilePkg=local. - In the future if we want local sandboxing, we could explore sandboxfs: https://bazel.build/docs/sandboxing - When used with a remote build cache, these stamps (despite being marked as volatile), induce an unnecessary golink step when switching back and forth between branches. See bazelbuild/bazel#10075. Opting out here shaves a few precious seconds if the previously built binary is already available. Release note: None
- Sandboxing on MacOS have measurable overhead. See bazelbuild/bazel#8230; it comes from symlinks being much slower to create. This commit disables sandboxing on local builds (still present in CI) to hopefully speed up builds. - If this proves to be buggy for whatever reason, we could opt out of sandboxing for go package compiles specifically using --strategy=GoCompilePkg=local. - In the future if we want local sandboxing, we could explore sandboxfs: https://bazel.build/docs/sandboxing - When used with a remote build cache, these stamps (despite being marked as volatile), induce an unnecessary golink step when switching back and forth between branches. See bazelbuild/bazel#10075. Opting out here shaves a few precious seconds if the previously built binary is already available. Release note: None
- Sandboxing on MacOS have measurable overhead. See bazelbuild/bazel#8230; it comes from symlinks being much slower to create. This commit disables sandboxing on local builds (still present in CI) to hopefully speed up builds. - If this proves to be buggy for whatever reason, we could opt out of sandboxing for go package compiles specifically using --strategy=GoCompilePkg=local. - In the future if we want local sandboxing, we could explore sandboxfs: https://bazel.build/docs/sandboxing - When used with a remote build cache, these stamps (despite being marked as volatile), induce an unnecessary golink step when switching back and forth between branches. See bazelbuild/bazel#10075. Opting out here shaves a few precious seconds if the previously built binary is already available. Release note: None
79360: bazel: disable sandboxing and avoid stamping r=irfansharif a=irfansharif - Sandboxing on MacOS have measurable overhead. See bazelbuild/bazel#8230; it comes from symlinks being much slower to create. This commit disables sandboxing on local builds (still present in CI) to hopefully speed up builds. - If this proves to be buggy for whatever reason, we could opt out of sandboxing for go package compiles specifically using --strategy=GoCompilePkg=local. - In the future if we want local sandboxing, we could explore sandboxfs: https://bazel.build/docs/sandboxing - When used with a remote build cache, these stamps (despite being marked as volatile), induce an unnecessary golink step when switching back and forth between branches. See bazelbuild/bazel#10075. Opting out here shaves a few precious seconds if the previously built binary is already available. Release note: None Jira issue: CRDB-14748 Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
We are also facing this issue, which impact the efficiency of our continuous delivery and load and fill, unnecessarily, our remote cache. I have just spent a little time to find the origin of the problem, and if it can help, here a minimal example to reproduce the behaviour. Example
A simple rule which copy the file ScenariosFirst buildWe start from an empty environment. The Bazel local cache and the remote cache are fully cleaned. bazel build \
--stamp \
--remote_cache=grpc://localhost:9092 \
--experimental_remote_grpc_log=/tmp/grpc.log \
//:dummy
As expected, the file, not present in the local cache and in the remote cache, is built and the remote cache is filled.
Second build without clean cachebazel build \
--stamp \
--remote_cache=grpc://localhost:9092 \
--experimental_remote_grpc_log=/tmp/grpc.log \
//:dummy
Not surprisingly, the file, already in the local cache is not rebuilt even if the volatile stamp file has changed. Third build with a clean of the local cache
The file, with a different The computation of this It's the definition of the volatile file:
(from the user-manual 5.1.1) This behaviour should be the same even if we use a remote cache. WorkaroundTo avoid this issue, we are going to remove the use of the
If we can reopen this issue that would be really great and appreciated ❤️ ! |
Fix non-hermetic zipping of example zips by fixing the zip entry timestamps. I also hardcoded stamp values in stable-status.txt and volatile-status.txt using the workspace status command for the aio_local_deps config to improve cache performance. The Bazel remote cache appears to not ignore volatile-status.txt when there are no other changes, unlike the local Bazel cache: bazelbuild/bazel#10075 (comment)
Fix non-hermetic zipping of example zips by fixing the zip entry timestamps. I also hardcoded stamp values in stable-status.txt and volatile-status.txt using the workspace status command for the aio_local_deps config to improve cache performance. The Bazel remote cache appears to not ignore volatile-status.txt when there are no other changes, unlike the local Bazel cache: bazelbuild/bazel#10075 (comment)
Fix non-hermetic zipping of example zips by fixing the zip entry timestamps. I also hardcoded stamp values in stable-status.txt and volatile-status.txt using the workspace status command for the aio_local_deps config to improve cache performance. The Bazel remote cache appears to not ignore volatile-status.txt when there are no other changes, unlike the local Bazel cache: bazelbuild/bazel#10075 (comment)
Fix non-hermetic zipping of example zips by fixing the zip entry timestamps. I also hardcoded stamp values in stable-status.txt and volatile-status.txt using the workspace status command for the aio_local_deps config to improve cache performance. The Bazel remote cache appears to not ignore volatile-status.txt when there are no other changes, unlike the local Bazel cache: bazelbuild/bazel#10075 (comment)
Fix non-hermetic zipping of example zips by fixing the zip entry timestamps. I also hardcoded stamp values in stable-status.txt and volatile-status.txt using the workspace status command for the aio_local_deps config to improve cache performance. The Bazel remote cache appears to not ignore volatile-status.txt when there are no other changes, unlike the local Bazel cache: bazelbuild/bazel#10075 (comment)
Fix non-hermetic zipping of example zips by fixing the zip entry timestamps. I also hardcoded stamp values in stable-status.txt and volatile-status.txt using the workspace status command for the aio_local_deps config to improve cache performance. The Bazel remote cache appears to not ignore volatile-status.txt when there are no other changes, unlike the local Bazel cache: bazelbuild/bazel#10075 (comment)
Fix non-hermetic zipping of example zips by fixing the zip entry timestamps. I also hardcoded stamp values in stable-status.txt and volatile-status.txt using the workspace status command for the aio_local_deps config to improve cache performance. The Bazel remote cache appears to not ignore volatile-status.txt when there are no other changes, unlike the local Bazel cache: bazelbuild/bazel#10075 (comment) PR Close #48579
Fix non-hermetic zipping of example zips by fixing the zip entry timestamps. I also hardcoded stamp values in stable-status.txt and volatile-status.txt using the workspace status command for the aio_local_deps config to improve cache performance. The Bazel remote cache appears to not ignore volatile-status.txt when there are no other changes, unlike the local Bazel cache: bazelbuild/bazel#10075 (comment) PR Close #48579
…r#48579) Fix non-hermetic zipping of example zips by fixing the zip entry timestamps. I also hardcoded stamp values in stable-status.txt and volatile-status.txt using the workspace status command for the aio_local_deps config to improve cache performance. The Bazel remote cache appears to not ignore volatile-status.txt when there are no other changes, unlike the local Bazel cache: bazelbuild/bazel#10075 (comment) PR Close angular#48579
' into 'master' [IDX-2760] Pass volatile status over a side channel. Passing it as dependency invalidates remote cache: bazelbuild/bazel#10075 See merge request dfinity-lab/public/ic!11580
~~This PR improves our remote caching by changing the way we use [`--workspace_status_command`](https://bazel.build/reference/command-line-reference#flag--workspace_status_command) which enables us to fetch 100% of actions from the remote cache when applicable.~~ ~~Previously I noticed CI jobs would always appear to rebuild the leafs of our repo (e.g. `environmentd` and `clusterd`), which unfortunately take the longest because of linking. This was because our workspace status command required generating some Rust code which was breaking our remote caching. Now with this change I've observed the remote caching working exactly as a local cache, e.g. a change in the `sql` crate only requires re-building `environmentd` whereas it used to also cause a recompile of `clusterd` and some other crates.~~ Edit: Unfortunately there is no free lunch and my testing was flawed. The original approach from this PR of tweaking the way we use `workspace_status_command` did not work, and we were actually just providing a constant value for the git hash instead of the updated value. After investigating further it turns out that Bazel treats local cache artifacts differently than remote cache artifacts when it comes to the workspace status and specificallly the `volatile-status.txt` file we rely on here, see bazelbuild/bazel#10075. The only way to prevent `build_info!()` from causing unnecessary rebuilds _and_ update the git hash for anything that does get rebuilt, is to side channel the necessary information like we're doing here. The approach is we write the current git hash to a known location, in this case `/tmp/mz_git_hash.txt` and then read it via Rust. `bin/bazel` has been updated to write this file, and I'll make the same change to `mzbuild`. When running with `--config=release-stamp` we fallback to the original behavior with `workspace_status_command` and formally "stamp" the builds which causes everything to get rebuilt. This is definitely a hack but from what I can tell there is no other way to get the same behavior. An alternative is we always just return a constant git hash when not running with `--config=release-stamped`, not sure how important stamping development builds is. ### Motivation Progress towards https://github.com/MaterializeInc/materialize/issues/26796 * Improve remote caching speed ### Checklist - [x] This PR has adequate test coverage / QA involvement has been duly considered. ([trigger-ci for additional test/nightly runs](https://trigger-ci.dev.materialize.com/)) - [x] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design. <!-- Reference the design in the description. --> - [x] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [x] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)). <!-- Ask in #team-cloud on Slack if you need help preparing the cloud PR. --> - [x] If this PR includes major [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note), I have pinged the relevant PM to schedule a changelog post.
Description of the problem / feature request:
When a target depends on the file
volatile-status.txt
as an input, that target will not be rebuilt ifvolatile-status.txt
changes and the target is cached locally. However, the remote cache does not behave in the same way, and a change in the hash ofvolatile-status.txt
will cause the target not to read from the remote cache.Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Clone the repo test_volatile_status and set up a bazel remote cache. Then run the following commands at the root of the repo:
What operating system are you running Bazel on?
Ubuntu 16.04.6 LTS
What's the output of
bazel info release
?release 0.29.0
What's the output of
git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?https://github.com/Panfilwk/test_volatile_status.git
da002118b6c79bd7e33b017221c772fbfba4f009
da002118b6c79bd7e33b017221c772fbfba4f009
Have you found anything relevant by searching the web?
This issue touches on this concept for local caching, but nothing about how the behavior of this file works with remote caching.
Any other information, logs, or outputs that you want to share?
exec1.log.txt and exec2.log.txt are the resulting parsed experimental execution logs from the first and third build commands following the debug instructions here
The text was updated successfully, but these errors were encountered: