-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow rewinding to re-create lost inputs #14126
Allow rewinding to re-create lost inputs #14126
Conversation
/cc @coeuvre A couple of open questions:
|
157e280
to
c935218
Compare
Sorry for the delay but I don't have the capacity to look into this yet. Will get back after In the meantime, any thoughts? @bazelbuild/remote-execution |
366c051
to
26e59fd
Compare
@coeuvre I've just pushed a rebase, would appreciate any thoughts when you have some time :) |
Great @illicitonion, that you make ’Builds without the Bytes’ more robust! Can the solution be extended to resolve also the local build scenario in issue #10880? |
Yes, I think so - that should definitely be a separate PR, but at a high level there are three changes needed to fix that issue:
|
This remedies the following sequence of events: 1. Build build_tool (e.g. the go builder) from source with remote execution and `--remote_download_minimal`. 2. Use build_tool to build some_binary with remote execution. 3. Evict `build_tool` from the remote execution system. 4. Edit the sources to some_binary and attempt to build it again with remote execution. Before this change, Bazel would give an FileNotFoundException complaining that build_tool couldn't be found (and so couldn't be uploaded). After this change, Bazel will notice that it knows how to regenerate the missing file, and so rewind the graph and re-perform the actions it needs to be able to build some_binary.
26e59fd
to
8401737
Compare
@coeuvre - I just rebased this onto HEAD, would you be able to take a look some time soon? I still have a couple of open questions around how to make it properly land-able, but I think the approach is hopefully not too controversial... |
Hi, I've been working on open sourcing more of the action rewinding code and just came across this thread. As of 68ffdd2, action rewinding is permitted for eligible builds (non-incremental, no action cache). However the running action still needs to throw a lost inputs exception to trigger it, which never happens right now in bazel. So your PR might get a bit simpler now. I'm also planning to add a flag for this. |
This sounds great, thanks @justinhorvitz!
Do you have plans to relax these restrictions? Is that what the flag would do? If not, where do these restrictions come from? I can happily pull out the "throw LostInputsExecException" piece into a standalone PR if that'd be useful? |
The flag is going to keep rewinding disabled by default even for eligible builds. I should have just included it originally. See https://bazel-review.googlesource.com/c/bazel/+/196650. The restrictions are:
It seems feasible to support the action cache by just having rewinding evict entries, but to this point we haven't needed to support it. Support for reverse deps is much harder since rdep tracking is a performance hotspot. This may be on our roadmap by the end of the year - we are currently weighing rewinding support for rdeps vs an alternative for an internal project.
Sounds like a plan, assuming the above restrictions aren't an issue for you? |
Also, I'm planning to open source the rewinding integration test, but that's likely going to take a couple months since I will be on leave throughout May. |
I spun off #15345 for the
Unfortunately, they are... But I'm a little confused about where they come from... I added illicitonion@95c71f0 on top of the above PR, and rewinding seemed to work fine for me. Specifically I did the following: With this BUILD file: load("@io_bazel_rules_go//go:def.bzl", "go_binary")
go_binary(
name = "main",
srcs = ["main.go"],
) And this package main
import "fmt"
func main() {
fmt.Println("Sup yo")
} I ran a build against a remote cluster with Then I flushed the storage on the remote cluster, modified the string in @justinhorvitz Do you have repro steps for your errors that caused you to restrict rewinding based on incrementality and action caching? |
It's just from the rewinding integration test I mentioned, which unfortunately is not yet open source. I'm not very familiar with remote execution in bazel, so not sure exactly how your example is working. Since @anakanemison and I will both be away for some time, I added @ericfelly to coordinate and/or find someone who may be able to assist you in the meantime. Otherwise, I can take a closer look with you in June. |
I see - without more information, I suspect that may be an issue with the test setup, or some combination of options that doesn't manifest itself in real life, but we'll find out when we find out!
Thanks! Either way, I hope your time off goes well in the mean time :) |
I'm going to try getting the rewinding tests open sourced in the next couple weeks so that we can have a better conversation on the limitations. |
I was making progress on moving the rewinding test infrastructure to open source (see my recent commits in June), but encountered some test flakiness when trying to move the actual tests. Upon further inspection, it's due to an actual bug with rewinding and an execution strategy that relies on the local output base to find inputs (say for example local execution, which I was hopeful to use for the test). With concurrent rewinding, we could attempt to rewind an action whose output is simultaneously being consumed. When the action goes to re-execute, it deletes its outputs as a standard "prepare" step. That could cause the concurrent consumer to fail due to a missing input. This means that rewinding is only correct with an execution strategy that does not need to read the local output base (i.e. one that refers to inputs by digest, or one that uses an in-memory action file system). Maybe I could use the bazel remote execution framework for this test, but that would take me quite some time to ramp up on. Another option is a custom strategy just for this test, but then it might not be as valuable. |
Now that we can both look at the tests, here's what happens when blindly permitting rewinding without meeting prerequisites:
|
Because it's been a couple of months since Justin's last comment, and because he's out on leave for a few more weeks, I'd like to acknowledge our current state. The blockers Justin described above are still blocking: compatibility with 1) action cache, 2) incrementality, and 3) local execution. We haven't had the opportunity to work on them yet. What Justin said here about our possibly working on the incrementality problem later this year continues to be true. We don't yet have anything to share regarding our plans for addressing the other blockers, other than we're still talking about them, and we'd like to do something about them! |
Really hope this pull request to be merged, otherwise the requirement for Bw/oB is too strict to use. |
I was rebasing this PR on HEAD to be able to run it against |
Hello @illicitonion, Are you still looking to submit this PR. Could you please respond and share us the latest update on it. Thanks! |
I was finally able to run this PR with
|
Closing in favour of #16660 |
This remedies the following sequence of events:
execution and
--remote_download_minimal
.build_tool
from the remote execution system.remote execution.
Before this change, Bazel would give an FileNotFoundException
complaining that build_tool couldn't be found (and so couldn't be
uploaded).
After this change, Bazel will notice that it knows how to regenerate the
missing file, and so rewind the graph and re-perform the actions it
needs to be able to build some_binary.