Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow rewinding to re-create lost inputs #14126

Conversation

illicitonion
Copy link
Contributor

This remedies the following sequence of events:

  1. Build build_tool (e.g. the go builder) from source with remote
    execution and --remote_download_minimal.
  2. Use build_tool to build some_binary with remote execution.
  3. Evict build_tool from the remote execution system.
  4. Edit the sources to some_binary and attempt to build it again with
    remote execution.

Before this change, Bazel would give an FileNotFoundException
complaining that build_tool couldn't be found (and so couldn't be
uploaded).

After this change, Bazel will notice that it knows how to regenerate the
missing file, and so rewind the graph and re-perform the actions it
needs to be able to build some_binary.

@illicitonion illicitonion requested a review from a team as a code owner October 18, 2021 17:12
@google-cla google-cla bot added the cla: yes label Oct 18, 2021
@illicitonion
Copy link
Contributor Author

/cc @coeuvre

A couple of open questions:

  1. Is there a way to make this configurable without adding a startup option from the launcher? I didn't love adding a startup option, but I also couldn't find any other options which were parsed by the time we're constructing the SkyframeExecutor...
  2. How should I test this? I don't know of any tests which are do anything quite as detailed as this in terms of interacting with a (mock) remote server...

@illicitonion illicitonion force-pushed the rewinding-bulk-upload-exceptions-5.0 branch 5 times, most recently from 157e280 to c935218 Compare October 19, 2021 11:04
@coeuvre
Copy link
Member

coeuvre commented Oct 28, 2021

Sorry for the delay but I don't have the capacity to look into this yet. Will get back after 5.0.0 is released.

In the meantime, any thoughts? @bazelbuild/remote-execution

@illicitonion
Copy link
Contributor Author

@coeuvre I've just pushed a rebase, would appreciate any thoughts when you have some time :)

@ulrfa
Copy link
Contributor

ulrfa commented Jan 24, 2022

Great @illicitonion, that you make ’Builds without the Bytes’ more robust!

Can the solution be extended to resolve also the local build scenario in issue #10880?

@illicitonion
Copy link
Contributor Author

Can the solution be extended to resolve also the local build scenario in issue #10880?

Yes, I think so - that should definitely be a separate PR, but at a high level there are three changes needed to fix that issue:

  1. Bazel needs generically enable-able support for rewinding - this PR adds this support.
  2. When an exception is thrown which should trigger rewinding, it needs to be a LostInputsExecException - that would be a small code change to
    try {
    RemoteCache.waitForBulkTransfer(
    downloadsToWaitFor.values(), /* cancelRemainingOnInterrupt=*/ true);
    } catch (BulkTransferException e) {
    if (e.onlyCausedByCacheNotFoundException()) {
    BulkTransferException bulkAnnotatedException = new BulkTransferException();
    for (Throwable t : e.getSuppressed()) {
    IOException annotatedException =
    new IOException(
    String.format(
    "Failed to fetch file with hash '%s' because it does not exist remotely."
    + " --remote_download_outputs=minimal does not work if"
    + " your remote cache evicts files during builds.",
    ((CacheNotFoundException) t).getMissingDigest().getHash()));
    bulkAnnotatedException.add(annotatedException);
    }
    e = bulkAnnotatedException;
    }
    throw e;
    }
    }
    to throw a different exception type. In particular, probably to call the new BulkTransferException.asLostInputsExecException() method added in this PR.
  3. When throwing the original exceptions which get aggregated into the BulkTransferException, you need to have enough information to be able to point at the ActionOwner of the missing file - this may require a bit of looking back through call-stacks (and across async calls) to make sure we store the generating ActionOwner alongside the digest, but that should be doable. Basically, for each CacheNotFoundException which is in the BulkTransferException, where it's thrown, instead of a CacheNotFoundException, we'd instead need to be throwing a LostInputsExecException (and have enough information to be able to construct one).

This remedies the following sequence of events:
1. Build build_tool (e.g. the go builder) from source with remote
   execution and `--remote_download_minimal`.
2. Use build_tool to build some_binary with remote execution.
3. Evict `build_tool` from the remote execution system.
4. Edit the sources to some_binary and attempt to build it again with
   remote execution.

Before this change, Bazel would give an FileNotFoundException
complaining that build_tool couldn't be found (and so couldn't be
uploaded).

After this change, Bazel will notice that it knows how to regenerate the
missing file, and so rewind the graph and re-perform the actions it
needs to be able to build some_binary.
@illicitonion illicitonion force-pushed the rewinding-bulk-upload-exceptions-5.0 branch from 26e59fd to 8401737 Compare April 4, 2022 14:33
@illicitonion
Copy link
Contributor Author

@coeuvre - I just rebased this onto HEAD, would you be able to take a look some time soon? I still have a couple of open questions around how to make it properly land-able, but I think the approach is hopefully not too controversial...

@justinhorvitz
Copy link
Contributor

Hi, I've been working on open sourcing more of the action rewinding code and just came across this thread. As of 68ffdd2, action rewinding is permitted for eligible builds (non-incremental, no action cache). However the running action still needs to throw a lost inputs exception to trigger it, which never happens right now in bazel. So your PR might get a bit simpler now.

I'm also planning to add a flag for this.

@illicitonion
Copy link
Contributor Author

This sounds great, thanks @justinhorvitz!

non-incremental, no action cache

Do you have plans to relax these restrictions? Is that what the flag would do? If not, where do these restrictions come from?

I can happily pull out the "throw LostInputsExecException" piece into a standalone PR if that'd be useful?

@justinhorvitz
Copy link
Contributor

Do you have plans to relax these restrictions? Is that what the flag would do? If not, where do these restrictions come from?

The flag is going to keep rewinding disabled by default even for eligible builds. I should have just included it originally. See https://bazel-review.googlesource.com/c/bazel/+/196650. The restrictions are:

It seems feasible to support the action cache by just having rewinding evict entries, but to this point we haven't needed to support it. Support for reverse deps is much harder since rdep tracking is a performance hotspot. This may be on our roadmap by the end of the year - we are currently weighing rewinding support for rdeps vs an alternative for an internal project.

I can happily pull out the "throw LostInputsExecException" piece into a standalone PR if that'd be useful?

Sounds like a plan, assuming the above restrictions aren't an issue for you?

@justinhorvitz
Copy link
Contributor

Also, I'm planning to open source the rewinding integration test, but that's likely going to take a couple months since I will be on leave throughout May.

@illicitonion
Copy link
Contributor Author

I can happily pull out the "throw LostInputsExecException" piece into a standalone PR if that'd be useful?

Sounds like a plan

I spun off #15345 for the LostInputsExecException change :)

assuming the above restrictions aren't an issue for you?

Unfortunately, they are... But I'm a little confused about where they come from... I added illicitonion@95c71f0 on top of the above PR, and rewinding seemed to work fine for me. Specifically I did the following:

With this BUILD file:

load("@io_bazel_rules_go//go:def.bzl", "go_binary")

go_binary(
    name = "main",
    srcs = ["main.go"],
)

And this main.go:

package main

import "fmt"

func main() {
    fmt.Println("Sup yo")
}

I ran a build against a remote cluster with --remote_download_minimal, and no other interesting flags (i.e. default caching, default incrementality, etc).

Then I flushed the storage on the remote cluster, modified the string in main.go, and re-built, and I saw the build rewind and work.

@justinhorvitz Do you have repro steps for your errors that caused you to restrict rewinding based on incrementality and action caching?

@justinhorvitz
Copy link
Contributor

Do you have repro steps for your errors that caused you to restrict rewinding based on incrementality and action caching?

It's just from the rewinding integration test I mentioned, which unfortunately is not yet open source.

I'm not very familiar with remote execution in bazel, so not sure exactly how your example is working. Since @anakanemison and I will both be away for some time, I added @ericfelly to coordinate and/or find someone who may be able to assist you in the meantime. Otherwise, I can take a closer look with you in June.

@illicitonion
Copy link
Contributor Author

It's just from the rewinding integration test I mentioned, which unfortunately is not yet open source.

I see - without more information, I suspect that may be an issue with the test setup, or some combination of options that doesn't manifest itself in real life, but we'll find out when we find out!

I'm not very familiar with remote execution in bazel, so not sure exactly how your example is working. Since @anakanemison and I will both be away for some time, I added @ericfelly to coordinate and/or find someone who may be able to assist you in the meantime. Otherwise, I can take a closer look with you in June.

Thanks! Either way, I hope your time off goes well in the mean time :)

@ericfelly ericfelly removed their assignment Jun 2, 2022
@justinhorvitz
Copy link
Contributor

I'm going to try getting the rewinding tests open sourced in the next couple weeks so that we can have a better conversation on the limitations.

@justinhorvitz
Copy link
Contributor

I was making progress on moving the rewinding test infrastructure to open source (see my recent commits in June), but encountered some test flakiness when trying to move the actual tests. Upon further inspection, it's due to an actual bug with rewinding and an execution strategy that relies on the local output base to find inputs (say for example local execution, which I was hopeful to use for the test). With concurrent rewinding, we could attempt to rewind an action whose output is simultaneously being consumed. When the action goes to re-execute, it deletes its outputs as a standard "prepare" step. That could cause the concurrent consumer to fail due to a missing input.

This means that rewinding is only correct with an execution strategy that does not need to read the local output base (i.e. one that refers to inputs by digest, or one that uses an in-memory action file system). Maybe I could use the bazel remote execution framework for this test, but that would take me quite some time to ramp up on. Another option is a custom strategy just for this test, but then it might not be as valuable.

@justinhorvitz
Copy link
Contributor

RewindingTest is now in the open source repository: 515382b. Please note the caveats mentioned in that commit description, including a workaround for the concern in my previous comment about actions racing with deletion of outputs from local disk.

Now that we can both look at the tests, here's what happens when blindly permitting rewinding without meeting prerequisites:

--use_action_cache

--track_incremental_state

@anakanemison
Copy link
Contributor

Because it's been a couple of months since Justin's last comment, and because he's out on leave for a few more weeks, I'd like to acknowledge our current state.

The blockers Justin described above are still blocking: compatibility with 1) action cache, 2) incrementality, and 3) local execution. We haven't had the opportunity to work on them yet. What Justin said here about our possibly working on the incrementality problem later this year continues to be true.

We don't yet have anything to share regarding our plans for addressing the other blockers, other than we're still talking about them, and we'd like to do something about them!

@liyuyun-lyy
Copy link

Really hope this pull request to be merged, otherwise the requirement for Bw/oB is too strict to use.

@k1nkreet
Copy link

k1nkreet commented Oct 7, 2022

I was rebasing this PR on HEAD to be able to run it against RewindingTest and I've found out Bazel is not throwing FileNotFoundException anymore in this case but hags infinitely instead. It looks like this PR is not applicable anymore until it fixed. So I've created an issue for that and PR fixing it.

@k1nkreet
Copy link

Since #16422 is fixed, I've rebased this PR on HEAD to include `RewindingTest.java' and be able to test this PR against it, the rebased version is #16470

@sgowroji
Copy link
Member

Hello @illicitonion, Are you still looking to submit this PR. Could you please respond and share us the latest update on it. Thanks!

@sgowroji sgowroji added awaiting-user-response Awaiting a response from the author team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Oct 19, 2022
@illicitonion
Copy link
Contributor Author

@k1nkreet has kindly picked it up and moved it over to #16470 - it looks like there's a little more rebasing to do before it's working again.

@k1nkreet
Copy link

k1nkreet commented Oct 20, 2022

I was finally able to run this PR with RewindingTest after rebase and there is couple interesting findings probably unrelated to limitations outlined above:

  • by default Bazel crashes in all the testcases with:
java.lang.IllegalStateException: Unexpected inconsistency: ActionLookupData{actionLookupKey=ConfiguredTargetKey{label=//test:consume_1, config=BuildConfigurationKey[91b98a11150aa42ab36f3ce12d30291638db7840cb81a44c8c3b3c12fa499ffb]}, actionIndex=0}, [ActionLookupData{actionLookupKey=ConfiguredTargetKey{label=//test:rule1, config=BuildConfigurationKey[91b98a11150aa42ab36f3ce12d30291638db7840cb81a44c8c3b3c12fa499ffb]}, actionIndex=0}], PARENT_FORCE_REBUILD_OF_CHILD
    at com.google.devtools.build.skyframe.GraphInconsistencyReceiver.lambda$static$0(GraphInconsistencyReceiver.java:37)
    at com.google.devtools.build.skyframe.AbstractParallelEvaluator.maybeHandleRestart(AbstractParallelEvaluator.java:963)
    at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:652)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
  • If I add --experimental_rewind_missing_files to startupOptions in RewindingTest there is one test crashing:
1) buildingParentFoundUndoneChildNotToleratedWithoutRewinding[keepGoing=false](com.google.devtools.build.lib.skyframe.rewinding.RewindingTest)
value of           : throwable.getMessage()
expected to contain: Unexpected undone children
but was            : Unexpected inconsistency: ActionLookupData{actionLookupKey=ConfiguredTargetKey{label=//foo:top, config=BuildConfigurationKey[91b98a11150aa42ab36f3ce12d30291638db7840cb81a44c8c3b3c12fa499ffb]}, actionIndex=0}, [ActionLookupData{actionLookupKey=ConfiguredTargetKey{label=//foo:dep, config=BuildConfigurationKey[91b98a11150aa42ab36f3ce12d30291638db7840cb81a44c8c3b3c12fa499ffb]}, actionIndex=0}], BUILDING_PARENT_FOUND_UNDONE_CHILD
    at com.google.devtools.build.lib.skyframe.rewinding.RewindingTestsHelper.runBuildingParentFoundUndoneChildNotToleratedWithoutRewinding(RewindingTestsHelper.java:419)
    at com.google.devtools.build.lib.skyframe.rewinding.RewindingTest.buildingParentFoundUndoneChildNotToleratedWithoutRewinding(RewindingTest.java:90)
Caused by: java.lang.IllegalStateException: Unexpected inconsistency: ActionLookupData{actionLookupKey=ConfiguredTargetKey{label=//foo:top, config=BuildConfigurationKey[91b98a11150aa42ab36f3ce12d30291638db7840cb81a44c8c3b3c12fa499ffb]}, actionIndex=0}, [ActionLookupData{actionLookupKey=ConfiguredTargetKey{label=//foo:dep, config=BuildConfigurationKey[91b98a11150aa42ab36f3ce12d30291638db7840cb81a44c8c3b3c12fa499ffb]}, actionIndex=0}], BUILDING_PARENT_FOUND_UNDONE_CHILD
    at com.google.devtools.build.skyframe.RewindingGraphInconsistencyReceiver.noteInconsistencyAndMaybeThrow(RewindingGraphInconsistencyReceiver.java:19)
    at com.google.devtools.build.skyframe.SkyFunctionEnvironment.batchPrefetch(SkyFunctionEnvironment.java:231)
    at com.google.devtools.build.skyframe.SkyFunctionEnvironment.<init>(SkyFunctionEnvironment.java:203)
    at com.google.devtools.build.skyframe.SkyFunctionEnvironment.create(SkyFunctionEnvironment.java:154)
    at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:544)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

@illicitonion
Copy link
Contributor Author

Closing in favour of #16660

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-response Awaiting a response from the author cla: yes team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants