-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests all fail with distributed caching enabled #1413
Comments
Thanks for the report. I'll look into this. |
Also cc @philwo |
I experimented a bit with this in the debugger. There's an exception being thrown in
This thrown exception causes the test to be marked as a failure.
This bug seems like a problem with the architecture of test actions. On the one hand, I would not expect it to be OK for test targets to declare outputs that their actions don't generate. On the other hand, the test process itself can't be trusted; the success/failure of the run is a fact we observe about running the runner, so I don't think we can/should trust the test process to manage its own cache status. (How does this ever work in Blaze?!) |
Blaze has a completely different implementation of TestStrategy, which doesn't go through SpawnStrategy, but instead talks to the remote execution system through another internal API (which is only used for this purpose). It manually overrides the set of output files, which implicitly removes test.cache_status (which is written by Blaze / Bazel after the test process finishes). I.e., test.cache_status is not cached remotely, but is reconstructed locally on cache hits. TBH, the code is a bit of a mess. My preferred solution would be to rewrite / refactor large chunks of this code to merge the (currently separate) implementations, and go through SpawnStrategy internally as well. However, this might be tricky. IIRC, we use the test.cache_status file to implement a custom caching policy, which is to cache test passes, but not failures. As a temporary workaround, maybe we could remove the test.cache_status from the output files of the spawn. That seems like it should fix this, even if it's not optimal. |
Ideally, we'd remove the test.cache_status files. I seem to remember discussing that with Han-wen back when he wrote the StandaloneTestStrategy. I think the proposal was to represent that in Skyframe / add a bit to the local action cache, but Skyframe wasn't fully rolled out at that time, so we would've had to add a workaround for the non-Skyframe code path. |
We need to fix that in the next quarter. |
I'm having the same issue. Is there any workaround other than not using distributed caching when running tests? |
Not that I know of, unfortunately. Philipp is having similar problems with sandboxing, and may cover this use case as well. |
I tried just removing the line to So instead I tried updating That's a significantly bigger issue; it specifically applies to this bug because the test jars are just runfiles for the test launcher script; when those runfiles are updated, the cache key isn't getting updated, and thus the "success" result isn't invalidated when the test inputs change. |
Figured out a fix for #1593, so here's a PR for #1413 (ignore missing output files) https://bazel-review.googlesource.com/4231 |
Fixes bazelbuild#1413 Change-Id: Ib863a6bab8ebe35b6076f8fbf780c022424c158b Conflicts: src/main/java/com/google/devtools/build/lib/remote/MemcacheActionCache.java
I'm using bazel 0.3.1 and ran into this recently. What's the status on the fix for this? |
I'm experiencing the same failure with golang tests and bazel 0.4.1 |
I've made some progress on refactoring the test strategies, and a stack of further changes to unify some code between all implementations, but the end of the tunnel is not in sight yet. |
I'm still working on this. More progress on the test strategies, and I'm preparing to open source some more of the code we're using internally. The code is incredibly subtle, and I've had to roll back several times because I overlooked some corner cases. I have a good idea of the model I'm shooting for, but it still requires significant refactoring work. |
All spawn strategies already treat all normal outputs as optional. Bazel checks at the action level whether all action outputs are created, but does not check at the spawn level. Spawn.getOptionalOutputs is therefore unnecessary, and removed in this change. The only place where this was set was in StandaloneTestStrategy, which now specifies the full set of outputs, which is now computed by TestRunnerAction. The internal test strategy implementations are also updated in this change. While I'm at it, also remove the use of BaseSpawn and use SimpleSpawn instead. This may go some way towards fixing #1413 and #942. -- PiperOrigin-RevId: 149397100 MOS_MIGRATED_REVID=149397100
I wonder if the change above fixed this. Can someone try with head? |
I think distributed test caching is just not working in head revision 413415c, so it "passes" by not using caching. e.g. when I add a 10s sleep to the test and run/re-run it with I've updated the https://github.com/dfabulich/hazelcast-junit-test test repo; it should reproduce out of the box on head like this:
|
The way it's implemented right now, it only looks for a cache hit if remote execution is enabled. It also does not write local results to the remote cache. |
There's simply no code to combine local execution with remote caching. It's not hard to implement, which I did in a few minutes. It doesn't work for tests yet, because we're not uploading stdout/stderr. I'm also reluctant to merge this change into master because it's not how I want this to work. My change is here: The failure shown in the first post on this bug is most likely because the code is swallowing certain exceptions. Note the debug code in my branch where I'm calling e.printStackTrace(). |
I think what happened is that we rewrote RemoteSpawnStrategy in a way that dropped the remote caching part. Previously it was failing because it was trying to upload non-existent files - it tried to upload all files declared on Spawn, but the files declared are all optional, and - for tests - several are rarely generated. When re-implemented remote caching, I got the same error and changed it to only upload existing files, which fixed that part. |
Fixes bazelbuild#1413 Change-Id: Ib863a6bab8ebe35b6076f8fbf780c022424c158b Conflicts: src/main/java/com/google/devtools/build/lib/remote/MemcacheActionCache.java
I've made some progress on the underlying infrastructure, and I just put a change together this morning to reimplement remote caching in the new infrastructure. However, I don't have tests yet, and the code also still needs to be hooked up in the right way. It's probably still a few weeks before this is fixed. |
The REST remote caching API will not change and after Ulf's changes, it will be called when building with Bazel. |
@steren @ulfjack can you please clarify? We're interested in the |
Yes, I was underlining the REST API of remote caching but I suppose that the other APIs will not be affected either by @ulfjack's work. He is fixing the underlying implementation. |
…exec The new class wraps an existing SpawnRunner and adds remote caching. Ideally, the wrapped runner should be local and sandboxed, but this is not currently enforced. The new class is not hooked up to anything yet. The added test indicates that the RemoteActionCache interface is still more complex than necessary - in particular, we should merge downloadAllResults and downloadBlobs (for stdout/stderr) into a single method, and also change the upload to a single combined method in a similar way instead of two calls. Doing so allows the RemoteActionCache implementation more leeway in how it wants to implement these, potentially improving parallelism and performance. One step towards #1413. PiperOrigin-RevId: 152245644
Also f5918bf. I've sent out a change to fix upload of stdout/stderr, which is necessary for test caching. I think that will allow us to close this issue, though there's still more refactoring happening. |
Only write a cache entry when the spawn executed successfully, and with a 0 exit code. In the test, we only check that uploadFileContents is called exactly twice. Progress on #1413. PiperOrigin-RevId: 153458240
We're also missing download of stdout/stderr in the old implementation, and there's some interaction with #2844. |
Sorry, I left in the 'Fixes #1413' in the previous commit description, even though it wasn't complete. I have the last fix for stdout/stderr pending, and then I'll close this. |
Consider this repository. https://github.com/dfabulich/hazelcast-junit-test
It has two targets,
app
andapptest
.The test is a trivial test that always passes.
If you run
bazel test :apptest
, the test succeeds.Now run it with distributed caching enabled, following the instructions here: https://raw.githubusercontent.com/bazelbuild/bazel/79adf59e2973754c8c0415fcab45cd58c7c34697/src/main/java/com/google/devtools/build/lib/remote/README.md
The test now fails when it should succeed. The test log shows no apparent problem.
The text was updated successfully, but these errors were encountered: