Tests all fail with distributed caching enabled #1413

dfabulich · 2016-06-16T03:19:17Z

Consider this repository. https://github.com/dfabulich/hazelcast-junit-test

It has two targets, app and apptest.

java_library(
    name='app',
    srcs=glob(['src/main/java/**']),
)

java_test(
    name='apptest',
    srcs=glob(['src/test/java/**']),
    test_class='com.redfin.AppTest',
    size="small",
    deps=['app', '@junit//jar', '@hamcrest_core//jar'],
)

The test is a trivial test that always passes.

package com.redfin;

public class App {
    public String go() {
        return "hello";
    }
}

package com.redfin;

import org.junit.Test;

public class AppTest {

    @Test
    public void testGo() throws Exception {
        new App().go();
    }
}

If you run bazel test :apptest, the test succeeds.

Now run it with distributed caching enabled, following the instructions here: https://raw.githubusercontent.com/bazelbuild/bazel/79adf59e2973754c8c0415fcab45cd58c7c34697/src/main/java/com/google/devtools/build/lib/remote/README.md

java -jar hazelcast-3.5.4.jar &

bazel clean && bazel test :apptest --hazelcast_node=127.0.0.1:5701 --spawn_strategy=remote

The test now fails when it should succeed. The test log shows no apparent problem.

$ bazel clean && bazel test :apptest --hazelcast_node=127.0.0.1:5701 --spawn_strategy=remote
INFO: Starting clean (this may take a while). Consider using --expunge_async if the clean takes more than several minutes.
INFO: Found 1 test target...
FAIL: //:apptest (see /private/var/tmp/_bazel_dfabulich/bc91e687c045d0f8011faa3e2c77b2b4/execroot/hazelcast-junit-test/bazel-out/local-fastbuild/testlogs/apptest/test.log).
Target //:apptest up-to-date:
  bazel-bin/apptest.jar
  bazel-bin/apptest
INFO: Elapsed time: 8.864s, Critical Path: 5.65s
//:apptest                                                               FAILED in 5.3s
  /private/var/tmp/_bazel_dfabulich/bc91e687c045d0f8011faa3e2c77b2b4/execroot/hazelcast-junit-test/bazel-out/local-fastbuild/testlogs/apptest/test.log

Executed 1 out of 1 test: 1 fails locally.
$ more /private/var/tmp/_bazel_dfabulich/bc91e687c045d0f8011faa3e2c77b2b4/execroot/hazelcast-junit-test/bazel-out/local-fastbuild/testlogs/apptest/test.log
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
-----------------------------------------------------------------------------
JUnit4 Test Runner
.
Time: 5.053

OK (1 test)


BazelTestRunner exiting with a return value of 0
JVM shutdown hooks (if any) will run now.
The JVM will exit once they complete.

-- JVM shutdown starting at 2016-06-16 03:17:21 --

The text was updated successfully, but these errors were encountered:

hhclam · 2016-06-16T03:32:26Z

Thanks for the report. I'll look into this.

hermione521 · 2016-06-16T08:55:55Z

Also cc @philwo

dfabulich · 2016-06-20T23:41:07Z

I experimented a bit with this in the debugger.

There's an exception being thrown in RemoteSpawnStrategy at line 165, when it attempts to add the output files to the remoteActionCache.

java.io.FileNotFoundException: /private/var/tmp/_bazel_dfabulich/bc91e687c045d0f8011faa3e2c77b2b4/execroot/hazelcast-junit-test/bazel-out/local-fastbuild/testlogs/apptest/test.cache_status (No such file or directory)

java.io.FileNotFoundException: /private/var/tmp/_bazel_dfabulich/bc91e687c045d0f8011faa3e2c77b2b4/execroot/hazelcast-junit-test/bazel-out/local-fastbuild/testlogs/apptest/test.cache_status (No such file or directory)
    at com.google.devtools.build.lib.unix.NativePosixFiles.md5sumAsBytes(Native Method)
    at com.google.devtools.build.lib.unix.NativePosixFiles.md5sum(NativePosixFiles.java:442)
    at com.google.devtools.build.lib.vfs.UnixFileSystem.getMD5Digest(UnixFileSystem.java:609)
    at com.google.devtools.build.lib.vfs.Path.getMD5Digest(Path.java:982)
    at com.google.devtools.build.lib.remote.MemcacheActionCache.putFileIfNotExist(MemcacheActionCache.java:59)
    at com.google.devtools.build.lib.remote.MemcacheActionCache.addToActionOutput(MemcacheActionCache.java:161)
    at com.google.devtools.build.lib.remote.MemcacheActionCache.putActionOutput(MemcacheActionCache.java:136)
    at com.google.devtools.build.lib.remote.RemoteSpawnStrategy.exec(RemoteSpawnStrategy.java:165)
    at com.google.devtools.build.lib.rules.test.StandaloneTestStrategy.execute(StandaloneTestStrategy.java:185)
    at com.google.devtools.build.lib.rules.test.StandaloneTestStrategy.exec(StandaloneTestStrategy.java:135)
    at com.google.devtools.build.lib.rules.test.ExclusiveTestStrategy.exec(ExclusiveTestStrategy.java:42)
    at com.google.devtools.build.lib.rules.test.TestRunnerAction.execute(TestRunnerAction.java:529)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeActionTask(SkyframeActionExecutor.java:760)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.prepareScheduleExecuteAndCompleteAction(SkyframeActionExecutor.java:699)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.access$800(SkyframeActionExecutor.java:102)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.call(SkyframeActionExecutor.java:589)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.call(SkyframeActionExecutor.java:551)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:373)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:387)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:174)
    at com.google.devtools.build.skyframe.ParallelEvaluator$Evaluate.run(ParallelEvaluator.java:972)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:532)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

This thrown exception causes the test to be marked as a failure.

test.cache_status is one of the outputs declared in TestActionBuilder.java, but test.cache_status isn't built by the action itself; it's built by StandaloneTestStrategy on line 137, two lines after it executes the action, when it posts the test result.

      at com.google.devtools.build.lib.rules.test.TestRunnerAction.saveCacheStatus(TestRunnerAction.java:232)
      at com.google.devtools.build.lib.rules.test.TestStrategy.postTestResult(TestStrategy.java:287)
      at com.google.devtools.build.lib.rules.test.StandaloneTestStrategy.finalizeTest(StandaloneTestStrategy.java:256)
      at com.google.devtools.build.lib.rules.test.StandaloneTestStrategy.exec(StandaloneTestStrategy.java:137)

This bug seems like a problem with the architecture of test actions. On the one hand, I would not expect it to be OK for test targets to declare outputs that their actions don't generate. On the other hand, the test process itself can't be trusted; the success/failure of the run is a fact we observe about running the runner, so I don't think we can/should trust the test process to manage its own cache status.

(How does this ever work in Blaze?!)

ulfjack · 2016-06-22T07:42:36Z

Blaze has a completely different implementation of TestStrategy, which doesn't go through SpawnStrategy, but instead talks to the remote execution system through another internal API (which is only used for this purpose). It manually overrides the set of output files, which implicitly removes test.cache_status (which is written by Blaze / Bazel after the test process finishes). I.e., test.cache_status is not cached remotely, but is reconstructed locally on cache hits.

TBH, the code is a bit of a mess.

My preferred solution would be to rewrite / refactor large chunks of this code to merge the (currently separate) implementations, and go through SpawnStrategy internally as well. However, this might be tricky. IIRC, we use the test.cache_status file to implement a custom caching policy, which is to cache test passes, but not failures.

As a temporary workaround, maybe we could remove the test.cache_status from the output files of the spawn. That seems like it should fix this, even if it's not optimal.

ulfjack · 2016-06-22T07:45:14Z

Ideally, we'd remove the test.cache_status files. I seem to remember discussing that with Han-wen back when he wrote the StandaloneTestStrategy. I think the proposal was to represent that in Skyframe / add a bit to the local action cache, but Skyframe wasn't fully rolled out at that time, so we would've had to add a workaround for the non-Skyframe code path.

damienmg · 2016-06-22T07:46:00Z

We need to fix that in the next quarter.

marczych · 2016-07-07T21:28:25Z

I'm having the same issue. Is there any workaround other than not using distributed caching when running tests?

ulfjack · 2016-07-08T11:38:46Z

Not that I know of, unfortunately. Philipp is having similar problems with sandboxing, and may cover this use case as well.

dfabulich · 2016-08-01T22:39:12Z

I tried just removing the line to results.add(cacheStatus); in TestActionBuilder, but that caused tests to be marked "skipped" with NO STATUS.

So instead I tried updating ConcurrentMapActionCache to ignore missing files. That nearly worked, because RemoteSpawnStrategy already refuses to cache outputs when the action fails, but I ran into issue #1593, which means that runfiles aren't being incorporated into the cache key.

That's a significantly bigger issue; it specifically applies to this bug because the test jars are just runfiles for the test launcher script; when those runfiles are updated, the cache key isn't getting updated, and thus the "success" result isn't invalidated when the test inputs change.

dfabulich · 2016-08-01T23:41:47Z

Figured out a fix for #1593, so here's a PR for #1413 (ignore missing output files) https://bazel-review.googlesource.com/4231

Fixes bazelbuild#1413 Change-Id: Ib863a6bab8ebe35b6076f8fbf780c022424c158b Conflicts: src/main/java/com/google/devtools/build/lib/remote/MemcacheActionCache.java

trainman419 · 2016-10-26T23:44:45Z

I'm using bazel 0.3.1 and ran into this recently. What's the status on the fix for this?

kerinin · 2016-12-09T23:08:35Z

I'm experiencing the same failure with golang tests and bazel 0.4.1

ulfjack · 2017-01-13T13:46:10Z

I've made some progress on refactoring the test strategies, and a stack of further changes to unify some code between all implementations, but the end of the tunnel is not in sight yet.

ulfjack · 2017-03-01T13:55:35Z

I'm still working on this. More progress on the test strategies, and I'm preparing to open source some more of the code we're using internally. The code is incredibly subtle, and I've had to roll back several times because I overlooked some corner cases. I have a good idea of the model I'm shooting for, but it still requires significant refactoring work.

All spawn strategies already treat all normal outputs as optional. Bazel checks at the action level whether all action outputs are created, but does not check at the spawn level. Spawn.getOptionalOutputs is therefore unnecessary, and removed in this change. The only place where this was set was in StandaloneTestStrategy, which now specifies the full set of outputs, which is now computed by TestRunnerAction. The internal test strategy implementations are also updated in this change. While I'm at it, also remove the use of BaseSpawn and use SimpleSpawn instead. This may go some way towards fixing #1413 and #942. -- PiperOrigin-RevId: 149397100 MOS_MIGRATED_REVID=149397100

ulfjack · 2017-03-07T16:23:10Z

I wonder if the change above fixed this. Can someone try with head?

dfabulich · 2017-03-10T21:36:41Z

I think distributed test caching is just not working in head revision 413415c, so it "passes" by not using caching. e.g. when I add a 10s sleep to the test and run/re-run it with bazel clean && bazel test :apptest it takes 10s every time, when it should be using the cache.

I've updated the https://github.com/dfabulich/hazelcast-junit-test test repo; it should reproduce out of the box on head like this:

java -jar hazelcast-3.8.jar &
bazel clean && bazel test :apptest
bazel clean && bazel test :apptest

ulfjack · 2017-03-14T10:34:10Z

The way it's implemented right now, it only looks for a cache hit if remote execution is enabled. It also does not write local results to the remote cache.

ulfjack · 2017-03-14T11:01:10Z

There's simply no code to combine local execution with remote caching. It's not hard to implement, which I did in a few minutes. It doesn't work for tests yet, because we're not uploading stdout/stderr. I'm also reluctant to merge this change into master because it's not how I want this to work.

My change is here:
https://github.com/ulfjack/bazel/tree/remote-caching

The failure shown in the first post on this bug is most likely because the code is swallowing certain exceptions. Note the debug code in my branch where I'm calling e.printStackTrace().

ulfjack · 2017-03-14T11:06:43Z

I think what happened is that we rewrote RemoteSpawnStrategy in a way that dropped the remote caching part. Previously it was failing because it was trying to upload non-existent files - it tried to upload all files declared on Spawn, but the files declared are all optional, and - for tests - several are rarely generated. When re-implemented remote caching, I got the same error and changed it to only upload existing files, which fixed that part.

Fixes bazelbuild#1413 Change-Id: Ib863a6bab8ebe35b6076f8fbf780c022424c158b Conflicts: src/main/java/com/google/devtools/build/lib/remote/MemcacheActionCache.java

ulfjack · 2017-03-27T13:28:33Z

I've made some progress on the underlying infrastructure, and I just put a change together this morning to reimplement remote caching in the new infrastructure. However, I don't have tests yet, and the code also still needs to be hooked up in the right way. It's probably still a few weeks before this is fixed.

steren · 2017-03-27T14:26:51Z

The REST remote caching API will not change and after Ulf's changes, it will be called when building with Bazel.

mwitkow · 2017-03-27T14:44:04Z

@steren @ulfjack can you please clarify?

We're interested in the remote_protocol.proto interfaces. Is that what you consider REST APIs? Hopefully these are also affected by @ulfjack fixes to the caching, since this is one of the primary blockers for us.

steren · 2017-03-27T18:05:53Z

Yes, I was underlining the REST API of remote caching but I suppose that the other APIs will not be affected either by @ulfjack's work. He is fixing the underlying implementation.

…exec The new class wraps an existing SpawnRunner and adds remote caching. Ideally, the wrapped runner should be local and sandboxed, but this is not currently enforced. The new class is not hooked up to anything yet. The added test indicates that the RemoteActionCache interface is still more complex than necessary - in particular, we should merge downloadAllResults and downloadBlobs (for stdout/stderr) into a single method, and also change the upload to a single combined method in a similar way instead of two calls. Doing so allows the RemoteActionCache implementation more leeway in how it wants to implement these, potentially improving parallelism and performance. One step towards #1413. PiperOrigin-RevId: 152245644

ulfjack · 2017-04-18T14:39:11Z

Also f5918bf.

I've sent out a change to fix upload of stdout/stderr, which is necessary for test caching. I think that will allow us to close this issue, though there's still more refactoring happening.

Only write a cache entry when the spawn executed successfully, and with a 0 exit code. In the test, we only check that uploadFileContents is called exactly twice. Progress on #1413. PiperOrigin-RevId: 153458240

ulfjack · 2017-04-19T07:52:02Z

We're also missing download of stdout/stderr in the old implementation, and there's some interaction with #2844.

ulfjack · 2017-04-19T11:20:19Z

Sorry, I left in the 'Fixes #1413' in the previous commit description, even though it wasn't complete. I have the last fix for stdout/stderr pending, and then I'll close this.

dfabulich mentioned this issue Jun 16, 2016

Distributed caching: hazelcast_node argument prints WARNING, does nothing #1414

Closed

hermione521 added type: bug P3 We're not considering working on this, but happy to review a PR. (No assignee) category: performance labels Jun 16, 2016

hermione521 added under investigation and removed category: performance P3 We're not considering working on this, but happy to review a PR. (No assignee) type: bug labels Jun 16, 2016

damienmg added type: bug P2 We'll consider working on this in future. (Assignee optional) category: performance and removed under investigation labels Jun 22, 2016

damienmg added this to the 0.5 milestone Jun 22, 2016

ulfjack assigned philwo Jul 8, 2016

rjaquino pushed a commit to Asana/bazel that referenced this issue Aug 31, 2016

Ignore missing output files

1b6722a

Fixes bazelbuild#1413 Change-Id: Ib863a6bab8ebe35b6076f8fbf780c022424c158b Conflicts: src/main/java/com/google/devtools/build/lib/remote/MemcacheActionCache.java

philwo modified the milestones: 0.5, 0.6 Dec 21, 2016

damienmg added category: service APIs and removed category: performance labels Jan 24, 2017

hhclam mentioned this issue Feb 21, 2017

bug: --remote_cache causes test rules for C++ and Java fail even on first run #2560

Closed

steren added the Customer impact label Mar 1, 2017

ulfjack changed the title ~~Distributed caching: java_test targets all fail with distributed caching enabled~~ Tests all fail with distributed caching enabled Mar 14, 2017

ulfjack mentioned this issue Mar 14, 2017

All tests fail with --spawn_strategy=remote #942

Closed

markchua pushed a commit to Asana/bazel that referenced this issue Mar 24, 2017

Ignore missing output files

0657b26

Fixes bazelbuild#1413 Change-Id: Ib863a6bab8ebe35b6076f8fbf780c022424c158b Conflicts: src/main/java/com/google/devtools/build/lib/remote/MemcacheActionCache.java

mwitkow mentioned this issue Apr 10, 2017

Feature request: external dependency caching using remote gRPC protocol #2557

Closed

bazel-io closed this as completed in f4408f6 Apr 19, 2017

ulfjack reopened this Apr 19, 2017

bazel-io closed this as completed in dd99e4b Apr 20, 2017

PoncinMatthieu mentioned this issue Oct 27, 2017

Option --check_tests_up_to_date doesn't work with Remote cache server #3978

Closed

RNabel mentioned this issue Jan 22, 2020

.../test.cache_status (Permission denied) crashes Bazel server #10633

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests all fail with distributed caching enabled #1413

Tests all fail with distributed caching enabled #1413

dfabulich commented Jun 16, 2016

hhclam commented Jun 16, 2016

hermione521 commented Jun 16, 2016

dfabulich commented Jun 20, 2016

ulfjack commented Jun 22, 2016

ulfjack commented Jun 22, 2016

damienmg commented Jun 22, 2016

marczych commented Jul 7, 2016

ulfjack commented Jul 8, 2016

dfabulich commented Aug 1, 2016

dfabulich commented Aug 1, 2016

trainman419 commented Oct 26, 2016

kerinin commented Dec 9, 2016

ulfjack commented Jan 13, 2017 •

edited

Loading

ulfjack commented Mar 1, 2017

ulfjack commented Mar 7, 2017

dfabulich commented Mar 10, 2017

ulfjack commented Mar 14, 2017

ulfjack commented Mar 14, 2017 •

edited

Loading

ulfjack commented Mar 14, 2017

ulfjack commented Mar 27, 2017

steren commented Mar 27, 2017

mwitkow commented Mar 27, 2017

steren commented Mar 27, 2017

ulfjack commented Apr 18, 2017

ulfjack commented Apr 19, 2017

ulfjack commented Apr 19, 2017

Tests all fail with distributed caching enabled #1413

Tests all fail with distributed caching enabled #1413

Comments

dfabulich commented Jun 16, 2016

hhclam commented Jun 16, 2016

hermione521 commented Jun 16, 2016

dfabulich commented Jun 20, 2016

ulfjack commented Jun 22, 2016

ulfjack commented Jun 22, 2016

damienmg commented Jun 22, 2016

marczych commented Jul 7, 2016

ulfjack commented Jul 8, 2016

dfabulich commented Aug 1, 2016

dfabulich commented Aug 1, 2016

trainman419 commented Oct 26, 2016

kerinin commented Dec 9, 2016

ulfjack commented Jan 13, 2017 • edited Loading

ulfjack commented Mar 1, 2017

ulfjack commented Mar 7, 2017

dfabulich commented Mar 10, 2017

ulfjack commented Mar 14, 2017

ulfjack commented Mar 14, 2017 • edited Loading

ulfjack commented Mar 14, 2017

ulfjack commented Mar 27, 2017

steren commented Mar 27, 2017

mwitkow commented Mar 27, 2017

steren commented Mar 27, 2017

ulfjack commented Apr 18, 2017

ulfjack commented Apr 19, 2017

ulfjack commented Apr 19, 2017

ulfjack commented Jan 13, 2017 •

edited

Loading

ulfjack commented Mar 14, 2017 •

edited

Loading