Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests all fail with distributed caching enabled #1413

Closed
dfabulich opened this issue Jun 16, 2016 · 26 comments
Closed

Tests all fail with distributed caching enabled #1413

dfabulich opened this issue Jun 16, 2016 · 26 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) type: bug
Milestone

Comments

@dfabulich
Copy link
Contributor

Consider this repository. https://github.com/dfabulich/hazelcast-junit-test

It has two targets, app and apptest.

java_library(
    name='app',
    srcs=glob(['src/main/java/**']),
)

java_test(
    name='apptest',
    srcs=glob(['src/test/java/**']),
    test_class='com.redfin.AppTest',
    size="small",
    deps=['app', '@junit//jar', '@hamcrest_core//jar'],
)

The test is a trivial test that always passes.

package com.redfin;

public class App {
    public String go() {
        return "hello";
    }
}
package com.redfin;

import org.junit.Test;

public class AppTest {

    @Test
    public void testGo() throws Exception {
        new App().go();
    }
}

If you run bazel test :apptest, the test succeeds.

Now run it with distributed caching enabled, following the instructions here: https://raw.githubusercontent.com/bazelbuild/bazel/79adf59e2973754c8c0415fcab45cd58c7c34697/src/main/java/com/google/devtools/build/lib/remote/README.md

java -jar hazelcast-3.5.4.jar &

bazel clean && bazel test :apptest --hazelcast_node=127.0.0.1:5701 --spawn_strategy=remote

The test now fails when it should succeed. The test log shows no apparent problem.

$ bazel clean && bazel test :apptest --hazelcast_node=127.0.0.1:5701 --spawn_strategy=remote
INFO: Starting clean (this may take a while). Consider using --expunge_async if the clean takes more than several minutes.
INFO: Found 1 test target...
FAIL: //:apptest (see /private/var/tmp/_bazel_dfabulich/bc91e687c045d0f8011faa3e2c77b2b4/execroot/hazelcast-junit-test/bazel-out/local-fastbuild/testlogs/apptest/test.log).
Target //:apptest up-to-date:
  bazel-bin/apptest.jar
  bazel-bin/apptest
INFO: Elapsed time: 8.864s, Critical Path: 5.65s
//:apptest                                                               FAILED in 5.3s
  /private/var/tmp/_bazel_dfabulich/bc91e687c045d0f8011faa3e2c77b2b4/execroot/hazelcast-junit-test/bazel-out/local-fastbuild/testlogs/apptest/test.log

Executed 1 out of 1 test: 1 fails locally.
$ more /private/var/tmp/_bazel_dfabulich/bc91e687c045d0f8011faa3e2c77b2b4/execroot/hazelcast-junit-test/bazel-out/local-fastbuild/testlogs/apptest/test.log
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
-----------------------------------------------------------------------------
JUnit4 Test Runner
.
Time: 5.053

OK (1 test)


BazelTestRunner exiting with a return value of 0
JVM shutdown hooks (if any) will run now.
The JVM will exit once they complete.

-- JVM shutdown starting at 2016-06-16 03:17:21 --
@hhclam
Copy link
Contributor

hhclam commented Jun 16, 2016

Thanks for the report. I'll look into this.

@hermione521 hermione521 added type: bug P3 We're not considering working on this, but happy to review a PR. (No assignee) category: performance labels Jun 16, 2016
@hermione521
Copy link
Contributor

Also cc @philwo

@hermione521 hermione521 added under investigation and removed category: performance P3 We're not considering working on this, but happy to review a PR. (No assignee) type: bug labels Jun 16, 2016
@dfabulich
Copy link
Contributor Author

I experimented a bit with this in the debugger.

There's an exception being thrown in RemoteSpawnStrategy at line 165, when it attempts to add the output files to the remoteActionCache.

java.io.FileNotFoundException: /private/var/tmp/_bazel_dfabulich/bc91e687c045d0f8011faa3e2c77b2b4/execroot/hazelcast-junit-test/bazel-out/local-fastbuild/testlogs/apptest/test.cache_status (No such file or directory)

java.io.FileNotFoundException: /private/var/tmp/_bazel_dfabulich/bc91e687c045d0f8011faa3e2c77b2b4/execroot/hazelcast-junit-test/bazel-out/local-fastbuild/testlogs/apptest/test.cache_status (No such file or directory)
    at com.google.devtools.build.lib.unix.NativePosixFiles.md5sumAsBytes(Native Method)
    at com.google.devtools.build.lib.unix.NativePosixFiles.md5sum(NativePosixFiles.java:442)
    at com.google.devtools.build.lib.vfs.UnixFileSystem.getMD5Digest(UnixFileSystem.java:609)
    at com.google.devtools.build.lib.vfs.Path.getMD5Digest(Path.java:982)
    at com.google.devtools.build.lib.remote.MemcacheActionCache.putFileIfNotExist(MemcacheActionCache.java:59)
    at com.google.devtools.build.lib.remote.MemcacheActionCache.addToActionOutput(MemcacheActionCache.java:161)
    at com.google.devtools.build.lib.remote.MemcacheActionCache.putActionOutput(MemcacheActionCache.java:136)
    at com.google.devtools.build.lib.remote.RemoteSpawnStrategy.exec(RemoteSpawnStrategy.java:165)
    at com.google.devtools.build.lib.rules.test.StandaloneTestStrategy.execute(StandaloneTestStrategy.java:185)
    at com.google.devtools.build.lib.rules.test.StandaloneTestStrategy.exec(StandaloneTestStrategy.java:135)
    at com.google.devtools.build.lib.rules.test.ExclusiveTestStrategy.exec(ExclusiveTestStrategy.java:42)
    at com.google.devtools.build.lib.rules.test.TestRunnerAction.execute(TestRunnerAction.java:529)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeActionTask(SkyframeActionExecutor.java:760)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.prepareScheduleExecuteAndCompleteAction(SkyframeActionExecutor.java:699)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.access$800(SkyframeActionExecutor.java:102)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.call(SkyframeActionExecutor.java:589)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.call(SkyframeActionExecutor.java:551)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:373)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:387)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:174)
    at com.google.devtools.build.skyframe.ParallelEvaluator$Evaluate.run(ParallelEvaluator.java:972)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:532)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

This thrown exception causes the test to be marked as a failure.

test.cache_status is one of the outputs declared in TestActionBuilder.java, but test.cache_status isn't built by the action itself; it's built by StandaloneTestStrategy on line 137, two lines after it executes the action, when it posts the test result.

      at com.google.devtools.build.lib.rules.test.TestRunnerAction.saveCacheStatus(TestRunnerAction.java:232)
      at com.google.devtools.build.lib.rules.test.TestStrategy.postTestResult(TestStrategy.java:287)
      at com.google.devtools.build.lib.rules.test.StandaloneTestStrategy.finalizeTest(StandaloneTestStrategy.java:256)
      at com.google.devtools.build.lib.rules.test.StandaloneTestStrategy.exec(StandaloneTestStrategy.java:137)

This bug seems like a problem with the architecture of test actions. On the one hand, I would not expect it to be OK for test targets to declare outputs that their actions don't generate. On the other hand, the test process itself can't be trusted; the success/failure of the run is a fact we observe about running the runner, so I don't think we can/should trust the test process to manage its own cache status.

(How does this ever work in Blaze?!)

@ulfjack
Copy link
Contributor

ulfjack commented Jun 22, 2016

Blaze has a completely different implementation of TestStrategy, which doesn't go through SpawnStrategy, but instead talks to the remote execution system through another internal API (which is only used for this purpose). It manually overrides the set of output files, which implicitly removes test.cache_status (which is written by Blaze / Bazel after the test process finishes). I.e., test.cache_status is not cached remotely, but is reconstructed locally on cache hits.

TBH, the code is a bit of a mess.

My preferred solution would be to rewrite / refactor large chunks of this code to merge the (currently separate) implementations, and go through SpawnStrategy internally as well. However, this might be tricky. IIRC, we use the test.cache_status file to implement a custom caching policy, which is to cache test passes, but not failures.

As a temporary workaround, maybe we could remove the test.cache_status from the output files of the spawn. That seems like it should fix this, even if it's not optimal.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 22, 2016

Ideally, we'd remove the test.cache_status files. I seem to remember discussing that with Han-wen back when he wrote the StandaloneTestStrategy. I think the proposal was to represent that in Skyframe / add a bit to the local action cache, but Skyframe wasn't fully rolled out at that time, so we would've had to add a workaround for the non-Skyframe code path.

@damienmg damienmg added type: bug P2 We'll consider working on this in future. (Assignee optional) category: performance and removed under investigation labels Jun 22, 2016
@damienmg damienmg added this to the 0.5 milestone Jun 22, 2016
@damienmg
Copy link
Contributor

We need to fix that in the next quarter.

@marczych
Copy link
Contributor

marczych commented Jul 7, 2016

I'm having the same issue. Is there any workaround other than not using distributed caching when running tests?

@ulfjack
Copy link
Contributor

ulfjack commented Jul 8, 2016

Not that I know of, unfortunately. Philipp is having similar problems with sandboxing, and may cover this use case as well.

@dfabulich
Copy link
Contributor Author

I tried just removing the line to results.add(cacheStatus); in TestActionBuilder, but that caused tests to be marked "skipped" with NO STATUS.

So instead I tried updating ConcurrentMapActionCache to ignore missing files. That nearly worked, because RemoteSpawnStrategy already refuses to cache outputs when the action fails, but I ran into issue #1593, which means that runfiles aren't being incorporated into the cache key.

That's a significantly bigger issue; it specifically applies to this bug because the test jars are just runfiles for the test launcher script; when those runfiles are updated, the cache key isn't getting updated, and thus the "success" result isn't invalidated when the test inputs change.

@dfabulich
Copy link
Contributor Author

Figured out a fix for #1593, so here's a PR for #1413 (ignore missing output files) https://bazel-review.googlesource.com/4231

rjaquino pushed a commit to Asana/bazel that referenced this issue Aug 31, 2016
Fixes bazelbuild#1413

Change-Id: Ib863a6bab8ebe35b6076f8fbf780c022424c158b

Conflicts:
	src/main/java/com/google/devtools/build/lib/remote/MemcacheActionCache.java
@trainman419
Copy link

I'm using bazel 0.3.1 and ran into this recently. What's the status on the fix for this?

@kerinin
Copy link

kerinin commented Dec 9, 2016

I'm experiencing the same failure with golang tests and bazel 0.4.1

@philwo philwo modified the milestones: 0.5, 0.6 Dec 21, 2016
@ulfjack
Copy link
Contributor

ulfjack commented Jan 13, 2017

I've made some progress on refactoring the test strategies, and a stack of further changes to unify some code between all implementations, but the end of the tunnel is not in sight yet.

@ulfjack
Copy link
Contributor

ulfjack commented Mar 1, 2017

I'm still working on this. More progress on the test strategies, and I'm preparing to open source some more of the code we're using internally. The code is incredibly subtle, and I've had to roll back several times because I overlooked some corner cases. I have a good idea of the model I'm shooting for, but it still requires significant refactoring work.

bazel-io pushed a commit that referenced this issue Mar 7, 2017
All spawn strategies already treat all normal outputs as optional. Bazel checks
at the action level whether all action outputs are created, but does not check
at the spawn level. Spawn.getOptionalOutputs is therefore unnecessary, and
removed in this change.

The only place where this was set was in StandaloneTestStrategy, which now
specifies the full set of outputs, which is now computed by TestRunnerAction.
The internal test strategy implementations are also updated in this change.

While I'm at it, also remove the use of BaseSpawn and use SimpleSpawn instead.

This may go some way towards fixing #1413 and #942.

--
PiperOrigin-RevId: 149397100
MOS_MIGRATED_REVID=149397100
@ulfjack
Copy link
Contributor

ulfjack commented Mar 7, 2017

I wonder if the change above fixed this. Can someone try with head?

@dfabulich
Copy link
Contributor Author

I think distributed test caching is just not working in head revision 413415c, so it "passes" by not using caching. e.g. when I add a 10s sleep to the test and run/re-run it with bazel clean && bazel test :apptest it takes 10s every time, when it should be using the cache.

I've updated the https://github.com/dfabulich/hazelcast-junit-test test repo; it should reproduce out of the box on head like this:

java -jar hazelcast-3.8.jar &
bazel clean && bazel test :apptest
bazel clean && bazel test :apptest

@ulfjack
Copy link
Contributor

ulfjack commented Mar 14, 2017

The way it's implemented right now, it only looks for a cache hit if remote execution is enabled. It also does not write local results to the remote cache.

@ulfjack
Copy link
Contributor

ulfjack commented Mar 14, 2017

There's simply no code to combine local execution with remote caching. It's not hard to implement, which I did in a few minutes. It doesn't work for tests yet, because we're not uploading stdout/stderr. I'm also reluctant to merge this change into master because it's not how I want this to work.

My change is here:
https://github.com/ulfjack/bazel/tree/remote-caching

The failure shown in the first post on this bug is most likely because the code is swallowing certain exceptions. Note the debug code in my branch where I'm calling e.printStackTrace().

@ulfjack
Copy link
Contributor

ulfjack commented Mar 14, 2017

I think what happened is that we rewrote RemoteSpawnStrategy in a way that dropped the remote caching part. Previously it was failing because it was trying to upload non-existent files - it tried to upload all files declared on Spawn, but the files declared are all optional, and - for tests - several are rarely generated. When re-implemented remote caching, I got the same error and changed it to only upload existing files, which fixed that part.

@ulfjack ulfjack changed the title Distributed caching: java_test targets all fail with distributed caching enabled Tests all fail with distributed caching enabled Mar 14, 2017
markchua pushed a commit to Asana/bazel that referenced this issue Mar 24, 2017
Fixes bazelbuild#1413

Change-Id: Ib863a6bab8ebe35b6076f8fbf780c022424c158b

Conflicts:
	src/main/java/com/google/devtools/build/lib/remote/MemcacheActionCache.java
@ulfjack
Copy link
Contributor

ulfjack commented Mar 27, 2017

I've made some progress on the underlying infrastructure, and I just put a change together this morning to reimplement remote caching in the new infrastructure. However, I don't have tests yet, and the code also still needs to be hooked up in the right way. It's probably still a few weeks before this is fixed.

@steren
Copy link
Contributor

steren commented Mar 27, 2017

The REST remote caching API will not change and after Ulf's changes, it will be called when building with Bazel.

@mwitkow
Copy link

mwitkow commented Mar 27, 2017

@steren @ulfjack can you please clarify?

We're interested in the remote_protocol.proto interfaces. Is that what you consider REST APIs? Hopefully these are also affected by @ulfjack fixes to the caching, since this is one of the primary blockers for us.

@steren
Copy link
Contributor

steren commented Mar 27, 2017

Yes, I was underlining the REST API of remote caching but I suppose that the other APIs will not be affected either by @ulfjack's work. He is fixing the underlying implementation.

bazel-io pushed a commit that referenced this issue Apr 5, 2017
…exec

The new class wraps an existing SpawnRunner and adds remote caching. Ideally,
the wrapped runner should be local and sandboxed, but this is not currently
enforced. The new class is not hooked up to anything yet.

The added test indicates that the RemoteActionCache interface is still more
complex than necessary - in particular, we should merge downloadAllResults and
downloadBlobs (for stdout/stderr) into a single method, and also change the
upload to a single combined method in a similar way instead of two calls. Doing
so allows the RemoteActionCache implementation more leeway in how it wants to
implement these, potentially improving parallelism and performance.

One step towards #1413.

PiperOrigin-RevId: 152245644
@ulfjack
Copy link
Contributor

ulfjack commented Apr 18, 2017

Also f5918bf.

I've sent out a change to fix upload of stdout/stderr, which is necessary for test caching. I think that will allow us to close this issue, though there's still more refactoring happening.

bazel-io pushed a commit that referenced this issue Apr 18, 2017
Only write a cache entry when the spawn executed successfully, and with a 0
exit code. In the test, we only check that uploadFileContents is called exactly
twice.

Progress on #1413.

PiperOrigin-RevId: 153458240
@ulfjack
Copy link
Contributor

ulfjack commented Apr 19, 2017

We're also missing download of stdout/stderr in the old implementation, and there's some interaction with #2844.

@ulfjack ulfjack reopened this Apr 19, 2017
@ulfjack
Copy link
Contributor

ulfjack commented Apr 19, 2017

Sorry, I left in the 'Fixes #1413' in the previous commit description, even though it wasn't complete. I have the last fix for stdout/stderr pending, and then I'll close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) type: bug
Projects
None yet
Development

No branches or pull requests