Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option --check_tests_up_to_date doesn't work with Remote cache server #3978

Closed
PoncinMatthieu opened this issue Oct 27, 2017 · 6 comments
Closed
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) stale Issues or PRs that are stale (no activity for 30 days) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request

Comments

@PoncinMatthieu
Copy link

Description of the bug:

Bazel option --check_tests_up_to_date doesn't work with the remote cache server.
Also there is no easy way to know if the tests were retrieved from the cache server or if they actually run.
Is this a bug or an intended behavior? Is there any alternative?

Background to the question / context

I have been trying to setup a remote cache server for setting up bazel with jenkins. We have a mono repo with multiple BUILD projects/libs and what I am trying to do is to setup a main jenkins pipeline which would run on every open pull requests and every commits to master. This main pipeline is checking if tests are up to date, if they are not, then we execute a downstream pipeline specific to the project requiring tests. The downstream pipeline run the tests and automatically deploy to staging/dev environments depending on the branch and project.
We need to use the remote cache server because we want the ability to run this pipeline in parallel for every PR. So that we can open 2 PRs at the same time and one doesn't need to wait for the other to run tests. Eventually this will be done from multiple jenkins nodes but for now we simply checkout the code in a different directory for each branches.
This setup works really well with a single server and single branch/workspace since we only need the local cache and the --check_tests_up_to_date works as expected. However this doesn't work when using the remote cache server.

Minimal example to reproduce the problem:

Consider the following repository: https://github.com/dfabulich/hazelcast-junit-test
First start the hazelcast server:

java -jar hazelcast-3.8.jar &

Then run the tests for the first time

$> bazel clean && bazel test :apptest --hazelcast_node=127.0.0.1:5701 --spawn_strategy=remote
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
INFO: Analysed target //:apptest (19 packages loaded).
INFO: Found 1 test target...
Target //:apptest up-to-date:
  bazel-bin/apptest.jar
  bazel-bin/apptest
INFO: Elapsed time: 14.457s, Critical Path: 12.08s
INFO: Build completed successfully, 12 total actions
//:apptest                                                               PASSED in 10.4s

Executed 1 out of 1 test: 1 test passes.

All good: We can see that the tests take 10s seconds to execute (this is due to a sleep in the tests)

Then we run the tests for a second time, but do a bazel clean to remove the local cache. Checking out the repo on a different path would do the same:

$> bazel clean && bazel test :apptest --hazelcast_node=127.0.0.1:5701 --spawn_strategy=remote
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
INFO: Analysed target //:apptest (19 packages loaded).
INFO: Found 1 test target...
Target //:apptest up-to-date:
  bazel-bin/apptest.jar
  bazel-bin/apptest
INFO: Elapsed time: 4.531s, Critical Path: 2.44s
INFO: Build completed successfully, 12 total actions
//:apptest                                                               PASSED in 0.0s

Executed 1 out of 1 test: 1 test passes.

Good but could be better: We can see that the tests passed but didn't actually run from the fact that tests took 0 seconds. However it doesn't show the usual string (cached) PASSED

Then doing the same with the argument --check_tests_up_to_date:

$> bazel clean && bazel test :apptest --hazelcast_node=127.0.0.1:5701 --spawn_strategy=remote --check_tests_up_to_date
INFO: Reading 'startup' options from /Users/ovelincompany/Projects/test/hazelcast-junit-test/tools/bazel.rc: --host_jvm_args=-Dbazel.DigestFunction=SHA1
INFO: Reading 'startup' options from /Users/ovelincompany/.bazelrc: --host_jvm_args=-Dbazel.DigestFunction=SHA1
INFO: Found 1 test target...
ERROR: action 'Expanding template apptest' is not up-to-date.
ERROR: action 'Creating source manifest for //:apptest' is not up-to-date.
ERROR: action 'BazelWorkspaceStatusAction stable-status.txt' is not up-to-date.
Target //:apptest failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.986s, Critical Path: 0.00s
//:apptest                                                            NO STATUS

Finished with 0 passing and 0 failing tests up to date, 1 out of date.

Not good: bazel doesn't use the remote cache server and shows NO STATUS .. forcing us to rebuild/retest all of our apps.

Environment info

  • Operating System:
    Linux ubuntu

  • Bazel version (output of bazel info release):
    release 0.7.0

Have you found anything relevant by searching the web?

This issue was useful for me to understand how the remote cache works: #1413

Small note:

Our actual project is mostly python code, but the behaviour is the same as for this java unit test.

@PoncinMatthieu
Copy link
Author

Not sure if it can help solve the issue, but I took latest master and tracked this back to the method result.getData().getRemotelyCached() returning false here:

if (!(result.isCached() || result.getData().getRemotelyCached())) {

I might be wrong but I believe the expected behaviour is for it to return true in the case the test was
taken from the remote cache.

I have zero knowledge on protobuf.. but my wild guess is that this method returns the value from this:

optional bool remotely_cached = 7;

Which is missing from the test result data and probably ought to be there.

@agoulti agoulti self-assigned this Oct 30, 2017
@katre katre added platform: other P2 We'll consider working on this in future. (Assignee optional) type: feature request labels Oct 31, 2017
@ulfjack
Copy link
Contributor

ulfjack commented Oct 31, 2017

  1. I think marking this as feature request is correct. The current design intentionally targets the local cache use case, so this is an intentional change in design.

  2. I'm not sure this is a P2 from our point of view. It certainly seems nice to have, but it's overall more of a performance optimization, isn't it? If you run this in order to decide whether to run the tests, why not just run the tests and get the cache hits for some and non-cache hits for others?

  3. This isn't going to be easy to implement. The current implementation is at an action level and requires local action cache cache hits. Applying it to the remote spawn cache will require a completely different implementation. (Note the difference between "actions" and "spawns".) In some cases, you can't even ask the remote cache for an entry without doing some local work first, e.g., in order to run the Java compiler, Bazel first may have to write a parameter file, and that happens locally (at this time).

Overall, I'm not sure the additional complexity is warranted. On the other hand, we've been considering removing the local action cache in favor of a local spawn cache (for which there is an implementation now), and we probably can't just remove the check_up_to_date and check_tests_up_to_date flags, so we'd need to find a way to implement these flags on top of spawn caches, in which case we might as well make it work in all cases.

@ulfjack ulfjack added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed P2 We'll consider working on this in future. (Assignee optional) labels Oct 31, 2017
@PoncinMatthieu
Copy link
Author

Hi @ulfjack ,
Thank you for looking into the issue!

  1. From my point of view, as a user, it is hard not to see it as a bug, so maybe there should be a mention on this in the documentation or command line reference. As a first time naive user, I have spent a good fair amount of time reading the doc and I expected this option to work on both local and remote cache. And again as a user, I can't see a good reason for such option to work only on the local cache.
  2. I understand this comes off as a performance optimisation, but from our use case I would argue that this feature is very important to us. Though I am quite surprised we seem to be the only ones trying to use this option with the remote cache on CI.. so maybe we are trying to use bazel with jenkins the wrong way? Maybe there are better ways? I guess Github is not the best place for discussing/explaining our use case so I will create a new thread to talk about this on bazel-discuss.
  3. That’s a fair point, I don’t understand very well the difference between spawn and action but it does sound like it would require a lot of changes.

I think anyone using those flags on the local cache will eventually want to be using them on the remote cache.. so either this should be implemented or the flags simply shouldn't exist to avoid this usage pattern?

I'm not sure if I could but with some guidance I would be very happy to try to contribute to implementing those flags for the remote cache.

@agoulti agoulti removed their assignment Nov 1, 2017
@PoncinMatthieu
Copy link
Author

In case anyone else is trying to do something similar and is blocked by this, I would recommend having a look at querying the graph instead of using the cache.
With the following script showing how to check for code changed between your branch and master and run only required tests: https://github.com/bazelbuild/bazel/blob/master/scripts/ci/ci.sh

For more info, read this discussion: https://groups.google.com/forum/#!starred/bazel-discuss/FVLLarcC5hA

@buchgr
Copy link
Contributor

buchgr commented Feb 5, 2019

I wasn't aware that we had this flag. The general idea seems useful for remote caching / execution to check which actions are not cached remotely (and maybe why). fyi @agoulti

@buchgr buchgr added team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed category: remote execution / caching platform: other labels Feb 5, 2019
@sgowroji sgowroji added the stale Issues or PRs that are stale (no activity for 30 days) label Feb 17, 2023
@sgowroji
Copy link
Member

Hi there! We're doing a clean up of old issues and will be closing this one. Please reopen (or ping me to reopen) if you’d like to discuss anything further. We’ll respond as soon as we have the bandwidth/resources to do so.

@sgowroji sgowroji closed this as not planned Won't fix, can't repro, duplicate, stale Feb 17, 2023
avagin added a commit to google/gvisor that referenced this issue May 10, 2024
Bazel lacks a method to verify if test results are current in the cache.
Additional details can be found at:
bazelbuild/bazel#3978.

To address this limitation, let's change the test runner so that it
doesn't run tests and exits with a non-zero code immediately if the
$TEST_CHECK_UP_TO_DATE_FILE file exists. To check test results in the
cache, we need to create the $TEST_CHECK_UP_TO_DATE_FILE and run tests.
Successful test execution indicates that the results are up-to-date in
the cache.

Signed-off-by: Andrei Vagin <avagin@google.com>
avagin added a commit to google/gvisor that referenced this issue May 10, 2024
Bazel lacks a method to verify if test results are current in the cache.
Additional details can be found at:
bazelbuild/bazel#3978.

To address this limitation, let's change the test runner so that it
doesn't run tests and exits with a non-zero code immediately if the
$TEST_CHECK_UP_TO_DATE_FILE file exists. To check test results in the
cache, we need to create the $TEST_CHECK_UP_TO_DATE_FILE and run tests.
Successful test execution indicates that the results are up-to-date in
the cache.

Signed-off-by: Andrei Vagin <avagin@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) stale Issues or PRs that are stale (no activity for 30 days) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request
Projects
None yet
Development

No branches or pull requests

7 participants