Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel hangs when using GRPC cache #10731

Closed
rmaz opened this issue Feb 7, 2020 · 10 comments
Closed

Bazel hangs when using GRPC cache #10731

rmaz opened this issue Feb 7, 2020 · 10 comments
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged

Comments

@rmaz
Copy link

rmaz commented Feb 7, 2020

Description of the problem / feature request:

Bazel hangs sometimes when using a remote GRPC cache. Console / BEP output shows:

11:10:56 INFO: From Bundling, processing and signing RewardsRiderTests.__internal__.__test_bundle:
11:10:56 bazel-out/darwin-dbg/bin/apps/app/RewardsRider/RewardsRiderTests.__internal__.__test_bundle_archive/RewardsRiderTests.xctest/Frameworks/Cronet.framework: replacing existing signature
11:10:56 bazel-out/darwin-dbg/bin/apps/app/RewardsRider/RewardsRiderTests.__internal__.__test_bundle_archive/RewardsRiderTests.xctest/Frameworks/TensorFlowWrapper.framework: replacing existing signature
11:11:05 INFO: From Bundling, processing and signing TripInstructionsTests.__internal__.__test_bundle:
11:11:05 bazel-out/darwin-dbg/bin/apps/app/TripInstructions/TripInstructionsTests.__internal__.__test_bundle_archive/TripInstructionsTests.xctest/Frameworks/Cronet.framework: replacing existing signature
11:11:05 bazel-out/darwin-dbg/bin/apps/app/TripInstructions/TripInstructionsTests.__internal__.__test_bundle_archive/TripInstructionsTests.xctest/Frameworks/TensorFlowWrapper.framework: replacing existing signature
11:11:08 [32,749 / 32,939] Compiling Swift module Safety; 3036s local, remote-cache ... (3 actions running)
11:19:29 [32,750 / 32,939] Compiling Swift module Safety; 3537s local, remote-cache
11:29:05 [32,750 / 32,939] Compiling Swift module Safety; 4112s local, remote-cache
11:40:07 [32,750 / 32,939] Compiling Swift module Safety; 4775s local, remote-cache
11:52:49 [32,750 / 32,939] Compiling Swift module Safety; 5536s local, remote-cache
12:07:25 [32,750 / 32,939] Compiling Swift module Safety; 6412s local, remote-cache
12:24:12 [32,750 / 32,939] Compiling Swift module Safety; 7419s local, remote-cache
12:43:30 [32,750 / 32,939] Compiling Swift module Safety; 8578s local, remote-cache

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

bazel build --remote_timeout=300 --remote_download_minimal --remote_upload_results=true --remote_cache=grpcs://remote.cache:443

What operating system are you running Bazel on?

Mac 10.14.6

What's the output of bazel info release?

release 2.0.0

Have you found anything relevant by searching the web?

#5112

Any other information, logs, or outputs that you want to share?

Stack trace of the bazel server while stuck:

stack.txt

@dslomov dslomov added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels Feb 10, 2020
@AlessandroPatti
Copy link
Contributor

/cc @buchgr

@buchgr
Copy link
Contributor

buchgr commented Feb 13, 2020

Does it hang "forever" or terminate after a long period time?

@buchgr
Copy link
Contributor

buchgr commented Feb 13, 2020

There is a problem that if the TCP connection dies Bazel might not be able to detect it for ~20 minutes or so. We have a solution but never got around to implement it: https://docs.google.com/document/d/13vXJTFHJX_hnYUxKdGQ2puxbNM3uJdZcEO5oNGP-Nnc/edit

@AlessandroPatti
Copy link
Contributor

AlessandroPatti commented Feb 13, 2020

In our case it seems to be hanging forever, or at least far more than 20 minutes.

@kastiglione
Copy link
Contributor

We now have evidence of this happening to us as well. It's not too frequent, thankfully. But when it hits, people cancel the build well before waiting 20min.

@kastiglione
Copy link
Contributor

Here's a jvm dump for our case: https://gist.github.com/kastiglione/151a3b5723daef7e968610ee38c8d569

@kastiglione
Copy link
Contributor

What is the reason that a timeout isn't happening for this?

@kastiglione
Copy link
Contributor

cc @artem-zinnatullin

@rmaz
Copy link
Author

rmaz commented Mar 11, 2020

Small update: this is only occurring for us with writes, if we set the client to read only then we never end up in this deadlocked state. The frontend causing the issues is nginx, when switching to a different one that uses haproxy we do not see any issues.

@coeuvre
Copy link
Member

coeuvre commented Dec 9, 2020

This issue should be fixed. Please check #11782 for more details. Closing.

@coeuvre coeuvre closed this as completed Dec 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged
Projects
None yet
Development

No branches or pull requests

6 participants