Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grpc update breaks with Bazel 3.4.0 in docker container #11756

Closed
katre opened this issue Jul 13, 2020 · 17 comments
Closed

Grpc update breaks with Bazel 3.4.0 in docker container #11756

katre opened this issue Jul 13, 2020 · 17 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required)

Comments

@katre
Copy link
Member

katre commented Jul 13, 2020

Error log:
https://buildkite.com/bazel/google-bazel-presubmit/builds/36688#7500893b-351a-4549-86ae-19803c73cbd1

@katre katre self-assigned this Jul 13, 2020
@katre
Copy link
Member Author

katre commented Jul 13, 2020

Looks like something changed so bazel frontend is unable to communicate with the server inside the docker container for RBE autoconfig.

@keith
Copy link
Member

keith commented Jul 13, 2020

We see this too with our projects and RBE and 3.4.0:


Starting local Bazel server and connecting to it...
--
  | ... still trying to connect to local Bazel server after 10 seconds ...
  | ... still trying to connect to local Bazel server after 20 seconds ...
  | ... still trying to connect to local Bazel server after 30 seconds ...
  | ... still trying to connect to local Bazel server after 40 seconds ...
  | ... still trying to connect to local Bazel server after 50 seconds ...
  | ... still trying to connect to local Bazel server after 60 seconds ...
  | ... still trying to connect to local Bazel server after 70 seconds ...
  | ... still trying to connect to local Bazel server after 80 seconds ...
  | ... still trying to connect to local Bazel server after 90 seconds ...
  | ... still trying to connect to local Bazel server after 100 seconds ...
  | ... still trying to connect to local Bazel server after 110 seconds ...
  | FATAL: couldn't connect to server (2502) after 120 seconds.

bazel-io pushed a commit that referenced this issue Jul 13, 2020
Bazel 3.4.0 broke something and now rbe_autoconfig fails somewhere inside Docker where it tries to start Bazel: https://buildkite.com/bazel/google-bazel-presubmit/builds/36688#262e09df-19ea-47cc-96bf-3c5a4e2d7352

This CL should be rolled back as soon as we figured out how to fix this.

The breakage is tracked in: #11756

RELNOTES: None.
PiperOrigin-RevId: 320977135
@katre
Copy link
Member Author

katre commented Jul 13, 2020

Yes, we're trying to diagnose and get out a patch as fast as we can.

@nathanhleung
Copy link

Same issue here in a Docker runner on CircleCI:

jobs:
  build:
    docker:
      - image: circleci/node:12.16.1
        environment:
          NODE_OPTIONS: --max_old_space_size=4096
    resource_class: medium+
Extracting Bazel installation...
 
Starting local Bazel server and connecting to it...
 
... still trying to connect to local Bazel server after 10 seconds ...
 
... still trying to connect to local Bazel server after 20 seconds ...
 
... still trying to connect to local Bazel server after 30 seconds ...
 
... still trying to connect to local Bazel server after 40 seconds ...
 
... still trying to connect to local Bazel server after 50 seconds ...
 
... still trying to connect to local Bazel server after 60 seconds ...
 
... still trying to connect to local Bazel server after 70 seconds ...
 
... still trying to connect to local Bazel server after 80 seconds ...
 
... still trying to connect to local Bazel server after 90 seconds ...
 
... still trying to connect to local Bazel server after 100 seconds ...
 
... still trying to connect to local Bazel server after 110 seconds ...
 
FATAL: couldn't connect to server (1521) after 120 seconds.
 
Makefile:5: recipe for target 'build' failed
 
make: *** [build] Error 37

@meisterT meisterT added release blocker P1 I'll work on this now. (Assignee required) labels Jul 13, 2020
@philwo
Copy link
Member

philwo commented Jul 13, 2020

@nathanhleung Can you confirm that in your case no remote execution or rbe_autoconfig stuff is involved and this happens just when running Bazel 3.4.0 inside the Docker container on CircleCI?

@meisterT
Copy link
Member

Repro without RBE:

docker run --rm -i -t l.gcr.io/google/rbe-ubuntu16-04@sha256:5464e3e83dc656fc6e4eae6a01f5c2645f1f7e95854b3802b85e86484132d90e bash

# wget https://github.com/bazelbuild/bazelisk/releases/download/v1.3.0/bazelisk-linux-amd64
# chmod +x bazelisk-linux-amd64
# touch WORKSPACE
# ./bazelisk-linux-amd64 info
Starting local Bazel server and connecting to it...
... still trying to connect to local Bazel server after 10 seconds ...
... still trying to connect to local Bazel server after 20 seconds ...
... still trying to connect to local Bazel server after 30 seconds ...
... still trying to connect to local Bazel server after 40 seconds ...
... still trying to connect to local Bazel server after 50 seconds ...
... still trying to connect to local Bazel server after 60 seconds ...
... still trying to connect to local Bazel server after 70 seconds ...

@meisterT
Copy link
Member

culprit seems to be dfbf87c

@nathanhleung
Copy link

@nathanhleung Can you confirm that in your case no remote execution or rbe_autoconfig stuff is involved and this happens just when running Bazel 3.4.0 inside the Docker container on CircleCI?

Not sure how to confirm, but this is our Bazel install step:

install_bazel:
    # From https://docs.bazel.build/versions/master/install-ubuntu.html
    steps:
      - run: |
          sudo apt install curl gnupg apt-transport-https
          curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
          echo "deb [arch=amd64] https://storage.googleapis.com/bazel-apt \
            stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
      - run: sudo apt-get update && sudo apt-get -y install bazel

.bazelrc generated by this script:

#!/bin/bash

# Exit on error
set -e

echo "# Generated by ./scripts/generate_bazelrc.sh" > .bazelrc
echo "build --remote_cache=https://$BAZEL_CACHE_USER:$BAZEL_CACHE_PASSWORD@cache.company.xyz" >> .bazelrc

(nothing else)

And build command:

bazel build //src/server_bin_deploy.jar

@meisterT meisterT changed the title RBE Tests in CI fail with Bazel 3.4.0 Grpc update breaks with Bazel 3.4.0 in docker container Jul 13, 2020
@meisterT
Copy link
Member

We have confirmed that 0415511 fixes the issue

@meteorcloudy
Copy link
Member

To avoid conflict, we should also cherry-pick 08bf906

@benjaminp
Copy link
Collaborator

(My observation has been that enabling IPv6 in the container fixes this issue.)

@meisterT
Copy link
Member

Patch release can be found here: https://releases.bazel.build/3.4.1/rc1/index.html

@philwo
Copy link
Member

philwo commented Jul 13, 2020

Downstream pipeline for Bazel 3.4.1rc1 running here: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/1564

@meisterT
Copy link
Member

The bug is mitigated and Bazel 3.4.1 is released. Assigning to Yun who offered to look into creating a test (and probably has an incentive to roll forward the Grpc update).

@meisterT meisterT assigned meteorcloudy and unassigned meisterT Jul 14, 2020
@meteorcloudy
Copy link
Member

FYI @olekw

bazel-io pushed a commit that referenced this issue Jul 14, 2020
*** Reason for rollback ***

Bazel 3.4.1 is released and fixes the issue.

*** Original change description ***

Disable tests on RBE in Bazel's presubmit and postsubmit.

Bazel 3.4.0 broke something and now rbe_autoconfig fails somewhere inside Docker where it tries to start Bazel: https://buildkite.com/bazel/google-bazel-presubmit/builds/36688#262e09df-19ea-47cc-96bf-3c5a4e2d7352

This CL should be rolled back as soon as we figured out how to fix this.

The breakage is tracked in: #11756

RELNOTES: None.
PiperOrigin-RevId: 321120981
@katre
Copy link
Member Author

katre commented Jul 14, 2020

This is now fixed with the release of Bazel 3.4.1.

@katre katre closed this as completed Jul 14, 2020
@benjaminp
Copy link
Collaborator

BTW, netty/netty#10402 is the underlying issue. It could be worked around by not attempting to bind [::1] if io.netty.util.NetUtil.isIpV4StackPreferred() returns true.

@katre katre reopened this Jul 14, 2020
bazel-io pushed a commit that referenced this issue Jul 15, 2020
This is a workaround for a netty bug netty/netty#10402 that caused the rollback of upgrading grpc-java to 1.26.0 (#11756)

Closes #11776.

PiperOrigin-RevId: 321342799
meteorcloudy added a commit to meteorcloudy/bazel that referenced this issue Jul 16, 2020
This will make io.netty.channel.unix.Socket.isIPv6Preferred()
available for fixing bazelbuild#11756
bazel-io pushed a commit that referenced this issue Jul 16, 2020
This will make io.netty.channel.unix.Socket.isIPv6Preferred()
available for fixing #11756
bazel-io pushed a commit that referenced this issue Jul 16, 2020
The underlying issue has been worked around in #11776

Fixes #11756
Closes #11792
UberOpenSourceBot pushed a commit to fusionjs/fusionjs that referenced this issue Jul 17, 2020
bazel 3.4.0 is basically broken in docker containers: bazelbuild/bazel#11756

After this lands, I will cut a jazelle release that uses bazel 3.4.1, and upgrade fusion CI to use that new jazelle release. I am hopeful that this will fix our bazel server startup issues.
coeuvre pushed a commit to coeuvre/bazel that referenced this issue Oct 22, 2020
This will make io.netty.channel.unix.Socket.isIPv6Preferred()
available for fixing bazelbuild#11756

# Conflicts:
#	third_party/BUILD
coeuvre pushed a commit to coeuvre/bazel that referenced this issue Oct 22, 2020
The underlying issue has been worked around in bazelbuild#11776

Fixes bazelbuild#11756
Closes bazelbuild#11792
rtsao pushed a commit to uber-web/jazelle that referenced this issue Nov 12, 2020
https://github.com/uber/fusionjs/pull/1120

bazel 3.4.0 is basically broken in docker containers: bazelbuild/bazel#11756

After this lands, I will cut a jazelle release that uses bazel 3.4.1, and upgrade fusion CI to use that new jazelle release. I am hopeful that this will fix our bazel server startup issues.
rtsao pushed a commit to uber-web/jazelle that referenced this issue Nov 12, 2020
https://github.com/uber/fusionjs/pull/1120

bazel 3.4.0 is basically broken in docker containers: bazelbuild/bazel#11756

After this lands, I will cut a jazelle release that uses bazel 3.4.1, and upgrade fusion CI to use that new jazelle release. I am hopeful that this will fix our bazel server startup issues.
rtsao pushed a commit to uber-web/jazelle that referenced this issue Nov 12, 2020
https://github.com/uber/fusionjs/pull/1120

bazel 3.4.0 is basically broken in docker containers: bazelbuild/bazel#11756

After this lands, I will cut a jazelle release that uses bazel 3.4.1, and upgrade fusion CI to use that new jazelle release. I am hopeful that this will fix our bazel server startup issues.
luca-digrazia pushed a commit to luca-digrazia/DatasetCommitsDiffSearch that referenced this issue Sep 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required)
Projects
None yet
7 participants