Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release-1.30] - Agent loadbalancer may deadlock when servers are removed #6321

Closed
brandond opened this issue Jul 15, 2024 · 2 comments
Closed
Assignees

Comments

@brandond
Copy link
Member

Backport fix for Agent loadbalancer may deadlock when servers are removed

@aganesh-suse
Copy link

sorry posted k3s results here and closed by mistake. hence re-opening (deleted the k3s results). will update with rke2 results and close next week.

@aganesh-suse
Copy link

Validated on release-1.30 branch with version v1.30.3-rc4+rke2r1

Environment Details

Infrastructure

  • Cloud
  • Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Cluster Configuration:

HA : 3 server / 1 agent

Config.yaml:

token: xxxx
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1

Testing Steps

  1. Copy config.yaml
$ sudo mkdir -p /etc/rancher/rke2 && sudo cp config.yaml /etc/rancher/rke2
  1. Install RKE2
curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_VERSION='v1.30.3-rc4+rke2r1' INSTALL_RKE2_TYPE='server' INSTALL_RKE2_METHOD=tar sh -
  1. Start the RKE2 service
$ sudo systemctl enable --now rke2-server
or 
$ sudo systemctl enable --now rke2-agent
  1. Verify Cluster Status:
kubectl get nodes -o wide
kubectl get pods -A
  1. Refer to verification steps here: Fix loadbalancer reentrant rlock k3s-io/k3s#10511
    Identify the server that the agent is connected to : netstat -na | grep 6443
    Disconnect the network on that server: ip link set dev eth0 down (or whatever interface that node is using)
    The failed server should get removed from the server list

Replication Results:

  • rke2 version used for replication:
$ rke2 -v
rke2 version v1.30.2+rke2r1 (f01072ab2b9cf1a529ce188e4a8d8645813d4620)
go version go1.22.4 X:boringcrypto
level=info msg="Connecting to proxy" url="wss://<ip1>:9345/v1-rke2/connect"
level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp <ip1>:9345: connect: connection refused" url="wss://<ip1>:9345/v1-rke2/connect"
level=error msg="Remotedialer proxy error; reconnecting..." error="websocket: close 1006 (abnormal closure): unexpected EOF" url="wss://<ip1>:9345/v1-rke2/connect"
level=debug msg="Failed over to new server for load balancer rke2-api-server-agent-load-balancer: <ip1>:6443 -> <ip2>:6443"
.
.
level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: <ip1>:6443"
level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [<ip3>:6443 <ip2>:6443] [default: <ip1>:6443]"
level=info msg="Removing server from load balancer rke2-agent-load-balancer: <ip1>:9345"
level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [<ip3>:9345 <ip2>:9345] [default: <ip1>:9345]"

Validation Results:

  • rke2 version used for validation:
$ rke2 -v
rke2 version v1.30.3-rc4+rke2r1 (370f53b3794863b6481adc9a7f5a838d1fca66ac)
go version go1.22.5 X:boringcrypto
level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: <ip1>:6443"
level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [<ip2>:6443 <ip3>:6443] [default: <ip1>:6443]"
level=info msg="Removing server from load balancer rke2-agent-load-balancer: <ip1>:9345"
level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [<ip2>:9345 <ip3>:9345] [default: <ip1>:9345]"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants