Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e with VIP: resetting node fails with control plane endpoint being down #3500

Closed
smira opened this issue Apr 16, 2021 · 1 comment · Fixed by #3590
Closed

e2e with VIP: resetting node fails with control plane endpoint being down #3500

smira opened this issue Apr 16, 2021 · 1 comment · Fixed by #3590
Assignees
Milestone

Comments

@smira
Copy link
Member

smira commented Apr 16, 2021

Bug Report

Test failure to reset the node because drain task fails

Description

Logs

[   71.147344] [talos] boot sequence: done: 1m2.424736265s
[  673.551602] [talos] reset request received
[  673.554622] [talos] reset sequence: 10 phase(s)
[  673.555950] [talos] phase drain (1/10): 1 tasks(s)
[  673.557394] [talos] task cordonAndDrainNode (1/1): starting
[  703.564537] [talos] task cordonAndDrainNode (1/1): failed: failed to cordon node e2e-qemu-master-1: 2 error(s) occurred:
[  703.567016] 	Get "https://172.20.1.50:6443/api/v1/nodes/e2e-qemu-master-1?timeout=30s": net/http: TLS handshake timeout
[  703.569899] 	timeout
[  703.570485] [talos] phase drain (1/10): failed
[  703.571428] [talos] reset sequence: failed
[  703.572568] [talos] reset failed: error running phase 1 in reset sequence: task 1/1: failed, failed to cordon node e2e-qemu-master-1: 2 error(s) occurred:
[  703.576531] 	Get "https://172.20.1.50:6443/api/v1/nodes/e2e-qemu-master-1?timeout=30s": net/http: TLS handshake timeout
[  703.578763] 	timeout
[  703.579263] [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 2657, container kubelet)
[  703.581416] [talos] service[etcd](Stopping): Sending SIGTERM to task etcd (PID 2599, container etcd)
[  703.583382] [talos] service[trustd](Stopping): Sending SIGTERM to task trustd (PID 2526, container trustd)
[  703.586583] [talos] service[machined](Finished): Service finished successfully
[  703.588590] [talos] service[udevd](Stopping): Sending SIGTERM to Process(["/sbin/udevd" "--resolve-names=never"])
[  703.592240] [talos] service[apid](Stopping): Sending SIGTERM to task apid (PID 2525, container apid)
[  703.616619] [talos] service[udevd](Finished): Service finished successfully
[  703.669443] [talos] service[trustd](Finished): Service finished successfully
[  703.695612] [talos] service[apid](Finished): Service finished successfully
[  703.701732] [talos] service[containerd](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"])
[  703.717190] [talos] service[containerd](Finished): Service finished successfully
[  703.768446] [talos] service[kubelet](Finished): Service finished successfully
[  704.710995] [talos] service[etcd](Finished): Service finished successfully
[  704.712550] [talos] service[cri](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"])
[  704.732445] [talos] service[cri](Finished): Service finished successfully
�[36mINFO�[0m[0750] Broadcasting ARP update for 172.20.1.50 (f6:ec:b5:a4:5f:ab) via eth0 
{"level":"warn","ts":"2021-04-16T18:05:18.702Z","caller":"v3@v3.5.0-alpha.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d0c700/#initially=[127.0.0.1:2379]","attempt":0,"error":"rpc error: code = Canceled desc = context canceled"}
{"level":"warn","ts":"2021-04-16T18:05:28.703Z","caller":"v3@v3.5.0-alpha.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d0c700/#initially=[127.0.0.1:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
{"level":"warn","ts":"2021-04-16[  823.339433] [talos] service[networkd](Finished): Service finished successfully
T18:06:28.704Z",[  823.341096] [talos] fatal sequencer error in "reset" sequence: message:"sequence failed: error running phase 1 in reset sequence: task 1/1: failed, failed to cordon node e2e-qemu-master-1: 2 error(s) occurred:\n\tGet \"https://172.20.1.50:6443/api/v1/nodes/e2e-qemu-master-1?timeout=30s\": net/http: TLS handshake timeout\n\ttimeout"
"caller":"v3@v3.5.0-alpha.0/retr[  823.347335] [talos] controller runtime finished
y_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d0c700/#initially=[127.0.0.1:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
{"level":"warn","ts":"2021-04-16T18:06:28.704Z","caller":"v3@v3.5.0-alpha.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d0c700/#initially=[127.0.0.1:2379]","attempt":0,"error":"rpc error: code = Canceled desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
[  823.528619] [talos] rebooting in 10 seconds
[  824.530207] [talos] rebooting in 9 seconds
[  825.532034] [talos] rebooting in 8 seconds
[  826.533074] [talos] rebooting in 7 seconds
[  827.535054] [talos] rebooting in 6 seconds
[  828.536059] [talos] rebooting in 5 seconds
[  829.538050] [talos] rebooting in 4 seconds
[  830.539164] [talos] rebooting in 3 seconds
[  831.540394] [talos] rebooting in 2 seconds
[  832.541437] [talos] rebooting in 1 seconds
[  833.543038] [talos] rebooting in 0 seconds

Environment

  • Talos version: [talosctl version --nodes <problematic nodes>]
  • Kubernetes version: [kubectl version --short]
  • Platform:
@smira smira added this to the 0.11 milestone Apr 16, 2021
@smira
Copy link
Member Author

smira commented Apr 16, 2021

@smira smira self-assigned this May 7, 2021
smira added a commit to smira/talos that referenced this issue May 7, 2021
The problem is with VIP and `reset` sequence: the order of operations
was that `etcd` was stopped first while `networkd` was still running,
and if the node owned the VIP at the time of the reset action, the lease
will be lost (as client connection is gone), so VIP will be unassigned
for a pretty long time.

This PR changes the order of operations: first, stop `networkd` and
other pods, and leave `etcd` last, so that VIP is released, and
`kube-apiserver` for example isn't left hanging on the node while `etcd`
is gone.

Fixes siderolabs#3500

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot pushed a commit that referenced this issue May 7, 2021
The problem is with VIP and `reset` sequence: the order of operations
was that `etcd` was stopped first while `networkd` was still running,
and if the node owned the VIP at the time of the reset action, the lease
will be lost (as client connection is gone), so VIP will be unassigned
for a pretty long time.

This PR changes the order of operations: first, stop `networkd` and
other pods, and leave `etcd` last, so that VIP is released, and
`kube-apiserver` for example isn't left hanging on the node while `etcd`
is gone.

Fixes #3500

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
smira added a commit to smira/talos that referenced this issue May 13, 2021
The problem is with VIP and `reset` sequence: the order of operations
was that `etcd` was stopped first while `networkd` was still running,
and if the node owned the VIP at the time of the reset action, the lease
will be lost (as client connection is gone), so VIP will be unassigned
for a pretty long time.

This PR changes the order of operations: first, stop `networkd` and
other pods, and leave `etcd` last, so that VIP is released, and
`kube-apiserver` for example isn't left hanging on the node while `etcd`
is gone.

Fixes siderolabs#3500

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
(cherry picked from commit 4ffd7c0)
smira added a commit that referenced this issue May 13, 2021
The problem is with VIP and `reset` sequence: the order of operations
was that `etcd` was stopped first while `networkd` was still running,
and if the node owned the VIP at the time of the reset action, the lease
will be lost (as client connection is gone), so VIP will be unassigned
for a pretty long time.

This PR changes the order of operations: first, stop `networkd` and
other pods, and leave `etcd` last, so that VIP is released, and
`kube-apiserver` for example isn't left hanging on the node while `etcd`
is gone.

Fixes #3500

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
(cherry picked from commit 4ffd7c0)
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant