e2e with VIP: resetting node fails with control plane endpoint being down #3500

smira · 2021-04-16T18:17:00Z

Bug Report

Test failure to reset the node because drain task fails

Description

Logs

[   71.147344] [talos] boot sequence: done: 1m2.424736265s
[  673.551602] [talos] reset request received
[  673.554622] [talos] reset sequence: 10 phase(s)
[  673.555950] [talos] phase drain (1/10): 1 tasks(s)
[  673.557394] [talos] task cordonAndDrainNode (1/1): starting
[  703.564537] [talos] task cordonAndDrainNode (1/1): failed: failed to cordon node e2e-qemu-master-1: 2 error(s) occurred:
[  703.567016] 	Get "https://172.20.1.50:6443/api/v1/nodes/e2e-qemu-master-1?timeout=30s": net/http: TLS handshake timeout
[  703.569899] 	timeout
[  703.570485] [talos] phase drain (1/10): failed
[  703.571428] [talos] reset sequence: failed
[  703.572568] [talos] reset failed: error running phase 1 in reset sequence: task 1/1: failed, failed to cordon node e2e-qemu-master-1: 2 error(s) occurred:
[  703.576531] 	Get "https://172.20.1.50:6443/api/v1/nodes/e2e-qemu-master-1?timeout=30s": net/http: TLS handshake timeout
[  703.578763] 	timeout
[  703.579263] [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 2657, container kubelet)
[  703.581416] [talos] service[etcd](Stopping): Sending SIGTERM to task etcd (PID 2599, container etcd)
[  703.583382] [talos] service[trustd](Stopping): Sending SIGTERM to task trustd (PID 2526, container trustd)
[  703.586583] [talos] service[machined](Finished): Service finished successfully
[  703.588590] [talos] service[udevd](Stopping): Sending SIGTERM to Process(["/sbin/udevd" "--resolve-names=never"])
[  703.592240] [talos] service[apid](Stopping): Sending SIGTERM to task apid (PID 2525, container apid)
[  703.616619] [talos] service[udevd](Finished): Service finished successfully
[  703.669443] [talos] service[trustd](Finished): Service finished successfully
[  703.695612] [talos] service[apid](Finished): Service finished successfully
[  703.701732] [talos] service[containerd](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"])
[  703.717190] [talos] service[containerd](Finished): Service finished successfully
[  703.768446] [talos] service[kubelet](Finished): Service finished successfully
[  704.710995] [talos] service[etcd](Finished): Service finished successfully
[  704.712550] [talos] service[cri](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"])
[  704.732445] [talos] service[cri](Finished): Service finished successfully
�[36mINFO�[0m[0750] Broadcasting ARP update for 172.20.1.50 (f6:ec:b5:a4:5f:ab) via eth0 
{"level":"warn","ts":"2021-04-16T18:05:18.702Z","caller":"v3@v3.5.0-alpha.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d0c700/#initially=[127.0.0.1:2379]","attempt":0,"error":"rpc error: code = Canceled desc = context canceled"}
{"level":"warn","ts":"2021-04-16T18:05:28.703Z","caller":"v3@v3.5.0-alpha.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d0c700/#initially=[127.0.0.1:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
{"level":"warn","ts":"2021-04-16[  823.339433] [talos] service[networkd](Finished): Service finished successfully
T18:06:28.704Z",[  823.341096] [talos] fatal sequencer error in "reset" sequence: message:"sequence failed: error running phase 1 in reset sequence: task 1/1: failed, failed to cordon node e2e-qemu-master-1: 2 error(s) occurred:\n\tGet \"https://172.20.1.50:6443/api/v1/nodes/e2e-qemu-master-1?timeout=30s\": net/http: TLS handshake timeout\n\ttimeout"
"caller":"v3@v3.5.0-alpha.0/retr[  823.347335] [talos] controller runtime finished
y_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d0c700/#initially=[127.0.0.1:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
{"level":"warn","ts":"2021-04-16T18:06:28.704Z","caller":"v3@v3.5.0-alpha.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d0c700/#initially=[127.0.0.1:2379]","attempt":0,"error":"rpc error: code = Canceled desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
[  823.528619] [talos] rebooting in 10 seconds
[  824.530207] [talos] rebooting in 9 seconds
[  825.532034] [talos] rebooting in 8 seconds
[  826.533074] [talos] rebooting in 7 seconds
[  827.535054] [talos] rebooting in 6 seconds
[  828.536059] [talos] rebooting in 5 seconds
[  829.538050] [talos] rebooting in 4 seconds
[  830.539164] [talos] rebooting in 3 seconds
[  831.540394] [talos] rebooting in 2 seconds
[  832.541437] [talos] rebooting in 1 seconds
[  833.543038] [talos] rebooting in 0 seconds

Environment

Talos version: [talosctl version --nodes <problematic nodes>]
Kubernetes version: [kubectl version --short]
Platform:

The text was updated successfully, but these errors were encountered:

smira · 2021-04-16T18:18:08Z

logs_talos-systems_talos_7_5 (2).log

The problem is with VIP and `reset` sequence: the order of operations was that `etcd` was stopped first while `networkd` was still running, and if the node owned the VIP at the time of the reset action, the lease will be lost (as client connection is gone), so VIP will be unassigned for a pretty long time. This PR changes the order of operations: first, stop `networkd` and other pods, and leave `etcd` last, so that VIP is released, and `kube-apiserver` for example isn't left hanging on the node while `etcd` is gone. Fixes siderolabs#3500 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>

The problem is with VIP and `reset` sequence: the order of operations was that `etcd` was stopped first while `networkd` was still running, and if the node owned the VIP at the time of the reset action, the lease will be lost (as client connection is gone), so VIP will be unassigned for a pretty long time. This PR changes the order of operations: first, stop `networkd` and other pods, and leave `etcd` last, so that VIP is released, and `kube-apiserver` for example isn't left hanging on the node while `etcd` is gone. Fixes #3500 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>

The problem is with VIP and `reset` sequence: the order of operations was that `etcd` was stopped first while `networkd` was still running, and if the node owned the VIP at the time of the reset action, the lease will be lost (as client connection is gone), so VIP will be unassigned for a pretty long time. This PR changes the order of operations: first, stop `networkd` and other pods, and leave `etcd` last, so that VIP is released, and `kube-apiserver` for example isn't left hanging on the node while `etcd` is gone. Fixes siderolabs#3500 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com> (cherry picked from commit 4ffd7c0)

The problem is with VIP and `reset` sequence: the order of operations was that `etcd` was stopped first while `networkd` was still running, and if the node owned the VIP at the time of the reset action, the lease will be lost (as client connection is gone), so VIP will be unassigned for a pretty long time. This PR changes the order of operations: first, stop `networkd` and other pods, and leave `etcd` last, so that VIP is released, and `kube-apiserver` for example isn't left hanging on the node while `etcd` is gone. Fixes #3500 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com> (cherry picked from commit 4ffd7c0)

smira added this to the 0.11 milestone Apr 16, 2021

smira self-assigned this May 7, 2021

smira mentioned this issue May 7, 2021

fix: stop networkd before leaving etcd on 'reset' path #3590

Merged

talos-bot closed this as completed in #3590 May 7, 2021

github-actions bot locked as resolved and limited conversation to collaborators Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e with VIP: resetting node fails with control plane endpoint being down #3500

e2e with VIP: resetting node fails with control plane endpoint being down #3500

smira commented Apr 16, 2021

smira commented Apr 16, 2021

e2e with VIP: resetting node fails with control plane endpoint being down #3500

e2e with VIP: resetting node fails with control plane endpoint being down #3500

Comments

smira commented Apr 16, 2021

Bug Report

Description

Logs

Environment

smira commented Apr 16, 2021