-
Notifications
You must be signed in to change notification settings - Fork 575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e with VIP: resetting node fails with control plane endpoint being down #3500
Milestone
Comments
smira
added a commit
to smira/talos
that referenced
this issue
May 7, 2021
The problem is with VIP and `reset` sequence: the order of operations was that `etcd` was stopped first while `networkd` was still running, and if the node owned the VIP at the time of the reset action, the lease will be lost (as client connection is gone), so VIP will be unassigned for a pretty long time. This PR changes the order of operations: first, stop `networkd` and other pods, and leave `etcd` last, so that VIP is released, and `kube-apiserver` for example isn't left hanging on the node while `etcd` is gone. Fixes siderolabs#3500 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot
pushed a commit
that referenced
this issue
May 7, 2021
The problem is with VIP and `reset` sequence: the order of operations was that `etcd` was stopped first while `networkd` was still running, and if the node owned the VIP at the time of the reset action, the lease will be lost (as client connection is gone), so VIP will be unassigned for a pretty long time. This PR changes the order of operations: first, stop `networkd` and other pods, and leave `etcd` last, so that VIP is released, and `kube-apiserver` for example isn't left hanging on the node while `etcd` is gone. Fixes #3500 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
smira
added a commit
to smira/talos
that referenced
this issue
May 13, 2021
The problem is with VIP and `reset` sequence: the order of operations was that `etcd` was stopped first while `networkd` was still running, and if the node owned the VIP at the time of the reset action, the lease will be lost (as client connection is gone), so VIP will be unassigned for a pretty long time. This PR changes the order of operations: first, stop `networkd` and other pods, and leave `etcd` last, so that VIP is released, and `kube-apiserver` for example isn't left hanging on the node while `etcd` is gone. Fixes siderolabs#3500 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com> (cherry picked from commit 4ffd7c0)
smira
added a commit
that referenced
this issue
May 13, 2021
The problem is with VIP and `reset` sequence: the order of operations was that `etcd` was stopped first while `networkd` was still running, and if the node owned the VIP at the time of the reset action, the lease will be lost (as client connection is gone), so VIP will be unassigned for a pretty long time. This PR changes the order of operations: first, stop `networkd` and other pods, and leave `etcd` last, so that VIP is released, and `kube-apiserver` for example isn't left hanging on the node while `etcd` is gone. Fixes #3500 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com> (cherry picked from commit 4ffd7c0)
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Bug Report
Test failure to reset the node because drain task fails
Description
Logs
Environment
talosctl version --nodes <problematic nodes>
]kubectl version --short
]The text was updated successfully, but these errors were encountered: