Send error response if member list cannot be retrieved #9722

brandond · 2024-03-11T22:26:14Z

Proposed Changes

Move error response generation code into util
Send error response if member list cannot be retrieved
Prevents joining nodes from being stuck with bad initial member list if there is a transient failure, or if they try to join themselves

I am assuming there was some reason we generated a dummy list instead of sending an error, but it's done this since the initial embedded etcd code was bulk-added by Darren in #1770 so I don't know why this was supposed to be a good idea.

Types of Changes

bugfix

Verification

See linked issue
Try joining a node that is behind an external load-balancer that includes the new node in its target poor. For example, using the following haproxy config, where SERVER1 is the existing node, and SERVER2 is the node attempting to join the cluster:

    global
        maxconn 256
        max-spread-checks 0

    defaults
        mode tcp
        timeout connect 10ms
        timeout client  60s
        timeout server  60s

    frontend listener
        bind *:6443
        default_backend servers

    backend servers
        retries 10
        retry-on conn-failure
        server server1 SERVER1:6443 check downinter 5ms inter 1s weight 1
        server server2 SERVER2:6443 check downinter 5ms inter 1s weight 100

This haproxy config will cause it to send traffic to the new server as soon as it is up, but before it is actually ready - which will result in it trying to join itself.

Testing

Linked Issues

Unrecoverable error when joining node attempts to retrieve etcd member list from itself #9661

User-Facing Change

Further Comments

codecov · 2024-03-11T23:10:25Z

Codecov Report

Attention: Patch coverage is 15.06849% with 62 lines in your changes are missing coverage. Please review.

Project coverage is 45.90%. Comparing base (fe2ca9e) to head (f9e74e9).
Report is 4 commits behind head on master.

Files	Patch %	Lines
pkg/util/apierrors.go	21.95%	30 Missing and 2 partials ⚠️
pkg/etcd/etcd.go	7.14%	13 Missing ⚠️
pkg/server/router.go	9.09%	10 Missing ⚠️
pkg/server/auth.go	0.00%	4 Missing ⚠️
pkg/daemons/control/tunnel.go	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #9722      +/-   ##
==========================================
- Coverage   53.04%   45.90%   -7.14%     
==========================================
  Files         154      155       +1     
  Lines       13599    13588      -11     
==========================================
- Hits         7214     6238     -976     
- Misses       5020     6145    +1125     
+ Partials     1365     1205     -160

Flag	Coverage Δ
e2etests	`37.55% <1.36%> (-11.90%)`	⬇️
inttests	`39.44% <15.06%> (+0.08%)`	⬆️
unittests	`16.14% <1.81%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

Prevents joining nodes from being stuck with bad initial member list if there is a transient failure, or if they try to join themselves Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

brandond requested a review from a team as a code owner March 11, 2024 22:26

brandond added 2 commits March 12, 2024 00:53

Move error response generation code into util

5590699

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

Send error response if member list cannot be retrieved

f9e74e9

Prevents joining nodes from being stuck with bad initial member list if there is a transient failure, or if they try to join themselves Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

brandond force-pushed the fix-etcd-self-join branch from 4b289a1 to f9e74e9 Compare March 12, 2024 01:06

dereknola approved these changes Mar 12, 2024

View reviewed changes

vitorsavian approved these changes Mar 26, 2024

View reviewed changes

brandond merged commit d7cdbb7 into k3s-io:master Mar 26, 2024
26 checks passed

This was referenced Apr 10, 2024

[release-1.28] Backports for 2024-04 release cycle #9911

Merged

[release-1.27] Backports for 2024-04 release cycle #9912

Merged

mdrahman-suse mentioned this pull request Apr 18, 2024

[Release-1.27] - Unrecoverable error when joining node attempts to retrieve etcd member list from itself rancher/rke2#5806

Closed

cguertin14 mentioned this pull request Apr 27, 2024

new release: k3s update from v1.29.3+k3s1 to v1.29.4+k3s1 cguertin14/k3s-ansible-ha#36

Closed

brandond deleted the fix-etcd-self-join branch June 6, 2024 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send error response if member list cannot be retrieved #9722

Send error response if member list cannot be retrieved #9722

brandond commented Mar 11, 2024 •

edited

Loading

codecov bot commented Mar 11, 2024 •

edited

Loading

Send error response if member list cannot be retrieved #9722

Send error response if member list cannot be retrieved #9722

Conversation

brandond commented Mar 11, 2024 • edited Loading

Proposed Changes

Types of Changes

Verification

Testing

Linked Issues

User-Facing Change

Further Comments

codecov bot commented Mar 11, 2024 • edited Loading

Codecov Report

brandond commented Mar 11, 2024 •

edited

Loading

codecov bot commented Mar 11, 2024 •

edited

Loading