Improve join logic to handle unreachable nodes #560

bschimke95 · 2024-07-18T12:38:29Z

Fixes an issue where the join process would fail if a node in the join_address list is unreachable.

When a token is created, it includes the IP addresses of all nodes in the cluster. If a node is removed between the token creation and joining, the join process previously failed because it tried to connect to the non-existent node.

Now, the client discovery logic will continue attempting to connect to other nodes in the join_address list instead of failing early. This ensures that the join process can succeed as long as at least one node is reachable.

This PR fixes this issue on k8s-snap level for the worker nodes and requires the microcluster upgrades from #562 to actually pass the integration test.

src/k8s/pkg/k8sd/app/hooks_bootstrap.go

bschimke95 · 2024-07-23T11:21:31Z

rebased on top of #562 to verify this fix works.

tests/integration/tests/test_clustering_race.py

louiseschmidtgen

LGTM, thanks Ben!

berkayoz

LGTM

Co-authored-by: Mateo Florido <32885896+mateoflorido@users.noreply.github.com>

Co-authored-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

neoaggelos · 2024-07-31T08:33:38Z

src/k8s/pkg/k8sd/app/hooks_bootstrap.go

+	// Get remote certificate from the cluster member. We only need one node to be reachable for this.
+	// One might fail because the node is not part of the cluster anymore but was at the time the token was created.
+	var cert *x509.Certificate
+	var address string
+	var err error
+	for _, address = range token.JoinAddresses {
+		cert, err = utils.GetRemoteCertificate(address)
+		if err == nil {
+			break
+		}
+	}


In general, try to avoid mixing dependency upgrades along with changes in the code logic, as it would be harder to backport and/or revert if required.

--------- Co-authored-by: Mateo Florido <32885896+mateoflorido@users.noreply.github.com> Co-authored-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

bschimke95 changed the title ~~clustering race test~~ Improve join logic to handle unreachable nodes Jul 19, 2024

bschimke95 marked this pull request as ready for review July 19, 2024 15:42

bschimke95 requested a review from a team as a code owner July 19, 2024 15:42

bschimke95 force-pushed the bschimke95/debug-upgrade-race-condition branch from 766c1ea to 5d406dc Compare July 19, 2024 15:43

neoaggelos reviewed Jul 22, 2024

View reviewed changes

src/k8s/pkg/k8sd/app/hooks_bootstrap.go Outdated Show resolved Hide resolved

bschimke95 force-pushed the bschimke95/debug-upgrade-race-condition branch from 71a8fa4 to 55382de Compare July 23, 2024 08:01

bschimke95 changed the base branch from main to bschimke95/upgrade-microcluster July 23, 2024 09:28

bschimke95 mentioned this pull request Jul 24, 2024

Add rollout upgrade integration test canonical/cluster-api-k8s#29

Merged

louiseschmidtgen reviewed Jul 30, 2024

View reviewed changes

tests/integration/tests/test_clustering_race.py Outdated Show resolved Hide resolved

louiseschmidtgen approved these changes Jul 30, 2024

View reviewed changes

berkayoz approved these changes Jul 30, 2024

View reviewed changes

bschimke95 and others added 9 commits July 30, 2024 14:44

update microcluster version and adapt to API changes

5e78613

Address comments

9baed3b

refactor cleanup steps into remove hook

c64f39d

update diff

8b41089

wait until node has joined the cluster

bea4f0a

Update src/k8s/pkg/k8sd/app/hooks_remove.go

d760a8d

Co-authored-by: Mateo Florido <32885896+mateoflorido@users.noreply.github.com>

indicate that state is not used in signature

9812d98

rename file

7040f61

rebase fix

af9e3ac

bschimke95 force-pushed the bschimke95/upgrade-microcluster branch from 2ab601e to af9e3ac Compare July 30, 2024 12:45

bschimke95 and others added 8 commits July 30, 2024 15:47

do not stop k8s-dqlite as it is already stopped

b5afdbf

clustering race test

be13a9c

fix test

de1b4dc

Provide k8s-snap level fix

fe602e9

linting

a8eed66

Use join address in follow-up places

4219296

fix typo

9845d09

Update tests/integration/tests/test_clustering_race.py

bafed4c

Co-authored-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

bschimke95 force-pushed the bschimke95/debug-upgrade-race-condition branch from b172853 to bafed4c Compare July 30, 2024 13:51

fix linter errors

e60594e

Base automatically changed from bschimke95/upgrade-microcluster to main July 30, 2024 19:06

bschimke95 merged commit f7167b5 into main Jul 31, 2024
17 checks passed

bschimke95 deleted the bschimke95/debug-upgrade-race-condition branch July 31, 2024 07:21

neoaggelos reviewed Jul 31, 2024

View reviewed changes

bschimke95 added a commit that referenced this pull request Aug 5, 2024

Improve join logic to handle unreachable nodes (#560)

ff609c8

--------- Co-authored-by: Mateo Florido <32885896+mateoflorido@users.noreply.github.com> Co-authored-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve join logic to handle unreachable nodes #560

Improve join logic to handle unreachable nodes #560

bschimke95 commented Jul 18, 2024 •

edited

Loading

bschimke95 commented Jul 23, 2024

louiseschmidtgen left a comment

berkayoz left a comment

neoaggelos Jul 31, 2024

Improve join logic to handle unreachable nodes #560

Improve join logic to handle unreachable nodes #560

Conversation

bschimke95 commented Jul 18, 2024 • edited Loading

bschimke95 commented Jul 23, 2024

louiseschmidtgen left a comment

Choose a reason for hiding this comment

berkayoz left a comment

Choose a reason for hiding this comment

neoaggelos Jul 31, 2024

Choose a reason for hiding this comment

bschimke95 commented Jul 18, 2024 •

edited

Loading