Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve join logic to handle unreachable nodes #560

Merged
merged 18 commits into from
Jul 31, 2024

Conversation

bschimke95
Copy link
Contributor

@bschimke95 bschimke95 commented Jul 18, 2024

Fixes an issue where the join process would fail if a node in the join_address list is unreachable.

When a token is created, it includes the IP addresses of all nodes in the cluster. If a node is removed between the token creation and joining, the join process previously failed because it tried to connect to the non-existent node.

Now, the client discovery logic will continue attempting to connect to other nodes in the join_address list instead of failing early. This ensures that the join process can succeed as long as at least one node is reachable.

This PR fixes this issue on k8s-snap level for the worker nodes and requires the microcluster upgrades from #562 to actually pass the integration test.

@bschimke95 bschimke95 changed the title clustering race test Improve join logic to handle unreachable nodes Jul 19, 2024
@bschimke95 bschimke95 marked this pull request as ready for review July 19, 2024 15:42
@bschimke95 bschimke95 requested a review from a team as a code owner July 19, 2024 15:42
@bschimke95 bschimke95 force-pushed the bschimke95/debug-upgrade-race-condition branch from 766c1ea to 5d406dc Compare July 19, 2024 15:43
@bschimke95 bschimke95 force-pushed the bschimke95/debug-upgrade-race-condition branch from 71a8fa4 to 55382de Compare July 23, 2024 08:01
@bschimke95 bschimke95 changed the base branch from main to bschimke95/upgrade-microcluster July 23, 2024 09:28
@bschimke95
Copy link
Contributor Author

rebased on top of #562 to verify this fix works.

Copy link
Contributor

@louiseschmidtgen louiseschmidtgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Ben!

Copy link
Member

@berkayoz berkayoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bschimke95 bschimke95 force-pushed the bschimke95/upgrade-microcluster branch from 2ab601e to af9e3ac Compare July 30, 2024 12:45
@bschimke95 bschimke95 force-pushed the bschimke95/debug-upgrade-race-condition branch from b172853 to bafed4c Compare July 30, 2024 13:51
Base automatically changed from bschimke95/upgrade-microcluster to main July 30, 2024 19:06
@bschimke95 bschimke95 merged commit f7167b5 into main Jul 31, 2024
17 checks passed
@bschimke95 bschimke95 deleted the bschimke95/debug-upgrade-race-condition branch July 31, 2024 07:21
Comment on lines +73 to +83
// Get remote certificate from the cluster member. We only need one node to be reachable for this.
// One might fail because the node is not part of the cluster anymore but was at the time the token was created.
var cert *x509.Certificate
var address string
var err error
for _, address = range token.JoinAddresses {
cert, err = utils.GetRemoteCertificate(address)
if err == nil {
break
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, try to avoid mixing dependency upgrades along with changes in the code logic, as it would be harder to backport and/or revert if required.

bschimke95 added a commit that referenced this pull request Aug 5, 2024
---------

Co-authored-by: Mateo Florido <32885896+mateoflorido@users.noreply.github.com>
Co-authored-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants