-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve join logic to handle unreachable nodes #560
Conversation
766c1ea
to
5d406dc
Compare
71a8fa4
to
55382de
Compare
rebased on top of #562 to verify this fix works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Ben!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Mateo Florido <32885896+mateoflorido@users.noreply.github.com>
2ab601e
to
af9e3ac
Compare
Co-authored-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
b172853
to
bafed4c
Compare
// Get remote certificate from the cluster member. We only need one node to be reachable for this. | ||
// One might fail because the node is not part of the cluster anymore but was at the time the token was created. | ||
var cert *x509.Certificate | ||
var address string | ||
var err error | ||
for _, address = range token.JoinAddresses { | ||
cert, err = utils.GetRemoteCertificate(address) | ||
if err == nil { | ||
break | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, try to avoid mixing dependency upgrades along with changes in the code logic, as it would be harder to backport and/or revert if required.
--------- Co-authored-by: Mateo Florido <32885896+mateoflorido@users.noreply.github.com> Co-authored-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Fixes an issue where the join process would fail if a node in the join_address list is unreachable.
When a token is created, it includes the IP addresses of all nodes in the cluster. If a node is removed between the token creation and joining, the join process previously failed because it tried to connect to the non-existent node.
Now, the client discovery logic will continue attempting to connect to other nodes in the join_address list instead of failing early. This ensures that the join process can succeed as long as at least one node is reachable.
This PR fixes this issue on
k8s-snap
level for the worker nodes and requires the microcluster upgrades from #562 to actually pass the integration test.