Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix leader election request timeout #3027

Closed
serathius opened this issue Nov 27, 2024 · 2 comments · Fixed by #3028
Closed

Fix leader election request timeout #3027

serathius opened this issue Nov 27, 2024 · 2 comments · Fixed by #3028

Comments

@serathius
Copy link

The controller-runtime uses the resourcelock.New function for configuring leader election. This function is known to have an incorrectly configured request timeout, which sets the request timeout equal to the leader election deadline. This allows a single request timeout to trigger a change in leadership.

Source:

return resourcelock.New(options.LeaderElectionResourceLock,
options.LeaderElectionNamespace,
options.LeaderElectionID,
corev1Client,
coordinationClient,
resourcelock.ResourceLockConfig{
Identity: id,
EventRecorder: recorderProvider.GetEventRecorderFor(id),
})

Impact:

This issue causes unnecessary leader changes, which can cause:

  • Lower availability - new leader might require reinitialization of informers which can take tens of seconds in large clusters.
  • Waste of resources - Increased API server load due to concurrent re-initializations, potentially triggering a KCP scale-up and wasting resources.

Fix:

Update controller-runtime to use resourcelock.NewFromKubeconfig for leader election. This will ensure that the request timeout is correctly configured and prevent unnecessary leadership changes due to transient network issues or API server unavailability. This change should involve approximately 10 lines of code.

Example:

kubernetes/kubernetes#98059

@alvaroaleman
Copy link
Member

@serathius thanks for the report, I've opened #3028 to fix it

Why is the resourceLock.New not deprecated if it has known issues?

@serathius
Copy link
Author

Why is the resourceLock.New not deprecated if it has known issues?

No breaking change policy :P
Issue is just resourceLock.New allows users to configure deadline and timeout independently, so most users are not aware that relation between those two parameters can impact reliability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants