You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The controller-runtime uses the resourcelock.New function for configuring leader election. This function is known to have an incorrectly configured request timeout, which sets the request timeout equal to the leader election deadline. This allows a single request timeout to trigger a change in leadership.
This issue causes unnecessary leader changes, which can cause:
Lower availability - new leader might require reinitialization of informers which can take tens of seconds in large clusters.
Waste of resources - Increased API server load due to concurrent re-initializations, potentially triggering a KCP scale-up and wasting resources.
Fix:
Update controller-runtime to use resourcelock.NewFromKubeconfig for leader election. This will ensure that the request timeout is correctly configured and prevent unnecessary leadership changes due to transient network issues or API server unavailability. This change should involve approximately 10 lines of code.
Why is the resourceLock.New not deprecated if it has known issues?
No breaking change policy :P
Issue is just resourceLock.New allows users to configure deadline and timeout independently, so most users are not aware that relation between those two parameters can impact reliability.
The
controller-runtime
uses theresourcelock.New
function for configuring leader election. This function is known to have an incorrectly configured request timeout, which sets the request timeout equal to the leader election deadline. This allows a single request timeout to trigger a change in leadership.Source:
controller-runtime/pkg/leaderelection/leader_election.go
Lines 101 to 109 in 8e44a43
Impact:
This issue causes unnecessary leader changes, which can cause:
Fix:
Update controller-runtime to use
resourcelock.NewFromKubeconfig
for leader election. This will ensure that the request timeout is correctly configured and prevent unnecessary leadership changes due to transient network issues or API server unavailability. This change should involve approximately 10 lines of code.Example:
kubernetes/kubernetes#98059
The text was updated successfully, but these errors were encountered: