fix: Prevent data race from global metrics round-tripper #13641
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #13637
Motivation
This line introduces a data race by globally storing a round-tripper used by the kubernetes client. If a new request starts before the first one completes, and the first request attempts to use the same round-tripper it originally created before the second finishes upgrading its connection, it will result in a panic due to a nil
net.Conn
in the underlying SPDY implementation. As a result, if the controller makes too many new connections to the API server in a short-enough period of time, it will crash and restart.While it is sensible to store a global handle to the metrics that this round-tripper records, storing the round-tripper itself is not.
Modifications
This patch retains the "context" of the metrics (i.e. the actual
ctx
value, plus the handle to the metrics themselves) as global, but scopes the round-tripper to each connection by refactoring the context to its own type and global variable, and uses an embedded pointer to it in the round-tripper implementation to avoid needing downstream changes.Verification
E2E functional tests were run locally. Additionally, the PR tests identified in the issue as failing as a result of this race now pass, and the controller logs from running without the race detector were visually inspected to confirm that no panics occurred after multiple re-runs of the originally failing test.