Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Prevent data race from global metrics round-tripper #13641

Merged
merged 1 commit into from
Sep 23, 2024

Conversation

meln5674
Copy link
Contributor

@meln5674 meln5674 commented Sep 21, 2024

Fixes #13637

Motivation

This line introduces a data race by globally storing a round-tripper used by the kubernetes client. If a new request starts before the first one completes, and the first request attempts to use the same round-tripper it originally created before the second finishes upgrading its connection, it will result in a panic due to a nil net.Conn in the underlying SPDY implementation. As a result, if the controller makes too many new connections to the API server in a short-enough period of time, it will crash and restart.

While it is sensible to store a global handle to the metrics that this round-tripper records, storing the round-tripper itself is not.

Modifications

This patch retains the "context" of the metrics (i.e. the actual ctx value, plus the handle to the metrics themselves) as global, but scopes the round-tripper to each connection by refactoring the context to its own type and global variable, and uses an embedded pointer to it in the round-tripper implementation to avoid needing downstream changes.

Verification

E2E functional tests were run locally. Additionally, the PR tests identified in the issue as failing as a result of this race now pass, and the controller logs from running without the race detector were visually inspected to confirm that no panics occurred after multiple re-runs of the originally failing test.

Signed-off-by: Andrew Melnick <meln5674@kettering.edu>
@meln5674
Copy link
Contributor Author

Can anyone comment on this test failure? Why is it expecting 404? That metric seems to be reporting healthily here. Was this maybe a hack added to work around something related to this same issue?

@Joibel
Copy link
Member

Joibel commented Sep 21, 2024

I believe this is a flakey test that does just sometimes fail. I don't remember introducing it when the roundtripper was added. I've kicked it to run again.

Anyway, that's for finding this and testing the RC. I will take a proper look when I have access to a proper computer.

@meln5674
Copy link
Contributor Author

meln5674 commented Sep 21, 2024

Regarding the test-executor failure, it looks like the executor image is going missing between sometime after its loaded at the start of the test, and it looks like the same thing is happening on a lot of recent runs of it. I've seen similar issues before in my own repo's actions when the runner is low on storage and kubelet tries to garbage collect images I need later in the test. You can mitigate this with the imageGC{High,Low}ThresholdPercent kubelet config fields, which are easy enough to set in KinD, but I've never done it in K3s.

Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating and your detailed root cause analysis!
If you want to improve on anything else in the repo, like other data races, flakey tests (which are themselves sometimes due to races), race detecting tests, or anything else, that'd be great 🙂

This LGTM but I'll defer to Alan for final approval and merge since he wrote this code recently

@agilgur5 agilgur5 added area/controller Controller issues, panics area/metrics labels Sep 21, 2024
@agilgur5 agilgur5 added this to the v3.6.0 milestone Sep 21, 2024
Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find, thank you for fixing it.

@Joibel Joibel enabled auto-merge (squash) September 23, 2024 07:29
@Joibel Joibel merged commit 5244064 into argoproj:main Sep 23, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/metrics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3.6.0-rc1: Controller panics due to metrics round-tripper data race
3 participants