Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): fix race condition in updating ephemeral metadata #3975

Merged

Conversation

y-rabie
Copy link
Contributor

@y-rabie y-rabie commented Dec 3, 2024

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

Currently, the order of the two steps for updating ephemeral metadata can cause some stable pods to end up having the old ephemeral metadata. As per the comment in the code, if we are fetching the set of pods first and updating them ----> then updating the replicaset template spec, then any pods created in the interim will be using the un-updated replicaset template spec, that is, the old ephemeral metadata.

And, since we have at the beginning of the function

modifiedRS, modified := replicasetutil.SyncReplicaSetEphemeralPodMetadata(rs, podMetadata)
if !modified {
  return nil
}

The next sync won't go in and update those dangling pods, since the replicaset is actually updated and hasn't been modified.

This happened with me on a scale of 500 pods, with Karpenter consolidation active evicting pods and causing new ones to be created, where this sort of race condition is likely to happen.

Copy link
Contributor

github-actions bot commented Dec 3, 2024

Published E2E Test Results

  4 files    4 suites   3h 15m 11s ⏱️
113 tests 104 ✅  7 💤 2 ❌
454 runs  424 ✅ 28 💤 2 ❌

For more details on these failures, see this check.

Results for commit 4a0dc71.

♻️ This comment has been updated with latest results.

Copy link
Contributor

github-actions bot commented Dec 3, 2024

Published Unit Test Results

2 280 tests   2 280 ✅  2m 59s ⏱️
  128 suites      0 💤
    1 files        0 ❌

Results for commit 4a0dc71.

♻️ This comment has been updated with latest results.

Signed-off-by: Youssef Rabie <youssef.rabie@procore.com>
@y-rabie y-rabie force-pushed the fix-ephemeral-metadata-race-condition branch from 3690743 to 4a0dc71 Compare December 3, 2024 20:58
Copy link

sonarqubecloud bot commented Dec 3, 2024

Copy link

codecov bot commented Dec 3, 2024

Codecov Report

Attention: Patch coverage is 73.33333% with 4 lines in your changes missing coverage. Please review.

Project coverage is 82.72%. Comparing base (5f59344) to head (4a0dc71).
Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
rollout/ephemeralmetadata.go 73.33% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3975      +/-   ##
==========================================
+ Coverage   82.69%   82.72%   +0.02%     
==========================================
  Files         163      163              
  Lines       22895    22903       +8     
==========================================
+ Hits        18934    18947      +13     
+ Misses       3087     3084       -3     
+ Partials      874      872       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zachaller zachaller added this to the v1.8 milestone Dec 4, 2024
@zachaller zachaller merged commit 1bfcd0c into argoproj:master Dec 5, 2024
25 of 26 checks passed
@y-rabie y-rabie deleted the fix-ephemeral-metadata-race-condition branch December 6, 2024 11:49
Rizwana777 pushed a commit to Rizwana777/argo-rollouts that referenced this pull request Dec 12, 2024
…rgoproj#3975)

Signed-off-by: Youssef Rabie <youssef.rabie@procore.com>
meeech pushed a commit to CircleCI-Public/argo-rollouts that referenced this pull request Feb 10, 2025
…rgoproj#3975)

Signed-off-by: Youssef Rabie <youssef.rabie@procore.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants