-
Notifications
You must be signed in to change notification settings - Fork 141
Fix HPA race condition by reading deployment replicas instead of HPA status #4214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4214 +/- ##
==========================================
+ Coverage 86.10% 86.12% +0.02%
==========================================
Files 131 131
Lines 14162 14171 +9
Branches 35 35
==========================================
+ Hits 12194 12205 +11
+ Misses 1765 1764 -1
+ Partials 203 202 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
bjee19
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you verified that the situation described in the bug report was resolved?
Yeah we should try to bottleneck the reconciliation/patch logic and see how that goes(stability eventually happens) since this was a bigger pain point for prod environments |
salonichf5
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
The comment above answered my question. |
@bjee19 @salonichf5 To the best of my abilities, yes. I created a HPA with min replicas of 1 and max of 5, and set replicas in the NginxProxy as 2. I manually changed the replicas to 5 using |
…status When HPA is enabled, read the current Deployment.Spec.Replicas directly instead of HPA.Status.DesiredReplicas, which is eventually consistent and lags behind deployment changes. This prevents the controller from overwriting HPA's replica count with stale values, eliminating pod churn and connection drops. Fixes race condition where HPA scales down → NGF reads stale HPA status → NGF overwrites deployment with old replica count → pods restart.
c25cdb5 to
164a501
Compare
…status (#4214) When HPA is enabled, read the current Deployment.Spec.Replicas directly instead of HPA.Status.DesiredReplicas, which is eventually consistent and lags behind deployment changes. This prevents the controller from overwriting HPA's replica count with stale values, eliminating pod churn and connection drops. Fixes race condition where HPA scales down → NGF reads stale HPA status → NGF overwrites deployment with old replica count → pods restart.
Proposed changes
Problem: When autoscaling.enable: true is configured in the Helm chart, the NGF controller updates the deployment and modifies the spec.replicas field in conflict with the HPA. This causes the deployment to scale up and down in the same second, resulting in constant pod churn and preventing the HPA from scaling up or down consistently.
Solution: When HPA is enabled, read the current Deployment.Spec.Replicas directly instead of HPA.Status.DesiredReplicas, which is eventually consistent and lags behind deployment changes. This prevents the controller from overwriting HPA's replica count with stale values, eliminating pod churn and connection drops.
Testing: Unit and local testing
Please focus on (optional): If you any specific areas where you would like reviewers to focus their attention or provide
specific feedback, add them here.
Closes #4007
Checklist
Before creating a PR, run through this checklist and mark each as complete.
Release notes
If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.