Fix HPA race condition by reading deployment replicas instead of HPA status #4214

ciarams87 · 2025-11-05T09:56:00Z

Proposed changes

Problem: When autoscaling.enable: true is configured in the Helm chart, the NGF controller updates the deployment and modifies the spec.replicas field in conflict with the HPA. This causes the deployment to scale up and down in the same second, resulting in constant pod churn and preventing the HPA from scaling up or down consistently.

Solution: When HPA is enabled, read the current Deployment.Spec.Replicas directly instead of HPA.Status.DesiredReplicas, which is eventually consistent and lags behind deployment changes. This prevents the controller from overwriting HPA's replica count with stale values, eliminating pod churn and connection drops.

Testing: Unit and local testing

Please focus on (optional): If you any specific areas where you would like reviewers to focus their attention or provide
specific feedback, add them here.

Closes #4007

Checklist

Before creating a PR, run through this checklist and mark each as complete.

I have read the CONTRIBUTING doc
I have added tests that prove my fix is effective or that my feature works
I have checked that all unit tests pass after adding my changes
I have updated necessary documentation
I have rebased my branch onto main
I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.

Preserve HPA replicas on deployment

codecov · 2025-11-05T10:04:35Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.12%. Comparing base (a5a0f72) to head (164a501).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4214      +/-   ##
==========================================
+ Coverage   86.10%   86.12%   +0.02%     
==========================================
  Files         131      131              
  Lines       14162    14171       +9     
  Branches       35       35              
==========================================
+ Hits        12194    12205      +11     
+ Misses       1765     1764       -1     
+ Partials      203      202       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

internal/controller/provisioner/objects.go

internal/controller/provisioner/objects_test.go

internal/controller/provisioner/objects.go

bjee19

Have you verified that the situation described in the bug report was resolved?

internal/controller/provisioner/objects_test.go

salonichf5 · 2025-11-05T21:24:54Z

Have you verified that the situation described in the bug report was resolved?

Yeah we should try to bottleneck the reconciliation/patch logic and see how that goes(stability eventually happens) since this was a bigger pain point for prod environments

salonichf5

🚀

salonichf5 · 2025-11-06T16:54:58Z

Have you verified that the situation described in the bug report was resolved?

Yeah we should try to bottleneck the reconciliation/patch logic and see how that goes(stability eventually happens) since this was a bigger pain point for prod environments

The comment above answered my question.

ciarams87 · 2025-11-07T09:35:33Z

Have you verified that the situation described in the bug report was resolved?

@bjee19 @salonichf5 To the best of my abilities, yes.

I created a HPA with min replicas of 1 and max of 5, and set replicas in the NginxProxy as 2. I manually changed the replicas to 5 using kubectl scale, verified that HPA scaled back to 1, and that the correct manager was managing the replica field (i.e. kube-controller-manager):

k get deployments.apps gateway-nginx --show-managed-fields -o json | \
  jq '.metadata.managedFields[] | select(.fieldsV1."f:spec"."f:replicas")'
{
  "apiVersion": "apps/v1",
  "fieldsType": "FieldsV1",
  "fieldsV1": {
    "f:spec": {
      "f:replicas": {}
    }
  },
  "manager": "kube-controller-manager",
  "operation": "Update",
  "subresource": "scale"
}

…status When HPA is enabled, read the current Deployment.Spec.Replicas directly instead of HPA.Status.DesiredReplicas, which is eventually consistent and lags behind deployment changes. This prevents the controller from overwriting HPA's replica count with stale values, eliminating pod churn and connection drops. Fixes race condition where HPA scales down → NGF reads stale HPA status → NGF overwrites deployment with old replica count → pods restart.

…status (#4214) When HPA is enabled, read the current Deployment.Spec.Replicas directly instead of HPA.Status.DesiredReplicas, which is eventually consistent and lags behind deployment changes. This prevents the controller from overwriting HPA's replica count with stale values, eliminating pod churn and connection drops. Fixes race condition where HPA scales down → NGF reads stale HPA status → NGF overwrites deployment with old replica count → pods restart.

ciarams87 requested a review from a team as a code owner November 5, 2025 09:56

github-project-automation bot added this to NGINX Gateway Fabric Nov 5, 2025

github-project-automation bot moved this to 🆕 New in NGINX Gateway Fabric Nov 5, 2025

github-actions bot added the bug Something isn't working label Nov 5, 2025

salonichf5 requested changes Nov 5, 2025

View reviewed changes

internal/controller/provisioner/objects.go Show resolved Hide resolved

internal/controller/provisioner/objects_test.go Show resolved Hide resolved

github-project-automation bot moved this from 🆕 New to 🏗 In Progress in NGINX Gateway Fabric Nov 5, 2025

bjee19 reviewed Nov 5, 2025

View reviewed changes

internal/controller/provisioner/objects.go Show resolved Hide resolved

bjee19 reviewed Nov 5, 2025

View reviewed changes

salonichf5 reviewed Nov 5, 2025

View reviewed changes

internal/controller/provisioner/objects_test.go Show resolved Hide resolved

salonichf5 approved these changes Nov 6, 2025

View reviewed changes

shaun-nx approved these changes Nov 6, 2025

View reviewed changes

salonichf5 moved this from 🏗 In Progress to 👀 In Review in NGINX Gateway Fabric Nov 6, 2025

ciarams87 force-pushed the fix/hpa-autoscaling branch from c25cdb5 to 164a501 Compare November 7, 2025 09:36

ciarams87 enabled auto-merge (squash) November 7, 2025 09:36

ciarams87 merged commit 96032ac into main Nov 7, 2025
101 of 102 checks passed

ciarams87 deleted the fix/hpa-autoscaling branch November 7, 2025 10:24

github-project-automation bot moved this from 👀 In Review to ✅ Done in NGINX Gateway Fabric Nov 7, 2025

nginx-bot bot added the release-notes label Nov 7, 2025

ciarams87 mentioned this pull request Nov 7, 2025

Fix HPA race condition by reading deployment replicas instead of HPA … #4239

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix HPA race condition by reading deployment replicas instead of HPA status #4214

Fix HPA race condition by reading deployment replicas instead of HPA status #4214

Uh oh!

ciarams87 commented Nov 5, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bjee19 left a comment

Uh oh!

Uh oh!

salonichf5 commented Nov 5, 2025

Uh oh!

salonichf5 left a comment

Uh oh!

salonichf5 commented Nov 6, 2025

Uh oh!

ciarams87 commented Nov 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix HPA race condition by reading deployment replicas instead of HPA status #4214

Fix HPA race condition by reading deployment replicas instead of HPA status #4214

Uh oh!

Conversation

ciarams87 commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Release notes

Uh oh!

codecov bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bjee19 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

salonichf5 commented Nov 5, 2025

Uh oh!

salonichf5 left a comment

Choose a reason for hiding this comment

Uh oh!

salonichf5 commented Nov 6, 2025

Uh oh!

ciarams87 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ciarams87 commented Nov 5, 2025 •

edited

Loading

codecov bot commented Nov 5, 2025 •

edited

Loading

ciarams87 commented Nov 7, 2025 •

edited

Loading