Skip to content

Conversation

@ciarams87
Copy link
Contributor

@ciarams87 ciarams87 commented Nov 5, 2025

Proposed changes

Problem: When autoscaling.enable: true is configured in the Helm chart, the NGF controller updates the deployment and modifies the spec.replicas field in conflict with the HPA. This causes the deployment to scale up and down in the same second, resulting in constant pod churn and preventing the HPA from scaling up or down consistently.

Solution: When HPA is enabled, read the current Deployment.Spec.Replicas directly instead of HPA.Status.DesiredReplicas, which is eventually consistent and lags behind deployment changes. This prevents the controller from overwriting HPA's replica count with stale values, eliminating pod churn and connection drops.

Testing: Unit and local testing

Please focus on (optional): If you any specific areas where you would like reviewers to focus their attention or provide
specific feedback, add them here.

Closes #4007

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.

Preserve HPA replicas on deployment

@ciarams87 ciarams87 requested a review from a team as a code owner November 5, 2025 09:56
@github-actions github-actions bot added the bug Something isn't working label Nov 5, 2025
@codecov
Copy link

codecov bot commented Nov 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.12%. Comparing base (a5a0f72) to head (164a501).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4214      +/-   ##
==========================================
+ Coverage   86.10%   86.12%   +0.02%     
==========================================
  Files         131      131              
  Lines       14162    14171       +9     
  Branches       35       35              
==========================================
+ Hits        12194    12205      +11     
+ Misses       1765     1764       -1     
+ Partials      203      202       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-project-automation github-project-automation bot moved this from 🆕 New to 🏗 In Progress in NGINX Gateway Fabric Nov 5, 2025
Copy link
Contributor

@bjee19 bjee19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified that the situation described in the bug report was resolved?

@salonichf5
Copy link
Contributor

Have you verified that the situation described in the bug report was resolved?

Yeah we should try to bottleneck the reconciliation/patch logic and see how that goes(stability eventually happens) since this was a bigger pain point for prod environments

Copy link
Contributor

@salonichf5 salonichf5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@salonichf5
Copy link
Contributor

Have you verified that the situation described in the bug report was resolved?

Yeah we should try to bottleneck the reconciliation/patch logic and see how that goes(stability eventually happens) since this was a bigger pain point for prod environments

The comment above answered my question.

@salonichf5 salonichf5 moved this from 🏗 In Progress to 👀 In Review in NGINX Gateway Fabric Nov 6, 2025
@ciarams87
Copy link
Contributor Author

ciarams87 commented Nov 7, 2025

Have you verified that the situation described in the bug report was resolved?

@bjee19 @salonichf5 To the best of my abilities, yes.

I created a HPA with min replicas of 1 and max of 5, and set replicas in the NginxProxy as 2. I manually changed the replicas to 5 using kubectl scale, verified that HPA scaled back to 1, and that the correct manager was managing the replica field (i.e. kube-controller-manager):

k get deployments.apps gateway-nginx --show-managed-fields -o json | \
  jq '.metadata.managedFields[] | select(.fieldsV1."f:spec"."f:replicas")'
{
  "apiVersion": "apps/v1",
  "fieldsType": "FieldsV1",
  "fieldsV1": {
    "f:spec": {
      "f:replicas": {}
    }
  },
  "manager": "kube-controller-manager",
  "operation": "Update",
  "subresource": "scale"
}

…status

When HPA is enabled, read the current Deployment.Spec.Replicas directly
instead of HPA.Status.DesiredReplicas, which is eventually consistent and
lags behind deployment changes. This prevents the controller from
overwriting HPA's replica count with stale values, eliminating pod churn
and connection drops.

Fixes race condition where HPA scales down → NGF reads stale HPA status
→ NGF overwrites deployment with old replica count → pods restart.
@ciarams87 ciarams87 force-pushed the fix/hpa-autoscaling branch from c25cdb5 to 164a501 Compare November 7, 2025 09:36
@ciarams87 ciarams87 enabled auto-merge (squash) November 7, 2025 09:36
@ciarams87 ciarams87 merged commit 96032ac into main Nov 7, 2025
101 of 102 checks passed
@ciarams87 ciarams87 deleted the fix/hpa-autoscaling branch November 7, 2025 10:24
@github-project-automation github-project-automation bot moved this from 👀 In Review to ✅ Done in NGINX Gateway Fabric Nov 7, 2025
ciarams87 added a commit that referenced this pull request Nov 7, 2025
…status (#4214)

When HPA is enabled, read the current Deployment.Spec.Replicas directly
instead of HPA.Status.DesiredReplicas, which is eventually consistent and
lags behind deployment changes. This prevents the controller from
overwriting HPA's replica count with stale values, eliminating pod churn
and connection drops.

Fixes race condition where HPA scales down → NGF reads stale HPA status
→ NGF overwrites deployment with old replica count → pods restart.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working release-notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

HPA and NGF Controller Conflicting

5 participants