Fix scale test errors caused by upstream server count #2439

bjee19 · 2024-08-21T22:23:54Z

Proposed changes

Problem: When the scale test runs with NGINX Plus with 648 upstream servers, it reports both NGF and NGINX Plus errors, because at some point the upstream zone size is no longer enough to hold all upstream servers. As a result, NGF fails to update NGINX Plus.

Solution: Adjust the upstream server count on the scale test when it runs with NGINX Plus from 648 to 556.

Testing: Scale test passed 5 consecutive times.

Closes #2023

Checklist

Before creating a PR, run through this checklist and mark each as complete.

I have read the CONTRIBUTING doc
I have added tests that prove my fix is effective or that my feature works
I have checked that all unit tests pass after adding my changes
I have updated necessary documentation
I have rebased my branch onto main
I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.

None

codecov · 2024-08-21T22:27:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.79%. Comparing base (1a9cea0) to head (957568f).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2439   +/-   ##
=======================================
  Coverage   88.79%   88.79%           
=======================================
  Files         100      100           
  Lines        7531     7531           
  Branches       50       50           
=======================================
  Hits         6687     6687           
  Misses        788      788           
  Partials       56       56

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bjee19 · 2024-08-21T22:29:19Z

I've added the change to scale_test.go such that when running with NGINX Plus, it only uses 556 upstream servers.

However, I encountered some issues where this fix had flaky results. Sometimes the test would pass and sometimes it would error.

I've recorded the results from multiple runs of the test. the files with the name big-node are test runs on the n2d-standard-16 which is the correct node size. I made a mistake where I used a default node of e2-medium for the other results.

Regardless of the node size, I found that with upstream server count of 557 (one above the recommended limit), it always failed and reported errors in the logs and in the writeup. However, with upstream server count of 556 it sometimes passed and sometimes failed. However, when it did fail, there would be no reported errors in the test writeup, however the error was present in the logs.

Does anyone have any thoughts on how to proceed with this?

kate-osborn · 2024-08-22T15:36:55Z

@bjee19

However, with upstream server count of 556 it sometimes passed and sometimes failed. However, when it did fail, there would be no reported errors in the test writeup, however the error was present in the logs.

What's the error in the logs? Is it the upstream server full error?

bjee19 · 2024-08-22T16:27:27Z

@kate-osborn Yep from the nginx logs its:

2024/08/21 20:52:25 [crit] 121#121: ngx_slab_alloc() failed: no memory in upstream zone "scale_backend_80"
2024/08/21 20:52:25 [notice] 28#28: signal 1 (SIGHUP) received from 7, reconfiguring
2024/08/21 20:52:25 [notice] 28#28: reconfiguring
2024/08/21 20:52:25 [crit] 28#28: ngx_slab_alloc() failed: no memory in upstream zone "scale_backend_80"

And the NGF logs its:

{"level":"error","ts":"2024-08-21T20:52:25Z","logger":"eventLoop.eventHandler","msg":"couldn't update upstream via the API, reloading configuration instead","batchID":287,"upstreamName":"scale_backend_80","error":"failed to update servers of scale_backend_80 upstream: failed to add 10.8.0.124:8080 server to scale_backend_80 upstream: expected 201 response, got 500. error.status=500; error.text=upstream memory exhausted; error.code=UpstreamOutOfMemory; request_id=6b9527946bd39bb7c95829698260544d; href=https://nginx.org/en/docs/http/ngx_http_api_module.html","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).updateUpstreamServers\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static/handler.go:374\ngithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static/handler.go:200\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events/loop.go:74"}

kate-osborn · 2024-08-22T17:06:18Z

@kate-osborn Yep from the nginx logs its:

2024/08/21 20:52:25 [crit] 121#121: ngx_slab_alloc() failed: no memory in upstream zone "scale_backend_80"
2024/08/21 20:52:25 [notice] 28#28: signal 1 (SIGHUP) received from 7, reconfiguring
2024/08/21 20:52:25 [notice] 28#28: reconfiguring
2024/08/21 20:52:25 [crit] 28#28: ngx_slab_alloc() failed: no memory in upstream zone "scale_backend_80"

And the NGF logs its:

{"level":"error","ts":"2024-08-21T20:52:25Z","logger":"eventLoop.eventHandler","msg":"couldn't update upstream via the API, reloading configuration instead","batchID":287,"upstreamName":"scale_backend_80","error":"failed to update servers of scale_backend_80 upstream: failed to add 10.8.0.124:8080 server to scale_backend_80 upstream: expected 201 response, got 500. error.status=500; error.text=upstream memory exhausted; error.code=UpstreamOutOfMemory; request_id=6b9527946bd39bb7c95829698260544d; href=https://nginx.org/en/docs/http/ngx_http_api_module.html","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).updateUpstreamServers\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static/handler.go:374\ngithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static/handler.go:200\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events/loop.go:74"}

Ok, so we have two problems:

Our test isn't collecting errors consistently
Our upper limit on N+ upstream servers isn't accurate enough.

Let's focus on number 2. I think we need to drop the N+ upstream number down until we get 5 runs that pass consecutively. You can iterate more quickly by just running the scale upstreams test by adding a F in front of the It:

	FIt(fmt.Sprintf("scales upstream servers to %d", upstreamServerCount), func() {
		const testName = "TestScale_UpstreamServers"

		testResultsDir := filepath.Join(resultsDir, testName)
		Expect(os.MkdirAll(testResultsDir, 0o755)).To(Succeed())

		runTestWithMetricsAndLogs(
			testName,
			testResultsDir,
			func() {
				runScaleUpstreams()
			},
		)
	})

bjee19 · 2024-08-22T20:52:11Z

Update to the above: there was user testing error on my behalf which gave false negative results for these changes. After fixing my testing procedure, setting the upstream server count to what we suspected would work (original number from the manual test) 556, lets the scale test correctly pass.

tests/results/longevity/1.4.0/oss-cpu.png

tests/suite/scale_test.go

kate-osborn

LGTM

Problem: When the scale test runs with NGINX Plus with 648 upstream servers, it reports both NGF and NGINX Plus errors, because at some point the upstream zone size is no longer enough to hold all upstream servers. As a result, NGF fails to update NGINX Plus. Solution: Adjust the upstream server count on the scale test when it runs with NGINX Plus from 648 to 556. Testing: Scale test passed 5 consecutive times.

nginx-bot bot assigned salonichf5 Aug 21, 2024

github-actions bot added documentation Improvements or additions to documentation tests Pull requests that update tests labels Aug 21, 2024

github-actions bot removed the documentation Improvements or additions to documentation label Aug 22, 2024

bjee19 marked this pull request as ready for review August 22, 2024 20:51

bjee19 requested a review from a team as a code owner August 22, 2024 20:51

bjee19 force-pushed the tests/fix-scale-upstream-nginx-plus branch from cf96dc8 to 77ded16 Compare August 22, 2024 21:13

github-actions bot added the documentation Improvements or additions to documentation label Aug 22, 2024

kate-osborn reviewed Aug 22, 2024

View reviewed changes

tests/results/longevity/1.4.0/oss-cpu.png Outdated Show resolved Hide resolved

kate-osborn reviewed Aug 22, 2024

View reviewed changes

tests/suite/scale_test.go Outdated Show resolved Hide resolved

bjee19 force-pushed the tests/fix-scale-upstream-nginx-plus branch from 77ded16 to 247bead Compare August 22, 2024 21:26

github-actions bot removed the documentation Improvements or additions to documentation label Aug 22, 2024

kate-osborn approved these changes Aug 22, 2024

View reviewed changes

sjberman approved these changes Aug 22, 2024

View reviewed changes

bjee19 added 4 commits August 23, 2024 09:10

Add test results

e5451e6

Add updated test results for verification

644c81f

Remove all results

cb98e79

Refactor upstream test name to include both OSS and Plus server count

957568f

bjee19 force-pushed the tests/fix-scale-upstream-nginx-plus branch from 6682fce to 957568f Compare August 23, 2024 16:11

bjee19 merged commit 6136715 into nginx:main Aug 23, 2024
37 checks passed

This was referenced Sep 16, 2024

NFR Test Results for NGF version edge #2554

Merged

NFR Scale upstreams tests continue to report out of memory errors #2563

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix scale test errors caused by upstream server count #2439

Fix scale test errors caused by upstream server count #2439

Uh oh!

bjee19 commented Aug 21, 2024 •

edited

Loading

Uh oh!

codecov bot commented Aug 21, 2024 •

edited

Loading

Uh oh!

bjee19 commented Aug 21, 2024

Uh oh!

kate-osborn commented Aug 22, 2024

Uh oh!

bjee19 commented Aug 22, 2024 •

edited

Loading

Uh oh!

kate-osborn commented Aug 22, 2024

Uh oh!

bjee19 commented Aug 22, 2024

Uh oh!

Uh oh!

Uh oh!

kate-osborn left a comment

Uh oh!

Uh oh!

Uh oh!

Fix scale test errors caused by upstream server count #2439

Fix scale test errors caused by upstream server count #2439

Uh oh!

Conversation

bjee19 commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Release notes

Uh oh!

codecov bot commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bjee19 commented Aug 21, 2024

Uh oh!

kate-osborn commented Aug 22, 2024

Uh oh!

bjee19 commented Aug 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kate-osborn commented Aug 22, 2024

Uh oh!

bjee19 commented Aug 22, 2024

Uh oh!

Uh oh!

Uh oh!

kate-osborn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bjee19 commented Aug 21, 2024 •

edited

Loading

codecov bot commented Aug 21, 2024 •

edited

Loading

bjee19 commented Aug 22, 2024 •

edited

Loading