Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix scale test errors caused by upstream server count #2439

Merged
merged 4 commits into from
Aug 23, 2024

Conversation

bjee19
Copy link
Contributor

@bjee19 bjee19 commented Aug 21, 2024

Proposed changes

Problem: When the scale test runs with NGINX Plus with 648 upstream servers, it reports both NGF and NGINX Plus errors, because at some point the upstream zone size is no longer enough to hold all upstream servers. As a result, NGF fails to update NGINX Plus.

Solution: Adjust the upstream server count on the scale test when it runs with NGINX Plus from 648 to 556.

Testing: Scale test passed 5 consecutive times.

Closes #2023

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.

None

@github-actions github-actions bot added documentation Improvements or additions to documentation tests Pull requests that update tests labels Aug 21, 2024
Copy link

codecov bot commented Aug 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.79%. Comparing base (1a9cea0) to head (957568f).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2439   +/-   ##
=======================================
  Coverage   88.79%   88.79%           
=======================================
  Files         100      100           
  Lines        7531     7531           
  Branches       50       50           
=======================================
  Hits         6687     6687           
  Misses        788      788           
  Partials       56       56           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bjee19
Copy link
Contributor Author

bjee19 commented Aug 21, 2024

I've added the change to scale_test.go such that when running with NGINX Plus, it only uses 556 upstream servers.

However, I encountered some issues where this fix had flaky results. Sometimes the test would pass and sometimes it would error.

I've recorded the results from multiple runs of the test. the files with the name big-node are test runs on the n2d-standard-16 which is the correct node size. I made a mistake where I used a default node of e2-medium for the other results.

Regardless of the node size, I found that with upstream server count of 557 (one above the recommended limit), it always failed and reported errors in the logs and in the writeup. However, with upstream server count of 556 it sometimes passed and sometimes failed. However, when it did fail, there would be no reported errors in the test writeup, however the error was present in the logs.

Does anyone have any thoughts on how to proceed with this?

@kate-osborn
Copy link
Contributor

@bjee19

However, with upstream server count of 556 it sometimes passed and sometimes failed. However, when it did fail, there would be no reported errors in the test writeup, however the error was present in the logs.

What's the error in the logs? Is it the upstream server full error?

@bjee19
Copy link
Contributor Author

bjee19 commented Aug 22, 2024

@kate-osborn Yep from the nginx logs its:

2024/08/21 20:52:25 [crit] 121#121: ngx_slab_alloc() failed: no memory in upstream zone "scale_backend_80"
2024/08/21 20:52:25 [notice] 28#28: signal 1 (SIGHUP) received from 7, reconfiguring
2024/08/21 20:52:25 [notice] 28#28: reconfiguring
2024/08/21 20:52:25 [crit] 28#28: ngx_slab_alloc() failed: no memory in upstream zone "scale_backend_80"

And the NGF logs its:

{"level":"error","ts":"2024-08-21T20:52:25Z","logger":"eventLoop.eventHandler","msg":"couldn't update upstream via the API, reloading configuration instead","batchID":287,"upstreamName":"scale_backend_80","error":"failed to update servers of scale_backend_80 upstream: failed to add 10.8.0.124:8080 server to scale_backend_80 upstream: expected 201 response, got 500. error.status=500; error.text=upstream memory exhausted; error.code=UpstreamOutOfMemory; request_id=6b9527946bd39bb7c95829698260544d; href=https://nginx.org/en/docs/http/ngx_http_api_module.html","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).updateUpstreamServers\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static/handler.go:374\ngithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static/handler.go:200\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events/loop.go:74"}

@kate-osborn
Copy link
Contributor

@kate-osborn Yep from the nginx logs its:

2024/08/21 20:52:25 [crit] 121#121: ngx_slab_alloc() failed: no memory in upstream zone "scale_backend_80"
2024/08/21 20:52:25 [notice] 28#28: signal 1 (SIGHUP) received from 7, reconfiguring
2024/08/21 20:52:25 [notice] 28#28: reconfiguring
2024/08/21 20:52:25 [crit] 28#28: ngx_slab_alloc() failed: no memory in upstream zone "scale_backend_80"

And the NGF logs its:

{"level":"error","ts":"2024-08-21T20:52:25Z","logger":"eventLoop.eventHandler","msg":"couldn't update upstream via the API, reloading configuration instead","batchID":287,"upstreamName":"scale_backend_80","error":"failed to update servers of scale_backend_80 upstream: failed to add 10.8.0.124:8080 server to scale_backend_80 upstream: expected 201 response, got 500. error.status=500; error.text=upstream memory exhausted; error.code=UpstreamOutOfMemory; request_id=6b9527946bd39bb7c95829698260544d; href=https://nginx.org/en/docs/http/ngx_http_api_module.html","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).updateUpstreamServers\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static/handler.go:374\ngithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static/handler.go:200\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events/loop.go:74"}

Ok, so we have two problems:

  1. Our test isn't collecting errors consistently
  2. Our upper limit on N+ upstream servers isn't accurate enough.

Let's focus on number 2. I think we need to drop the N+ upstream number down until we get 5 runs that pass consecutively. You can iterate more quickly by just running the scale upstreams test by adding a F in front of the It:

	FIt(fmt.Sprintf("scales upstream servers to %d", upstreamServerCount), func() {
		const testName = "TestScale_UpstreamServers"

		testResultsDir := filepath.Join(resultsDir, testName)
		Expect(os.MkdirAll(testResultsDir, 0o755)).To(Succeed())

		runTestWithMetricsAndLogs(
			testName,
			testResultsDir,
			func() {
				runScaleUpstreams()
			},
		)
	})

@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Aug 22, 2024
@bjee19 bjee19 marked this pull request as ready for review August 22, 2024 20:51
@bjee19 bjee19 requested a review from a team as a code owner August 22, 2024 20:51
@bjee19
Copy link
Contributor Author

bjee19 commented Aug 22, 2024

Update to the above: there was user testing error on my behalf which gave false negative results for these changes. After fixing my testing procedure, setting the upstream server count to what we suspected would work (original number from the manual test) 556, lets the scale test correctly pass.

@bjee19 bjee19 force-pushed the tests/fix-scale-upstream-nginx-plus branch from cf96dc8 to 77ded16 Compare August 22, 2024 21:13
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 22, 2024
@bjee19 bjee19 force-pushed the tests/fix-scale-upstream-nginx-plus branch from 77ded16 to 247bead Compare August 22, 2024 21:26
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Aug 22, 2024
Copy link
Contributor

@kate-osborn kate-osborn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bjee19 bjee19 force-pushed the tests/fix-scale-upstream-nginx-plus branch from 6682fce to 957568f Compare August 23, 2024 16:11
@bjee19 bjee19 merged commit 6136715 into nginxinc:main Aug 23, 2024
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Pull requests that update tests
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Scale upstreams tests reports errors because zone size for NGINX Plus upstream
4 participants