Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(api) support HTTP/2 status and metrics requests #8687

Closed
wants to merge 2 commits into from

Conversation

rainest
Copy link
Contributor

@rainest rainest commented Apr 14, 2022

Summary

Rework the status and metrics endpoint implementations to make them not break on HTTP/2.

Full changelog

  • Add a new Unix socket listen that runs NGINX's stub_status on requests for /nginx_status at $PREFIX/nginx_status.sock.
  • Use lua-resty-http to access the new status socket listen in the status endpoint instead of using ngx.location.capture (which breaks if invoked during an HTTP/2 request) on the existing internal locations.
  • Ditto for Prometheus /metrics.

Issue reference

Discovered that Kong/kubernetes-ingress-controller#2343 did not work with HTTPS during post-release testing. We'd prefer to change the Kong end rather than disable HTTP/2 across the board on the controller end.

Implements @dndx's suggestion from #8528

Draft problems

Edit: 671e15e works around this: #8687 (comment)

This exposes a possible issue in spec/02-integration/02-cmd/02-start_stop_spec.lua:

cat: /home/runner/work/kong/kong/servroot/pids/nginx.pid: No such file or directory
1146.04   OK 635: kong start/stop #postgres deprecated properties prints a warning to stderr client_max_body_size
        __________
        FAILED  spec/02-integration/02-cmd/02-start_stop_spec.lua:641: kong start/stop #postgres deprecated properties prints a warning to stderr client_body_buffer_size
spec/02-integration/02-cmd/02-start_stop_spec.lua:608: 2022/04/14 20:33:48 [warn] the 'client_body_buffer_size' configuration property is deprecated, use 'nginx_http_client_body_buffer_size' instead
Error: ./kong/cmd/start.lua:64: nginx: [emerg] bind() to unix:/home/runner/work/kong/kong/servroot/nginx_status.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/home/runner/work/kong/kong/servroot/nginx_status.sock failed (98: Address already in use)

This change adds a unix socket that is almost always (either admin or status listen enabled) present. I think I'm hitting something in those tests that has multiple overlapping Kong instances. The cat error just before the conflict suggests we were trying to check if an instance was still up and/or kill it but failed due to a missing pidfile, but I'm not familiar enough with the tests to say for sure. I can't reproduce it when testing locally--I get no PID conflicts, just complaints about Cassandra and Postgres availability in 3 out of 39 tests. The issue flakes/does not occur consistently.

Manual testing (do we have HTTP/2 test client stuff yet?): example.txt

Add a new Unix socket listen that runs NGINX's stub_status on requests
for /nginx_status at $PREFIX/nginx_status.sock.

Use lua-resty-http to access the new status socket listen in the status
endpoint instead of using ngx.location.capture (which breaks if invoked
during an HTTP/2 request) on the existing internal locations.
@rainest rainest force-pushed the fix/status-sock branch 5 times, most recently from 099c947 to fa46f3a Compare April 18, 2022 21:52
Wait for the nginx_status.sock socket to clear before starting a Kong
instance.
@rainest
Copy link
Contributor Author

rainest commented Apr 18, 2022

Alright, well, workaround is less than ideal for using a specific socket path, but does apparently avoid whatever the issue was. Guessing that since kill_all only sends a signal there's no guarantee when NGINX actually stops, but since the function's done the after_each() exits and continues with the next test.

Also tried:

  • pl.delete(<prefix>/*.sock), which I'm guessing didn't work because either it doesn't expand globs or doesn't work on socket files.
  • Moving the clean_prefix() call to after_each. This broke because some tests get unhappy if the prefix isn't around yet--the nginx_main_daemon = off can't deal with the extra logs about creating it again.

There's now an unrelated flake with one of the Cassandra tests that appears to just be "Cassandra slow", but I don't have permissions to trigger re-tests.

@rainest rainest marked this pull request as ready for review April 18, 2022 22:03
@rainest rainest requested a review from a team as a code owner April 18, 2022 22:03
@rainest
Copy link
Contributor Author

rainest commented Apr 18, 2022

#8690 probably obsoletes this, so closing on the assumption we'll use that.

@rainest rainest closed this Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant