-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce HTTP2 health check transport options #9022
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #9022 +/- ##
==========================================
- Coverage 91.57% 91.54% -0.04%
==========================================
Files 316 316
Lines 17147 17154 +7
==========================================
+ Hits 15702 15703 +1
- Misses 1150 1154 +4
- Partials 295 297 +2 ☔ View full report in Codecov by Sentry. |
atoulme
reviewed
Nov 30, 2023
atoulme
reviewed
Nov 30, 2023
rmfitzpatrick
approved these changes
Nov 30, 2023
dloucasfx
force-pushed
the
http2ping
branch
2 times, most recently
from
November 30, 2023 22:36
a43ca0c
to
0a049f4
Compare
dmitryax
reviewed
Dec 7, 2023
Signed-off-by: Dani Louca <dlouca@splunk.com>
dmitryax
approved these changes
Dec 7, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
pantuza
pushed a commit
to pantuza/opentelemetry-collector
that referenced
this pull request
Dec 8, 2023
**Description:** This PR introduces options to enable `http/2` health check by exposing `HTTP2ReadIdleTimeout` and `HTTP2PingTimeout` The golang issues are: golang/go#59690 golang/go#36026 In summary, if due to environmental issue the underlying tcp connection used by the http/2 client in the exporter became unstable/unusable/unreachable, unlike http/1, the http/2 client does not forcibly close the connection and redial a new one, instead it keeps using it for 15 minutes (default value of OS `tcp_retries2` ) until the OS cleans it up and a new tcp connection gets established. From OTEL user perspective, one will see a spike in export failures/timeouts for ~15 minutes, this will happen for every connection that got into a bad state, after 15 minutes things will recover until next time the tcp connection gets into a bad state. **Testing:** - Run OTEL with one of the exporters that uses HTTP/2 client, example `signalfx` exporter - For simplicity use a single pipeline/exporter - In a different shell, run this to watch the tcp state of the established connection ``` while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done ``` - From the netstat, take a note of the source port and the source IP address - replace <> from previous step `sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP` - Note how the OTEL exporter export starts timing out Expected Result: - A new connection should be established, similarly to http/1 and exports should succeed Actual Result: - The exports keep failing for ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to - After 15 minutes, a new tcp connection is created and exports start working **Documentation:** Readme is updated with new settings Signed-off-by: Dani Louca <dlouca@splunk.com>
codeboten
pushed a commit
to open-telemetry/opentelemetry-collector-contrib
that referenced
this pull request
Dec 11, 2023
Same description as in open-telemetry/opentelemetry-collector#9022 This PR enables the HTTP2 health check to workaround the issue described here open-telemetry/opentelemetry-collector#9022 As to why I chose 10 seconds for `HTTP2ReadIdleTimeout` and 10 seconds for `HTTP2PingTimeout` Those values have been tested in production and they will result, in an active env (with default http timeout of 10 seconds and default retry settings), of a single export failure or (2 max) before the health check detects the corrupted tcp connection and closes it. The only drawback is if the connection was not used for over 10 seconds, we might end up sending unnecessary ping frames, which should not be an issue and if it became an issue, then we can tune those settings. The SFX exporter has multiples http clients: - Metric client, Trace client and Event client . Those client will have the http2 health check enabled by default as they share the same default config - Correlation client and Dimension client will NOT have the http2 health check enabled. We can revisit this if needed. **Link to tracking Issue:** <Issue number if applicable> **Testing:** <Describe what testing was performed and which tests were added.> - Run OTEL with one of the exporters that uses HTTP/2 client, example `signalfx` exporter - For simplicity use a single pipeline/exporter - In a different shell, run this to watch the tcp state of the established connection ``` while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done ``` - From the netstat, take a note of the source port and the source IP address - replace <> from previous step `sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP` - Note how the OTEL exporter export starts timing out Expected Result: - A new connection should be established, similarly to http/1 and exports should succeed Actual Result: - The exports keep failing for ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to - After 15 minutes, a new tcp connection is created and exports start working **Documentation:** <Describe the documentation added.> Readme is updated Signed-off-by: Dani Louca <dlouca@splunk.com>
dmitryax
pushed a commit
to open-telemetry/opentelemetry-collector-contrib
that referenced
this pull request
Dec 11, 2023
**Description:** This PR enables the HTTP2 health check to workaround the issue described here open-telemetry/opentelemetry-collector#9022 As to why I chose 10 seconds for `HTTP2ReadIdleTimeout` and ~~5 seconds~~ 10 seconds (see review comment) for `HTTP2PingTimeout` Those values have been tested in production and they will result, in an active env (with default http timeout of 10 seconds and default retry settings), of a single export failure at max before the health check detects the corrupted tcp connection and closes it. The only drawback is if the connection was not used for over 10 seconds, we might end up sending unnecessary ping frames, which should not be an issue and if it became an issue, then we can tune those settings. The SFX exporter has multiples http clients: - Metric client, Trace client and Event client . Those client will have the http2 health check enabled by default as they share the same default config - Correlation client and Dimension client will NOT have the http2 health check enabled. We can revisit this if needed. **Testing:** - Run OTEL with one of the exporters that uses HTTP/2 client, example `signalfx` exporter - For simplicity use a single pipeline/exporter - In a different shell, run this to watch the tcp state of the established connection ``` while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done ``` - From the netstat, take a note of the source port and the source IP address - replace <> from previous step `sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP` - Note how the OTEL exporter export starts timing out Expected Result: - A new connection should be established, similarly to http/1 and exports should succeed Actual Result: - The exports keep failing for ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to - After 15 minutes, a new tcp connection is created and exports start working **Documentation:** <Describe the documentation added.> Readme is updated **Disclaimer:** Not all HTTP/2 servers support H2 Ping, however, this should not be a concern as our ingest servers do support H2 ping. But if you are routing you can check if H2 ping is supported using this script golang/go#60818 (comment) Signed-off-by: Dani Louca <dlouca@splunk.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
This PR introduces options to enable
http/2
health check by exposingHTTP2ReadIdleTimeout
andHTTP2PingTimeout
The golang issues are:
golang/go#59690
golang/go#36026
In summary, if due to environmental issue the underlying tcp connection used by the http/2 client in the exporter became unstable/unusable/unreachable, unlike http/1, the http/2 client does not forcibly close the connection and redial a new one, instead it keeps using it for 15 minutes (default value of OS
tcp_retries2
) until the OS cleans it up and a new tcp connection gets established.From OTEL user perspective, one will see a spike in export failures/timeouts for ~15 minutes, this will happen for every connection that got into a bad state, after 15 minutes things will recover until next time the tcp connection gets into a bad state.
Testing:
signalfx
exportersudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP
Expected Result:
Actual Result:
tcp_retries2
is configured toDocumentation:
Readme is updated with new settings