Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce HTTP2 health check transport options #9022

Merged
merged 1 commit into from
Dec 8, 2023

Conversation

dloucasfx
Copy link
Contributor

@dloucasfx dloucasfx commented Nov 30, 2023

Description:
This PR introduces options to enable http/2 health check by exposing HTTP2ReadIdleTimeout and HTTP2PingTimeout

The golang issues are:
golang/go#59690
golang/go#36026

In summary, if due to environmental issue the underlying tcp connection used by the http/2 client in the exporter became unstable/unusable/unreachable, unlike http/1, the http/2 client does not forcibly close the connection and redial a new one, instead it keeps using it for 15 minutes (default value of OS tcp_retries2 ) until the OS cleans it up and a new tcp connection gets established.
From OTEL user perspective, one will see a spike in export failures/timeouts for ~15 minutes, this will happen for every connection that got into a bad state, after 15 minutes things will recover until next time the tcp connection gets into a bad state.

Testing:

  • Run OTEL with one of the exporters that uses HTTP/2 client, example signalfx exporter
  • For simplicity use a single pipeline/exporter
  • In a different shell, run this to watch the tcp state of the established connection
 while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done
  • From the netstat, take a note of the source port and the source IP address
  • replace <> from previous step
    sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP
  • Note how the OTEL exporter export starts timing out

Expected Result:

  • A new connection should be established, similarly to http/1 and exports should succeed

Actual Result:

  • The exports keep failing for ~ 15 minutes or for whatever the OS tcp_retries2 is configured to
  • After 15 minutes, a new tcp connection is created and exports start working

Documentation:
Readme is updated with new settings

Copy link

codecov bot commented Nov 30, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (111037d) 91.57% compared to head (4718e11) 91.54%.
Report is 2 commits behind head on main.

Files Patch % Lines
config/confighttp/confighttp.go 57.14% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9022      +/-   ##
==========================================
- Coverage   91.57%   91.54%   -0.04%     
==========================================
  Files         316      316              
  Lines       17147    17154       +7     
==========================================
+ Hits        15702    15703       +1     
- Misses       1150     1154       +4     
- Partials      295      297       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

.chloggen/http2ping.yaml Outdated Show resolved Hide resolved
Signed-off-by: Dani Louca <dlouca@splunk.com>
Copy link
Member

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dmitryax dmitryax merged commit f9a38b2 into open-telemetry:main Dec 8, 2023
31 of 32 checks passed
@github-actions github-actions bot added this to the next release milestone Dec 8, 2023
pantuza pushed a commit to pantuza/opentelemetry-collector that referenced this pull request Dec 8, 2023
**Description:** 
This PR introduces options to enable `http/2` health check by exposing
`HTTP2ReadIdleTimeout` and `HTTP2PingTimeout`

The golang issues are:
golang/go#59690
golang/go#36026

In summary, if due to environmental issue the underlying tcp connection
used by the http/2 client in the exporter became
unstable/unusable/unreachable, unlike http/1, the http/2 client does not
forcibly close the connection and redial a new one, instead it keeps
using it for 15 minutes (default value of OS `tcp_retries2` ) until the
OS cleans it up and a new tcp connection gets established.
From OTEL user perspective, one will see a spike in export
failures/timeouts for ~15 minutes, this will happen for every connection
that got into a bad state, after 15 minutes things will recover until
next time the tcp connection gets into a bad state.


**Testing:**
- Run OTEL with one of the exporters that uses HTTP/2 client, example
`signalfx` exporter
- For simplicity use a single pipeline/exporter
- In a different shell, run this to watch the tcp state of the
established connection
```
 while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done
 ```  
- From the netstat, take a note of the source port and the source IP address
- replace <> from previous step
`sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP`
- Note how the OTEL exporter export starts timing out

Expected Result:
- A new connection should be established, similarly to http/1 and exports should succeed

Actual Result: 
- The exports keep failing for  ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to
- After 15 minutes, a new tcp connection is created and exports start working

**Documentation:** 
Readme is updated with new settings

Signed-off-by: Dani Louca <dlouca@splunk.com>
codeboten pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this pull request Dec 11, 2023
Same description as in
open-telemetry/opentelemetry-collector#9022

This PR enables the HTTP2 health check to workaround the issue described
here open-telemetry/opentelemetry-collector#9022

As to why I chose 10 seconds for `HTTP2ReadIdleTimeout` and 10 seconds
for `HTTP2PingTimeout`
Those values have been tested in production and they will result, in an
active env (with default http timeout of 10 seconds and default retry
settings), of a single export failure or (2 max) before the health check
detects the corrupted tcp connection and closes it.
The only drawback is if the connection was not used for over 10 seconds,
we might end up sending unnecessary ping frames, which should not be an
issue and if it became an issue, then we can tune those settings.

The SFX exporter has multiples http clients:
- Metric client, Trace client and Event client . Those client will have
the http2 health check enabled by default as they share the same default
config
- Correlation client and Dimension client will NOT have the http2 health
check enabled. We can revisit this if needed.

**Link to tracking Issue:** <Issue number if applicable>

**Testing:** <Describe what testing was performed and which tests were
added.>
- Run OTEL with one of the exporters that uses HTTP/2 client, example
`signalfx` exporter
- For simplicity use a single pipeline/exporter
- In a different shell, run this to watch the tcp state of the
established connection
```
 while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done
 ```  
- From the netstat, take a note of the source port and the source IP address
- replace <> from previous step
`sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP`
- Note how the OTEL exporter export starts timing out

Expected Result:
- A new connection should be established, similarly to http/1 and exports should succeed

Actual Result: 
- The exports keep failing for  ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to
- After 15 minutes, a new tcp connection is created and exports start working

**Documentation:** <Describe the documentation added.>
Readme is updated

Signed-off-by: Dani Louca <dlouca@splunk.com>
dmitryax pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this pull request Dec 11, 2023
**Description:**
This PR enables the HTTP2 health check to workaround the issue described
here open-telemetry/opentelemetry-collector#9022

As to why I chose 10 seconds for `HTTP2ReadIdleTimeout` and ~~5
seconds~~ 10 seconds (see review comment) for `HTTP2PingTimeout`
Those values have been tested in production and they will result, in an
active env (with default http timeout of 10 seconds and default retry
settings), of a single export failure at max before the health check
detects the corrupted tcp connection and closes it.
The only drawback is if the connection was not used for over 10 seconds,
we might end up sending unnecessary ping frames, which should not be an
issue and if it became an issue, then we can tune those settings.

The SFX exporter has multiples http clients:
- Metric client, Trace client and Event client . Those client will have
the http2 health check enabled by default as they share the same default
config
- Correlation client and Dimension client will NOT have the http2 health
check enabled. We can revisit this if needed.

**Testing:** 
- Run OTEL with one of the exporters that uses HTTP/2 client, example
`signalfx` exporter
- For simplicity use a single pipeline/exporter
- In a different shell, run this to watch the tcp state of the
established connection
```
 while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done
 ```  
- From the netstat, take a note of the source port and the source IP address
- replace <> from previous step
`sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP`
- Note how the OTEL exporter export starts timing out

Expected Result:
- A new connection should be established, similarly to http/1 and exports should succeed

Actual Result: 
- The exports keep failing for  ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to
- After 15 minutes, a new tcp connection is created and exports start working

**Documentation:** <Describe the documentation added.>
Readme is updated

**Disclaimer:**
Not all HTTP/2 servers support H2 Ping, however, this should not be a concern as our ingest servers do support H2 ping.
But if you are routing you can check if H2 ping is supported using this script golang/go#60818 (comment)

Signed-off-by: Dani Louca <dlouca@splunk.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants