-
Notifications
You must be signed in to change notification settings - Fork 849
Fix issue where origins could be unintentionally marked as down #12729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue where origins could be unintentionally marked as down #12729
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a critical bug where origin servers were incorrectly marked as down when server session reuse was enabled. The issue was introduced in PR #9181 and affected versions 9.2.11, 10.1.0, and master.
Key Changes:
- Moved the pre-emptive
set_connect_fail(EIO)call from the generalORIGIN_SERVER_OPENstate action to the specificNET_EVENT_OPENhandler, ensuring it's only set when establishing a new connection - Added
clear_connect_fail()call in theCONNECT_EVENT_TXNhandler to properly clear connection failures for session reuse and multiplexed origin scenarios - Preserved existing
clear_connect_fail()behavior for successful connection handshake events
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // Pre-emptively set a server connect failure that will be cleared once a WRITE_READY is received from origin or | ||
| // bytes are received back | ||
| t_state.set_connect_fail(EIO); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you're just moving existing code. However, can we set EIO only if we observe errors? IMO, we easily make a mistake with current approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review.
I also thought it would be better to set EIO only when an actual error is detected, but due to time constraints I applied the current workaround-like approach.
I'll look into whether that approach is feasible, so please give me some more time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you find it's tough. This approach is not ideal but fixes the bug, so I don't want to block for long time. We can clean it up later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay.
I investigated the points you raised and implemented some changes.
Below is a summary of the investigation performed during the modification.
First, I found that the master branch already contains logic that invokes the set_connect_fail method when a connection attempt fails:
| (*entry)->t_state.set_connect_fail(lerrno); |
trafficserver/src/proxy/http/HttpSM.cc
Line 1874 in 90dbc21
| t_state.set_connect_fail(_netvc->lerrno); |
trafficserver/src/proxy/http/HttpSM.cc
Lines 1130 to 1136 in 90dbc21
| if (t_state.cause_of_death_errno == -UNKNOWN_INTERNAL_ERROR) { | |
| if (event == VC_EVENT_EOS) { | |
| t_state.set_connect_fail(EPIPE); | |
| } else { | |
| t_state.set_connect_fail(EIO); | |
| } | |
| } |
Since set_connect_fail is executed when a connection actually fails, it seemed unnecessary to pre-set EIO before the connection is made, so I looked into the reasoning.
My assumption is that the pre-set EIO exists to avoid updating connect_result when set_connect_fail is invoked after the connection has already succeeded.
There are several locations where set_connect_fail may run after the connection is established, but due to the following logic, only cause_of_death_errno is updated while connect_result is left unchanged.
This is because connect_result is set to 0 after the connection succeeds.
trafficserver/include/proxy/http/HttpTransact.h
Lines 931 to 936 in 90dbc21
| } else if (e == EIO || this->current.server->connect_result == EIO) { | |
| this->current.server->connect_result = e; | |
| } | |
| if (e != EIO) { | |
| this->cause_of_death_errno = e; | |
| } |
Based on these findings, I identified the following issues:
- The
set_connect_failmethod is used not only for connection-related failures but also for other types of errors (its name no longer matches its actual behavior). - The logic determines whether the error occurred during connection by relying on a pre-set
EIO, which is not ideal.
I was not able to resolve these issues perfectly cleanly, but I believe improvements are possible, and I applied the following changes:
- Renamed
set_connect_failtoset_fail. - Changed the initial value of
connect_resultto-UNKNOWN_INTERNAL_ERROR(instead ofEIO), matching the initial value ofcause_of_death_errnoand making the intent clearer. - Updated
set_failso that it determines whether the failure occurred during connection by checkingnext_actionand the value ofconnect_result. - Removed the logic that pre-sets
EIObefore a connection attempt.
Additionally, although slightly outside the main issue, I made the following related improvements:
- Added a
set_successmethod to reset bothconnect_resultandcause_of_death_errno, since callers ofclear_connect_failwere not resettingcause_of_death_errno. - Renamed
clear_connect_failtoset_connect_success. - Ensured that
set_successis always invoked upon successful connection by calling it withinhandle_http_server_open.
587a6cf to
bd1c192
Compare
src/proxy/http/ConnectingEntry.cc
Outdated
| if (lerrno != -UNKNOWN_INTERNAL_ERROR) { | ||
| (*entry)->t_state.set_fail(lerrno); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cases where a connection timeout occurs, the code path temporarily assigned EIO (5) before ultimately setting ETIMEDOUT (110), as shown in the logs below.
Assigning an incorrect intermediate errno is undesirable, so this change ensures that an inappropriate errno is not set during the transition.
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <ConnectingEntry.cc:48 (state_http_server_open)> (http_connect) entered inside ConnectingEntry::state_http_server_open
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <ConnectingEntry.cc:130 (state_http_server_open)> (http_connect) Stop 1 state machines waiting for failed origin
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <TLSEventSupport.cc:153 (callHooks)> (ssl) sslHandshakeHookState=TS_SSL_HOOK_PRE_CONNECT eventID=110
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <TLSEventSupport.cc:271 (callHooks)> (ssl) iterated to curHook=(nil)
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpTransact.h:933 (set_fail)> (http) Setting upstream connection failure -19999 to 5
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpSM.cc:2645 (main_handler)> (http) [0] VC_EVENT_INACTIVITY_TIMEOUT/TS_EVENT_VCONN_INACTIVITY_TIMEOUT, 105
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpSM.cc:1796 (state_http_server_open)> (http_track) [0] entered inside state_http_server_open: VC_EVENT_INACTIVITY_TIMEOUT/TS_EVENT_VCONN_INACTIVITY_TIMEOUT
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpSM.cc:1797 (state_http_server_open)> (http) [0] [&HttpSM::state_http_server_open, VC_EVENT_INACTIVITY_TIMEOUT/TS_EVENT_VCONN_INACTIVITY_TIMEOUT]
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpTransact.h:933 (set_fail)> (http) Setting upstream connection failure 5 to 110
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpTransact.cc:3432 (HandleResponse)> (http_trans) [0] Entering HttpTransact::HandleResponse
|
Thank you for cleaning up code. However, it's getting big than expected. Let's make another PR for cleanup and focus on bug fix on this PR. IMO, touching |
masaori335
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
|
[approve ci] |
Problem
#9181 introduced an issue where an origin server was marked as down even though a connection had been successfully established.
This issue occurs under the following conditions:
proxy.config.http.server_session_sharing.matchis set to a value other thannone(i.e., server session reuse is enabled).proxy.config.http.connect_attempts_rr_retries.The issue has been confirmed in the following branches/versions (other versions not tested):
Cause
When ATS begins processing an origin connection, it executes
t_state.set_connect_fail(EIO)to tentatively setconnect_resulttoEIO:trafficserver/src/proxy/http/HttpSM.cc
Line 8054 in 90dbc21
trafficserver/include/proxy/http/HttpTransact.h
Line 932 in 90dbc21
If server session reuse is not possible,
connect_resultis cleared once the connection is established:trafficserver/src/proxy/http/HttpSM.cc
Line 1860 in 90dbc21
However, when a server session is reused,
connect_resultis not cleared and remains set toEIO.This regression was triggered by the change introduced in #9181 .
Before the PR was merged,
t_state.set_connect_fail(EIO)was not executed when a server session was reused.After the PR, it is executed regardless of whether a server session is reused or not.
With
connect_resultincorrectly left asEIO, if the connection is closed after sending a request to the origin, the following call chain leads to execution ofHttpSM::mark_host_failure, causing thefail_countto be incremented:trafficserver/src/proxy/http/HttpTransact.cc
Line 3466 in 90dbc21
trafficserver/src/proxy/http/HttpTransact.cc
Line 3786 in 90dbc21
trafficserver/src/proxy/http/HttpTransact.cc
Line 3884 in 90dbc21
trafficserver/src/proxy/http/HttpSM.cc
Line 4630 in 90dbc21
trafficserver/src/proxy/http/HttpSM.cc
Line 5876 in 90dbc21
If this happens repeatedly and reaches the threshold defined by
proxy.config.http.connect_attempts_rr_retries, the origin server is incorrectly marked as down:trafficserver/src/proxy/http/HttpSM.cc
Lines 5876 to 5885 in 90dbc21
Since the connection to the origin is actually successful, marking it as down is incorrect.
Fix
Update the logic so that
t_state.set_connect_fail(EIO)is executed only when establishing a new connection to the origin (i.e., when a server session is not reused), and ensure thatconnect_resultis cleared once the connection succeeds.Additionally, when
multiplexed_originis true,connect_resultwas also not being cleared after a successful connection.In this case, although
t_state.set_connect_fail(EIO)is executed (see below), the lack of a corresponding clear operation results inconnect_resultremainingEIO:trafficserver/src/proxy/http/HttpSM.cc
Lines 5706 to 5723 in 90dbc21
This patch ensures that
connect_resultis cleared whenever the connection succeeds, regardless of whethermultiplexed_originis enabled.