Fix issue where origins could be unintentionally marked as down #12729

yknoya · 2025-12-03T23:51:46Z

Problem

#9181 introduced an issue where an origin server was marked as down even though a connection had been successfully established.

This issue occurs under the following conditions:

proxy.config.http.server_session_sharing.match is set to a value other than none (i.e., server session reuse is enabled).
A server session is reused when connecting to the origin.
The connection is closed after sending a request to the origin.
Condition 3 occurs repeatedly until it reaches the threshold defined by proxy.config.http.connect_attempts_rr_retries.

The issue has been confirmed in the following branches/versions (other versions not tested):

master (90dbc21)
10.1.0
9.2.11

Cause

When ATS begins processing an origin connection, it executes t_state.set_connect_fail(EIO) to tentatively set connect_result to EIO:

trafficserver/src/proxy/http/HttpSM.cc

Line 8054 in 90dbc21

t_state.set_connect_fail(EIO);

trafficserver/include/proxy/http/HttpTransact.h

Line 932 in 90dbc21

this->current.server->connect_result = e;

If server session reuse is not possible, connect_result is cleared once the connection is established:

trafficserver/src/proxy/http/HttpSM.cc

Line 1860 in 90dbc21

t_state.current.server->clear_connect_fail();

However, when a server session is reused, connect_result is not cleared and remains set to EIO.
This regression was triggered by the change introduced in #9181 .

Before the PR was merged, t_state.set_connect_fail(EIO) was not executed when a server session was reused.
After the PR, it is executed regardless of whether a server session is reused or not.

With connect_result incorrectly left as EIO, if the connection is closed after sending a request to the origin, the following call chain leads to execution of HttpSM::mark_host_failure, causing the fail_count to be incremented:

trafficserver/src/proxy/http/HttpTransact.cc

Line 3466 in 90dbc21

handle_response_from_server(s);
trafficserver/src/proxy/http/HttpTransact.cc

Line 3786 in 90dbc21

handle_server_connection_not_open(s);
trafficserver/src/proxy/http/HttpTransact.cc

Line 3884 in 90dbc21

s->state_machine->do_hostdb_update_if_necessary();
trafficserver/src/proxy/http/HttpSM.cc

Line 4630 in 90dbc21

this->mark_host_failure(&t_state.dns_info, ts_clock::from_time_t(t_state.client_request_time));
trafficserver/src/proxy/http/HttpSM.cc

Line 5876 in 90dbc21

if (auto [down, fail_count] = info->active->increment_fail_count(time_down, t_state.txn_conf->connect_attempts_rr_retries);

If this happens repeatedly and reaches the threshold defined by proxy.config.http.connect_attempts_rr_retries, the origin server is incorrectly marked as down:

trafficserver/src/proxy/http/HttpSM.cc

Lines 5876 to 5885 in 90dbc21

    
           if (auto [down, fail_count] = info->active->increment_fail_count(time_down, t_state.txn_conf->connect_attempts_rr_retries); 
        
               down) { 
        
             char            *url_str = t_state.hdr_info.client_request.url_string_get_ref(nullptr); 
        
             std::string_view host_name{t_state.unmapped_url.host_get()}; 
        
             swoc::bwprint(error_bw_buffer, "CONNECT : {::s} connecting to {} for host='{}' url='{}' fail_count='{}' marking down", 
        
                           swoc::bwf::Errno(t_state.current.server->connect_result), t_state.current.server->dst_addr, host_name, 
        
                           swoc::bwf::FirstOf(url_str, "<none>"), fail_count); 
        
             Log::error("%s", error_bw_buffer.c_str()); 
        
             SMDbg(dbg_ctl_http, "hostdb update marking IP: %s as down", addrbuf); 
        
             ATS_PROBE2(hostdb_mark_ip_as_down, sm_id, addrbuf);

Since the connection to the origin is actually successful, marking it as down is incorrect.

Fix

Update the logic so that t_state.set_connect_fail(EIO) is executed only when establishing a new connection to the origin (i.e., when a server session is not reused), and ensure that connect_result is cleared once the connection succeeds.

Additionally, when multiplexed_origin is true, connect_result was also not being cleared after a successful connection.
In this case, although t_state.set_connect_fail(EIO) is executed (see below), the lack of a corresponding clear operation results in connect_result remaining EIO:

trafficserver/src/proxy/http/HttpSM.cc

Lines 5706 to 5723 in 90dbc21

    
           if (multiplexed_origin) { 
        
             EThread *ethread = this_ethread(); 
        
             if (nullptr != ethread->connecting_pool) { 
        
               SMDbg(dbg_ctl_http_ss, "Queue multiplexed request"); 
        
               new_entry          = new ConnectingEntry(); 
        
               new_entry->mutex   = this->mutex; 
        
               new_entry->ua_txn  = _ua.get_txn(); 
        
               new_entry->handler = (ContinuationHandler)&ConnectingEntry::state_http_server_open; 
        
               new_entry->ipaddr.assign(&t_state.current.server->dst_addr.sa); 
        
               new_entry->hostname            = t_state.current.server->name; 
        
               new_entry->sni                 = this->get_outbound_sni(); 
        
               new_entry->cert_name           = this->get_outbound_cert(); 
        
               new_entry->is_no_plugin_tunnel = plugin_tunnel_type == HttpPluginTunnel_t::NONE; 
        
               this->t_state.set_connect_fail(EIO); 
        
               new_entry->connect_sms.insert(this); 
        
               ethread->connecting_pool->m_ip_pool.insert(std::make_pair(new_entry->ipaddr, new_entry)); 
        
             } 
        
           }

This patch ensures that connect_result is cleared whenever the connection succeeds, regardless of whether multiplexed_origin is enabled.

Copilot

Pull request overview

This PR fixes a critical bug where origin servers were incorrectly marked as down when server session reuse was enabled. The issue was introduced in PR #9181 and affected versions 9.2.11, 10.1.0, and master.

Key Changes:

Moved the pre-emptive set_connect_fail(EIO) call from the general ORIGIN_SERVER_OPEN state action to the specific NET_EVENT_OPEN handler, ensuring it's only set when establishing a new connection
Added clear_connect_fail() call in the CONNECT_EVENT_TXN handler to properly clear connection failures for session reuse and multiplexed origin scenarios
Preserved existing clear_connect_fail() behavior for successful connection handshake events

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

masaori335 · 2025-12-09T06:02:00Z

src/proxy/http/HttpSM.cc


+      // Pre-emptively set a server connect failure that will be cleared once a WRITE_READY is received from origin or
+      // bytes are received back
+      t_state.set_connect_fail(EIO);


I guess you're just moving existing code. However, can we set EIO only if we observe errors? IMO, we easily make a mistake with current approach.

Thanks for the review.
I also thought it would be better to set EIO only when an actual error is detected, but due to time constraints I applied the current workaround-like approach.
I'll look into whether that approach is feasible, so please give me some more time.

Let me know if you find it's tough. This approach is not ideal but fixes the bug, so I don't want to block for long time. We can clean it up later.

Sorry for the delay.
I investigated the points you raised and implemented some changes.
Below is a summary of the investigation performed during the modification.

First, I found that the master branch already contains logic that invokes the set_connect_fail method when a connection attempt fails:

trafficserver/src/proxy/http/ConnectingEntry.cc

Line 142 in 90dbc21

(*entry)->t_state.set_connect_fail(lerrno);

trafficserver/src/proxy/http/HttpSM.cc

Line 1874 in 90dbc21

t_state.set_connect_fail(_netvc->lerrno);

trafficserver/src/proxy/http/HttpSM.cc

Lines 1130 to 1136 in 90dbc21

if (t_state.cause_of_death_errno == -UNKNOWN_INTERNAL_ERROR) {

if (event == VC_EVENT_EOS) {

t_state.set_connect_fail(EPIPE);

} else {

t_state.set_connect_fail(EIO);

}

}

Since set_connect_fail is executed when a connection actually fails, it seemed unnecessary to pre-set EIO before the connection is made, so I looked into the reasoning.
My assumption is that the pre-set EIO exists to avoid updating connect_result when set_connect_fail is invoked after the connection has already succeeded.
There are several locations where set_connect_fail may run after the connection is established, but due to the following logic, only cause_of_death_errno is updated while connect_result is left unchanged.
This is because connect_result is set to 0 after the connection succeeds.

trafficserver/include/proxy/http/HttpTransact.h

Lines 931 to 936 in 90dbc21

} else if (e == EIO || this->current.server->connect_result == EIO) {

this->current.server->connect_result = e;

}

if (e != EIO) {

this->cause_of_death_errno = e;

}

Based on these findings, I identified the following issues:

The set_connect_fail method is used not only for connection-related failures but also for other types of errors (its name no longer matches its actual behavior).

The logic determines whether the error occurred during connection by relying on a pre-set EIO, which is not ideal.

I was not able to resolve these issues perfectly cleanly, but I believe improvements are possible, and I applied the following changes:

Renamed set_connect_fail to set_fail.

Changed the initial value of connect_result to -UNKNOWN_INTERNAL_ERROR (instead of EIO), matching the initial value of cause_of_death_errno and making the intent clearer.

Updated set_fail so that it determines whether the failure occurred during connection by checking next_action and the value of connect_result.

Removed the logic that pre-sets EIO before a connection attempt.

Additionally, although slightly outside the main issue, I made the following related improvements:

Added a set_success method to reset both connect_result and cause_of_death_errno, since callers of clear_connect_fail were not resetting cause_of_death_errno.

Renamed clear_connect_fail to set_connect_success.

Ensured that set_success is always invoked upon successful connection by calling it within handle_http_server_open.

yknoya · 2025-12-16T06:36:39Z

src/proxy/http/ConnectingEntry.cc

+      if (lerrno != -UNKNOWN_INTERNAL_ERROR) {
+        (*entry)->t_state.set_fail(lerrno);
+      }


In cases where a connection timeout occurs, the code path temporarily assigned EIO (5) before ultimately setting ETIMEDOUT (110), as shown in the logs below.
Assigning an incorrect intermediate errno is undesirable, so this change ensures that an inappropriate errno is not set during the transition.

[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <ConnectingEntry.cc:48 (state_http_server_open)> (http_connect) entered inside ConnectingEntry::state_http_server_open [Dec 16 15:23:26.800] [ET_NET 0] DIAG: <ConnectingEntry.cc:130 (state_http_server_open)> (http_connect) Stop 1 state machines waiting for failed origin [Dec 16 15:23:26.800] [ET_NET 0] DIAG: <TLSEventSupport.cc:153 (callHooks)> (ssl) sslHandshakeHookState=TS_SSL_HOOK_PRE_CONNECT eventID=110 [Dec 16 15:23:26.800] [ET_NET 0] DIAG: <TLSEventSupport.cc:271 (callHooks)> (ssl) iterated to curHook=(nil) [Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpTransact.h:933 (set_fail)> (http) Setting upstream connection failure -19999 to 5 [Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpSM.cc:2645 (main_handler)> (http) [0] VC_EVENT_INACTIVITY_TIMEOUT/TS_EVENT_VCONN_INACTIVITY_TIMEOUT, 105 [Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpSM.cc:1796 (state_http_server_open)> (http_track) [0] entered inside state_http_server_open: VC_EVENT_INACTIVITY_TIMEOUT/TS_EVENT_VCONN_INACTIVITY_TIMEOUT [Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpSM.cc:1797 (state_http_server_open)> (http) [0] [&HttpSM::state_http_server_open, VC_EVENT_INACTIVITY_TIMEOUT/TS_EVENT_VCONN_INACTIVITY_TIMEOUT] [Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpTransact.h:933 (set_fail)> (http) Setting upstream connection failure 5 to 110 [Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpTransact.cc:3432 (HandleResponse)> (http_trans) [0] Entering HttpTransact::HandleResponse

masaori335 · 2025-12-19T05:29:43Z

Thank you for cleaning up code. However, it's getting big than expected. Let's make another PR for cleanup and focus on bug fix on this PR.

IMO, touching connect_result and cause_of_death_errno in the set_connect_fail() function is making things complicated. Anyway, let's talk this more in another PR.

yknoya · 2025-12-19T05:33:52Z

We decided to split this PR into two parts: a bug fix and a cleanup.
The following commits will be reverted and moved to a separate PR:

This reverts commit 76cd57e.

…tected" This reverts commit bd1c192.

masaori335

Looks good.

bneradt · 2025-12-19T20:22:00Z

[approve ci]

Fix issue where origins could be unintentionally marked as down

b696a36

yknoya marked this pull request as ready for review December 4, 2025 00:55

bryancall requested a review from Copilot December 4, 2025 02:49

bryancall assigned yknoya Dec 4, 2025

bryancall added Bug HostDB labels Dec 4, 2025

Copilot started reviewing on behalf of bryancall December 4, 2025 02:49 View session

bryancall added this to the 10.2.0 milestone Dec 4, 2025

Copilot finished reviewing on behalf of bryancall December 4, 2025 02:52

Copilot AI reviewed Dec 4, 2025

View reviewed changes

ezelkow1 modified the milestones: 10.2.0, 9.2.12 Dec 8, 2025

bryancall requested a review from masaori335 December 8, 2025 23:06

masaori335 reviewed Dec 9, 2025

View reviewed changes

Set connect_result only when an actual connection error is detected

bd1c192

yknoya force-pushed the fix-unintentionally-marked-as-down branch from 587a6cf to bd1c192 Compare December 15, 2025 04:35

Prevent unnecessary overwriting of error codes

76cd57e

yknoya commented Dec 16, 2025

View reviewed changes

yknoya added 2 commits December 19, 2025 14:35

Revert "Prevent unnecessary overwriting of error codes"

ea59532

This reverts commit 76cd57e.

Revert "Set connect_result only when an actual connection error is de…

5c87945

…tected" This reverts commit bd1c192.

masaori335 approved these changes Dec 19, 2025

View reviewed changes

yknoya mentioned this pull request Dec 19, 2025

Cleanup connect result handling #12766

Open

masaori335 merged commit f2e959f into apache:master Dec 24, 2025
15 checks passed

yknoya deleted the fix-unintentionally-marked-as-down branch December 25, 2025 03:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix issue where origins could be unintentionally marked as down #12729

Fix issue where origins could be unintentionally marked as down #12729

Uh oh!

yknoya commented Dec 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

masaori335 Dec 9, 2025 •

edited

Loading

Uh oh!

yknoya Dec 10, 2025 •

edited

Loading

Uh oh!

masaori335 Dec 11, 2025 •

edited

Loading

Uh oh!

yknoya Dec 15, 2025

Uh oh!

yknoya Dec 16, 2025

Uh oh!

masaori335 commented Dec 19, 2025

Uh oh!

yknoya commented Dec 19, 2025

Uh oh!

masaori335 left a comment

Uh oh!

bneradt commented Dec 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	if (auto [down, fail_count] = info->active->increment_fail_count(time_down, t_state.txn_conf->connect_attempts_rr_retries);
	down) {
	char *url_str = t_state.hdr_info.client_request.url_string_get_ref(nullptr);
	std::string_view host_name{t_state.unmapped_url.host_get()};
	swoc::bwprint(error_bw_buffer, "CONNECT : {::s} connecting to {} for host='{}' url='{}' fail_count='{}' marking down",
	swoc::bwf::Errno(t_state.current.server->connect_result), t_state.current.server->dst_addr, host_name,
	swoc::bwf::FirstOf(url_str, "<none>"), fail_count);
	Log::error("%s", error_bw_buffer.c_str());
	SMDbg(dbg_ctl_http, "hostdb update marking IP: %s as down", addrbuf);
	ATS_PROBE2(hostdb_mark_ip_as_down, sm_id, addrbuf);

	if (multiplexed_origin) {
	EThread *ethread = this_ethread();
	if (nullptr != ethread->connecting_pool) {
	SMDbg(dbg_ctl_http_ss, "Queue multiplexed request");
	new_entry = new ConnectingEntry();
	new_entry->mutex = this->mutex;
	new_entry->ua_txn = _ua.get_txn();
	new_entry->handler = (ContinuationHandler)&ConnectingEntry::state_http_server_open;
	new_entry->ipaddr.assign(&t_state.current.server->dst_addr.sa);
	new_entry->hostname = t_state.current.server->name;
	new_entry->sni = this->get_outbound_sni();
	new_entry->cert_name = this->get_outbound_cert();
	new_entry->is_no_plugin_tunnel = plugin_tunnel_type == HttpPluginTunnel_t::NONE;
	this->t_state.set_connect_fail(EIO);
	new_entry->connect_sms.insert(this);
	ethread->connecting_pool->m_ip_pool.insert(std::make_pair(new_entry->ipaddr, new_entry));
	}
	}

	if (t_state.cause_of_death_errno == -UNKNOWN_INTERNAL_ERROR) {
	if (event == VC_EVENT_EOS) {
	t_state.set_connect_fail(EPIPE);
	} else {
	t_state.set_connect_fail(EIO);
	}
	}

	} else if (e == EIO \|\| this->current.server->connect_result == EIO) {
	this->current.server->connect_result = e;
	}
	if (e != EIO) {
	this->cause_of_death_errno = e;
	}

Fix issue where origins could be unintentionally marked as down #12729

Fix issue where origins could be unintentionally marked as down #12729

Uh oh!

Conversation

yknoya commented Dec 3, 2025

Problem

Cause

Fix

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

masaori335 Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yknoya Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masaori335 Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yknoya Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

yknoya Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

masaori335 commented Dec 19, 2025

Uh oh!

yknoya commented Dec 19, 2025

Uh oh!

masaori335 left a comment

Choose a reason for hiding this comment

Uh oh!

bneradt commented Dec 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

masaori335 Dec 9, 2025 •

edited

Loading

yknoya Dec 10, 2025 •

edited

Loading

masaori335 Dec 11, 2025 •

edited

Loading