Skip to content

Conversation

@yknoya
Copy link
Contributor

@yknoya yknoya commented Dec 3, 2025

Problem

#9181 introduced an issue where an origin server was marked as down even though a connection had been successfully established.

This issue occurs under the following conditions:

  1. proxy.config.http.server_session_sharing.match is set to a value other than none (i.e., server session reuse is enabled).
  2. A server session is reused when connecting to the origin.
  3. The connection is closed after sending a request to the origin.
  4. Condition 3 occurs repeatedly until it reaches the threshold defined by proxy.config.http.connect_attempts_rr_retries.

The issue has been confirmed in the following branches/versions (other versions not tested):

Cause

When ATS begins processing an origin connection, it executes t_state.set_connect_fail(EIO) to tentatively set connect_result to EIO:

t_state.set_connect_fail(EIO);

this->current.server->connect_result = e;

If server session reuse is not possible, connect_result is cleared once the connection is established:

t_state.current.server->clear_connect_fail();

However, when a server session is reused, connect_result is not cleared and remains set to EIO.
This regression was triggered by the change introduced in #9181 .

Before the PR was merged, t_state.set_connect_fail(EIO) was not executed when a server session was reused.
After the PR, it is executed regardless of whether a server session is reused or not.

With connect_result incorrectly left as EIO, if the connection is closed after sending a request to the origin, the following call chain leads to execution of HttpSM::mark_host_failure, causing the fail_count to be incremented:

  1. handle_response_from_server(s);
  2. handle_server_connection_not_open(s);
  3. s->state_machine->do_hostdb_update_if_necessary();
  4. this->mark_host_failure(&t_state.dns_info, ts_clock::from_time_t(t_state.client_request_time));
  5. if (auto [down, fail_count] = info->active->increment_fail_count(time_down, t_state.txn_conf->connect_attempts_rr_retries);

If this happens repeatedly and reaches the threshold defined by proxy.config.http.connect_attempts_rr_retries, the origin server is incorrectly marked as down:

if (auto [down, fail_count] = info->active->increment_fail_count(time_down, t_state.txn_conf->connect_attempts_rr_retries);
down) {
char *url_str = t_state.hdr_info.client_request.url_string_get_ref(nullptr);
std::string_view host_name{t_state.unmapped_url.host_get()};
swoc::bwprint(error_bw_buffer, "CONNECT : {::s} connecting to {} for host='{}' url='{}' fail_count='{}' marking down",
swoc::bwf::Errno(t_state.current.server->connect_result), t_state.current.server->dst_addr, host_name,
swoc::bwf::FirstOf(url_str, "<none>"), fail_count);
Log::error("%s", error_bw_buffer.c_str());
SMDbg(dbg_ctl_http, "hostdb update marking IP: %s as down", addrbuf);
ATS_PROBE2(hostdb_mark_ip_as_down, sm_id, addrbuf);

Since the connection to the origin is actually successful, marking it as down is incorrect.

Fix

Update the logic so that t_state.set_connect_fail(EIO) is executed only when establishing a new connection to the origin (i.e., when a server session is not reused), and ensure that connect_result is cleared once the connection succeeds.

Additionally, when multiplexed_origin is true, connect_result was also not being cleared after a successful connection.
In this case, although t_state.set_connect_fail(EIO) is executed (see below), the lack of a corresponding clear operation results in connect_result remaining EIO:

if (multiplexed_origin) {
EThread *ethread = this_ethread();
if (nullptr != ethread->connecting_pool) {
SMDbg(dbg_ctl_http_ss, "Queue multiplexed request");
new_entry = new ConnectingEntry();
new_entry->mutex = this->mutex;
new_entry->ua_txn = _ua.get_txn();
new_entry->handler = (ContinuationHandler)&ConnectingEntry::state_http_server_open;
new_entry->ipaddr.assign(&t_state.current.server->dst_addr.sa);
new_entry->hostname = t_state.current.server->name;
new_entry->sni = this->get_outbound_sni();
new_entry->cert_name = this->get_outbound_cert();
new_entry->is_no_plugin_tunnel = plugin_tunnel_type == HttpPluginTunnel_t::NONE;
this->t_state.set_connect_fail(EIO);
new_entry->connect_sms.insert(this);
ethread->connecting_pool->m_ip_pool.insert(std::make_pair(new_entry->ipaddr, new_entry));
}
}

This patch ensures that connect_result is cleared whenever the connection succeeds, regardless of whether multiplexed_origin is enabled.

@yknoya yknoya marked this pull request as ready for review December 4, 2025 00:55
@bryancall bryancall requested a review from Copilot December 4, 2025 02:49
@bryancall bryancall added this to the 10.2.0 milestone Dec 4, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug where origin servers were incorrectly marked as down when server session reuse was enabled. The issue was introduced in PR #9181 and affected versions 9.2.11, 10.1.0, and master.

Key Changes:

  • Moved the pre-emptive set_connect_fail(EIO) call from the general ORIGIN_SERVER_OPEN state action to the specific NET_EVENT_OPEN handler, ensuring it's only set when establishing a new connection
  • Added clear_connect_fail() call in the CONNECT_EVENT_TXN handler to properly clear connection failures for session reuse and multiplexed origin scenarios
  • Preserved existing clear_connect_fail() behavior for successful connection handshake events

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ezelkow1 ezelkow1 modified the milestones: 10.2.0, 9.2.12 Dec 8, 2025
@bryancall bryancall requested a review from masaori335 December 8, 2025 23:06

// Pre-emptively set a server connect failure that will be cleared once a WRITE_READY is received from origin or
// bytes are received back
t_state.set_connect_fail(EIO);
Copy link
Contributor

@masaori335 masaori335 Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you're just moving existing code. However, can we set EIO only if we observe errors? IMO, we easily make a mistake with current approach.

Copy link
Contributor Author

@yknoya yknoya Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.
I also thought it would be better to set EIO only when an actual error is detected, but due to time constraints I applied the current workaround-like approach.
I'll look into whether that approach is feasible, so please give me some more time.

Copy link
Contributor

@masaori335 masaori335 Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you find it's tough. This approach is not ideal but fixes the bug, so I don't want to block for long time. We can clean it up later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay.
I investigated the points you raised and implemented some changes.
Below is a summary of the investigation performed during the modification.

First, I found that the master branch already contains logic that invokes the set_connect_fail method when a connection attempt fails:

(*entry)->t_state.set_connect_fail(lerrno);

t_state.set_connect_fail(_netvc->lerrno);

if (t_state.cause_of_death_errno == -UNKNOWN_INTERNAL_ERROR) {
if (event == VC_EVENT_EOS) {
t_state.set_connect_fail(EPIPE);
} else {
t_state.set_connect_fail(EIO);
}
}

Since set_connect_fail is executed when a connection actually fails, it seemed unnecessary to pre-set EIO before the connection is made, so I looked into the reasoning.
My assumption is that the pre-set EIO exists to avoid updating connect_result when set_connect_fail is invoked after the connection has already succeeded.
There are several locations where set_connect_fail may run after the connection is established, but due to the following logic, only cause_of_death_errno is updated while connect_result is left unchanged.
This is because connect_result is set to 0 after the connection succeeds.

} else if (e == EIO || this->current.server->connect_result == EIO) {
this->current.server->connect_result = e;
}
if (e != EIO) {
this->cause_of_death_errno = e;
}

Based on these findings, I identified the following issues:

  1. The set_connect_fail method is used not only for connection-related failures but also for other types of errors (its name no longer matches its actual behavior).
  2. The logic determines whether the error occurred during connection by relying on a pre-set EIO, which is not ideal.

I was not able to resolve these issues perfectly cleanly, but I believe improvements are possible, and I applied the following changes:

  • Renamed set_connect_fail to set_fail.
  • Changed the initial value of connect_result to -UNKNOWN_INTERNAL_ERROR (instead of EIO), matching the initial value of cause_of_death_errno and making the intent clearer.
  • Updated set_fail so that it determines whether the failure occurred during connection by checking next_action and the value of connect_result.
  • Removed the logic that pre-sets EIO before a connection attempt.

Additionally, although slightly outside the main issue, I made the following related improvements:

  • Added a set_success method to reset both connect_result and cause_of_death_errno, since callers of clear_connect_fail were not resetting cause_of_death_errno.
  • Renamed clear_connect_fail to set_connect_success.
  • Ensured that set_success is always invoked upon successful connection by calling it within handle_http_server_open.

@yknoya yknoya force-pushed the fix-unintentionally-marked-as-down branch from 587a6cf to bd1c192 Compare December 15, 2025 04:35
Comment on lines 142 to 144
if (lerrno != -UNKNOWN_INTERNAL_ERROR) {
(*entry)->t_state.set_fail(lerrno);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases where a connection timeout occurs, the code path temporarily assigned EIO (5) before ultimately setting ETIMEDOUT (110), as shown in the logs below.
Assigning an incorrect intermediate errno is undesirable, so this change ensures that an inappropriate errno is not set during the transition.

[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <ConnectingEntry.cc:48 (state_http_server_open)> (http_connect) entered inside ConnectingEntry::state_http_server_open
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <ConnectingEntry.cc:130 (state_http_server_open)> (http_connect) Stop 1 state machines waiting for failed origin
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <TLSEventSupport.cc:153 (callHooks)> (ssl) sslHandshakeHookState=TS_SSL_HOOK_PRE_CONNECT eventID=110
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <TLSEventSupport.cc:271 (callHooks)> (ssl) iterated to curHook=(nil)
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpTransact.h:933 (set_fail)> (http) Setting upstream connection failure -19999 to 5
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpSM.cc:2645 (main_handler)> (http) [0] VC_EVENT_INACTIVITY_TIMEOUT/TS_EVENT_VCONN_INACTIVITY_TIMEOUT, 105
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpSM.cc:1796 (state_http_server_open)> (http_track) [0] entered inside state_http_server_open: VC_EVENT_INACTIVITY_TIMEOUT/TS_EVENT_VCONN_INACTIVITY_TIMEOUT
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpSM.cc:1797 (state_http_server_open)> (http) [0] [&HttpSM::state_http_server_open, VC_EVENT_INACTIVITY_TIMEOUT/TS_EVENT_VCONN_INACTIVITY_TIMEOUT]
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpTransact.h:933 (set_fail)> (http) Setting upstream connection failure 5 to 110
[Dec 16 15:23:26.800] [ET_NET 0] DIAG: <HttpTransact.cc:3432 (HandleResponse)> (http_trans) [0] Entering HttpTransact::HandleResponse

@masaori335
Copy link
Contributor

Thank you for cleaning up code. However, it's getting big than expected. Let's make another PR for cleanup and focus on bug fix on this PR.

IMO, touching connect_result and cause_of_death_errno in the set_connect_fail() function is making things complicated. Anyway, let's talk this more in another PR.

@yknoya
Copy link
Contributor Author

yknoya commented Dec 19, 2025

We decided to split this PR into two parts: a bug fix and a cleanup.
The following commits will be reverted and moved to a separate PR:

Copy link
Contributor

@masaori335 masaori335 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@bneradt
Copy link
Contributor

bneradt commented Dec 19, 2025

[approve ci]

@masaori335 masaori335 merged commit f2e959f into apache:master Dec 24, 2025
15 checks passed
@yknoya yknoya deleted the fix-unintentionally-marked-as-down branch December 25, 2025 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants