Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster related metrics, user-agent #973

Merged
merged 14 commits into from
Aug 8, 2023
Merged
16 changes: 8 additions & 8 deletions doc/debugging_strategies.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,15 +131,15 @@ unresponsive backend server).

The `sozu.http.errors` counter is the sum of failed requests. It contains the following:

* `sozu.http.front_parse_errors`: sozu received some invalid traffic
* `sozu.http.frontend_parse_errors`: sozu received some invalid traffic
* `sozu.http.400.errors`: cannot parse hostname
* `sozu.http.404.errors`: unknown hostname and/or path
* `sozu.http.413.errors`: request too large
* `sozu.http.503.errors`: could not connect to backend server, or no backend server available for the corresponding cluster

Going further, backend connections issues are tracked by the following metrics:

* `sozu.backend.connections.errors`: could not connect to a backend server
* `sozu.backend.connections.error`: could not connect to a backend server
* `sozu.backend.down`: the retry policy triggered and marked the backend server as down
hcaumeil marked this conversation as resolved.
Show resolved Hide resolved

The `sozu.http.503.errors` metric is incremented after a request sent back a 503 error, and a 503 error is sent
Expand Down Expand Up @@ -180,10 +180,10 @@ active requests (an inactive session can keep a backend connection around).

These metrics are closely linked to resource usage, which is tracked by the following:

* `sozu.slab.count`: number of slots used in the slab allocator. Typically, there's one slot per listener socket,
* `sozu.slab.entries`: number of slots used in the slab allocator. Typically, there's one slot per listener socket,
one for the connection to the main process, one for the metrics socket, then one per frontend connection and
one per backend connection. So the number of connections should always be close to (but lower than) the slab count.
* `sozu.buffer.count`: number of buffers used in the buffer pool. Inactive sessions and requests for which we send
* `sozu.buffer.number`: number of buffers used in the buffer pool. Inactive sessions and requests for which we send
a default answer (400, 404, 413, 503 HTTP errors) do not use buffers. Active HTTP sessions use one buffer (except
in pipelining mode), WebSocket sessions use two buffers. So the number of buffers should always be lower than the
slab count, and lower than the number of connections.
Expand All @@ -194,7 +194,7 @@ is incremented for each zombie session that gets deleted.
New connections are put into a queue, and wait until the session is created (if we have available resources),
or until a configurable timeout has elapsed. The following metrics observe the accept queue usage:

* `sozu.accept_queue.count`: number of sockets in the accept queue
* `sozu.accept_queue.connections`: number of sockets in the accept queue
* `sozu.accept_queue.timeout`: incremented every time a socket stayed too long in the queue and is closed
* `sozu.accept_queue.wait_time`: every time a session is created, this metric records how long the socket had to wait in the accept queue

Expand Down Expand Up @@ -243,7 +243,7 @@ sozu -c /etc/config.toml query clusters -i cluster_id

### Backend server unavailable

`sozu.http.503.errors` increases, lots of `sozu.backend.connections.errors` and a
`sozu.http.503.errors` increases, lots of `sozu.backend.connections.error` and a
`sozu.backend.down` record: a backend server is down.
Check the logs for `error connecting to backend, trying again` and `no more available backends for cluster <cluster_id>`
to find out which cluster is affected
Expand All @@ -254,7 +254,7 @@ if the `sozu.zombies` metric triggers, this means there's an event loop or proto
bug. The logs should contain the internal state of the sessions that were killed. Please copy those
logs and open an issue to sozu.

It usually comes with the `sozu.slab.count` increasing while the number of connections or active requests stays
It usually comes with the `sozu.slab.entries` increasing while the number of connections or active requests stays
the same. The slab count will then drop when the zombie checker activates.

### Invalid session close
Expand All @@ -265,7 +265,7 @@ are increasing, it means sessions are not properly closed by sozu, please open a

### accept queue filling up

if `sozu.accept_queue.count` is increasing, that means the accept queue is filling up because sozu is under
if `sozu.accept_queue.connections` is increasing, that means the accept queue is filling up because sozu is under
heavy load (in a healthy load, this queue is almost always empty). `sozu.accept_queue.wait_time` should increase
as well. If `sozu.accept_queue.timeout` is higher than zero, sozu cannot accept sessions fast enough and
is rejecting traffic.
Expand Down
21 changes: 19 additions & 2 deletions lib/src/http.rs
Original file line number Diff line number Diff line change
Expand Up @@ -420,6 +420,7 @@ impl L7ListenerHandler for HttpListener {
uri: &str,
method: &Method,
) -> Result<Route, FrontendFromRequestError> {
let start = Instant::now();
let (remaining_input, (hostname, _)) = match hostname_and_port(host.as_bytes()) {
Ok(tuple) => tuple,
Err(parse_error) => {
Expand Down Expand Up @@ -447,9 +448,25 @@ impl L7ListenerHandler for HttpListener {
*/
let host = unsafe { from_utf8_unchecked(hostname) };

self.fronts
let route = self
.fronts
.lookup(host.as_bytes(), uri.as_bytes(), method)
.ok_or(FrontendFromRequestError::NoClusterFound)
.ok_or_else(|| {
incr!("http.failed_backend_matching");
FrontendFromRequestError::NoClusterFound
})?;

let now = Instant::now();

if let Route::ClusterId(cluster) = &route {
time!(
"frontend_matching_time",
cluster,
(now - start).whole_milliseconds()
);
}

Ok(route)
}
}

Expand Down
21 changes: 19 additions & 2 deletions lib/src/https.rs
Original file line number Diff line number Diff line change
Expand Up @@ -562,6 +562,7 @@ impl L7ListenerHandler for HttpsListener {
uri: &str,
method: &Method,
) -> Result<Route, FrontendFromRequestError> {
let start = Instant::now();
let (remaining_input, (hostname, _)) = match hostname_and_port(host.as_bytes()) {
Ok(tuple) => tuple,
Err(parse_error) => {
Expand All @@ -584,9 +585,25 @@ impl L7ListenerHandler for HttpsListener {
// chars in there
let host = unsafe { from_utf8_unchecked(hostname) };

self.fronts
let route = self
.fronts
.lookup(host.as_bytes(), uri.as_bytes(), method)
.ok_or(FrontendFromRequestError::NoClusterFound)
.ok_or_else(|| {
incr!("http.failed_backend_matching");
FrontendFromRequestError::NoClusterFound
})?;

let now = Instant::now();

if let Route::ClusterId(cluster) = &route {
time!(
"frontend_matching_time",
cluster,
(now - start).whole_milliseconds()
);
}

Ok(route)
}
}

Expand Down
30 changes: 25 additions & 5 deletions lib/src/logs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ pub struct RequestRecord<'a> {
pub client_rtt: Option<Duration>,
pub server_rtt: Option<Duration>,
pub metrics: &'a SessionMetrics,
pub user_agent: Option<&'a str>,
}

impl RequestRecord<'_> {
Expand All @@ -150,6 +151,7 @@ impl RequestRecord<'_> {
let session_address = self.session_address;
let backend_address = self.backend_address;
let endpoint = &self.endpoint;
let user_agent = &self.user_agent;

let metrics = self.metrics;
// let backend_response_time = metrics.backend_response_time();
Expand Down Expand Up @@ -193,7 +195,7 @@ impl RequestRecord<'_> {
match self.error {
None => {
info_access!(
"{}{} -> {} \t{}/{}/{}/{} \t{} -> {} \t {} {} {}",
"{}{} -> {} \t{}/{}/{}/{} \t{} -> {} \t {}{} {} {}",
context,
session_address.as_str_or("X"),
backend_address.as_str_or("X"),
Expand All @@ -203,14 +205,25 @@ impl RequestRecord<'_> {
LogDuration(server_rtt),
metrics.bin,
metrics.bout,
tags.as_str_or("-"),
match user_agent {
Some(_) => tags.as_str_or(""),
None => tags.as_str_or("-"),
},
match tags {
Some(tags) if !tags.is_empty() => user_agent.map(|ua| format!(", user-agent={ua}")).unwrap_or_default(),
Some(_) | None => user_agent.map(|ua| format!("user-agent={ua}")).unwrap_or_default(),
},
protocol,
endpoint
);
incr!("access_logs.count");
incr!(
"access_logs.count",
self.context.cluster_id,
self.context.backend_id
);
}
Some(message) => error_access!(
"{}{} -> {} \t{}/{}/{}/{} \t{} -> {} \t {} {} {} | {}",
"{}{} -> {} \t{}/{}/{}/{} \t{} -> {} \t {}{} {} {} | {}",
context,
session_address.as_str_or("X"),
backend_address.as_str_or("X"),
Expand All @@ -220,7 +233,14 @@ impl RequestRecord<'_> {
LogDuration(server_rtt),
metrics.bin,
metrics.bout,
tags.as_str_or("-"),
match user_agent {
Some(_) => tags.as_str_or(""),
None => tags.as_str_or("-"),
},
match tags {
Some(tags) if !tags.is_empty() => user_agent.map(|ua| format!(", user-agent={ua}")).unwrap_or_default(),
Some(_) | None => user_agent.map(|ua| format!("user-agent={ua}")).unwrap_or_default(),
},
protocol,
endpoint,
message
Expand Down
22 changes: 13 additions & 9 deletions lib/src/metrics/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -321,11 +321,13 @@ macro_rules! count (
macro_rules! incr (
($key:expr) => (count!($key, 1));
($key:expr, $cluster_id:expr, $backend_id:expr) => {
use $crate::metrics::Subscriber;
{
hcaumeil marked this conversation as resolved.
Show resolved Hide resolved
use $crate::metrics::Subscriber;

$crate::metrics::METRICS.with(|metrics| {
(*metrics.borrow_mut()).receive_metric($key, $cluster_id, $backend_id, $crate::metrics::MetricValue::Count(1));
});
$crate::metrics::METRICS.with(|metrics| {
(*metrics.borrow_mut()).receive_metric($key, $cluster_id, $backend_id, $crate::metrics::MetricValue::Count(1));
});
}
}
);

Expand Down Expand Up @@ -353,12 +355,14 @@ macro_rules! gauge_add (
});
});
($key:expr, $value:expr, $cluster_id:expr, $backend_id:expr) => {
use $crate::metrics::Subscriber;
let v = $value;
{
use $crate::metrics::Subscriber;
let v = $value;

$crate::metrics::METRICS.with(|metrics| {
(*metrics.borrow_mut()).receive_metric($key, $cluster_id, $backend_id, $crate::metrics::MetricValue::GaugeAdd(v));
});
$crate::metrics::METRICS.with(|metrics| {
(*metrics.borrow_mut()).receive_metric($key, $cluster_id, $backend_id, $crate::metrics::MetricValue::GaugeAdd(v));
});
}
}
);

Expand Down
4 changes: 2 additions & 2 deletions lib/src/pool.rs
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ impl Pool {
})
.map(|c| {
let old_buffer_count = BUFFER_COUNT.fetch_add(1, Ordering::SeqCst);
gauge!("buffer.count", old_buffer_count + 1);
gauge!("buffer.number", old_buffer_count + 1);
Checkout { inner: c }
})
}
Expand Down Expand Up @@ -115,7 +115,7 @@ impl ops::DerefMut for Checkout {
impl Drop for Checkout {
fn drop(&mut self) {
let old_buffer_count = BUFFER_COUNT.fetch_sub(1, Ordering::SeqCst);
gauge!("buffer.count", old_buffer_count - 1);
gauge!("buffer.number", old_buffer_count - 1);
}
}

Expand Down
63 changes: 46 additions & 17 deletions lib/src/protocol/kawa_h1/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -424,7 +424,7 @@ impl<Front: SocketHandler, L: ListenerHandler + L7ListenerHandler> Http<Front, L
}

if self.response_stream.is_terminated() && self.response_stream.is_completed() {
save_http_status_metric(self.context.status);
save_http_status_metric(self.context.status, self.log_context());

self.log_request_success(metrics);
metrics.reset();
Expand Down Expand Up @@ -765,6 +765,7 @@ impl<Front: SocketHandler, L: ListenerHandler + L7ListenerHandler> Http<Front, L
client_rtt: socket_rtt(self.front_socket()),
server_rtt: self.backend_socket.as_ref().and_then(socket_rtt),
metrics,
user_agent: self.context.user_agent.as_deref(),
}
.log();
}
Expand All @@ -788,15 +789,43 @@ impl<Front: SocketHandler, L: ListenerHandler + L7ListenerHandler> Http<Front, L
);
} else {
match answer {
DefaultAnswerStatus::Answer301 => incr!("http.301.redirection"),
DefaultAnswerStatus::Answer301 => incr!(
"http.301.redirection",
self.cluster_id.as_deref(),
self.backend_id.as_deref()
),
DefaultAnswerStatus::Answer400 => incr!("http.400.errors"),
DefaultAnswerStatus::Answer401 => incr!("http.401.errors"),
DefaultAnswerStatus::Answer401 => incr!(
"http.401.errors",
self.cluster_id.as_deref(),
self.backend_id.as_deref()
),
DefaultAnswerStatus::Answer404 => incr!("http.404.errors"),
DefaultAnswerStatus::Answer408 => incr!("http.408.errors"),
DefaultAnswerStatus::Answer413 => incr!("http.413.errors"),
DefaultAnswerStatus::Answer502 => incr!("http.502.errors"),
DefaultAnswerStatus::Answer503 => incr!("http.503.errors"),
DefaultAnswerStatus::Answer504 => incr!("http.504.errors"),
DefaultAnswerStatus::Answer408 => incr!(
"http.408.errors",
self.cluster_id.as_deref(),
self.backend_id.as_deref()
),
DefaultAnswerStatus::Answer413 => incr!(
"http.413.errors",
self.cluster_id.as_deref(),
self.backend_id.as_deref()
),
DefaultAnswerStatus::Answer502 => incr!(
"http.502.errors",
self.cluster_id.as_deref(),
self.backend_id.as_deref()
),
DefaultAnswerStatus::Answer503 => incr!(
"http.503.errors",
self.cluster_id.as_deref(),
self.backend_id.as_deref()
),
DefaultAnswerStatus::Answer504 => incr!(
"http.504.errors",
self.cluster_id.as_deref(),
self.backend_id.as_deref()
),
};
}

Expand Down Expand Up @@ -1275,7 +1304,7 @@ impl<Front: SocketHandler, L: ListenerHandler + L7ListenerHandler> Http<Front, L

if backend.retry_policy.is_down() {
incr!(
"up",
"backend.up",
self.cluster_id.as_deref(),
metrics.backend_id.as_deref()
);
Expand Down Expand Up @@ -1313,7 +1342,7 @@ impl<Front: SocketHandler, L: ListenerHandler + L7ListenerHandler> Http<Front, L
let already_unavailable = backend.retry_policy.is_down();
backend.retry_policy.fail();
incr!(
"connections.error",
"backend.connections.error",
self.cluster_id.as_deref(),
metrics.backend_id.as_deref()
);
Expand All @@ -1323,7 +1352,7 @@ impl<Front: SocketHandler, L: ListenerHandler + L7ListenerHandler> Http<Front, L
backend.backend_id, backend.address
);
incr!(
"down",
"backend.down",
self.cluster_id.as_deref(),
metrics.backend_id.as_deref()
);
Expand Down Expand Up @@ -1698,23 +1727,23 @@ impl<Front: SocketHandler, L: ListenerHandler + L7ListenerHandler> SessionState
}

/// Save the HTTP status code of the backend response
fn save_http_status_metric(status: Option<u16>) {
fn save_http_status_metric(status: Option<u16>, context: LogContext) {
if let Some(status) = status {
match status {
100..=199 => {
incr!("http.status.1xx");
incr!("http.status.1xx", context.cluster_id, context.backend_id);
}
200..=299 => {
incr!("http.status.2xx");
incr!("http.status.2xx", context.cluster_id, context.backend_id);
}
300..=399 => {
incr!("http.status.3xx");
incr!("http.status.3xx", context.cluster_id, context.backend_id);
}
400..=499 => {
incr!("http.status.4xx");
incr!("http.status.4xx", context.cluster_id, context.backend_id);
}
500..=599 => {
incr!("http.status.5xx");
incr!("http.status.5xx", context.cluster_id, context.backend_id);
}
_ => {
// http responses with other codes (protocol error)
Expand Down
1 change: 1 addition & 0 deletions lib/src/protocol/pipe.rs
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,7 @@ impl<Front: SocketHandler, L: ListenerHandler> Pipe<Front, L> {
client_rtt: socket_rtt(self.front_socket()),
server_rtt: self.backend_socket.as_ref().and_then(socket_rtt),
metrics,
user_agent: None,
}
.log();
}
Expand Down
Loading
Loading