Skip to content

Commit

Permalink
Merge pull request #973 from hcaumeil/metrics
Browse files Browse the repository at this point in the history
Cluster related metrics, user-agent
  • Loading branch information
FlorentinDUBOIS authored Aug 8, 2023
2 parents 5429d58 + 0ecf4cc commit 2a856d8
Show file tree
Hide file tree
Showing 11 changed files with 154 additions and 53 deletions.
16 changes: 8 additions & 8 deletions doc/debugging_strategies.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,15 +131,15 @@ unresponsive backend server).

The `sozu.http.errors` counter is the sum of failed requests. It contains the following:

* `sozu.http.front_parse_errors`: sozu received some invalid traffic
* `sozu.http.frontend_parse_errors`: sozu received some invalid traffic
* `sozu.http.400.errors`: cannot parse hostname
* `sozu.http.404.errors`: unknown hostname and/or path
* `sozu.http.413.errors`: request too large
* `sozu.http.503.errors`: could not connect to backend server, or no backend server available for the corresponding cluster

Going further, backend connections issues are tracked by the following metrics:

* `sozu.backend.connections.errors`: could not connect to a backend server
* `sozu.backend.connections.error`: could not connect to a backend server
* `sozu.backend.down`: the retry policy triggered and marked the backend server as down

The `sozu.http.503.errors` metric is incremented after a request sent back a 503 error, and a 503 error is sent
Expand Down Expand Up @@ -180,10 +180,10 @@ active requests (an inactive session can keep a backend connection around).

These metrics are closely linked to resource usage, which is tracked by the following:

* `sozu.slab.count`: number of slots used in the slab allocator. Typically, there's one slot per listener socket,
* `sozu.slab.entries`: number of slots used in the slab allocator. Typically, there's one slot per listener socket,
one for the connection to the main process, one for the metrics socket, then one per frontend connection and
one per backend connection. So the number of connections should always be close to (but lower than) the slab count.
* `sozu.buffer.count`: number of buffers used in the buffer pool. Inactive sessions and requests for which we send
* `sozu.buffer.number`: number of buffers used in the buffer pool. Inactive sessions and requests for which we send
a default answer (400, 404, 413, 503 HTTP errors) do not use buffers. Active HTTP sessions use one buffer (except
in pipelining mode), WebSocket sessions use two buffers. So the number of buffers should always be lower than the
slab count, and lower than the number of connections.
Expand All @@ -194,7 +194,7 @@ is incremented for each zombie session that gets deleted.
New connections are put into a queue, and wait until the session is created (if we have available resources),
or until a configurable timeout has elapsed. The following metrics observe the accept queue usage:

* `sozu.accept_queue.count`: number of sockets in the accept queue
* `sozu.accept_queue.connections`: number of sockets in the accept queue
* `sozu.accept_queue.timeout`: incremented every time a socket stayed too long in the queue and is closed
* `sozu.accept_queue.wait_time`: every time a session is created, this metric records how long the socket had to wait in the accept queue

Expand Down Expand Up @@ -243,7 +243,7 @@ sozu -c /etc/config.toml query clusters -i cluster_id

### Backend server unavailable

`sozu.http.503.errors` increases, lots of `sozu.backend.connections.errors` and a
`sozu.http.503.errors` increases, lots of `sozu.backend.connections.error` and a
`sozu.backend.down` record: a backend server is down.
Check the logs for `error connecting to backend, trying again` and `no more available backends for cluster <cluster_id>`
to find out which cluster is affected
Expand All @@ -254,7 +254,7 @@ if the `sozu.zombies` metric triggers, this means there's an event loop or proto
bug. The logs should contain the internal state of the sessions that were killed. Please copy those
logs and open an issue to sozu.

It usually comes with the `sozu.slab.count` increasing while the number of connections or active requests stays
It usually comes with the `sozu.slab.entries` increasing while the number of connections or active requests stays
the same. The slab count will then drop when the zombie checker activates.

### Invalid session close
Expand All @@ -265,7 +265,7 @@ are increasing, it means sessions are not properly closed by sozu, please open a

### accept queue filling up

if `sozu.accept_queue.count` is increasing, that means the accept queue is filling up because sozu is under
if `sozu.accept_queue.connections` is increasing, that means the accept queue is filling up because sozu is under
heavy load (in a healthy load, this queue is almost always empty). `sozu.accept_queue.wait_time` should increase
as well. If `sozu.accept_queue.timeout` is higher than zero, sozu cannot accept sessions fast enough and
is rejecting traffic.
Expand Down
21 changes: 19 additions & 2 deletions lib/src/http.rs
Original file line number Diff line number Diff line change
Expand Up @@ -419,6 +419,7 @@ impl L7ListenerHandler for HttpListener {
uri: &str,
method: &Method,
) -> Result<Route, FrontendFromRequestError> {
let start = Instant::now();
let (remaining_input, (hostname, _)) = match hostname_and_port(host.as_bytes()) {
Ok(tuple) => tuple,
Err(parse_error) => {
Expand Down Expand Up @@ -446,9 +447,25 @@ impl L7ListenerHandler for HttpListener {
*/
let host = unsafe { from_utf8_unchecked(hostname) };

self.fronts
let route = self
.fronts
.lookup(host.as_bytes(), uri.as_bytes(), method)
.ok_or(FrontendFromRequestError::NoClusterFound)
.ok_or_else(|| {
incr!("http.failed_backend_matching");
FrontendFromRequestError::NoClusterFound
})?;

let now = Instant::now();

if let Route::ClusterId(cluster) = &route {
time!(
"frontend_matching_time",
cluster,
(now - start).whole_milliseconds()
);
}

Ok(route)
}
}

Expand Down
21 changes: 19 additions & 2 deletions lib/src/https.rs
Original file line number Diff line number Diff line change
Expand Up @@ -562,6 +562,7 @@ impl L7ListenerHandler for HttpsListener {
uri: &str,
method: &Method,
) -> Result<Route, FrontendFromRequestError> {
let start = Instant::now();
let (remaining_input, (hostname, _)) = match hostname_and_port(host.as_bytes()) {
Ok(tuple) => tuple,
Err(parse_error) => {
Expand All @@ -584,9 +585,25 @@ impl L7ListenerHandler for HttpsListener {
// chars in there
let host = unsafe { from_utf8_unchecked(hostname) };

self.fronts
let route = self
.fronts
.lookup(host.as_bytes(), uri.as_bytes(), method)
.ok_or(FrontendFromRequestError::NoClusterFound)
.ok_or_else(|| {
incr!("http.failed_backend_matching");
FrontendFromRequestError::NoClusterFound
})?;

let now = Instant::now();

if let Route::ClusterId(cluster) = &route {
time!(
"frontend_matching_time",
cluster,
(now - start).whole_milliseconds()
);
}

Ok(route)
}
}

Expand Down
30 changes: 25 additions & 5 deletions lib/src/logs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ pub struct RequestRecord<'a> {
pub client_rtt: Option<Duration>,
pub server_rtt: Option<Duration>,
pub metrics: &'a SessionMetrics,
pub user_agent: Option<&'a str>,
}

impl RequestRecord<'_> {
Expand All @@ -150,6 +151,7 @@ impl RequestRecord<'_> {
let session_address = self.session_address;
let backend_address = self.backend_address;
let endpoint = &self.endpoint;
let user_agent = &self.user_agent;

let metrics = self.metrics;
// let backend_response_time = metrics.backend_response_time();
Expand Down Expand Up @@ -193,7 +195,7 @@ impl RequestRecord<'_> {
match self.error {
None => {
info_access!(
"{}{} -> {} \t{}/{}/{}/{} \t{} -> {} \t {} {} {}",
"{}{} -> {} \t{}/{}/{}/{} \t{} -> {} \t {}{} {} {}",
context,
session_address.as_str_or("X"),
backend_address.as_str_or("X"),
Expand All @@ -203,14 +205,25 @@ impl RequestRecord<'_> {
LogDuration(server_rtt),
metrics.bin,
metrics.bout,
tags.as_str_or("-"),
match user_agent {
Some(_) => tags.as_str_or(""),
None => tags.as_str_or("-"),
},
match tags {
Some(tags) if !tags.is_empty() => user_agent.map(|ua| format!(", user-agent={ua}")).unwrap_or_default(),
Some(_) | None => user_agent.map(|ua| format!("user-agent={ua}")).unwrap_or_default(),
},
protocol,
endpoint
);
incr!("access_logs.count");
incr!(
"access_logs.count",
self.context.cluster_id,
self.context.backend_id
);
}
Some(message) => error_access!(
"{}{} -> {} \t{}/{}/{}/{} \t{} -> {} \t {} {} {} | {}",
"{}{} -> {} \t{}/{}/{}/{} \t{} -> {} \t {}{} {} {} | {}",
context,
session_address.as_str_or("X"),
backend_address.as_str_or("X"),
Expand All @@ -220,7 +233,14 @@ impl RequestRecord<'_> {
LogDuration(server_rtt),
metrics.bin,
metrics.bout,
tags.as_str_or("-"),
match user_agent {
Some(_) => tags.as_str_or(""),
None => tags.as_str_or("-"),
},
match tags {
Some(tags) if !tags.is_empty() => user_agent.map(|ua| format!(", user-agent={ua}")).unwrap_or_default(),
Some(_) | None => user_agent.map(|ua| format!("user-agent={ua}")).unwrap_or_default(),
},
protocol,
endpoint,
message
Expand Down
22 changes: 13 additions & 9 deletions lib/src/metrics/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -321,11 +321,13 @@ macro_rules! count (
macro_rules! incr (
($key:expr) => (count!($key, 1));
($key:expr, $cluster_id:expr, $backend_id:expr) => {
use $crate::metrics::Subscriber;
{
use $crate::metrics::Subscriber;

$crate::metrics::METRICS.with(|metrics| {
(*metrics.borrow_mut()).receive_metric($key, $cluster_id, $backend_id, $crate::metrics::MetricValue::Count(1));
});
$crate::metrics::METRICS.with(|metrics| {
(*metrics.borrow_mut()).receive_metric($key, $cluster_id, $backend_id, $crate::metrics::MetricValue::Count(1));
});
}
}
);

Expand Down Expand Up @@ -353,12 +355,14 @@ macro_rules! gauge_add (
});
});
($key:expr, $value:expr, $cluster_id:expr, $backend_id:expr) => {
use $crate::metrics::Subscriber;
let v = $value;
{
use $crate::metrics::Subscriber;
let v = $value;

$crate::metrics::METRICS.with(|metrics| {
(*metrics.borrow_mut()).receive_metric($key, $cluster_id, $backend_id, $crate::metrics::MetricValue::GaugeAdd(v));
});
$crate::metrics::METRICS.with(|metrics| {
(*metrics.borrow_mut()).receive_metric($key, $cluster_id, $backend_id, $crate::metrics::MetricValue::GaugeAdd(v));
});
}
}
);

Expand Down
4 changes: 2 additions & 2 deletions lib/src/pool.rs
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ impl Pool {
})
.map(|c| {
let old_buffer_count = BUFFER_COUNT.fetch_add(1, Ordering::SeqCst);
gauge!("buffer.count", old_buffer_count + 1);
gauge!("buffer.number", old_buffer_count + 1);
Checkout { inner: c }
})
}
Expand Down Expand Up @@ -115,7 +115,7 @@ impl ops::DerefMut for Checkout {
impl Drop for Checkout {
fn drop(&mut self) {
let old_buffer_count = BUFFER_COUNT.fetch_sub(1, Ordering::SeqCst);
gauge!("buffer.count", old_buffer_count - 1);
gauge!("buffer.number", old_buffer_count - 1);
}
}

Expand Down
63 changes: 46 additions & 17 deletions lib/src/protocol/kawa_h1/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -424,7 +424,7 @@ impl<Front: SocketHandler, L: ListenerHandler + L7ListenerHandler> Http<Front, L
}

if self.response_stream.is_terminated() && self.response_stream.is_completed() {
save_http_status_metric(self.context.status);
save_http_status_metric(self.context.status, self.log_context());

self.log_request_success(metrics);
metrics.reset();
Expand Down Expand Up @@ -765,6 +765,7 @@ impl<Front: SocketHandler, L: ListenerHandler + L7ListenerHandler> Http<Front, L
client_rtt: socket_rtt(self.front_socket()),
server_rtt: self.backend_socket.as_ref().and_then(socket_rtt),
metrics,
user_agent: self.context.user_agent.as_deref(),
}
.log();
}
Expand All @@ -788,15 +789,43 @@ impl<Front: SocketHandler, L: ListenerHandler + L7ListenerHandler> Http<Front, L
);
} else {
match answer {
DefaultAnswerStatus::Answer301 => incr!("http.301.redirection"),
DefaultAnswerStatus::Answer301 => incr!(
"http.301.redirection",
self.cluster_id.as_deref(),
self.backend_id.as_deref()
),
DefaultAnswerStatus::Answer400 => incr!("http.400.errors"),
DefaultAnswerStatus::Answer401 => incr!("http.401.errors"),
DefaultAnswerStatus::Answer401 => incr!(
"http.401.errors",
self.cluster_id.as_deref(),
self.backend_id.as_deref()
),
DefaultAnswerStatus::Answer404 => incr!("http.404.errors"),
DefaultAnswerStatus::Answer408 => incr!("http.408.errors"),
DefaultAnswerStatus::Answer413 => incr!("http.413.errors"),
DefaultAnswerStatus::Answer502 => incr!("http.502.errors"),
DefaultAnswerStatus::Answer503 => incr!("http.503.errors"),
DefaultAnswerStatus::Answer504 => incr!("http.504.errors"),
DefaultAnswerStatus::Answer408 => incr!(
"http.408.errors",
self.cluster_id.as_deref(),
self.backend_id.as_deref()
),
DefaultAnswerStatus::Answer413 => incr!(
"http.413.errors",
self.cluster_id.as_deref(),
self.backend_id.as_deref()
),
DefaultAnswerStatus::Answer502 => incr!(
"http.502.errors",
self.cluster_id.as_deref(),
self.backend_id.as_deref()
),
DefaultAnswerStatus::Answer503 => incr!(
"http.503.errors",
self.cluster_id.as_deref(),
self.backend_id.as_deref()
),
DefaultAnswerStatus::Answer504 => incr!(
"http.504.errors",
self.cluster_id.as_deref(),
self.backend_id.as_deref()
),
};
}

Expand Down Expand Up @@ -1275,7 +1304,7 @@ impl<Front: SocketHandler, L: ListenerHandler + L7ListenerHandler> Http<Front, L

if backend.retry_policy.is_down() {
incr!(
"up",
"backend.up",
self.cluster_id.as_deref(),
metrics.backend_id.as_deref()
);
Expand Down Expand Up @@ -1313,7 +1342,7 @@ impl<Front: SocketHandler, L: ListenerHandler + L7ListenerHandler> Http<Front, L
let already_unavailable = backend.retry_policy.is_down();
backend.retry_policy.fail();
incr!(
"connections.error",
"backend.connections.error",
self.cluster_id.as_deref(),
metrics.backend_id.as_deref()
);
Expand All @@ -1323,7 +1352,7 @@ impl<Front: SocketHandler, L: ListenerHandler + L7ListenerHandler> Http<Front, L
backend.backend_id, backend.address
);
incr!(
"down",
"backend.down",
self.cluster_id.as_deref(),
metrics.backend_id.as_deref()
);
Expand Down Expand Up @@ -1698,23 +1727,23 @@ impl<Front: SocketHandler, L: ListenerHandler + L7ListenerHandler> SessionState
}

/// Save the HTTP status code of the backend response
fn save_http_status_metric(status: Option<u16>) {
fn save_http_status_metric(status: Option<u16>, context: LogContext) {
if let Some(status) = status {
match status {
100..=199 => {
incr!("http.status.1xx");
incr!("http.status.1xx", context.cluster_id, context.backend_id);
}
200..=299 => {
incr!("http.status.2xx");
incr!("http.status.2xx", context.cluster_id, context.backend_id);
}
300..=399 => {
incr!("http.status.3xx");
incr!("http.status.3xx", context.cluster_id, context.backend_id);
}
400..=499 => {
incr!("http.status.4xx");
incr!("http.status.4xx", context.cluster_id, context.backend_id);
}
500..=599 => {
incr!("http.status.5xx");
incr!("http.status.5xx", context.cluster_id, context.backend_id);
}
_ => {
// http responses with other codes (protocol error)
Expand Down
1 change: 1 addition & 0 deletions lib/src/protocol/pipe.rs
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,7 @@ impl<Front: SocketHandler, L: ListenerHandler> Pipe<Front, L> {
client_rtt: socket_rtt(self.front_socket()),
server_rtt: self.backend_socket.as_ref().and_then(socket_rtt),
metrics,
user_agent: None,
}
.log();
}
Expand Down
Loading

0 comments on commit 2a856d8

Please sign in to comment.