-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(system): Log service back pressure as metric [INGEST-1630] #1583
Conversation
I'll add unit tests that assert emitted metrics; otherwise this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks great to me. Just left a small thought
tokio::select! { | ||
biased; | ||
|
||
_ = self.interval.tick() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this select now. We will be missing some metrics, since once we get the message we exit this receive method.
I'm thinking maybe it will be better to spawn the tokio::task
and just report the metrics from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recv
is called in a loop by the service, so this will run continuously. All that this inner loop and select!
does is inject into the existing external loop and slide some metrics in.
What this means is that the metric cannot be logged if the service does not make any progress. On the other hand, as long as the service makes slow progress, we'll get updates.
relay-system/src/service.rs
Outdated
@@ -305,12 +314,14 @@ pub trait FromMessage<M>: Interface { | |||
/// long as the service is running. It can be freely cloned. | |||
pub struct Addr<I: Interface> { | |||
tx: mpsc::UnboundedSender<I>, | |||
backlog: Arc<AtomicU64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: To me, "backlog" refers to the actual list/queue of backlogged items, while this is merely a counter, so I would name it something like backlog_size
or queue_size
.
.unwrap(); | ||
|
||
let _guard = rt.enter(); | ||
tokio::time::pause(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work? Is the time still paused after calling tokio::time::sleep
? That is, calling sleep(X)
increases the paused now
time by X
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relevant information is in the "Auto-advance" section on the docs of pause: The runtime skips ahead to the next waiting timer, in our case a sleep. We can think about it like skipping idle time. Also, the timer does not advance in between, so multiple calls to Instant::now()
will always give the same instant back.
There is also tokio::time::advance()
, but that does not run all paused timers, which we want in this case.
This adds central back pressure monitoring via statsd metrics to all
services that use the new tokio-based service framework. Channels track
the number of pending messages and emit a gauge metric with that number.
The metric is
service.back_pressure
and tagged with aservice
tag toidentify the specific service. The metric is debounced once per second.
Between these intervals, the back pressure number is not checked, so
this metric is not suitable to detect very short-lived spikes.
Additionally, the current implementation requires that the service makes
progress by polling
recv
.The purpose of this metric is to detect lasting increases in backlogs
that indicate bottlenecks or degraded system behavior.