Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(system): Log service back pressure as metric [INGEST-1630] #1583

Merged
merged 9 commits into from
Nov 16, 2022

Conversation

jan-auer
Copy link
Member

@jan-auer jan-auer commented Nov 15, 2022

This adds central back pressure monitoring via statsd metrics to all
services that use the new tokio-based service framework. Channels track
the number of pending messages and emit a gauge metric with that number.

The metric is service.back_pressure and tagged with a service tag to
identify the specific service. The metric is debounced once per second.
Between these intervals, the back pressure number is not checked, so
this metric is not suitable to detect very short-lived spikes.
Additionally, the current implementation requires that the service makes
progress by polling recv.

The purpose of this metric is to detect lasting increases in backlogs
that indicate bottlenecks or degraded system behavior.

@jan-auer
Copy link
Member Author

I'll add unit tests that assert emitted metrics; otherwise this is ready for review.

Copy link
Contributor

@olksdr olksdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks great to me. Just left a small thought

tokio::select! {
biased;

_ = self.interval.tick() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this select now. We will be missing some metrics, since once we get the message we exit this receive method.

I'm thinking maybe it will be better to spawn the tokio::task and just report the metrics from there.

Copy link
Member Author

@jan-auer jan-auer Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recv is called in a loop by the service, so this will run continuously. All that this inner loop and select! does is inject into the existing external loop and slide some metrics in.

What this means is that the metric cannot be logged if the service does not make any progress. On the other hand, as long as the service makes slow progress, we'll get updates.

@jan-auer jan-auer marked this pull request as ready for review November 16, 2022 10:52
@jan-auer jan-auer requested a review from a team November 16, 2022 10:52
relay-statsd/src/lib.rs Show resolved Hide resolved
@@ -305,12 +314,14 @@ pub trait FromMessage<M>: Interface {
/// long as the service is running. It can be freely cloned.
pub struct Addr<I: Interface> {
tx: mpsc::UnboundedSender<I>,
backlog: Arc<AtomicU64>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: To me, "backlog" refers to the actual list/queue of backlogged items, while this is merely a counter, so I would name it something like backlog_size or queue_size.

relay-system/src/service.rs Show resolved Hide resolved
.unwrap();

let _guard = rt.enter();
tokio::time::pause();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? Is the time still paused after calling tokio::time::sleep? That is, calling sleep(X) increases the paused now time by X?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relevant information is in the "Auto-advance" section on the docs of pause: The runtime skips ahead to the next waiting timer, in our case a sleep. We can think about it like skipping idle time. Also, the timer does not advance in between, so multiple calls to Instant::now() will always give the same instant back.

There is also tokio::time::advance(), but that does not run all paused timers, which we want in this case.

@jan-auer jan-auer self-assigned this Nov 16, 2022
@jan-auer jan-auer enabled auto-merge (squash) November 16, 2022 14:01
@jan-auer jan-auer merged commit 73b4cea into master Nov 16, 2022
@jan-auer jan-auer deleted the feat/system-backpressure branch November 16, 2022 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants