feat(system): Log service back pressure as metric [INGEST-1630] #1583

jan-auer · 2022-11-15T19:41:15Z

This adds central back pressure monitoring via statsd metrics to all
services that use the new tokio-based service framework. Channels track
the number of pending messages and emit a gauge metric with that number.

The metric is service.back_pressure and tagged with a service tag to
identify the specific service. The metric is debounced once per second.
Between these intervals, the back pressure number is not checked, so
this metric is not suitable to detect very short-lived spikes.
Additionally, the current implementation requires that the service makes
progress by polling recv.

The purpose of this metric is to detect lasting increases in backlogs
that indicate bottlenecks or degraded system behavior.

jan-auer · 2022-11-15T19:42:15Z

I'll add unit tests that assert emitted metrics; otherwise this is ready for review.

olksdr

Overall it looks great to me. Just left a small thought

olksdr · 2022-11-16T09:40:44Z

relay-system/src/service.rs

+            tokio::select! {
+                biased;
+
+                _ = self.interval.tick() => {


I'm not sure about this select now. We will be missing some metrics, since once we get the message we exit this receive method.

I'm thinking maybe it will be better to spawn the tokio::task and just report the metrics from there.

recv is called in a loop by the service, so this will run continuously. All that this inner loop and select! does is inject into the existing external loop and slide some metrics in.

What this means is that the metric cannot be logged if the service does not make any progress. On the other hand, as long as the service makes slow progress, we'll get updates.

relay-statsd/src/lib.rs

jjbayer · 2022-11-16T11:57:45Z

relay-system/src/service.rs

@@ -305,12 +314,14 @@ pub trait FromMessage<M>: Interface {
 /// long as the service is running. It can be freely cloned.
 pub struct Addr<I: Interface> {
    tx: mpsc::UnboundedSender<I>,
+    backlog: Arc<AtomicU64>,


nit: To me, "backlog" refers to the actual list/queue of backlogged items, while this is merely a counter, so I would name it something like backlog_size or queue_size.

relay-system/src/service.rs

jjbayer · 2022-11-16T12:08:33Z

relay-system/src/service.rs

+            .unwrap();
+
+        let _guard = rt.enter();
+        tokio::time::pause();


How does this work? Is the time still paused after calling tokio::time::sleep? That is, calling sleep(X) increases the paused now time by X?

The relevant information is in the "Auto-advance" section on the docs of pause: The runtime skips ahead to the next waiting timer, in our case a sleep. We can think about it like skipping idle time. Also, the timer does not advance in between, so multiple calls to Instant::now() will always give the same instant back.

There is also tokio::time::advance(), but that does not run all paused timers, which we want in this case.

feat(system): Log service back pressure as metric

082d6b0

jan-auer added 5 commits November 15, 2022 20:43

ref(ci): Document metrics from all crates

d609c12

meta: Changelog

e7739a1

fix(ci): List all statsd files for docs job

019a816

ref(statsd): Add a feature flag for the test client

80c6db1

wip: Add a basic test (not working)

a7986e8

olksdr reviewed Nov 16, 2022

View reviewed changes

test(system): Fix the test

97f76fb

jan-auer marked this pull request as ready for review November 16, 2022 10:52

jan-auer requested a review from a team November 16, 2022 10:52

test(system): Fix test comments

c889276

jjbayer approved these changes Nov 16, 2022

View reviewed changes

olksdr approved these changes Nov 16, 2022

View reviewed changes

ref: Rename backlog field to queue_size

f50925a

jan-auer self-assigned this Nov 16, 2022

jan-auer enabled auto-merge (squash) November 16, 2022 14:01

jan-auer merged commit 73b4cea into master Nov 16, 2022

jan-auer deleted the feat/system-backpressure branch November 16, 2022 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(system): Log service back pressure as metric [INGEST-1630] #1583

feat(system): Log service back pressure as metric [INGEST-1630] #1583

jan-auer commented Nov 15, 2022 •

edited

Loading

jan-auer commented Nov 15, 2022

olksdr left a comment

olksdr Nov 16, 2022

jan-auer Nov 16, 2022 •

edited

Loading

jjbayer Nov 16, 2022

jjbayer Nov 16, 2022

jan-auer Nov 16, 2022

feat(system): Log service back pressure as metric [INGEST-1630] #1583

feat(system): Log service back pressure as metric [INGEST-1630] #1583

Conversation

jan-auer commented Nov 15, 2022 • edited Loading

jan-auer commented Nov 15, 2022

olksdr left a comment

Choose a reason for hiding this comment

olksdr Nov 16, 2022

Choose a reason for hiding this comment

jan-auer Nov 16, 2022 • edited Loading

Choose a reason for hiding this comment

jjbayer Nov 16, 2022

Choose a reason for hiding this comment

jjbayer Nov 16, 2022

Choose a reason for hiding this comment

jan-auer Nov 16, 2022

Choose a reason for hiding this comment

jan-auer commented Nov 15, 2022 •

edited

Loading

jan-auer Nov 16, 2022 •

edited

Loading