feat: add specification for messaging latencies #895

kjschnei001 · 2024-04-05T17:31:38Z

Fixes #792

Changes

This PR introduces the a metric for messaging systems to report the time difference from which a message was produced to when it was consumed. This is a pretty typical business metric used to indicate the health of an asynchronous messaging system.

Merge requirement checklist

CONTRIBUTING.md guidelines followed.
Change log entry added, according to the guidelines in When to add a changelog entry.
- If your PR does not need a change log, start the PR title with [chore]
[n/a] schema-next.yaml updated with changes to existing conventions.

pyohannes

Thanks @kjschnei001 for starting this, I'm looking forward to discussions around this topic.

pyohannes · 2024-04-08T09:17:46Z

docs/messaging/messaging-metrics.md

+<!-- semconv metric.messaging.consumer.latency.duration(metric_table) -->
+| Name     | Instrument Type | Unit (UCUM) | Description    | Stability |
+| -------- | --------------- | ----------- | -------------- | --------- |
+| `messaging.consumer.latency.duration` | Histogram | `s` | Measures the duration between message production and consumption. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |


the duration between message production and consumption

I wonder if we'd need to define this more exact. Some messaging systems provide you with the time a message was created ("message creation time" via the client API), others provide you with the time a message was into a partition ("message enqueue time" in the broker).

I'm not sure if it's feasible to specify exact semantics that are applicable across all messaging systems, however, I think it's beneficial to have consistent directions for each specific messaging system (to have consistent semantic and implementations across different client library implementations of the same system).

This came up in a few comments. I tried to address this in my latest changes by declaring two different latencies that represent "creation to processing" time and "enqueued to processing" time. I proposed some specific names: messaging.latency.duration and messaging.buffering.duration , respective, but I'm open to suggestions.

lmolkova · 2024-04-08T16:29:41Z

.chloggen/messaging.consumer.latency.yaml

+component: messaging
+
+# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
+note: "Add `messaging.consumer.latency.duration` to capture latency between production and consumption."


I suggest to call it messaging.consumer.lag

Suggested change

note: "Add `messaging.consumer.latency.duration` to capture latency between production and consumption."

note: "Add `messaging.consumer.lag` to capture the time difference between when a message was published and when it was consumed."

(and correct the brief in the yaml)

I suggest to call it messaging.consumer.lag

Let's be careful with that. Consumer lag is commonly defined as a difference of offsets (producer's end offset minus consumer's last committed offset), see example documentation for Kafka or Azure Event Hubs. Latency is a difference of time stamps.

Both measurements are very important to have, however, to avoid confusion with established terminology what's proposed in this PR shouldn't be called "lag".

Agreed, so we need to find a different name, but I still think that latency.duration or latency would not work and I'd prefer the name that emphasizes that it's a time message spent on the broker before being consumed (this time) rather than in-process latency/duration.

It's common to call it time-in-queue, but we're avoiding queue/topic terminology

but I still think that latency.duration or latency would not work and

I agree. Basically, we're messaging the duration from the end of the "publish" to the beginning of the "process" operation. We don't yet have a satisfying name for this.

The term "enqueued" seems to be a possible candidate (although it messes with our intentions to avoid the term). One could then have messaging.enqueued.duration for the latency (time difference), and messaging.enqueued.count for the number of unsettled messages in the topic/queue (offset difference).

This came up in a few comments. I tried to address this in my latest changes by declaring two different latencies that represent "creation to processing" time and "enqueued to processing" time. I proposed some specific names: messaging.latency.duration and messaging.buffering.duration , respective, but I'm open to suggestions.

lmolkova · 2024-04-08T16:33:10Z

docs/messaging/messaging-metrics.md

@@ -179,6 +180,20 @@ _Note: The need to report `messaging.process.messages` depends on the messaging
 | `messaging.process.messages` | Counter | `{message}` | Measures the number of processed messages. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
 <!-- endsemconv -->

+### Metric: `messaging.consumer.latency.duration`
+
+This metric is [recommended][MetricRecommended] for any consumer with the capability to extract these timings.


related to @pyohannes comment below - I believe we only need the creation time and consumer can use now. We however need to clarify if now means the time when message is delivered to the application vs the time it's prefetched (i.e. arrives on the consumer, but may stay in the internal client library queues for a while)

Another thing we need to call out that the time is recorded on different machines and is skewed.
We should pick a strategy on how to record negative difference

Another thing we need to call out that the time is recorded on different machines and is skewed.

Yes, I encountered such cases. It was especially critical because we used the metric for defining alerts. We ended up skipping negative latencies and recording the count of negative latencies in a different metric.

I believe we only need the creation time

We'd need some basic prototyping before settling on this. Popular systems allow customizations around either recording creation time (a user-specified timestamp) or enqueuing time (a property set by the broker), see documentation for Kafka and RabbitMQ. I know that Azure Event Hubs messages now their enqueued time (see here), I'm not sure one can obtain the creation time.

We however need to clarify if now means the time when message is delivered to the application vs the time it's prefetched

The first seems more intuitive to me: the "consumer latency duration" (or however we will call it) would end when the messaging.process.duration starts.

This came up in a few comments. I tried to address this in my latest changes by declaring two different latencies that represent "creation to processing" time and "enqueued to processing" time. I proposed some specific names: messaging.latency.duration and messaging.buffering.duration , respective, but I'm open to suggestions.

lmolkova · 2024-04-08T16:42:39Z

docs/messaging/messaging-metrics.md

+
+This metric SHOULD be specified with
+[`ExplicitBucketBoundaries`](https://github.com/open-telemetry/opentelemetry-specification/tree/v1.31.0/specification/metrics/api.md#instrument-advice)
+of `[ 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10, 30, 60, 300, 600, 1800 ]`.


I suggest to reduce number of buckets. I wonder if we should start with exponential boundaries

Given time skew, we probably can't guarantee precision below several seconds, but I guess starting with 0.005 (5ms) is fine since some systems can attempt to minimize the skew, but I'd end in hours range (3600).

I'd pick 14 points to match the count on other metrics, but would make it steeper.

I took you suggestion, and I've updated the buckets to be [ 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 30, 60, 300, 600, 1800, 3600, 14400 ]. Let me know what you think.

kjschnei001 · 2024-04-11T00:47:01Z

I appreciate all the wonderful collaboration going on here. Quick logistical note that I will continue to be afk for the next week and a half, but I look forward to incorporating all the feedback upon my return.

pyohannes · 2024-04-18T14:24:12Z

Given the current messaging metric naming conventions of messaging.<operation type>.[duration|message], I wondered whether it's applicable to have a "synthetic" operation type called "enqueued", then we could have the metric messaging.enqueued.duration for the latency (the duration a message was enqueued), and messaging.enqueued.messages for the lag (this would be an UpDownCounter).

github-actions · 2024-05-04T03:20:01Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

…:kjschnei001/semantic-conventions into feat/messaging.consumer.latency.duration

tednaleid · 2024-05-16T20:38:43Z

docs/messaging/messaging-metrics.md

+| `messaging.latency.duration` | Histogram | `s` | Measures the observed duration between when a message was created and when it began being processed. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
+<!-- endsemconv -->
+
+### Metric: `metric.messaging.buffering.duration`


I've read the descriptions to metric.messaging.latency.duration and metric.messaging.buffering.duration twice and I'm still not 100% sure I understand the difference between them (or why this one is called buffering)

I think buffering is the time between when the event was written to the kafka topic and the time that it was read by the consumer.

And latency is buffering plus the time that it took for the producer to write the message to the topic, right?

If this is correct, there are terms that confluent uses that I think I prefer. (from: https://www.confluent.io/blog/configure-kafka-to-minimize-latency/)

The term end-to-end latency instead of just latency:

"End-to-end latency" is the time between when the application logic produces a record via KafkaProducer.send() to when the record can be consumed by the application logic via KafkaConsumer.poll().

They don't have as clear of a name for what we're calling metric.messaging.latency.duration, but they refer to the period I think we're trying to measure here as "catch up latency".

That feels more descriptive to me than just buffering as it's how far the consumer needs to go to "catch up" to current messages on the topic.

So maybe:

messaging.endtoend.latency.duration

messaging.catchup.latency.duration
?

Here's the specific diagram from confluent that I'm referencing:

github-actions · 2024-06-01T03:20:26Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions · 2024-06-09T03:20:11Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

feat: add specification for messaging.consumer.latency.duration

bd5d529

kjschnei001 requested review from a team April 5, 2024 17:31

Merge branch 'main' into feat/messaging.consumer.latency.duration

2f1723d

github-actions bot assigned AlexanderWert Apr 5, 2024

pyohannes reviewed Apr 8, 2024

View reviewed changes

lmolkova reviewed Apr 8, 2024

View reviewed changes

Merge branch 'main' into feat/messaging.consumer.latency.duration

e570003

pyohannes mentioned this pull request Apr 18, 2024

Review naming conventions used for messaging metrics #937

Closed

github-actions bot added the Stale label May 4, 2024

kjschnei001 added 2 commits May 4, 2024 00:29

refactor: incorporate feedback into messaging latency updates

5ad05f2

Merge branch 'feat/messaging.consumer.latency.duration' of github.com…

5ceb1f4

…:kjschnei001/semantic-conventions into feat/messaging.consumer.latency.duration

kjschnei001 changed the title ~~feat: add specification for messaging.consumer.latency.duration~~ feat: add specification for messaging latencies May 4, 2024

kjschnei001 requested review from lmolkova and pyohannes May 4, 2024 05:34

Merge branch 'main' into feat/messaging.consumer.latency.duration

46e1306

github-actions bot removed the Stale label May 5, 2024

kjschnei001 added 2 commits May 6, 2024 09:37

Merge branch 'main' into feat/messaging.consumer.latency.duration

d0b9e6c

Merge branch 'main' into feat/messaging.consumer.latency.duration

2e44944

tednaleid reviewed May 16, 2024

View reviewed changes

github-actions bot added the Stale label Jun 1, 2024

github-actions bot closed this Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add specification for messaging latencies #895

feat: add specification for messaging latencies #895

kjschnei001 commented Apr 5, 2024

pyohannes left a comment

pyohannes Apr 8, 2024

kjschnei001 May 4, 2024

lmolkova Apr 8, 2024 •

edited

Loading

pyohannes Apr 9, 2024

lmolkova Apr 9, 2024

pyohannes Apr 10, 2024

kjschnei001 May 4, 2024

lmolkova Apr 8, 2024 •

edited

Loading

lmolkova Apr 8, 2024

pyohannes Apr 10, 2024

kjschnei001 May 4, 2024

lmolkova Apr 8, 2024 •

edited

Loading

kjschnei001 May 4, 2024

kjschnei001 commented Apr 11, 2024

pyohannes commented Apr 18, 2024

github-actions bot commented May 4, 2024

tednaleid May 16, 2024 •

edited

Loading

github-actions bot commented Jun 1, 2024

github-actions bot commented Jun 9, 2024

	note: "Add `messaging.consumer.latency.duration` to capture latency between production and consumption."
	note: "Add `messaging.consumer.lag` to capture the time difference between when a message was published and when it was consumed."

feat: add specification for messaging latencies #895

feat: add specification for messaging latencies #895

Conversation

kjschnei001 commented Apr 5, 2024

Changes

Merge requirement checklist

pyohannes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kjschnei001 commented Apr 11, 2024

pyohannes commented Apr 18, 2024

github-actions bot commented May 4, 2024

tednaleid May 16, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jun 1, 2024

github-actions bot commented Jun 9, 2024

lmolkova Apr 8, 2024 •

edited

Loading

lmolkova Apr 8, 2024 •

edited

Loading

lmolkova Apr 8, 2024 •

edited

Loading

tednaleid May 16, 2024 •

edited

Loading