Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add specification for messaging latencies #895

Conversation

kjschnei001
Copy link

Fixes #792

Changes

This PR introduces the a metric for messaging systems to report the time difference from which a message was produced to when it was consumed. This is a pretty typical business metric used to indicate the health of an asynchronous messaging system.

Merge requirement checklist

@kjschnei001 kjschnei001 requested review from a team April 5, 2024 17:31
Copy link
Contributor

@pyohannes pyohannes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kjschnei001 for starting this, I'm looking forward to discussions around this topic.

<!-- semconv metric.messaging.consumer.latency.duration(metric_table) -->
| Name | Instrument Type | Unit (UCUM) | Description | Stability |
| -------- | --------------- | ----------- | -------------- | --------- |
| `messaging.consumer.latency.duration` | Histogram | `s` | Measures the duration between message production and consumption. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the duration between message production and consumption

I wonder if we'd need to define this more exact. Some messaging systems provide you with the time a message was created ("message creation time" via the client API), others provide you with the time a message was into a partition ("message enqueue time" in the broker).

I'm not sure if it's feasible to specify exact semantics that are applicable across all messaging systems, however, I think it's beneficial to have consistent directions for each specific messaging system (to have consistent semantic and implementations across different client library implementations of the same system).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came up in a few comments. I tried to address this in my latest changes by declaring two different latencies that represent "creation to processing" time and "enqueued to processing" time. I proposed some specific names: messaging.latency.duration and messaging.buffering.duration , respective, but I'm open to suggestions.

component: messaging

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: "Add `messaging.consumer.latency.duration` to capture latency between production and consumption."
Copy link
Contributor

@lmolkova lmolkova Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to call it messaging.consumer.lag

Suggested change
note: "Add `messaging.consumer.latency.duration` to capture latency between production and consumption."
note: "Add `messaging.consumer.lag` to capture the time difference between when a message was published and when it was consumed."

(and correct the brief in the yaml)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to call it messaging.consumer.lag

Let's be careful with that. Consumer lag is commonly defined as a difference of offsets (producer's end offset minus consumer's last committed offset), see example documentation for Kafka or Azure Event Hubs. Latency is a difference of time stamps.

Both measurements are very important to have, however, to avoid confusion with established terminology what's proposed in this PR shouldn't be called "lag".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, so we need to find a different name, but I still think that latency.duration or latency would not work and I'd prefer the name that emphasizes that it's a time message spent on the broker before being consumed (this time) rather than in-process latency/duration.

It's common to call it time-in-queue, but we're avoiding queue/topic terminology

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I still think that latency.duration or latency would not work and

I agree. Basically, we're messaging the duration from the end of the "publish" to the beginning of the "process" operation. We don't yet have a satisfying name for this.

The term "enqueued" seems to be a possible candidate (although it messes with our intentions to avoid the term). One could then have messaging.enqueued.duration for the latency (time difference), and messaging.enqueued.count for the number of unsettled messages in the topic/queue (offset difference).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came up in a few comments. I tried to address this in my latest changes by declaring two different latencies that represent "creation to processing" time and "enqueued to processing" time. I proposed some specific names: messaging.latency.duration and messaging.buffering.duration , respective, but I'm open to suggestions.

@@ -179,6 +180,20 @@ _Note: The need to report `messaging.process.messages` depends on the messaging
| `messaging.process.messages` | Counter | `{message}` | Measures the number of processed messages. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
<!-- endsemconv -->

### Metric: `messaging.consumer.latency.duration`

This metric is [recommended][MetricRecommended] for any consumer with the capability to extract these timings.
Copy link
Contributor

@lmolkova lmolkova Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to @pyohannes comment below - I believe we only need the creation time and consumer can use now. We however need to clarify if now means the time when message is delivered to the application vs the time it's prefetched (i.e. arrives on the consumer, but may stay in the internal client library queues for a while)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing we need to call out that the time is recorded on different machines and is skewed.
We should pick a strategy on how to record negative difference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing we need to call out that the time is recorded on different machines and is skewed.

Yes, I encountered such cases. It was especially critical because we used the metric for defining alerts. We ended up skipping negative latencies and recording the count of negative latencies in a different metric.

I believe we only need the creation time

We'd need some basic prototyping before settling on this. Popular systems allow customizations around either recording creation time (a user-specified timestamp) or enqueuing time (a property set by the broker), see documentation for Kafka and RabbitMQ. I know that Azure Event Hubs messages now their enqueued time (see here), I'm not sure one can obtain the creation time.

We however need to clarify if now means the time when message is delivered to the application vs the time it's prefetched

The first seems more intuitive to me: the "consumer latency duration" (or however we will call it) would end when the messaging.process.duration starts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came up in a few comments. I tried to address this in my latest changes by declaring two different latencies that represent "creation to processing" time and "enqueued to processing" time. I proposed some specific names: messaging.latency.duration and messaging.buffering.duration , respective, but I'm open to suggestions.


This metric SHOULD be specified with
[`ExplicitBucketBoundaries`](https://github.com/open-telemetry/opentelemetry-specification/tree/v1.31.0/specification/metrics/api.md#instrument-advice)
of `[ 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10, 30, 60, 300, 600, 1800 ]`.
Copy link
Contributor

@lmolkova lmolkova Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to reduce number of buckets. I wonder if we should start with exponential boundaries

Given time skew, we probably can't guarantee precision below several seconds, but I guess starting with 0.005 (5ms) is fine since some systems can attempt to minimize the skew, but I'd end in hours range (3600).

I'd pick 14 points to match the count on other metrics, but would make it steeper.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took you suggestion, and I've updated the buckets to be [ 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 30, 60, 300, 600, 1800, 3600, 14400 ]. Let me know what you think.

@kjschnei001
Copy link
Author

I appreciate all the wonderful collaboration going on here. Quick logistical note that I will continue to be afk for the next week and a half, but I look forward to incorporating all the feedback upon my return.

@pyohannes
Copy link
Contributor

Given the current messaging metric naming conventions of messaging.<operation type>.[duration|message], I wondered whether it's applicable to have a "synthetic" operation type called "enqueued", then we could have the metric messaging.enqueued.duration for the latency (the duration a message was enqueued), and messaging.enqueued.messages for the lag (this would be an UpDownCounter).

Copy link

github-actions bot commented May 4, 2024

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label May 4, 2024
@kjschnei001 kjschnei001 changed the title feat: add specification for messaging.consumer.latency.duration feat: add specification for messaging latencies May 4, 2024
@github-actions github-actions bot removed the Stale label May 5, 2024
| `messaging.latency.duration` | Histogram | `s` | Measures the observed duration between when a message was created and when it began being processed. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
<!-- endsemconv -->

### Metric: `metric.messaging.buffering.duration`
Copy link

@tednaleid tednaleid May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read the descriptions to metric.messaging.latency.duration and metric.messaging.buffering.duration twice and I'm still not 100% sure I understand the difference between them (or why this one is called buffering)

I think buffering is the time between when the event was written to the kafka topic and the time that it was read by the consumer.

And latency is buffering plus the time that it took for the producer to write the message to the topic, right?

If this is correct, there are terms that confluent uses that I think I prefer. (from: https://www.confluent.io/blog/configure-kafka-to-minimize-latency/)

The term end-to-end latency instead of just latency:

"End-to-end latency" is the time between when the application logic produces a record via KafkaProducer.send() to when the record can be consumed by the application logic via KafkaConsumer.poll().

They don't have as clear of a name for what we're calling metric.messaging.latency.duration, but they refer to the period I think we're trying to measure here as "catch up latency".

That feels more descriptive to me than just buffering as it's how far the consumer needs to go to "catch up" to current messages on the topic.

So maybe:

  • messaging.endtoend.latency.duration
  • messaging.catchup.latency.duration
    ?

Here's the specific diagram from confluent that I'm referencing:

CleanShot 2024-05-16 at 15 36 55

Copy link

github-actions bot commented Jun 1, 2024

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jun 1, 2024
Copy link

github-actions bot commented Jun 9, 2024

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Messaging: support for something like messaging.record.queue.duration metric
5 participants