Register Kafka "records-lag" metric per topic-partition #878

alexanderabramov · 2018-09-26T21:11:36Z

Affects: master version of micrometer-core (SNAPSHOT-1.1.0-20180921.212031-73).

This is in a way related to #877.

I've been trying to get "records-lag" metric (or "records-lag-max" would have been fine too) per topic-partition to show up in /actuator/prometheus/.
I'm not sure if current KafkaConsumerMetrics is expected to register it or not, at least KafkaConsumerMetrics.java#L236 hints that it's interested in per topic metrics, but currently it does not, and I think it is a very valuable metric for applications consuming more than 1 topic.

After digging around in code, I believe there are some issues with how current implementation of KafkaConsumerMetrics registers the metrics.

It tries to register all metrics with all combinations of tags, instead of only using existing combinations.
At KafkaConsumerMetrics.java#L201 JXM NotificationListener is called first with only client-id (many times actually, but maybe that's kafka-clients issue), then with [client-id, topic], then with [client-id, topic, partition] and each time it goes through the whole list at KafkaConsumerMetrics.java#L76.

For the list of metric-tags combinations actually coming from Kafka via JMX, see:
kafka FetcherMetricsRegistry.java#L66
kafka Fetcher.java#L1326

It does add "topic" tag, but not "partition" tag, whereas e,g, "records-lag-max" is only exported by Kafka per topic-partition, not per topic. So it's not clear how meaningful "records-lag-max" per topic will be even if it does get added to Prometheus registry.
nameTag(), KafkaConsumerMetrics.java#L240

The way to address this I believe is to have registerMetricsEventually have 3 callbacks, one for each combination of tags coming from Kafka:

client-id
client-id, topic
client-id, topic, partition

Each callback will then only register metrics defined for that tags combination.

If KafkaConsumerMetrics wants to support "records-lag-max" on 2 levels, then #877 needs to be solved too.

Another approach is to add it (or "records-lag") on topic-partition level only, not on client-id level, and then aggregate in Prometheus if desired. Personally I've implemented a MeterBinder to do just this.

jkschneider · 2018-10-26T05:55:46Z

I'm no Kafka expert, so I could really use some help in the form of additional tests that demonstrate what you're talking about.

On (1), when I test it with KafkaConsumerMetricsTest, I only ever see the records-lag-max meter created with the client tag. I don't see a second iteration adding topic or a third with partition. What am I doing wrong?

On (2), again, I only see client-id metadata in JMX, and nothing about partition. What am I doing wrong?

wardhapk · 2018-10-29T21:25:16Z

@jkschneider I don't see a per-partition level Consumer metrics exposed by Kafka via the MBeans. Please refer to - https://docs.confluent.io/current/kafka/monitoring.html
All the consumer-fetch-manager-metrics are per consumer threads.

MBean: kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+)

jkschneider · 2018-10-29T21:50:31Z

@wardhapk I think we are saying the same thing. I also don't see per-partition consumer metrics, and Micrometer isn't attempting to tag on partition.

wardhapk · 2018-10-29T21:52:48Z

@jkschneider Just figured that newer versions of kafka clients are exposing a per-partition level metrics records-lag in addition to records-lag-avg & records-lag-max(tested in ver 1.0.2)

    /***** Partition level *****/
    this.partitionRecordsLag = new MetricNameTemplate("{topic}-{partition}.records-lag", groupName, 
            "The latest lag of the partition", tags);
    this.partitionRecordsLagMax = new MetricNameTemplate("{topic}-{partition}.records-lag-max", groupName, 
            "The max lag of the partition", tags);
    this.partitionRecordsLagAvg = new MetricNameTemplate("{topic}-{partition}.records-lag-avg", groupName, 
            "The average lag of the partition", tags);`

jkschneider · 2018-10-29T22:00:02Z

Ah, so you are saying the partition is embedded in the MBean name. Now I think I understand.

larsduelfer · 2018-11-27T11:06:17Z

If possible it would also be great to have the consumer group as an additional tag in the metrics.

jkschneider · 2018-11-29T02:09:47Z

I took a stab at correcting the reason why some Kafka metrics were reporting NaN in e9b3efe, but am still pretty confused about exactly what is coming out of Kafka. I get MBean notifications for partition-level metrics but they always seem to be NaN and I don't see them in jvisualvm. Totally confused about what is going on. Anybody else willing to take a shot at this?

wardhapk · 2018-11-29T02:11:50Z

I can give it a shot

larsduelfer · 2018-12-10T08:41:39Z

Hi @jkschneider

we updated to version 1.1.1 of micrometer. It is better than before but still not correct. We now get metrics per consumer but since a consumer can have multiple topics/partitions assigned, the figures are still not correct. It only takes the first assigned topic/partition of a consumer.

Example:

A consumer works on 2 topic partitions. The lag on the first partition is 10 and the lag on the second partition is 20. The reported value for that consumer is 10 and not 30.

Actually, the values should also not be aggregated but reported independently. To be able to do this, the topic name and partition number must be added as additional labels. Then the reporting works as expected.

Regards, Lars

larsduelfer · 2018-12-11T12:01:38Z

@jkschneider
Here is a snapshot of jconsole showing an MBean while a consumer application is running:

As you can see the consumer-fetch-manager-metrics has metrics per partition of a topic in case the consumer is assigned to more than one partition. These should be reported with micrometer.

larsduelfer · 2018-12-12T16:54:19Z

One more comment: The metric "records-lag" that represents the latest lag of a consumer for a certain topic and partition is the relevant metric. The "records-lag-max" can also be calculated downstream by the corresponding monitoring solution (like prometheus), although it does not hurt if it is also exposed.

…ith topic and partition name I verified that this works as intended by creating a lag locally. I received the correct metrics via JMX and micrometer. Caveat: - I was not able to formulate a test for this, I also tried running the current tests for the KafkaConsumer and I believe they are not really testing anything. Most of the tests have no assertions and no real Kafka is being used when they run. But it might be that I miss some important parts

fkoehler · 2019-01-09T09:42:44Z

If possible it would also be great to have the consumer group as an additional tag in the metrics.

The consumer group is not exposed via the kafka MBeans metrics. One option might be to add the consumer group to your client id maybe?

…ition name I verified that this works as intended by creating a lag locally. I received the correct metrics via JMX and micrometer. Caveat: - I was not able to formulate a test for this, I also tried running the current tests for the KafkaConsumer and I believe they are not really testing anything. Most of the tests have no assertions and no real Kafka is being used when they run. But it might be that I miss some important parts

larsduelfer · 2019-01-16T10:41:47Z

As mentioned, this is addressed in #1136 . However, I left in comment there since we still have the same error as before and could not get this working even with 1.1.2 that should fix this issue.

larsduelfer · 2019-01-17T09:56:00Z

See #1136: We will provide a pull request to fix this.

… partition (micrometer-metrics#878) The kafka consumer fetch manager metrics have been registered too many times and therewith also for the wrong mbeans. Hence, when using prometheus as a registry, for example, only the metrics registered for the wrong mbean were shown with NaN as values. With this fix the metrics are registered for the right mbeans and in case they are provided for different levels like per consumer and topic as well as per consumer, topic and partition they are only registered once for the lowest level of details. Aggregations can be done with the monitoring tools.

… partition (#878) The kafka consumer fetch manager metrics have been registered too many times and therewith also for the wrong mbeans. Hence, when using prometheus as a registry, for example, only the metrics registered for the wrong mbean were shown with NaN as values. With this fix the metrics are registered for the right mbeans and in case they are provided for different levels like per consumer and topic as well as per consumer, topic and partition they are only registered once for the lowest level of details. Aggregations can be done with the monitoring tools.

shakuzen · 2019-02-07T05:33:25Z

This should be fixed by #1166. Can someone try the 1.1.3 snapshots and confirm?

larsduelfer · 2019-02-07T06:19:52Z

@shakuzen: Since I provided the mentioned pull request because of this ticket here (which I also faced, as you can see in the discussion further up) I can confirm that it works. If you need somebody else, maybe @fkoehler can confirm this as well.

shakuzen · 2019-02-11T13:33:34Z

Thanks @larsduelfer for the fix and the confirmation. I'll close this then. We're planning on releasing 1.1.3 this week with the fix. If anyone finds any additional issue, let us know.

jkschneider added this to the 1.1.0-rc.1 milestone Sep 27, 2018

jkschneider modified the milestones: 1.1.0-rc.1, 1.1.0-rc.2, 1.1.0 Oct 15, 2018

jkschneider added the bug A general bug label Oct 26, 2018

jkschneider modified the milestones: 1.1.0, 1.x Oct 27, 2018

jkschneider modified the milestones: 1.x, 1.1.1 Oct 29, 2018

jkschneider mentioned this issue Nov 13, 2018

Kafka consumer Metrics values are always NaN #983

Closed

jkschneider modified the milestones: 1.1.1, 1.1.2 Nov 29, 2018

jkschneider closed this as completed in e9b3efe Nov 29, 2018

jkschneider reopened this Dec 12, 2018

jkschneider mentioned this issue Dec 31, 2018

Add support for KafkaProducerMetrics (similar to KafkaConsumerMetrics) #1095

Closed

jkschneider modified the milestones: 1.1.2, 1.1.3 Jan 10, 2019

larsduelfer mentioned this issue Jan 17, 2019

Fix kafka consumer fetch manager metrics to be reported per topic and partition #1166

Merged

shakuzen closed this as completed Feb 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register Kafka "records-lag" metric per topic-partition #878

Register Kafka "records-lag" metric per topic-partition #878

alexanderabramov commented Sep 26, 2018

jkschneider commented Oct 26, 2018 •

edited

Loading

wardhapk commented Oct 29, 2018 •

edited

Loading

jkschneider commented Oct 29, 2018

wardhapk commented Oct 29, 2018

jkschneider commented Oct 29, 2018

larsduelfer commented Nov 27, 2018

jkschneider commented Nov 29, 2018 •

edited

Loading

wardhapk commented Nov 29, 2018 •

edited

Loading

larsduelfer commented Dec 10, 2018 •

edited

Loading

larsduelfer commented Dec 11, 2018

larsduelfer commented Dec 12, 2018

fkoehler commented Jan 9, 2019

larsduelfer commented Jan 16, 2019

larsduelfer commented Jan 17, 2019

shakuzen commented Feb 7, 2019

larsduelfer commented Feb 7, 2019

shakuzen commented Feb 11, 2019

Register Kafka "records-lag" metric per topic-partition #878

Register Kafka "records-lag" metric per topic-partition #878

Comments

alexanderabramov commented Sep 26, 2018

jkschneider commented Oct 26, 2018 • edited Loading

wardhapk commented Oct 29, 2018 • edited Loading

jkschneider commented Oct 29, 2018

wardhapk commented Oct 29, 2018

jkschneider commented Oct 29, 2018

larsduelfer commented Nov 27, 2018

jkschneider commented Nov 29, 2018 • edited Loading

wardhapk commented Nov 29, 2018 • edited Loading

larsduelfer commented Dec 10, 2018 • edited Loading

larsduelfer commented Dec 11, 2018

larsduelfer commented Dec 12, 2018

fkoehler commented Jan 9, 2019

larsduelfer commented Jan 16, 2019

larsduelfer commented Jan 17, 2019

shakuzen commented Feb 7, 2019

larsduelfer commented Feb 7, 2019

shakuzen commented Feb 11, 2019

jkschneider commented Oct 26, 2018 •

edited

Loading

wardhapk commented Oct 29, 2018 •

edited

Loading

jkschneider commented Nov 29, 2018 •

edited

Loading

wardhapk commented Nov 29, 2018 •

edited

Loading

larsduelfer commented Dec 10, 2018 •

edited

Loading