Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register Kafka "records-lag" metric per topic-partition #878

Closed
alexanderabramov opened this issue Sep 26, 2018 · 17 comments
Closed

Register Kafka "records-lag" metric per topic-partition #878

alexanderabramov opened this issue Sep 26, 2018 · 17 comments
Labels
bug A general bug
Milestone

Comments

@alexanderabramov
Copy link

Affects: master version of micrometer-core (SNAPSHOT-1.1.0-20180921.212031-73).

This is in a way related to #877.

I've been trying to get "records-lag" metric (or "records-lag-max" would have been fine too) per topic-partition to show up in /actuator/prometheus/.
I'm not sure if current KafkaConsumerMetrics is expected to register it or not, at least KafkaConsumerMetrics.java#L236 hints that it's interested in per topic metrics, but currently it does not, and I think it is a very valuable metric for applications consuming more than 1 topic.

After digging around in code, I believe there are some issues with how current implementation of KafkaConsumerMetrics registers the metrics.

  1. It tries to register all metrics with all combinations of tags, instead of only using existing combinations.
    At KafkaConsumerMetrics.java#L201 JXM NotificationListener is called first with only client-id (many times actually, but maybe that's kafka-clients issue), then with [client-id, topic], then with [client-id, topic, partition] and each time it goes through the whole list at KafkaConsumerMetrics.java#L76.

For the list of metric-tags combinations actually coming from Kafka via JMX, see:
kafka FetcherMetricsRegistry.java#L66
kafka Fetcher.java#L1326

  1. It does add "topic" tag, but not "partition" tag, whereas e,g, "records-lag-max" is only exported by Kafka per topic-partition, not per topic. So it's not clear how meaningful "records-lag-max" per topic will be even if it does get added to Prometheus registry.
    nameTag(), KafkaConsumerMetrics.java#L240

The way to address this I believe is to have registerMetricsEventually have 3 callbacks, one for each combination of tags coming from Kafka:

  • client-id
  • client-id, topic
  • client-id, topic, partition

Each callback will then only register metrics defined for that tags combination.

If KafkaConsumerMetrics wants to support "records-lag-max" on 2 levels, then #877 needs to be solved too.

Another approach is to add it (or "records-lag") on topic-partition level only, not on client-id level, and then aggregate in Prometheus if desired. Personally I've implemented a MeterBinder to do just this.

@jkschneider jkschneider added this to the 1.1.0-rc.1 milestone Sep 27, 2018
@jkschneider jkschneider modified the milestones: 1.1.0-rc.1, 1.1.0-rc.2, 1.1.0 Oct 15, 2018
@jkschneider jkschneider added the bug A general bug label Oct 26, 2018
@jkschneider
Copy link
Contributor

jkschneider commented Oct 26, 2018

I'm no Kafka expert, so I could really use some help in the form of additional tests that demonstrate what you're talking about.

On (1), when I test it with KafkaConsumerMetricsTest, I only ever see the records-lag-max meter created with the client tag. I don't see a second iteration adding topic or a third with partition. What am I doing wrong?

On (2), again, I only see client-id metadata in JMX, and nothing about partition. What am I doing wrong?

image

@jkschneider jkschneider modified the milestones: 1.1.0, 1.x Oct 27, 2018
@wardhapk
Copy link

wardhapk commented Oct 29, 2018

@jkschneider I don't see a per-partition level Consumer metrics exposed by Kafka via the MBeans. Please refer to - https://docs.confluent.io/current/kafka/monitoring.html
All the consumer-fetch-manager-metrics are per consumer threads.

MBean: kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+)

@jkschneider
Copy link
Contributor

@wardhapk I think we are saying the same thing. I also don't see per-partition consumer metrics, and Micrometer isn't attempting to tag on partition.

@wardhapk
Copy link

@jkschneider Just figured that newer versions of kafka clients are exposing a per-partition level metrics records-lag in addition to records-lag-avg & records-lag-max(tested in ver 1.0.2)

image

    /***** Partition level *****/
    this.partitionRecordsLag = new MetricNameTemplate("{topic}-{partition}.records-lag", groupName, 
            "The latest lag of the partition", tags);
    this.partitionRecordsLagMax = new MetricNameTemplate("{topic}-{partition}.records-lag-max", groupName, 
            "The max lag of the partition", tags);
    this.partitionRecordsLagAvg = new MetricNameTemplate("{topic}-{partition}.records-lag-avg", groupName, 
            "The average lag of the partition", tags);`

@jkschneider
Copy link
Contributor

Ah, so you are saying the partition is embedded in the MBean name. Now I think I understand.

@larsduelfer
Copy link
Contributor

If possible it would also be great to have the consumer group as an additional tag in the metrics.

@jkschneider
Copy link
Contributor

jkschneider commented Nov 29, 2018

I took a stab at correcting the reason why some Kafka metrics were reporting NaN in e9b3efe, but am still pretty confused about exactly what is coming out of Kafka. I get MBean notifications for partition-level metrics but they always seem to be NaN and I don't see them in jvisualvm. Totally confused about what is going on. Anybody else willing to take a shot at this?

@wardhapk
Copy link

wardhapk commented Nov 29, 2018

I can give it a shot

@jkschneider jkschneider modified the milestones: 1.1.1, 1.1.2 Nov 29, 2018
@larsduelfer
Copy link
Contributor

larsduelfer commented Dec 10, 2018

Hi @jkschneider

we updated to version 1.1.1 of micrometer. It is better than before but still not correct. We now get metrics per consumer but since a consumer can have multiple topics/partitions assigned, the figures are still not correct. It only takes the first assigned topic/partition of a consumer.

Example:

A consumer works on 2 topic partitions. The lag on the first partition is 10 and the lag on the second partition is 20. The reported value for that consumer is 10 and not 30.

Actually, the values should also not be aggregated but reported independently. To be able to do this, the topic name and partition number must be added as additional labels. Then the reporting works as expected.

Regards, Lars

@larsduelfer
Copy link
Contributor

@jkschneider
Here is a snapshot of jconsole showing an MBean while a consumer application is running:

screenshot_jconsole

As you can see the consumer-fetch-manager-metrics has metrics per partition of a topic in case the consumer is assigned to more than one partition. These should be reported with micrometer.

@jkschneider jkschneider reopened this Dec 12, 2018
@larsduelfer
Copy link
Contributor

One more comment: The metric "records-lag" that represents the latest lag of a consumer for a certain topic and partition is the relevant metric. The "records-lag-max" can also be calculated downstream by the corresponding monitoring solution (like prometheus), although it does not hurt if it is also exposed.

fkoehler added a commit to fkoehler/micrometer that referenced this issue Jan 8, 2019
…ith topic and partition name

I verified that this works as intended by creating a lag locally. I received the correct metrics via JMX and micrometer.

Caveat:
- I was not able to formulate a test for this, I also tried running the current tests for the KafkaConsumer and I believe they are not really testing anything. Most of the tests have no assertions and no real Kafka is being used when they run. But it might be that I miss some important parts
@fkoehler
Copy link
Contributor

fkoehler commented Jan 9, 2019

If possible it would also be great to have the consumer group as an additional tag in the metrics.

The consumer group is not exposed via the kafka MBeans metrics. One option might be to add the consumer group to your client id maybe?

@jkschneider jkschneider modified the milestones: 1.1.2, 1.1.3 Jan 10, 2019
jkschneider pushed a commit that referenced this issue Jan 10, 2019
…ition name

I verified that this works as intended by creating a lag locally. I received the correct metrics via JMX and micrometer.

Caveat:
- I was not able to formulate a test for this, I also tried running the current tests for the KafkaConsumer and I believe they are not really testing anything. Most of the tests have no assertions and no real Kafka is being used when they run. But it might be that I miss some important parts
@larsduelfer
Copy link
Contributor

As mentioned, this is addressed in #1136 . However, I left in comment there since we still have the same error as before and could not get this working even with 1.1.2 that should fix this issue.

@larsduelfer
Copy link
Contributor

See #1136: We will provide a pull request to fix this.

larsduelfer pushed a commit to larsduelfer/micrometer that referenced this issue Feb 1, 2019
… partition (micrometer-metrics#878)

The kafka consumer fetch manager metrics have been registered too many times and therewith
also for the wrong mbeans. Hence, when using prometheus as a registry, for example, only
the metrics registered for the wrong mbean were shown with NaN as values. With this fix
the metrics are registered for the right mbeans and in case they are provided for different levels
like per consumer and topic as well as per consumer, topic and partition they are only registered
once for the lowest level of details. Aggregations can be done with the monitoring tools.
jkschneider pushed a commit that referenced this issue Feb 6, 2019
… partition (#878)

The kafka consumer fetch manager metrics have been registered too many times and therewith
also for the wrong mbeans. Hence, when using prometheus as a registry, for example, only
the metrics registered for the wrong mbean were shown with NaN as values. With this fix
the metrics are registered for the right mbeans and in case they are provided for different levels
like per consumer and topic as well as per consumer, topic and partition they are only registered
once for the lowest level of details. Aggregations can be done with the monitoring tools.
@shakuzen
Copy link
Member

shakuzen commented Feb 7, 2019

This should be fixed by #1166. Can someone try the 1.1.3 snapshots and confirm?

@larsduelfer
Copy link
Contributor

@shakuzen: Since I provided the mentioned pull request because of this ticket here (which I also faced, as you can see in the discussion further up) I can confirm that it works. If you need somebody else, maybe @fkoehler can confirm this as well.

@shakuzen
Copy link
Member

Thanks @larsduelfer for the fix and the confirmation. I'll close this then. We're planning on releasing 1.1.3 this week with the fix. If anyone finds any additional issue, let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug
Projects
None yet
Development

No branches or pull requests

6 participants