-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register Kafka "records-lag" metric per topic-partition #878
Comments
I'm no Kafka expert, so I could really use some help in the form of additional tests that demonstrate what you're talking about. On (1), when I test it with On (2), again, I only see client-id metadata in JMX, and nothing about partition. What am I doing wrong? |
@jkschneider I don't see a per-partition level Consumer metrics exposed by Kafka via the MBeans. Please refer to - https://docs.confluent.io/current/kafka/monitoring.html
|
@wardhapk I think we are saying the same thing. I also don't see per-partition consumer metrics, and Micrometer isn't attempting to tag on partition. |
@jkschneider Just figured that newer versions of kafka clients are exposing a per-partition level metrics
|
Ah, so you are saying the partition is embedded in the MBean name. Now I think I understand. |
If possible it would also be great to have the consumer group as an additional tag in the metrics. |
I took a stab at correcting the reason why some Kafka metrics were reporting NaN in e9b3efe, but am still pretty confused about exactly what is coming out of Kafka. I get MBean notifications for partition-level metrics but they always seem to be NaN and I don't see them in jvisualvm. Totally confused about what is going on. Anybody else willing to take a shot at this? |
I can give it a shot |
Hi @jkschneider we updated to version 1.1.1 of micrometer. It is better than before but still not correct. We now get metrics per consumer but since a consumer can have multiple topics/partitions assigned, the figures are still not correct. It only takes the first assigned topic/partition of a consumer. Example: A consumer works on 2 topic partitions. The lag on the first partition is 10 and the lag on the second partition is 20. The reported value for that consumer is 10 and not 30. Actually, the values should also not be aggregated but reported independently. To be able to do this, the topic name and partition number must be added as additional labels. Then the reporting works as expected. Regards, Lars |
@jkschneider As you can see the consumer-fetch-manager-metrics has metrics per partition of a topic in case the consumer is assigned to more than one partition. These should be reported with micrometer. |
One more comment: The metric "records-lag" that represents the latest lag of a consumer for a certain topic and partition is the relevant metric. The "records-lag-max" can also be calculated downstream by the corresponding monitoring solution (like prometheus), although it does not hurt if it is also exposed. |
…ith topic and partition name I verified that this works as intended by creating a lag locally. I received the correct metrics via JMX and micrometer. Caveat: - I was not able to formulate a test for this, I also tried running the current tests for the KafkaConsumer and I believe they are not really testing anything. Most of the tests have no assertions and no real Kafka is being used when they run. But it might be that I miss some important parts
The consumer group is not exposed via the kafka MBeans metrics. One option might be to add the consumer group to your client id maybe? |
…ition name I verified that this works as intended by creating a lag locally. I received the correct metrics via JMX and micrometer. Caveat: - I was not able to formulate a test for this, I also tried running the current tests for the KafkaConsumer and I believe they are not really testing anything. Most of the tests have no assertions and no real Kafka is being used when they run. But it might be that I miss some important parts
As mentioned, this is addressed in #1136 . However, I left in comment there since we still have the same error as before and could not get this working even with 1.1.2 that should fix this issue. |
See #1136: We will provide a pull request to fix this. |
… partition (micrometer-metrics#878) The kafka consumer fetch manager metrics have been registered too many times and therewith also for the wrong mbeans. Hence, when using prometheus as a registry, for example, only the metrics registered for the wrong mbean were shown with NaN as values. With this fix the metrics are registered for the right mbeans and in case they are provided for different levels like per consumer and topic as well as per consumer, topic and partition they are only registered once for the lowest level of details. Aggregations can be done with the monitoring tools.
… partition (#878) The kafka consumer fetch manager metrics have been registered too many times and therewith also for the wrong mbeans. Hence, when using prometheus as a registry, for example, only the metrics registered for the wrong mbean were shown with NaN as values. With this fix the metrics are registered for the right mbeans and in case they are provided for different levels like per consumer and topic as well as per consumer, topic and partition they are only registered once for the lowest level of details. Aggregations can be done with the monitoring tools.
This should be fixed by #1166. Can someone try the 1.1.3 snapshots and confirm? |
Thanks @larsduelfer for the fix and the confirmation. I'll close this then. We're planning on releasing 1.1.3 this week with the fix. If anyone finds any additional issue, let us know. |
Affects: master version of micrometer-core (SNAPSHOT-1.1.0-20180921.212031-73).
This is in a way related to #877.
I've been trying to get "records-lag" metric (or "records-lag-max" would have been fine too) per topic-partition to show up in /actuator/prometheus/.
I'm not sure if current KafkaConsumerMetrics is expected to register it or not, at least KafkaConsumerMetrics.java#L236 hints that it's interested in per topic metrics, but currently it does not, and I think it is a very valuable metric for applications consuming more than 1 topic.
After digging around in code, I believe there are some issues with how current implementation of KafkaConsumerMetrics registers the metrics.
At KafkaConsumerMetrics.java#L201 JXM NotificationListener is called first with only client-id (many times actually, but maybe that's kafka-clients issue), then with [client-id, topic], then with [client-id, topic, partition] and each time it goes through the whole list at KafkaConsumerMetrics.java#L76.
For the list of metric-tags combinations actually coming from Kafka via JMX, see:
kafka FetcherMetricsRegistry.java#L66
kafka Fetcher.java#L1326
nameTag(), KafkaConsumerMetrics.java#L240
The way to address this I believe is to have registerMetricsEventually have 3 callbacks, one for each combination of tags coming from Kafka:
Each callback will then only register metrics defined for that tags combination.
If KafkaConsumerMetrics wants to support "records-lag-max" on 2 levels, then #877 needs to be solved too.
Another approach is to add it (or "records-lag") on topic-partition level only, not on client-id level, and then aggregate in Prometheus if desired. Personally I've implemented a MeterBinder to do just this.
The text was updated successfully, but these errors were encountered: