Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot see metric "kafka.consumer_lag" for consumers using "offsets.storage=kafka" #2611

Closed
sbrnunes opened this issue Jun 20, 2016 · 15 comments

Comments

@sbrnunes
Copy link

Hi

We're trying to set-up the integration between Datadog and Kafka and report metrics for a few consumers that commit the offsets into Kafka (we use "offsets.storage=kafka").

We're able to see metrics for consumers using "offsets.storage=zookeeper", but not for the ones commiting the offsets into Kafka.

We're particularly interested in knowing the consumer lag, which, as far as I know, is reported as "kafka.consumer_lag".

In the logs we can only see the following warning:

2016-06-20 17:03:48 UTC | WARNING | dd.collector | checks.kafka_consumer(kafka_consumer.py:58) | No zookeeper node at /kafka/consumers/<consumer_group>/offsets/<topic>/<partition>

Any idea what we might be doing wrong?

@degemer
Copy link
Member

degemer commented Jun 23, 2016

Hi @sbrnunes !

It's probably a configuration issue in kafka_consumer.yaml, we have a blog post which explain in details how to configure our Kafka check: https://www.datadoghq.com/blog/monitor-kafka-with-datadog/.
Would you mind checking it out to see if it helps you solve your issue ?

@degemer degemer added this to the Triage milestone Jun 24, 2016
@jalaziz
Copy link
Contributor

jalaziz commented Jun 28, 2016

Actually, I believe this is an issue with the check. Unless something changed recently, the agent only checks consumer offsets from ZK and not Kafka. We have the same issue. I've been meaning to fix the check to support Kafka-based offsets, but haven't gotten around to it yet.

@degemer
Copy link
Member

degemer commented Jun 28, 2016

My bad, you're right @jalaziz. Sorry @sbrnunes, I misunderstood the issue. Our kafka_consumer check was initially written when consumers offsets were only available in Zookeeper, and we didn't update it to support offsets in Kafka.
I added it to our backlog, but feel free to open a PR!

@sbrnunes
Copy link
Author

Oh, that makes sense now @degemer. I was getting confused about this as I couldn't see, in the codebase, any code pulling the offsets from Kafka.

@seeingidog
Copy link

Using the combination of Burrow to monitor offsets and this plugin is working for me.

@jeffwidman
Copy link
Contributor

jeffwidman commented Sep 26, 2016

The easiest solution would be if upstream kafka-python exposed an API for fetching consumer offsets, as tracked here: dpkp/kafka-python#819 and then hook onto that API.

See also dpkp/kafka-python#421 and dpkp/kafka-python#509

@alexef
Copy link

alexef commented Oct 31, 2016

Also having this problem.

@jeffwidman
Copy link
Contributor

I did a bunch of research into this as part of #2880. The concise summary is that it'll be much simpler to wait until KIP-88 lands before working on this.

The basic problem is that Datadog's consumer lag check is trying to grab all consumer offsets from a single place, vs in the Java kafka consumer and most other kafka consumer implementations, the consumer itself knows its offset and can report it somewhere as part of the poll() loop. Unfortunately, from a $ perspective that can be a lot more expensive because you now have to instrument all servers that run consumers (and datadog charges per server) instead of just running this check on one of your existing kafka brokers.

As KIP-88 explains, there is a workaround that involves creating a dummy consumer, joining to a consumer group, then calling committed() (which calls the Offset API) on every partition that the consumer group watches. It's a bit of a painful hack. So in KIP-88, they're talking about extending the Offset API so that you can send it a consumer group without any list of partitions and it will return offsets for all partitions tracked by the consumer group. Unfortunately, this protocol change won't land until 0.10.2.0 at the earliest.

Additionally, no matter how this is implemented, the upstream python library will need to support the call. Either kafka-python needs to be patched to handle the admin call to get all offsets for a consumer group, or switch to confluent-kafka-python as Magnus will most likely support it very quickly.

@jeffwidman
Copy link
Contributor

I just submitted a PR adding support for this:
DataDog/integrations-core#423

Please try it out and let me know if you hit issues. We run it at my day job against 6 production kafka clusters.

Note that the source is still littered with TODO's as I need to flesh out some of the error-handling and support for time-based offsets.

@masci
Copy link
Contributor

masci commented Jun 16, 2017

Related issue in integrations-core repo: DataDog/integrations-core#457

@StoneCypher
Copy link

Bumping for attention

@yangou
Copy link

yangou commented Aug 8, 2019

2019 still not fixed

@jgerman
Copy link

jgerman commented Nov 5, 2019

This still doesn't work?

We have consumer groups reporting lag etc via the kafka cli, but the DD agent doesn't seem to pick anything up...

    kafka_consumer (2.1.0)
    ----------------------
      Instance ID: kafka_consumer:1316b518dd19351b [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kafka_consumer.d/kafka_consumer.yml
      Total Runs: 15
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 13ms

@jeffwidman
Copy link
Contributor

This ticket should be closed. cc @ofek

The fix in #423 / #654 was merged two years ago.

Additionally, I added support in DataDog/integrations-core#3957 for monitoring unlisted consumer groups.

So you'll want to make sure your version of this check is upgraded to the latest copy and then read the updated config file comments to make sure you have the configs set properly.

If you're still having issues, probably best to open a new ticket rather than re-use this one.

@ofek
Copy link
Contributor

ofek commented Nov 6, 2019

Yes, anyone that is still experiencing issues should contact support.

Thanks!

@ofek ofek closed this as completed Nov 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests