-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mqtt_consumer stops gathering metrics #921
Comments
From this output it appears that nothing is being gathered from your mqtt_consumer input, so there is nothing to write to cloudwatch |
you can re-open as a different issue if the problem is with the mqtt_consumer not gathering metrics? Do you expect it to be gathering metrics? |
if you think it's an issue with the mqtt_consumer, can you provide the commit of the paho mqtt client that you're using? If it doesn't match the commit in Godeps I would suggest rebuilding and using gdm. |
the commit is matching |
okay, can you provide more details? How long does it take for this to happen? do you have any relevant mosquitto logs? the mqtt_consumer section of the config file would help as well. |
let me see if I can do a tcpdump now I know it's a mqtt problem. |
if I stop restart the telgraf process it's working again |
tshark show me no publish coming from mosquitto, so maybe the problem is mosquitto or the auto reconnect feature of the paho mqtt client which doesn't resubscribe when reconnecting on non persitent sessions. |
This has not been consistently reliable. This is likely a bug with the paho MQTT client library. It was supposed to be fixed, but will need to be revisited in the future. fixes #921
So with persistent session, no problem during the last 15 hours. |
@jvermillard could you provide your OS and config file? I'd like to try to reproduce so I can file a bug on the paho mqtt client project |
I'd also like to reproduce so I can test if this library does better: https://github.com/yosssi/gmq it certainly has better go style and unit test coverage |
I'm using 1.4.8-0mosquitto1 from debian jessie in a docker container (docker 1.8.3) running on CoreOS 835.13.0 |
Also I use the default mosquitto config |
could you provide the telegraf config? I haven't been able to reproduce so far |
|
but you don't see the problem with |
oups sorry... yes with |
so if you want to test the bug use this config without persistent_session |
Oh, got it, I was able to reproduce within 3 gathers this time. The reason is that your collection interval is larger than the default MQTT KeepAlive timer (60s). But this is not supposed to be a problem at all, the client library doesn't appear to be sending the proper keepalive heartbeat messages and is then disconnecting. I'm not sure the best solution, we could just set the keepalive to a large number that users would likely never hit (2 hours?). |
I've also just realized that they've moved the mqtt library off gerrit onto https://github.com/eclipse/paho.mqtt.golang, and there are some new changes and fixes there. I'll test with that updated library and see if it fixes the problem. |
@jvermillard in my testing I was getting the disconnect errors within 2-10 minutes using master. After updating to the updated github paho client I wasn't able to reproduce at all, and had one instance running for more than 6 hours without disconnecting. I've pushed this change into master (see #941). Please feel free to re-open if you still see the issue. |
I pulled the repo today and I got the issue again... paho was pulled from Restoring /root/work/src/github.com/eclipse/paho.mqtt.golang to 4ab3e867810d1ec5f35157c59e965054dbf43a0d |
@chaton78 can you provide any relevant mqtt or telegraf logs? and your telegraf config? |
Sorry for the delay.. I've redone my test with 0.12-1 still have the same issue. Please note that I have a bridged mosquitto broker connecting to the same rabbitmq server. No errors at the same time. This is what I have in journalctl (Centos 7)
Last write was at 8:13:10... nothing after... In rabbitmq, at the same time and after:
|
My config for telegraf:
|
@chaton78 can you provide more details? What is your telegraf agent config? How long does it take for this problem to occur? |
@sparrc I think I have the same issue. I need to restart Telegraf on Fedora22/ARM every couple days. I suspect it is related to network connection drops, when it results in MQTT client connection drop to the MQTT broker as well, and then Telegraf can't recover it. As far as I understand MQTT specs, it is a client responsibility to recover connection to the broker. |
I can take few hours to a full day....
Running on Linux 3.10.0-327.13.1.el7.x86_64 I am using the public ip of my interface. Let me switch to the loopback. It is my ideal config but we will see.... |
thanks all for the details, https://github.com/eclipse/paho.mqtt.golang needs to be patched. As far as I can tell, the client reconnect isn't working properly. I'll need to figure out a way to reproduce more reliably and then can push up a fix for that. |
It actually looks like this might be fixed as of a few days ago: eclipse-paho/paho.mqtt.golang#32, I'll update the telegraf Godeps file to grab this fix. If anyone is able to build from source and test that would be much appreciated :) |
didn't mean to close this yet |
I built it this morning. It is running. |
So far, so good... |
It happened again.. but from what I can see in rabbitmq log, client reconnected just after. Is it possible that the re-connection works but the subscription is lost? |
I did a fix last night. Testing it.... Any idea why make test_short doesn't work on go 1.6... Something about calling internal packages. I'm not very familiar with go. |
@chaton78 do you have your telegraf directory in |
@sparrc I am running from my fork github/chaton78/telegraf... I have also influxdata/telegraf |
building in your fork directly won't quite work correctly on Go 1.6. This is because the import paths still point to influxdata/telegraf, which your chaton78/telegraf package sees as an external package that is not allowed to have internal imports. the best way to solve this is to rm -rf chaton78/telegraf and build from influxdata/telegraf. You can then go into the influxdata/telegraf directory and change your git origin remote (or setup a different remote, such as |
Understood. Good idea.. Thank you! |
Hello.
Sonoff with Tasmota firmware push mqtt metrics every 60sec and I can view them with
After telegraf stops to push mqtt metrics it continues normal push local system metrics. In this situation if I try to stop telegraf it take long time (2-3min)
|
@Dees7 I wonder if your issue is because your |
I tied (client publish 30, interval 30, flush 60). How can I view debug as at the first message?
|
Do you mind opening a new issue for this, thanks. |
Using the master (id: f543dbb )
Debug output:
After some times nothing is pushed to cloudwatch
The text was updated successfully, but these errors were encountered: