-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MQTT connection failures with Mosquitto #7648
Comments
I'm not sure what the issue is here, but I found this issue that seems like it might be related: eclipse-paho/paho.mqtt.golang#263 (comment). The issue opened recreated the docker network and the error was no longer reproducable. Perhaps you could collect a packet capture of both Telegraf and mosquitto_sub:
|
I’ve also tried VerneMQ as the mqtt broker tonight and see the same issue. In terms of networks, the network is destroyed and recreated as part of the docker-compose down/up commands so doesn’t seem to be that. I’m thinking it may have to do with whether mosquitto is fully available when docker starts Telegraf. I’ve setup the dependency but it may not have sufficient time. That could explain why the external things network mqtt servers don’t show an issue. For the network log, what container should i run in or should I try and connect to the IoT-net externally? |
Try to collect the packet capture from the same container that runs Telegraf. |
Got the capture and the ping requests and responses look ok. I've not included the full file here as there are some passwords etc but the screenshot should be enough - let me know if you want anything else pulled. Interestingly, it seems after the ping response, one more message (an mqtt pub) is sent before the reconnect happens. I'm running a longer capture to try and prove this |
I can confirm that there is a Publish Message after every Ping Response before the reconnect happens. I suspect telegraf doesnt act on the Ping Response and drops its connection, mosquitto then sees the dropped connection when it tries to publish thus generating the socket error. For reference, I built tcpdump for arm (this is on a Raspberry Pi 4) using the following Dockerfile
You can then connect to the container and dump a capture file with:
Don't connect with --net=host or you see lots of spurious tcp transmissions as you log every communication twice! |
Also now removed other mqtt input plugins to see if there was any conflict - the same problem is visible even when this is the only mqtt input |
Dropped into the container and ran telegraf --test-wait 120 to see if data was being pulled ok. It seems it connects fine without pingresp problems and I get
|
I'm I understanding correctly that --test-wait is working but not normal operation? |
@danielnelson Seems to be the case although I'd caveat that I've not tested it robustly
I did look at whether it could be ticker related - e.g. wait for the pingresp but the connection isn't made for 20s to confirm. I changed the timing down to a few seconds for that plugin but no difference. Any thoughts? |
@danielnelson Can absolutely confirm that all is ok in --test-wait but not in normal operation I added the " command: --test-wait 600" flag to the docker compose file to ensure only one copy of telegraf was running and it ran for sufficient time. No errors on the mosquito side and I could see telegraf happily receiving data from all mqtt servers. |
With
|
@danielnelson I dont see that message with --test-wait Logs of all containers below. For reference, the disconnect on the mosquito device beer fridge was intentional to see if it recovered.
|
I'm not sure then, try to reduce the the docker-compose file to the minimal case with just mosquitto and telegraf and then add the compose file and I'll try to duplicate. |
@danielnelson OK, I've got it down to mosquitto and telegraf only. Note this is running on a Raspberry Pi 4 with 4GB ram and latest Raspbian Buster. Telegraf set to only receive from mosquito and output to a file to avoid the need for influx. Interestingly, te problem occurs slightly less often - wondering if its a timeout while processor is shared for instance? (note none of the containers are anywhere near CPU or RAM limits) Log File:
Docker File
Telegraf.cong (note mosquito username and password must be set - replaced with ****** and remember to create directories in dockerfile)
Mosquitto.conf (you'll need to create a mosquitto.passwd configuration file to match telegraf config and mqtt device. Remember to create directories in dockerfile)
Sample IoT Device Data
|
To follow up a little more I rebuilt the setup on a different host, this time using docker running on a Synology NAS. No problems seen connecting to mqtt at first - it maintained the server connection just fine. I then reprogrammed my thermometer sensor to the new mqtt host and started sending data. Immediately got pingresp problems as before. So it seems that the error relates somehow to data being sent or consumed - probably related to the mqtt data from the sensor being sent after the pingresp we saw in the network logs I’m wondering - could the sensor be publishing too quickly and telegraf is seeing the data before it thinks it’s got a pingresp or something similar? |
@danielnelson Did you see the response on eclipse-paho/paho.mqtt.golang#430 ? I'm getting some data through but I suspect something is blocking while telegraf processes that data causing the pingresp error. Its probably why I see it less often when sending to a file. Whats your thoughts on how to configure the flush intervals and QOS settings for best performance to avoid this? |
Try increasing
|
same problem here with VerneMQ and HAProxy in cluster mode:
|
Note that I have just released v1.3.0 of |
So I have also been seeing an issue with using Telegraf with The Things Network Broker. I suspect it is related `Jan 09 11:42:05 adelvps telegraf[2490]: 2021-01-09T00:42:05Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: EOF ` BTW I have two MQTT consumers to the same server. The problem is occurring even if I only have one. |
I'm seeing issues with communication between telegraf and mosquito in a docker setup where the connection fails and does not recover. This may be related to issue #4594 as there is similar behaviour although I believe that was resolved.
Relevant telegraf.conf:
System info:
Telegraf 1.14.3 with mosquito 1.6.10
Docker
Steps to reproduce:
Mosquitto setup with basic password authentication. Connected ok for the first 10 hours or so then failed. No config changes at the time. Fails on restart even when cleaned with docker-compose down.
I'm pulling data from the local mosquito broker.
Problem seems to be related to pingresp - see an error on both sides
Expected behavior:
Expect to see Telegraf pulling the mqqt data into influxdb. Sensor data is approx every 15 seconds
Actual behavior:
No data transferred to influxdb. Repeated reconnects
Additional info:
Can subscribe to the topic from a desktop without an issue:
C:\Program Files\mosquitto>mosquitto_sub -h grafana.local -t home/devices/+/+/up -u *****-P ***** -v
home/devices/garage/beerfridge/up {"temperature_fridge_air_c":4.4375}
home/devices/garage/beerfridge/up {"temperature_fridge_air_c":4.375}
I'm also ok pulling mqtt data from two other remote sources (Things network).
I've tried
Mosquitto log:
Telegraf log:
The text was updated successfully, but these errors were encountered: