Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MQTT sometimes does not notice broker shutdown #1759

Open
mcspr opened this issue Jun 5, 2019 · 1 comment
Open

MQTT sometimes does not notice broker shutdown #1759

mcspr opened this issue Jun 5, 2019 · 1 comment

Comments

@mcspr
Copy link
Collaborator

mcspr commented Jun 5, 2019

Bug description

When MQTT broker server is physically disconnected from the network (for example, network cable is removed), device still thinks that connection is OK. Neither onDisconnect or onTimeout firing for MQTT module.
mqtt.info shows state as connected, mqtt.reset finally resets it

While this seems like a bug in async mqtt client / asynctcp, still recording it here as it may be important "feature" to know. And it may be fixed by manual keep-alive, tracking last successful mqttSend() call

Steps to reproduce

  1. mqttQoS=1, mqttKeep=300, hbInterval 5 reset, wait until mqtt is connected on the device
  2. remove ethernet cable from the server
  3. debug logs still show that mqtt client is sending data, incrementing qos. after some time qos becomes 0 (ref: Stubborn HASS MQTT Discovery #1370), meaning that network layer does not accept any more data.

Additional context

Using async-mqtt-client master (or v0.8.1, same thing), Core 2.3.0 & 2.5.2 lwip2
Network is Linux server, Router that is directly wired to the server and ESP connected to the Router's AP. Both on the same LAN network bridged together.

I waited for about 10 min and reconnected the cable, after which MQTT reconnection finally happened.

tcpdump on the router shows that device is still sending data, but it does look like it just tries to retransmit not acked data:

08:18:33.534970 IP ESPURNA-35A259.lan.51606 > orion.lan.1883: Flags [P.], seq 1749:1785, ack 181, win 1924, length 36
08:18:36.559839 IP ESPURNA-35A259.lan.51606 > orion.lan.1883: Flags [P.], seq 1749:1785, ack 181, win 1924, length 36
08:18:42.584820 IP ESPURNA-35A259.lan.51606 > orion.lan.1883: Flags [P.], seq 1749:1785, ack 181, win 1924, length 36
08:18:54.784220 IP ESPURNA-35A259.lan.51606 > orion.lan.1883: Flags [P.], seq 1749:1785, ack 181, win 1924, length 36
08:19:19.333148 IP ESPURNA-35A259.lan.51606 > orion.lan.1883: Flags [P.], seq 1749:1785, ack 181, win 1924, length 36
08:20:08.005749 IP ESPURNA-35A259.lan.51606 > orion.lan.1883: Flags [P.], seq 1749:1785, ack 181, win 1924, length 36
...
08:21:45.550983 ARP, Request who-has orion.lan tell ESPURNA-35A259.lan, length 28

Small patch to the dev tree enables lwip debugging and shows that pcb is still ESTABLISHED
dev...mcspr:lwip/pcb-debug

[050328] [MQTT] Sending home/ESPURNA-35A259/status => 1 (PID 0)
[050333] [MQTT] Sending home/ESPURNA-35A259/loadavg => 0 (PID 0)
bs
lwip.pcbs
[050807] Active PCB states:
[050807] state=ESTABLISHED local=ESPURNA-35A259:63823 remote=orion.lan:1883 snd_nxt=11326 rcv_nxt=2086424742
[050808] TIME-WAIT PCB states:
[050811] BOUND PCB states:

One would expect that keep-alive would sort this out, but it does not.
(and as quick google search answers suggest, regarding lwip tcp /app keepalive)

Device information

  • Arduino Core version 2.3.0, 2.5.2
  • ESPurna version: c57a0ab
  • IDE & version PlatformIO
@mcspr
Copy link
Collaborator Author

mcspr commented Jun 6, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant