Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: #develop branch #219

Closed
bertmelis opened this issue Feb 8, 2021 · 87 comments
Closed

Discussion: #develop branch #219

bertmelis opened this issue Feb 8, 2021 · 87 comments
Labels
discussion Converstation

Comments

@bertmelis
Copy link
Contributor

I'm been doing some work on the https://github.com/marvinroger/async-mqtt-client/tree/develop branch lately.

It should solve lots of the current issues. The main improvement is that outgoing messages are queued and the memory is only freed after the TCP ack allowes to do so.

  • ❗ Sending in QoS 0, 1 and 2 is supported for messages as large as memory permits.
  • Receiving is supported in all three QoS levels. Messages are still directly passed to the API, in chuncks (you get an index, length and total). QoS 2 messages will be delivered "once completely", meaning that it is possible that you get parts of QoS 2 messages more than once. (OTA over MQTT is still possible)
  • ❗ You don't have to wait for messages to be sent before disconnecting. The disconnect signal will be queued and the client will only disconnect once the previously queued messages are sent.

❗ : although the API remained unchanged, the exlamation mark denotes a change in behaviour.

I'm looking forward to feedback. I do like positive feedback 😄, but I'm more interested in the negative ones. I still have to go through the list of mandatory statements of the MQTT specification. TLS is also untested but there has been no change in that part of the code.

@bertmelis bertmelis added the discussion Converstation label Feb 8, 2021
@bertmelis
Copy link
Contributor Author

bertmelis commented Feb 8, 2021

A sample sketch with the development code:

RAM:   [====      ]  36.5% (used 29936 bytes from 81920 bytes)
Flash: [===       ]  28.1% (used 292984 bytes from 1044464 bytes)

Master branch:

RAM:   [====      ]  36.7% (used 30104 bytes from 81920 bytes)
Flash: [===       ]  27.6% (used 288428 bytes from 1044464 bytes)

A similar library whose owner is quite patronizing (API differs a bit, so sketch is not 100% the same):

RAM:   [====      ]  36.5% (used 29880 bytes from 81920 bytes)
Flash: [===       ]  28.2% (used 294636 bytes from 1044464 bytes)

@luebbe
Copy link
Collaborator

luebbe commented Feb 8, 2021

I added https://github.com/marvinroger/async-mqtt-client.git#develop to the platformio.ini of my https://github.com/luebbe/homie-node-collection compiled and uploaded the sensor-nodes demo to a d1 mini.

This one contains a lot of nodes and many mqtt messages (84 topics) have to be sent upon startup. I'm running mosquitto on a raspberry Pi 1 and MQTT Explorer on Windows for monitoring.

The only notable difference when the device is reset is that it seems to take longer until all the initial messages have been received by MQTT explorer. I guess this is to be expected since they are queued up. Perhaps before messages were lost when the buffer was full.

I have also added "freeheap" to $stats in my fork of homie-esp2866, so I can monitor if anything runs away there.

The graph looks like this (around 15 minutes of history)
grafik

There is a significant drop directly after the start (expected) and then it is stable at around 30KB. You can also see the drop at the five minute intervals when sensor data is sent.

It is also interesting to see that when the $stats are sent, they now trickle in over the course of 1-3 seconds. With the previous version they came in the same second.

@bertmelis
Copy link
Contributor Author

bertmelis commented Feb 8, 2021

In which qos are you sending?

One notable difference in sending qos > 0 is that "the next message" is only sent when the previous has been acknowledged. I didn't want the logic to be too complicated to ensure proper message ordering (as per mqtt spec). In qos 0 this should not matter, but I might be wrong. I changed the logic a few times.

@luebbe
Copy link
Collaborator

luebbe commented Feb 8, 2021

Homie default. I didn't change anything. MQTT Explorer says qos 0?

@bertmelis
Copy link
Contributor Author

Homie default. I didn't change anything. MQTT Explorer says qos 0?

qos subscribing might differ from qos publishing.

Stats are sent qos 1 (https://github.com/homieiot/homie-esp8266/blob/9cd83972f27b394eab8a5e3e2baef20eea1b5408/src/Homie/Boot/BootNormal.cpp#L180). So yes, this is slower than before.

@luebbe
Copy link
Collaborator

luebbe commented Feb 8, 2021

The homie convention states https://homieiot.github.io/specification/#qos-and-retained-messages that it uses qos 1, but I have to check the source to confirm. I don't know from where MQTT Explorer gets the qos 0 info.

@luebbe
Copy link
Collaborator

luebbe commented Feb 8, 2021

qos subscribing might differ from qos publishing.

guess so

Stats are sent qos 1 (https://github.com/homieiot/homie-esp8266/blob/9cd83972f27b394eab8a5e3e2baef20eea1b5408/src/Homie/Boot/BootNormal.cpp#L180). So yes, this is slower than before.

Fine by me. Better slow & certain than quick & dirty

I'll leave my dev board and the monitor running at least until I go to sleep, but I'm already pretty confident that there's no memory leak.

@bertmelis
Copy link
Contributor Author

I'll anyhow add speed up to the wishlist.

@luebbe
Copy link
Collaborator

luebbe commented Feb 8, 2021

Just FYI, the sensor-demo has 84 subtopics. This is the second largest amount of my homie based applications.

@kleini
Copy link

kleini commented Feb 8, 2021

For me this results in a crash. Please see attached serial output from my relays-device. I am using SSL encrypted connection.
crash.log

@bertmelis
Copy link
Contributor Author

bertmelis commented Feb 8, 2021

Do you run your ESP on 160MHz? A soft watchdog reset on the sys stack during connecting is probably caused by the handshake taking too much time. Off course I can be wrong.

Edit: for the record, I am setting up TLS on my broker to test myself.

@kleini
Copy link

kleini commented Feb 8, 2021

It is a standard ESP12-F. This time with decoded crash:
crash.log

@bertmelis
Copy link
Contributor Author

gotta look into that. But still, it's a watchdog reset, so something is taking too long.
I workaround could be to not send immediately after connecting but on the first onPoll. You could try with commenting out line 382.
I don't know how long a broker waits to receive a CONNECT before disconnecting.

@kleini
Copy link

kleini commented Feb 8, 2021

I think, the connection to the MQTT broker does not work at all:

mosquitto    | 1612816096: New connection from 192.168.168.196 on port 1883.
mosquitto    | 1612816099: Client <unknown> has exceeded timeout, disconnecting.
mosquitto    | 1612816099: Socket error on client <unknown>, disconnecting.

And with some debug enabled, I see

✔ Wi-Fi connected, IP: 192.168.168.196
Triggering WIFI_CONNECTED event...
↕ Attempting to connect to MQTT...
CONNECTING
TCP conn, MQTT CONNECT
new front #1
snd #1: 165/100

--------------- CUT HERE FOR EXCEPTION DECODER ---------------

Soft WDT reset

So, is somehow the connect string not sent completely? Will try to sniff network traffic.

@bertmelis
Copy link
Contributor Author

It's been a while since I've used TLS on the esp8266. How did you solve this error:

ESPAsyncTCP/src/ESPAsyncTCP.cpp:1325:31: error: no matching function for call to 'AsyncClient::_recv(tcp_pcb*&, pbuf*&, int)'           
         c->_recv(pcb, p->pb, 0);  

@luebbe
Copy link
Collaborator

luebbe commented Feb 8, 2021

Still running fine after 3,5 hours. FreeHeap still at 30k, goes down to 29,4k .. 29,6k when values are sent and always goes back to the same max value.
No SSL.

@kleini
Copy link

kleini commented Feb 8, 2021

It's been a while since I've used TLS on the esp8266. How did you solve this error:

ESPAsyncTCP/src/ESPAsyncTCP.cpp:1325:31: error: no matching function for call to 'AsyncClient::_recv(tcp_pcb*&, pbuf*&, int)'           
         c->_recv(pcb, p->pb, 0);  

me-no-dev/ESPAsyncTCP#151 <- that requires a fix from @mhightower83, which is not merged for a 3/4 year now.

Sorry, enough for me for today.

@bertmelis
Copy link
Contributor Author

bertmelis commented Feb 8, 2021

Shoot me, with logging enable it gives this output:

CONNECTING              <-- connect call by the user
TCP conn, MQTT CONNECT  <-- TCP is connected (TLS handshake done)
new front #1            <-- CONNECT packet is added to the queue
snd #1: 101/40          <-- TCP sends 101 bytes whereas the CONNECT was only 40.

So the TCP lib is not returning the actual written payload when using encryption???

@kleini
Copy link

kleini commented Feb 9, 2021

Should TCP library normally return exact the amount of bytes, that have been sent? I saw the similar behaviour some comments above.

@bertmelis
Copy link
Contributor Author

bertmelis commented Feb 9, 2021

It does, but only the non-tls. I guess it returns the encrypted TCP payload size. If you ask me, this a bad API design.

@bertmelis
Copy link
Contributor Author

@luebbe To speed up Homie you can use set_tcp_nodelay to true if you use mosquitto as broker. But I added "inflight messages > 1" to the TODO.

@bertmelis
Copy link
Contributor Author

bertmelis commented Feb 9, 2021

I'm thinking about a workaround for TLS. It wont be pretty as it is impossible to know how much data actually has been sent: We can wait to send for the entire TCP buffer to be empty. Then copy a fixed amount to LWIP and "ack" our own data.

EDIT: Thinking about it... I might keep 2 sets of counters: one from tcp and one of our own and keep them in sync. I will work it out in my head while clearing the snow from the driveway.

@bertmelis
Copy link
Contributor Author

I've added the hacky solution for TLS. I've tested it with the example in the repo + added a 2000 byte message for chuncked sending.

Did it like described above.

@kleini
Copy link

kleini commented Feb 9, 2021

It works!
works.log
Reading your pull request, I see the _acked is already increased when sending the TLS data as it is unclear how log the data will be. The returned written length seems to be somewhere deep inside axTLS of ESPAsyncTCP. Pretty hard to just read and understand that code.

@bertmelis
Copy link
Contributor Author

Yes, there is room for improvement. I didn't want to touch the existing code too much. _acked is keeping track of the acked data in the mqtt client. But since we're copying the data to TCP, we can already consider it acked. On non-tls versions, _acked is only incremented when the real zck comes in.

I also realized I can use client->space() instead of the fixed number.

@bertmelis
Copy link
Contributor Author

bertmelis commented Feb 10, 2021

I'm going to give it a rest for a week or two (covid restrictions a eased a bit, whoohoo). I'll keep an eye on the issue and PR tracker and read whatever is going on. I'll keep supporting!

Meanwhile I'll think about possible improvements:

  • max inflight messages > 1 for qos > 0
  • publishing really big payloads (larger than heap)
  • ...

@kleini
Copy link

kleini commented Feb 10, 2021

Awesome work so far, I must say! Thank you very much @bertmelis !

@luebbe
Copy link
Collaborator

luebbe commented Feb 10, 2021

I'll go through old issues backwards and check if I can reproduce them. If not I'll tag them with "to be closed" and we can review together. Won't test TLS related stuff though. I didn't want to jump through that hoop yet.

@kleini
Copy link

kleini commented Feb 10, 2021

@luebbe I can do the TLS related stuff.

@bertmelis
Copy link
Contributor Author

Same problem here - I've just replaced "old" working develop with the new one, and got periodic:

Exception (9):
epc1=0x40225596 epc2=0x00000000 epc3=0x00000000 excvaddr=0x000006bf depc=0x00000000

LoadStoreAlignmentCause: Load or store to an unaligned address
  epc1=0x40225596 in AsyncMqttClient::publish(char const*, unsigned char, bool, char const*, unsigned int, bool, unsigned short) at lib/async-mqtt-client-develop/src/AsyncMqttClient.cpp:744

crashes...

Thank you, here, the line number isn't displayed. Furthermore, when I enable logging, the error is gone.

Now, if the line number is correct, it is on the packetId() call. It's a virtual call which returns the value _packetId when pointed to a Publish packet or zero when there's no method overloaded. Mind that the base class does not have a member variable _packetId. Am I doing something wrong here?

@Pablo2048
Copy link
Contributor

Well the complete stack trace is:

0x40225548 in AsyncMqttClient::publish(char const*, unsigned char, bool, char const*, unsigned int, bool, unsigned short) at lib/async-mqtt-client-develop/src/AsyncMqttClient.cpp:744
0x40260b90 in mqtt_if_output(netif*, pbuf*, ip4_addr const*) at lib/MQTTVPN/src/MQTTVPN.cpp:72
0x4025a69d in cnx_update_bss_more at ??:?
0x40260bc4 in mqtt_if_output(netif*, pbuf*, ip4_addr const*) at lib/MQTTVPN/src/MQTTVPN.cpp:78
0x40100e3f in umm_free_core at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/umm_malloc/umm_malloc.cpp:351
0x40101108 in free at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/umm_malloc/umm_malloc.cpp:398
0x402501d4 in mem_free at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/mem.c:237
0x40254459 in ieee80211_parse_beacon at ??:?
0x4024f6a8 in ip4_output_if_opt_src at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/ip4.c:1764
0x40258de7 in sta_input at ??:?
0x402393ee in AsyncClient::_s_recv(void*, tcp_pcb*, pbuf*, long) at .pio/libdeps/esp12e/ESPAsyncTCP/src/ESPAsyncTCP.cpp:741
0x4024f6f0 in ip4_output_if_opt at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/ip4.c:1577
0x402501d4 in mem_free at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/mem.c:237
0x4024f716 in ip4_output_if at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/ip4.c:1550
0x4024fd1d in icmp_input at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/icmp.c:291
0x40100e3f in umm_free_core at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/umm_malloc/umm_malloc.cpp:351
0x4024f451 in ip4_input at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/ip4.c:1473
0x40246090 in esp2glue_ethernet_input at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/glue-lwip/lwip-git.c:469
0x4026cb33 in ethernet_input at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/glue-esp/lwip-esp.c:373
0x40260b25 in mqtt_if_Task(ETSEventTag*) at lib/MQTTVPN/src/MQTTVPN.cpp:182
0x40105029 in call_user_start_local at ??:?
0x4010502f in call_user_start_local at ??:?
0x4010000d in call_user_start at ??:?
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x40101143 in malloc at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/umm_malloc/umm_malloc.cpp:552
0x4022ddeb in esp8266::MDNSImplementation::MDNSResponder::stcMDNSSendParameter::addDomainCacheItem(void const*, bool, unsigned short) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Structs.cpp:2441 (discriminator 1)
0x40235300 in operator new(unsigned int) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/abi.cpp:39
0x4022ddeb in esp8266::MDNSImplementation::MDNSResponder::stcMDNSSendParameter::addDomainCacheItem(void const*, bool, unsigned short) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Structs.cpp:2441 (discriminator 1)
0x4022eb8d in esp8266::MDNSImplementation::MDNSResponder::_writeMDNSRRDomain(esp8266::MDNSImplementation::MDNSResponder::stcMDNS_RRDomain const&, esp8266::MDNSImplementation::MDNSResponder::stcMDNSSendParameter&) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Transfer.cpp:1342
0x4022ed71 in esp8266::MDNSImplementation::MDNSResponder::_writeMDNSServiceDomain(esp8266::MDNSImplementation::MDNSResponder::stcMDNSService const&, bool, bool, esp8266::MDNSImplementation::MDNSResponder::stcMDNSSendParameter&) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Transfer.cpp:1431
0x4022e882 in esp8266::MDNSImplementation::MDNSResponder::_udpAppend8(unsigned char) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Transfer.cpp:1112
0x4022e9e9 in esp8266::MDNSImplementation::MDNSResponder::_write8(unsigned char, esp8266::MDNSImplementation::MDNSResponder::stcMDNSSendParameter&) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Transfer.cpp:1254
0x4022ed20 in esp8266::MDNSImplementation::MDNSResponder::_writeMDNSServiceDomain(esp8266::MDNSImplementation::MDNSResponder::stcMDNSService const&, bool, bool, esp8266::MDNSImplementation::MDNSResponder::stcMDNSSendParameter&) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Transfer.cpp:1425
0x40101143 in malloc at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/umm_malloc/umm_malloc.cpp:552
0x402501b4 in mem_malloc at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/mem.c:210
0x4022e882 in esp8266::MDNSImplementation::MDNSResponder::_udpAppend8(unsigned char) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Transfer.cpp:1112
0x402633f7 in pp_attach at ??:?
0x40263446 in pp_attach at ??:?
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x401057c9 in ets_timer_disarm at ??:?
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x401027bc in rcReachRetryLimit at ??:?
0x401057c9 in ets_timer_disarm at ??:?
0x40103d29 in lmacIsIdle at ??:?
0x40103d29 in lmacIsIdle at ??:?
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x40101c6c in pp_post at ??:?
0x40104f23 in lmacRxDone at ??:?
0x40101c6c in pp_post at ??:?
0x40102807 in rcReachRetryLimit at ??:?
0x40101c6c in pp_post at ??:?
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x40104f0b in lmacTxFrame at ??:?
0x40101c6c in pp_post at ??:?
0x40104f23 in lmacRxDone at ??:?
0x40102807 in rcReachRetryLimit at ??:?
0x401029e8 in rcReachRetryLimit at ??:?
0x401029e8 in rcReachRetryLimit at ??:?
0x40102eaa in wDev_ProcessFiq at ??:?
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x40101c6c in pp_post at ??:?
0x40104f23 in lmacRxDone at ??:?
0x40102807 in rcReachRetryLimit at ??:?
0x401029e8 in rcReachRetryLimit at ??:?
0x40102eaa in wDev_ProcessFiq at ??:?
0x40102bcc in wDev_ProcessFiq at ??:?
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x401008c2 in millis at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_wiring.cpp:188 (discriminator 3)
0x40100e3f in umm_free_core at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/umm_malloc/umm_malloc.cpp:351
0x401008c2 in millis at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_wiring.cpp:188 (discriminator 3)
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x401007e9 in esp_schedule at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:125
0x40235740 in loop_wrapper() at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:19

but - because it's from my MQTTVPN I don't think that it will help...

@bertmelis
Copy link
Contributor Author

bertmelis commented Feb 25, 2021

Got it. I thought I could be smart but I am not. Will push in a sec.

Strangely, on my stack trace, the line numbers don't show up. It could have helped me A LOT.

@proddy
Copy link

proddy commented Feb 25, 2021

Strangely, on my stack trace, the line numbers don't show up. It could have helped me A LOT.

you need to compile with -g or use build_type = debug in platformio.ini

@Pablo2048
Copy link
Contributor

Pablo2048 commented Feb 25, 2021

So with the new develop the crash is gone. Unfortunately it disconnects from the broker every ~24.9 seconds.
Snímek z 2021-02-25 17-14-51
... just to make it clear - this message is sent in onConnect callback.

@bertmelis
Copy link
Contributor Author

bertmelis commented Feb 25, 2021

You keepalive is probably standard, 15 secs?

@Pablo2048
Copy link
Contributor

Yes, I didn't modified it.

@Pablo2048
Copy link
Contributor

The reason reported in onDisconnect callback is always 0.

@Pablo2048
Copy link
Contributor

I do have some another observations. In the onConnect callback I publish to some topics like this:

    String info;

    info.reserve(512);
    info = F("{\"hw\":\"");
    info.concat(F(HW_NAME "-" HW_VARIANT));
    info.concat(F("\",\"app\":\""));
    info.concat(F(APP_NAME));
    info.concat(F("\",\"appversion\":\""));
    info.concat(F(APP_VERSION));
    info.concat(F("\",\"ip\":\""));
    IPAddress ip = mqtt_if->netif.ip_addr;
    info.concat(ip.toString().c_str());
    info.concat(F("\"}"));
    sprintf_P(buff, PSTR("%sinfo"), mqtt_if->service);
    mqtt_if->mqttcl.publish(buff, 0, true, info.c_str());
    sprintf_P(buff, PSTR("%sresetreason"), mqtt_if->service);
    mqtt_if->mqttcl.publish(buff, 0, true, ESP.getResetReason().c_str());
    mqtt_if->mqttcl.publish(mqtt_if->will, 1, true, online);

Publish to the /info and /resetreason topic seems to be ok (QoS 0), publish to the mqtt_if->will topic (it's my last will and testament topic) with QoS 1 NEVER succeed. The MQTT Explorer just logs periodic '0' writes to the last will topic, which probably goes from the broker itself.
Snímek z 2021-02-25 17-36-19

@bertmelis
Copy link
Contributor Author

what is this online variable?

@Pablo2048
Copy link
Contributor

Just character string:

static const char online[] = "1";
static const char offline[] = "0";

@bertmelis
Copy link
Contributor Author

Can't reproduce unfortunately. Is your code public?

@Pablo2048
Copy link
Contributor

Unfortunately not yet :-( . Maybe I can publish it at the weekend (at least some version...).

@bertmelis
Copy link
Contributor Author

bertmelis commented Feb 25, 2021

I messed up again with the packetId. Apparently you can have the same member variable in a derived class as in a base class without any compiler warning?

@Pablo2048
Copy link
Contributor

Pablo2048 commented Feb 25, 2021

I didn't notice that - I'll check through the code. Fortunately the last push did the trick - everything seems to be working again. Including the web access - didn't test the OTA so far.
Edit: So the OTA is also working... Thank You.

@bertmelis
Copy link
Contributor Author

Strangely enough the packetId errors didn't show up before I simplified the code. The error was there though.

Anyway, done for this week. Gotta do some other things now. Next week I'll be trying to speed things up by having more messages qos>0 in-flight.

@proddy
Copy link

proddy commented Feb 28, 2021

I'm getting regular crashes too on the ESP32 so getting my esp-prog debugger out to find where it's bombing.

Got it all working. I had to switch to a different ESP32 partition configuration. Anyway, latest develop has been running flawlessly on one of my projects for the last few days now with around 50K of topics per day on QOS0. Tested large payloads too. Works wonderfully, nice work Bert!

@bertmelis
Copy link
Contributor Author

Thanks. It's not over yet. The to do:

  • have more messages in flight (= don't wait for MQTT ack before publishing the next message with qos > 0)
  • change const char* to const uint8_t* in the callback

When that's done, I propose to merge the improvements to master, adjust the documentation and release a new version.

@bertmelis bertmelis changed the title Please test #develop Discussion: #develop branch Feb 28, 2021
@proddy
Copy link

proddy commented Feb 28, 2021

Nice. There's probably a few more things we can iron out too. I was looking at Phil's comments on AsyncMQTTClient that triggered him to create Pangolin. They are still valid concerns and it would be good to address them and mark them down as 'enhancements' in the GitHub so at least we have each feature logged.

@bertmelis
Copy link
Contributor Author

Do you happen to "know" which remarks are left? I'll also dig up his old rant as see what's left (I remember some of his remarks were as cryptic as the code in his mqtt client).

@proddy
Copy link

proddy commented Feb 28, 2021

Most of the original remarks were removed but you can trace it back here, hidden in the now removed files. Short summary:

  • large payloads
  • receiving chunked inbound payloads which are greater than the TCP MSS size, and correctly re-assembling them in onMessage
  • the common disconnect/reconnect problem (although I've never experienced this)

It's a shame that Marvin's library, which is widely used and works fine for most people's needs to get slammed. The whole point of open source is to collaborate and improve each other's code, instead of spinning off a competitive project and asking people to donate $.

@bertmelis
Copy link
Contributor Author

bertmelis commented Feb 28, 2021

  1. Spontaneous DCX/CNX. Done
  2. Will Topic bug Prevents sketch from starting with non-static input. goes into docs, alternatives?
  3. Bad Subscribe Invalid topic causes DCX/CNX I'm not going to implement input validation
  4. No topic validation for subscribe (see above)
  5. Discarded messages 1 what's this 1?
  6. Discarded messages 2 what's this 2?
  7. Total Message Loss what's this 3?
  8. No error handling callback TODO
  9. "Killer Packet" inbound stays as it is. Payload sizes > RAM needs to be enabled for OTA over MQTT
  10. QoS1 Protocol Violation Done
  11. Fragment Failure TCP does the reassembling, right? So what's this 5?
  12. Numerous API errors - sufficient for their own document Seriously? what's this 6?
  13. QoS 1/2 protocol violation - no message resend TODO implement retry together with inflight > 1
  14. QoS 1/2 protocol violation - no session recovery Done
  15. QoS 1 protocol violation - breach of delivery promise Done
  16. QoS 2 protocol violation - breach of delivery promise Done

My extra thoughts:

  • the dangling pointers never was an issue
  • currently there is no formal retry mechanism in place. However, when QoS>0 messages are not acked, the client "hangs" and will disconnect because of the keepalive mechanism. After reconnecting (with clean session == 0), the messages are sent again.
  • I'll gladly create a file with "helpers" to assemble large incoming payloads or to read payloads into a properly terminated c-string. Not to be included by default...

@mcspr
Copy link

mcspr commented Feb 28, 2021

Fragment Failure TCP does the reassembling, right? So what's this 5?

I wonder if this is related to this sort of issue? xoseperez/espurna#2166 (comment)
Single packet payload is allowed to be spread across multiple onmessage calls. Meaning, user needs to track 'total' and 'length', so idk if this is really a lib issue. Maybe another possible helper / example.

Other buffer / discard points seem to be related to add() / send() usage, since it is possible to get stuck when network buffers are full and 'lose' outgoing messages

Thx for fixing keepalive btw :)
(will try asap with some more load, only tried some basic services for like 10minutes)

@bertmelis
Copy link
Contributor Author

bertmelis commented Feb 28, 2021

Fragment Failure TCP does the reassembling, right? So what's this 5?

I wonder if this is related to this sort of issue? xoseperez/espurna#2166 (comment)
Single packet payload is allowed to be spread across multiple onmessage calls. Meaning, user needs to track 'total' and 'length', so idk if this is really a lib issue. Maybe another possible helper / example.

Something like this: #234

@bertmelis
Copy link
Contributor Author

To be clear. This is how the developbranch works:

  • all outgoing messages are placed in a queue. There is a memory limit (using a #define) to protect against low heap.
  • you cannot add messages when the client is not connected
  • the queue mechanism sends all outgoing messages one after the other (triggers are: adding an item, TCP acks, TCP polling)
  • there is a protection against double connects and double disconnects
  • when a message needs a confirmation (qos > 0, subscribe, unsibscribe), the queue waits for this confirmation to move on. Basically: max number of inflight messages is 1.

What I'm thinking of now is

  • expose a method to clear the queue
  • max number of inflight messages "unlimited"
  • settable max number of inflight messages (In case of unreliable WiFi: message ordering has to be preserved and when the first message of a long series doesn't get acked, the whole series has to be resent)

@bertmelis
Copy link
Contributor Author

Just a sign of life. I'm still alive 😂

I'm just very low on computer time.

@epiller
Copy link

epiller commented Apr 30, 2021

Sorry that I might be spamming as I have nothing to add, but I just wanted to give a huge thanks for the initiative to continue this project as it is the best one out there. Kudos @bertmelis, kudos.

@bertmelis
Copy link
Contributor Author

Merged and released. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Converstation
Projects
None yet
Development

No branches or pull requests

8 participants