-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PubSubClient on ESP32 overflows the stack #9
Comments
I am getting the same thing with the latest PubSubClient. However, when I put in the work around to flush after each publish, it locks up for 20 seconds and reconnects:
|
So it seems (I could be wrong though) that after a publish it doesn't go into the BR_SSL_RECVAPP state (a flush puts things in a run until BR_SSL_SENDAPP target) which causes this operation to time out. Will continue to investigate, but maybe forcing the flush instead straight away might work? Note this isn't looking to fix the issue at hand, just the workaround. |
Ah, running client.loop() after a client.publish seems to solve this problem. Edit: spoke too soon, this still causes the stack overflow issues. Edit2: doing the following with pubsub in the loop without the flushes (set to run every 30 seconds) causes it to crash the second time around the loop:
My MQTT_MAX_PACKET_SIZE is set to 1024 |
OH WOW. I set #define MQTT_MAX_TRANSFER_SIZE 80 in PubSubClient.h (note I always had ETHERNET_LARGE_BUFFERS set in Ethernet.h) and this fixed the above Edit2. I'm going to leave it running without the subscribe function (and without flush) in there and see if the original problem goes away. Edit: Looking good so far. Will leave things overnight. |
I added some more error logging to |
It crashed about 8 hours later without flushes and with MQTT_MAX_TRANSFER_SIZE 80 in PubSubClient.h set. Normally it would run for about 5-30 minutes. With flushes it actually runs just fine, it just has the long timeout issue. Trying with the latest and flushes re-enabled:
|
I tried running the flush operation (without MQTT_MAX_TRANSFER_SIZE set) after an .available() call and wasn't getting pauses (noticed in available() that it will force a flush if stuck in SENDAPP):
However, this still causes the overflow to occur shortly afterwards. |
Edit: Actually disregard below, client.available gets called in the PubSub loop method. Calling client.loop(); after each network operation would have the same effect. One thing I noticed in PubSubClient.cpp was that there isn't an available() call in the write function to the client, which goes against this library's recommendations concerning the write buffering. This is leading you to call the flush process manually as a fix for this.
I have placed "while (!_client->available()) {}" at the end of this function and will see how things go with the original issue at hand. This is weird though, because originally even though this available() function isn't being called, data is still being sent to the server (even without flush()). The publish call is the only network operation I have going. |
Aha! If I run a client.loop() before a ethClientSSL.flush() then flush works correctly without delay. Edit: it still caused crashes, but if I run client.loop as well before the publish, all is well! Basically doing this fixes everything:
Edit2: Looking into this further, it seems it might be a way buffers have been implemented around the place - esp8266/Arduino#3002 Not sure if it's a similar problem here (but it cropped up when the contributor was developing SSL in PubSubClient using WifiSecureClient - knolleary/pubsubclient#251), but definitely something to look at. |
That's interesting, I'm glad you found a fix. I suspect that one of the root causes is that BearSSL crashes/malfunctions when attempting to encrypt a 100% full buffer, but I will need to investigate further. |
Unfortunately spoke too soon. It does crash after several hours, so it's back around what it was like when I had MQTT_MAX_TRANSFER_SIZE set, except flushes now don't pause. I guess it's worth mentioning that I'm using the W6100, which has a double sized buffer (32k rather than 16k on the w5100). I have it set to 2 sockets, and the tx/rx buffers get dynamically set to maximise space when ETHERNET_LARGE_BUFFERS is set (i.e. each rx/tx buffer is 8k from a total of 32k). I am sending 180 bytes and my MQTT message size is 1024. I have a watchdog in there to reset as a workaround, but it's not the most ideal solution. Edit: I have this running all day for 10 hours with MQTT_MAX_TRANSFER_SIZE set to 256 as well as the .loop() call additions above and it's been rock solid so far (have an uptime counter keeping track). Will update tomorrow. Edit2: still running with both these changes 19 hours later. |
Uptime: Days: 1, Hours: :02, Minutes: :26, Seconds: :19 Still going strong! Edit: crashed at: Uptime: Days: 1, Hours: :12, Minutes: :23, Seconds: :19 Much better than before though. |
A bit of an update after some further dev and discovery. So it seems that running client.loop (note "client" is with regard to the defines in the MQTT example) as many times as possible without delay prevents the crashing I was having. Looking at the example I actually noticed you don't have a delay in your loop, whereas I had a delay(1) in there (which I have since now replaced with sleep code and interrupts on Ethernet), so that'd be why your flushes worked originally, whereas mine didn't. I have a check in the code that checks if ethClientSSL.available(), in which it sets a flag to run without delay/sleep until no longer available. Things have been rock solid for a good week now. So the flushes definitely prevent the aforementioned stack overflows, provided you run the client loop without delay (me big dumb dumb). Is it possible the original issue is related to frame corruption/data loss when not processing fast enough or if it's interrupted, which wreaks havoc with the SSL libraries? |
Maybe, but I don't think so. BearSSL's data should be completely isolated from the rest of the code, and SSLClient will not fill the buffer until BearSSL is finished processing. I think the most likely suspect is a buffer overflow happening somewhere in BearSSL or the Ethernet library. |
I know it's ESP8266 related, but I'm guessing the API implementation might be similar. They are possibly having the same issues over there - esp8266/Arduino#6143 Edit: ah nevermind, you mentioned you increased the stack size. |
Trying to debug why the flushes work and we get crashes without. It seems the crash occurs very reliably when the MQTT client loop() writes an MQTT keepAlive ping while it hasn't flushed the previous publish down the client stack yet (denoted* by size 2 here): publish: {"deviceID":"000102030405","pa":101972,"temp":22,"lux":7} So it seems if there is an output (or possibly input) buffer when publishing that hasn't been processed by bearSSL yet, it affects the way bearSSL works. I've never seen the crash occur during a regular publish which doesn't coincide with a ping (which is why it seems random). This would be why we're seeing this with PubSub. *Those calls are in SSLClient::write():
|
I've been experimenting with running: |
@bleckers are you changing the buffer size in PubSubClient? It's possible that this issue is related to knolleary/pubsubclient#764. |
No, not changing at runtime. Currently set to #define MQTT_MAX_PACKET_SIZE 1024 The workarounds have been rock solid for a good few months now without fail, provided you run flush() enough (those latest comments were more to see why it might be happening without the flush(). Also of note for those that might be using the Ethernet libraries. There's some issues with them that will cause lockups under certain network situations. arduino-libraries/Ethernet#137 Also there's a fix if using SSLClient as well that sorts out a few issues there too - #13 |
Hello, I got this error on serial print:
Do you believe is related to the known issue? or maybe I'm missing something? Also I have some doubts about why the sketch make a connection call to "arduinoClient" in function reconnect()? BR, |
@bleckers It's been awhile! I may have fixed the issue you were seeing. In |
@1simone0 It looks like that issue is with your network module. Does MQTT work without SSLClient, or does the EthernetHTTPS example work? If not, would you mind opening a new issue so discussion can continue in that thread? |
More than happy to give the latest a shot later in the week. I'll retry without the mitigations in place and see what happens. |
Testing right now with the client.flush() removed. Thank you! |
At 40k messages now; I'd say consider this issue fixed! |
I completely agree. I've tested this now for tens of thousand of publishes on M5Stack core 1's, M5Stack atom lites, and Adafruit esp32-based feathers. This at two different sites on different networks. Not. one. hiccup. Great job y'all! |
All seems perfectly fine with these changes. I reckon you can close this one. |
Awesome! Took awhile but I'm glad it's fixed now🎉. |
Working with @jhnwmr on #8 I discovered that the using PubSubClient with SSLClient causes a stack overflow on the ESP32:
Error Log
This error persists despite increasing the stack size to >16kb, suggesting that this error is not simply due to a shortage of memory. My best guess is a bug in the BearSSL implementation of the ChaCha/Poly cipher suite, however it is too early so say for sure.
This error can temporarily be worked around by flushing SSLClient's buffer using
SSLClient::flush
after every write to the network. I have updated the examples to include this workaround, however It would definitely be best if this issue was addressed with a more permanent fix for the future.The text was updated successfully, but these errors were encountered: