-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serious and fatal problem with entire library #193
Comments
Hi, nice writeup. I also saw this problem and there is already a (wrongly closed) bugreport #66 (comment) about this. Sadly this library contains a lot of other bugs and performance problems. I've fixed most of them here https://github.com/mknjc/async-mqtt-client/tree/mqtt5 but currently have no time to clean it up. Bugs and pull requests are welcome :-) |
Is I read that paragraph, the data is copied when you set the appropriate flag. In that case the variables holding the data can go out of scope. I believe this flag is set in ESPAsyncTCP. |
Nope, see for example async-mqtt-client/src/AsyncMqttClient.cpp Line 332 in 7f1ba48
|
Why not just add the flag then? Is see in ESPAsyncWebserver the flag is also not set, but there the lib takes care of memory management (allocates on beforehand and frees on ack). EDIT: don't get me wrong, I am convinced this is a problem. I 'solved' this myself on a higher level queueing the MQTT messages itself and using the MQTT acks to free. |
but my branch was focused in mqtt5 support and a fixed a lot of other things so this commit might not apply cleanly on this repo
Nope this doesn't fix the problem. The line I linked uses "sendbuffer" a stack allocated array in an internal callback. So nothing external could fix this. |
I see, although the data I add (topic, payload...) doesn't go out of scope, the data (headers etc) the library adds does go out of scope. And when you have a good connection where you don't have to retransmit (everything gets acked after first try), the library works miraculously fine. [EDIT] |
I'm seeing the same problem so implemented a similar queuing mechanism to remove messages only after an MQTT ACK from the publish. Only drawback is that you can't use Qos 0. Having said that when there's a lot of traffic on a noisy wifi I do see the occasional miss fires, so built in a re-retry mechanism too. Too much code now for a simple operation. |
yes this is the main problem
I needed a lot of time to find this out too. After a long packet capturing session I found a few packets with invalid checksum. They are likely caused by the missing copy.
Who do you ask? I used it in my fix a lot, some add calls misses them but these use constants or class variables. Why mavinroger doesn't used the copy? Because he liked the lib to be "lightweight" (whatever this means) see: #66 (comment) and me-no-dev said he tested it #66 (comment) (I don't know how, but in my reply I showed some ways how tests might get it wrong) [EDIT] |
Then as I stated, it works only by luck of the specific implementation. When porting to any other system e.g. STM32 or indeed if ESP8266, ESP32 change their LwIP or if ESPAsyncTCP changes, this lib will crash. It already does on STM32 which proves my point. The wider point is that the code is SO full of holes due to an apparent lack of understand of basic C issues e.g. the difference between uint8_t and char, the need for "strings" to be \0 terminated, that payloads are binary data and NOT character strings etc etc etc that it is beyond rescue. AND it does not correctlty implement QoS2 and session control. The "golden path" when everything works fine on a clean network and the connection to the server never breaks is fine...but that not the point of QoS2! The point is it should correctly recover when that breaks and this library does not! Again just looking at the code in detail shows that it cannot ever properly recover QoS2 retransmission after a reconnect. |
I welcome all contributors here to try my alternative library which I have just finished writing which fixes all of these problems - at least on my test system :) WITH THE WARNING that is is very very "alpha" and likely to contain "obvious" early bugs. Also, it does not yet handle TLS, nor have I tested it on ESP32 - I'm working on those right now and they will be added soon. I see no reason why it wouldn't / shouldn't drop straight in to ESP32 as long as no-one tries to do any fancy RTOS multiple task access and only uses it one one "thread" (I can't think why anyone would want to do otherwise :) ALSO it correctly handles binary payloads (up to the TCP_MSS size (ish)) which this lib does not ( you'd need to understand the difference between a char* and a uint8_t* and between a "string" and a "lump of arbitrary data" to do that. In my opinion, this author does not). Given that ALL MQTT payloads are length-specified binary blocks (!!!) and not strings or "strings" or Strings or sequences of null-terminated printable characters...minor changes are need to the API to correct those exisiting design errors. Notice how the examples never print the contents of the payload? I have included a hex dumper for that purpose and am currently adding api calls to take bare char* for anyone brave enough to want to use them and who also understand scope / lifetime and C pointers, plus overloads for std:::string and also String for die-hard Arduino fans. My lib correctly implements session control and QoS2 retransmits after failure, which again, this lib does not. Once its been alpha'd, I will add TCP large packet reconstruction "behind the scenes" - this lib expects the user to manage packet reassembly when payload > TCP_MSS!! To me, that says a lot about this lib, and none of it is good...I dispute the "lightweight" argument - the current lib has over 1800 lines of code, and although mine has the extras still to be added as I mentioned earlier, I don't expect it to get much bigger than the 622 lines it currently is... The difference in those two figures alone speaks volumes. Any help testing this alternative will be greatly welcomed: https://github.com/philbowles/asyncmqtt |
@qcasey Was that comment for me? I haven't implemented SSL yet - working on it now :) Also, I have hijacked this lib enough, going to suggest that any issues with my own replacement are dealt with on my own github. While it tries to preserve the same APi to make it "drop in" we should not forget it is a different library! |
No but it's another issue caused by the temporary stack pointers as you highlighted in your original post. I've only made this original library work by using my own queue to keep things in scope, it's so critical to the function it should be in the examples or fixed as you're doing in your rewrite. I will test yours and post any issues on your repo. Looking fwd to the TLS support :) |
@philbowles Thank you and congratulations ! Even though I never have had any crashes with marvinroger's library, except some disconnects with bad Wifi, I downloaded your new library and compiled it into an existing sketch. Thanks to the preservation of the same API I only had to do some type-casts and ignore some compiler warnings for a first step. So far my sketch is running and doing the same as it did before. I will comment further perceptions, issues or maybe pull requests on your GitHub repo https://github.com/philbowles/asyncmqtt In parallel I will also keep marvinroger's library and check if my very occasional disconnects are caused by the missing ASYNC_WRITE_FLAG_COPY in ESP8266 implementation. I do not have problems with other stuff you are complaining about. (e.g. difference between a char* and a uint8_t* ) |
I'll switch gladly. Then I can ditch my own queueing code. I don't agree with all of the issues you mention though (when talking about esp8266 only). And yes, using strings (whatever type of) as payload seems to confuse a lot of people. |
https://github.com/philbowles/asyncmqtt seems to be gone. Where can I find it? |
He has renamed it to https://github.com/philbowles/PangolinMQTT as you can see in the comment above. |
He renamed his project: https://github.com/philbowles/PangolinMQTT But don't expect a user-friendly atmosphere there. If you only need QoS0 I can only advice to stay with this lib and fix the issues. |
I just tried the Pangolin lib. Sadly PlatformIO is not supported, so this is no option for me. |
I have been working successfully with a slightly modified code of PangolinMQTT for a couple of months now. It also compiles
Also some improvements for Keep-Alive handling are added. To make it available for all I just created a fork https://github.com/obrain17/PangolinMQTT and uploaded my changes |
Very nice, but I preferably have only one interface for everything on my ESPs. Pangolin can't handle payloads larger than memory allows. So OTA-over-MQTT is not possible which I always found extremely handy. |
Closing this issue. It's being worked on. I must admit, the code in this lib is very readable. There are other libs where you can only guess the purpose of variables. |
This library has a fundamental problem that makes me wonder how it ever worked at all. There is even a comment in the code which led me to the surprising discovery that the code is simply not "fit for purpose". Of course I have concrete proof and an explanation of that statement which I present below. It also explains why ever since I have started using it, I have noticed "random" periods of disconnection / reconnection loops, so it actually doesn't work. As soon as I looked "under the hood" I could immediately see why.
This comment alerted me:
// Using a sendbuffer to fix bug setwill on SSL not working
The irony of this is that there is nothing wrong with either setwill or SSL or even the two combined. Certainly there is nothing wrong with LwIP that caused it. The problem is with your code and it is only coincidentally that the setwill issue showed it up. The same problem still exists in 90% of the rest of your library, and it is easy to explain.
The problem
ESPAsyncTCP is a thin wrapper arounf LwIP. LwIP client calls take char* parameters for "strings" e.g. clientID, topic etc. Thus of course, so does ESPAsyncTCP which effectively just "forwards" them to LwIP.
LwIP has many optimisations to improve throughput and performance, the main ones being buffering, queuing and coalescing of data. This is why there are two calls to send data rather than the more usual
send
found in synchronous aka "blocking" code.The
add
function simply tells LwIP to buffer the data. If it doesn't get any more within a given timeframe, it may then send the data. If gets moreadd
s in rapid succession, it lumps them all together until...The
send
function effectively tells LwIP "no more data from me for a while". At this point LwIP still may not actually put it "on the wire" - it has some complex algorithms - but the main point is that your code has no control over when LwIP finally decides to do that. It could be up to several seconds after you callsend
and long after your function has exited...invalidating all your temporary variables as it does so.Any code (not just this egregious example) that sends a temporary stack pointer (aka "local variable" address) data to a delayed function a) is asking for trouble b) has no concept of "asynchronous" c) breaks every rule of programming from chapter 1 of any book on C d) will run only by pure luck, if at all.
ALL the data you send to LwIP via ESPAsyncTCP are temporary stack pointers.
Example
ESPAsyncTCP
add
prototype takes achar*
It then passes that data * to LwIP
Code such as the next snippet (which is 90% of your library 'style') is going to fail. Not "might" or "could" but "absolutely and definitely will" . Your setwill / SSL already proves that, as do the frequent "random" disconnects.
scope / lifetime: stack variables
The ONLY reason this works at all on ESP8266 is that by happy accident and accident alone, the stack hasn't (yet) been trashed by the time LwIP does its actual send - which could be several seconds after your function has exited and released/destroyed and thus invalidated every pointer you just passed down the chain.
Whether, several seconds later the data bears any resemblance at all to the former values is a function of the number and size of LwIP buffers/queue length, the MCU clock, the latency of the network, the underlying 'system' code, interrupts etc etc etc -none of which you have any control over. Change any one of them and your code will crash at some point - as it currently does, regularly.
One of those "unlucky" "random" events just so happens to occur when setwill was used. Whatever you think you did to fix just slightly tweaked that list of uncontrollables above and accidentally hid it - till the next time...
Your library is basicially a whole collection of code written excusively in the style of the above snippet. It needs a ground-up rewrite to hold those pointers until the final ACK is received ...or make all variables paased to ESPAsyncTCP -> LwIP into globals.
And just in case you still don't beleive me, take it from the LwIP documentation:
The text was updated successfully, but these errors were encountered: