Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serious and fatal problem with entire library #193

Closed
philbowles opened this issue May 27, 2020 · 21 comments
Closed

Serious and fatal problem with entire library #193

philbowles opened this issue May 27, 2020 · 21 comments

Comments

@philbowles
Copy link

This library has a fundamental problem that makes me wonder how it ever worked at all. There is even a comment in the code which led me to the surprising discovery that the code is simply not "fit for purpose". Of course I have concrete proof and an explanation of that statement which I present below. It also explains why ever since I have started using it, I have noticed "random" periods of disconnection / reconnection loops, so it actually doesn't work. As soon as I looked "under the hood" I could immediately see why.

This comment alerted me:

// Using a sendbuffer to fix bug setwill on SSL not working

The irony of this is that there is nothing wrong with either setwill or SSL or even the two combined. Certainly there is nothing wrong with LwIP that caused it. The problem is with your code and it is only coincidentally that the setwill issue showed it up. The same problem still exists in 90% of the rest of your library, and it is easy to explain.

The problem

ESPAsyncTCP is a thin wrapper arounf LwIP. LwIP client calls take char* parameters for "strings" e.g. clientID, topic etc. Thus of course, so does ESPAsyncTCP which effectively just "forwards" them to LwIP.

LwIP has many optimisations to improve throughput and performance, the main ones being buffering, queuing and coalescing of data. This is why there are two calls to send data rather than the more usual send found in synchronous aka "blocking" code.

The add function simply tells LwIP to buffer the data. If it doesn't get any more within a given timeframe, it may then send the data. If gets more adds in rapid succession, it lumps them all together until...

The send function effectively tells LwIP "no more data from me for a while". At this point LwIP still may not actually put it "on the wire" - it has some complex algorithms - but the main point is that your code has no control over when LwIP finally decides to do that. It could be up to several seconds after you call send and long after your function has exited...invalidating all your temporary variables as it does so.

Any code (not just this egregious example) that sends a temporary stack pointer (aka "local variable" address) data to a delayed function a) is asking for trouble b) has no concept of "asynchronous" c) breaks every rule of programming from chapter 1 of any book on C d) will run only by pure luck, if at all.

ALL the data you send to LwIP via ESPAsyncTCP are temporary stack pointers.

Example

ESPAsyncTCP add prototype takes a char*

size_t add(const char* data, size_t size, uint8_t apiflags=0);//add for sending

It then passes that data * to LwIP

Code such as the next snippet (which is 90% of your library 'style') is going to fail. Not "might" or "could" but "absolutely and definitely will" . Your setwill / SSL already proves that, as do the frequent "random" disconnects.

scope / lifetime: stack variables

// @ global scope
AsyncClient*  client=new AsyncClient();
void willCrash(){
  char[] shortlifetime="disaster waiting to happen";
  client->add(shortlifetime);
 client->send(); // at an undefined time posibly seconds into the future
} // char array "shortlifetime just disappeared out-of-scope
// yet you just passed it ot a lower level function = disaster

The ONLY reason this works at all on ESP8266 is that by happy accident and accident alone, the stack hasn't (yet) been trashed by the time LwIP does its actual send - which could be several seconds after your function has exited and released/destroyed and thus invalidated every pointer you just passed down the chain.

Whether, several seconds later the data bears any resemblance at all to the former values is a function of the number and size of LwIP buffers/queue length, the MCU clock, the latency of the network, the underlying 'system' code, interrupts etc etc etc -none of which you have any control over. Change any one of them and your code will crash at some point - as it currently does, regularly.

One of those "unlucky" "random" events just so happens to occur when setwill was used. Whatever you think you did to fix just slightly tweaked that list of uncontrollables above and accidentally hid it - till the next time...

Your library is basicially a whole collection of code written excusively in the style of the above snippet. It needs a ground-up rewrite to hold those pointers until the final ACK is received ...or make all variables paased to ESPAsyncTCP -> LwIP into globals.

And just in case you still don't beleive me, take it from the LwIP documentation:
image

@mknjc
Copy link

mknjc commented May 27, 2020

Hi,

nice writeup. I also saw this problem and there is already a (wrongly closed) bugreport #66 (comment) about this.

Sadly this library contains a lot of other bugs and performance problems.

I've fixed most of them here https://github.com/mknjc/async-mqtt-client/tree/mqtt5 but currently have no time to clean it up. Bugs and pull requests are welcome :-)

@bertmelis
Copy link
Contributor

Is I read that paragraph, the data is copied when you set the appropriate flag. In that case the variables holding the data can go out of scope.

I believe this flag is set in ESPAsyncTCP.

@mknjc
Copy link

mknjc commented May 27, 2020

I believe this flag is set in ESPAsyncTCP.

Nope, see for example

_client.add(sendbuffer, 12);

@bertmelis
Copy link
Contributor

bertmelis commented May 27, 2020

Why not just add the flag then?

Is see in ESPAsyncWebserver the flag is also not set, but there the lib takes care of memory management (allocates on beforehand and frees on ack).

EDIT: don't get me wrong, I am convinced this is a problem. I 'solved' this myself on a higher level queueing the MQTT messages itself and using the MQTT acks to free.

@mknjc
Copy link

mknjc commented May 27, 2020

Why not just add the flag then?
I've done this in this commit mknjc@1e179b2

but my branch was focused in mqtt5 support and a fixed a lot of other things so this commit might not apply cleanly on this repo

EDIT: don't get me wrong, I am convinced this is a problem. I 'solved' this myself on a higher level queueing the MQTT messages itself and using the MQTT acks to free.

Nope this doesn't fix the problem. The line I linked uses "sendbuffer" a stack allocated array in an internal callback. So nothing external could fix this.

@bertmelis
Copy link
Contributor

bertmelis commented May 27, 2020

EDIT: don't get me wrong, I am convinced this is a problem. I 'solved' this myself on a higher level queueing the MQTT messages itself and using the MQTT acks to free.

Nope this doesn't fix the problem. The line I linked uses "sendbuffer" a stack allocated array in an internal callback. So nothing external could fix this.

I see, although the data I add (topic, payload...) doesn't go out of scope, the data (headers etc) the library adds does go out of scope.

And when you have a good connection where you don't have to retransmit (everything gets acked after first try), the library works miraculously fine.

[EDIT]
FYI: are you not using the ESPAsyncTCPs flags for a particular reason?
https://github.com/me-no-dev/ESPAsyncTCP/blob/15476867dcbab906c0f1d47a7f63cdde223abeab/src/ESPAsyncTCP.h#L40

@proddy
Copy link

proddy commented May 27, 2020

I 'solved' this myself on a higher level queueing the MQTT messages itself and using the MQTT acks to free.

I'm seeing the same problem so implemented a similar queuing mechanism to remove messages only after an MQTT ACK from the publish. Only drawback is that you can't use Qos 0. Having said that when there's a lot of traffic on a noisy wifi I do see the occasional miss fires, so built in a re-retry mechanism too. Too much code now for a simple operation.

@mknjc
Copy link

mknjc commented May 27, 2020

I see, although the data I add (topic, payload...) doesn't go out of scope, the data (headers etc) the library adds does go out of scope.

yes this is the main problem

And when you have a good connection where you don't have to retransmit (everything gets acked after first try), the library works miraculously fine.

I needed a lot of time to find this out too. After a long packet capturing session I found a few packets with invalid checksum. They are likely caused by the missing copy.

FYI: are you not using the ESPAsyncTCPs flags for a particular reason?
https://github.com/me-no-dev/ESPAsyncTCP/blob/15476867dcbab906c0f1d47a7f63cdde223abeab/src/ESPAsyncTCP.h#L40

Who do you ask? I used it in my fix a lot, some add calls misses them but these use constants or class variables.

Why mavinroger doesn't used the copy? Because he liked the lib to be "lightweight" (whatever this means) see: #66 (comment) and me-no-dev said he tested it #66 (comment) (I don't know how, but in my reply I showed some ways how tests might get it wrong)

[EDIT]
Ah now I see what you mean. Yes I probably should have used the AsyncTCP Flags and not the lwip one.
I can't remember now, I guess I wasn't sure if me-no-dev keeps the flags updated if lwip changes the values. (After the comment "copy is not needed" my trust in me-no-dev isn't really high)

@philbowles
Copy link
Author

philbowles commented Jun 4, 2020

Is I read that paragraph, the data is copied when you set the appropriate flag. In that case the variables holding the data can go out of scope.

I believe this flag is set in ESPAsyncTCP.

Then as I stated, it works only by luck of the specific implementation. When porting to any other system e.g. STM32 or indeed if ESP8266, ESP32 change their LwIP or if ESPAsyncTCP changes, this lib will crash.

It already does on STM32 which proves my point. The wider point is that the code is SO full of holes due to an apparent lack of understand of basic C issues e.g. the difference between uint8_t and char, the need for "strings" to be \0 terminated, that payloads are binary data and NOT character strings etc etc etc that it is beyond rescue.

AND it does not correctlty implement QoS2 and session control. The "golden path" when everything works fine on a clean network and the connection to the server never breaks is fine...but that not the point of QoS2! The point is it should correctly recover when that breaks and this library does not!

Again just looking at the code in detail shows that it cannot ever properly recover QoS2 retransmission after a reconnect.

@philbowles
Copy link
Author

I welcome all contributors here to try my alternative library which I have just finished writing which fixes all of these problems - at least on my test system :)

WITH THE WARNING that is is very very "alpha" and likely to contain "obvious" early bugs. Also, it does not yet handle TLS, nor have I tested it on ESP32 - I'm working on those right now and they will be added soon. I see no reason why it wouldn't / shouldn't drop straight in to ESP32 as long as no-one tries to do any fancy RTOS multiple task access and only uses it one one "thread" (I can't think why anyone would want to do otherwise :)

ALSO it correctly handles binary payloads (up to the TCP_MSS size (ish)) which this lib does not ( you'd need to understand the difference between a char* and a uint8_t* and between a "string" and a "lump of arbitrary data" to do that. In my opinion, this author does not). Given that ALL MQTT payloads are length-specified binary blocks (!!!) and not strings or "strings" or Strings or sequences of null-terminated printable characters...minor changes are need to the API to correct those exisiting design errors.

Notice how the examples never print the contents of the payload? I have included a hex dumper for that purpose and am currently adding api calls to take bare char* for anyone brave enough to want to use them and who also understand scope / lifetime and C pointers, plus overloads for std:::string and also String for die-hard Arduino fans.

My lib correctly implements session control and QoS2 retransmits after failure, which again, this lib does not.

Once its been alpha'd, I will add TCP large packet reconstruction "behind the scenes" - this lib expects the user to manage packet reassembly when payload > TCP_MSS!! To me, that says a lot about this lib, and none of it is good...I dispute the "lightweight" argument - the current lib has over 1800 lines of code, and although mine has the extras still to be added as I mentioned earlier, I don't expect it to get much bigger than the 622 lines it currently is... The difference in those two figures alone speaks volumes.

Any help testing this alternative will be greatly welcomed: https://github.com/philbowles/asyncmqtt

@philbowles
Copy link
Author

@qcasey Was that comment for me? I haven't implemented SSL yet - working on it now :) Also, I have hijacked this lib enough, going to suggest that any issues with my own replacement are dealt with on my own github. While it tries to preserve the same APi to make it "drop in" we should not forget it is a different library!

@qcasey
Copy link

qcasey commented Jun 4, 2020

No but it's another issue caused by the temporary stack pointers as you highlighted in your original post. I've only made this original library work by using my own queue to keep things in scope, it's so critical to the function it should be in the examples or fixed as you're doing in your rewrite.

I will test yours and post any issues on your repo. Looking fwd to the TLS support :)

@obrain17
Copy link

obrain17 commented Jun 4, 2020

@philbowles Thank you and congratulations !

Even though I never have had any crashes with marvinroger's library, except some disconnects with bad Wifi, I downloaded your new library and compiled it into an existing sketch. Thanks to the preservation of the same API I only had to do some type-casts and ignore some compiler warnings for a first step. So far my sketch is running and doing the same as it did before.

I will comment further perceptions, issues or maybe pull requests on your GitHub repo https://github.com/philbowles/asyncmqtt

In parallel I will also keep marvinroger's library and check if my very occasional disconnects are caused by the missing ASYNC_WRITE_FLAG_COPY in ESP8266 implementation. I do not have problems with other stuff you are complaining about. (e.g. difference between a char* and a uint8_t* )

@bertmelis
Copy link
Contributor

I'll switch gladly. Then I can ditch my own queueing code.

I don't agree with all of the issues you mention though (when talking about esp8266 only).

And yes, using strings (whatever type of) as payload seems to confuse a lot of people.

@BorisBrock
Copy link

https://github.com/philbowles/asyncmqtt seems to be gone. Where can I find it?

@luebbe
Copy link
Collaborator

luebbe commented Nov 5, 2020

He has renamed it to https://github.com/philbowles/PangolinMQTT as you can see in the comment above.

@bertmelis
Copy link
Contributor

He renamed his project: https://github.com/philbowles/PangolinMQTT

But don't expect a user-friendly atmosphere there. If you only need QoS0 I can only advice to stay with this lib and fix the issues.

@BorisBrock
Copy link

He renamed his project: https://github.com/philbowles/PangolinMQTT

But don't expect a user-friendly atmosphere there. If you only need QoS0 I can only advice to stay with this lib and fix the issues.

I just tried the Pangolin lib. Sadly PlatformIO is not supported, so this is no option for me.
Too bad async-mqtt-client frequently starts to do wild disconnect/reconnect loops that break everything...

@obrain17
Copy link

obrain17 commented Nov 8, 2020

I have been working successfully with a slightly modified code of PangolinMQTT for a couple of months now.

It also compiles

  • with the -wall compiler option used by Ardiuno IDE if switched on:

werror 002

  • and with PlatformIO as well, that uses the -wall option by default

Also some improvements for Keep-Alive handling are added.

To make it available for all I just created a fork https://github.com/obrain17/PangolinMQTT and uploaded my changes

@bertmelis
Copy link
Contributor

Very nice, but I preferably have only one interface for everything on my ESPs. Pangolin can't handle payloads larger than memory allows. So OTA-over-MQTT is not possible which I always found extremely handy.

@bertmelis
Copy link
Contributor

Closing this issue. It's being worked on.

I must admit, the code in this lib is very readable. There are other libs where you can only guess the purpose of variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants