Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MQTT fails on long messages #2

Closed
svdgraaf opened this issue Mar 25, 2016 · 26 comments
Closed

MQTT fails on long messages #2

svdgraaf opened this issue Mar 25, 2016 · 26 comments

Comments

@svdgraaf
Copy link
Contributor

I think I've found an issue. When I publish a message with a quote it in ", the message is never received on the iot end. When I publish a message without json, it does work:

The topic I subscribed to is $aws/things/xyz/shadow/update.

With regular text:

connecting to wifi
Connected
socket connected
MQTT connecting
MQTT connected
MQTT subscribed
Message 1 arrived: qos 1, retained 0, dup 0, packetid 2
Payload 
Message 2 arrived: qos 1, retained 0, dup 0, packetid 4
Payload foobar
Message 3 arrived: qos 1, retained 0, dup 0, packetid 5
Payload {}
Message 4 arrived: qos 1, retained 0, dup 0, packetid 6
Payload foo

With quotes:

connecting to wifi
Connected
socket connected
MQTT connecting
MQTT connected
MQTT subscribed

No response ;)

@svdgraaf
Copy link
Contributor Author

Hmmm, after some more testing, it seems to be that is something weird going on with the length of the string. If publish {"barrr"}, it doesn't seem to be receiving anything, but after a while, the esp will reboot (probably buffer overflow?). If I publish a slightly shorter string {"barr"}, it just works.

I'll see if I can find why.

@svdgraaf svdgraaf changed the title MQTT fails on json/quote? MQTT fails on long messages Mar 25, 2016
@svdgraaf
Copy link
Contributor Author

I can confirm that it works both ways (receiving and publishing), longer messages are not processed...

@svdgraaf
Copy link
Contributor Author

Uuuuuugh, got it! Took me a while! :D

Topic needs to be a const char*, it's probably used a lot, so not passing it by reference will eat up quite some memory probably ;)

Awesome, it works! :D woohoo!

@svdgraaf
Copy link
Contributor Author

Hmmm, there still is an issue with either long messages, or a long topic name. If a have a long Topic name, eg: $aws/things/your-device/shadow/update, and a long message: {"reported":{"foo": "bar"}, desired:{}}. Nothing will be delivered.
If I have a short message and a long topic, it works. A short topic and a long message wil work too. I haven't found where the bug is yet, or what the maximum amount of characters is.

@svdgraaf svdgraaf reopened this Mar 28, 2016
@svdgraaf
Copy link
Contributor Author

Ha! Got it :)

It was a bit of a search, but the Paho client is by default limited to max packages of X size. So ofcourse, this would result message failures when the topic + message gets a bit longer.

Changing:
MQTT::Client<IPStack, Countdown, 50, 1> client = MQTT::Client<IPStack, Countdown, 50, 1>(ipstack);
into:
MQTT::Client<IPStack, Countdown, 255, 1> client = MQTT::Client<IPStack, Countdown, 255, 1>(ipstack);

Fixed the issue, the 255 is the max length of the MQTT packages (the longer, the memory it will use as well, ofcourse). This is also valid for the incoming package length (so for receiving messages from the outside). Pretty important :)

@odelot
Copy link
Owner

odelot commented Mar 28, 2016

@svdgraaf I've notice this testing today. I've implemented the auto-reconnection feature and fix the need of that wait for connection workaround. Of course, it would be very nice to stress test these features ;-)

next target.... remove the need for bytebuffer

@svdgraaf
Copy link
Contributor Author

Awesome, I'll run some more testing tomorrow :)

Cheers!
Sander

On 28 mrt. 2016 20:58 +0200, odelotnotifications@github.com, wrote:

@svdgraaf(https://github.com/svdgraaf)I've notice this testing today. I've implemented the auto-reconnection feature and fix the need of that wait for connection workaround. Of course, it would be very nice to stress test these features ;-)

next target.... remove the need for bytebuffer


You are receiving this because you were mentioned.
Reply to this email directly orview it on GitHub(#2 (comment))

@odelot
Copy link
Owner

odelot commented Apr 1, 2016

hi @svdgraaf, let me ask you something?

are you using the stage or stable version of arduino esp8266 code? 2.1.0 or 2.1.0-rc2?

I am investigating the memory leak after many reconnections. Tried the stable version and nothing worked, neither the first connection. O.o

@svdgraaf
Copy link
Contributor Author

svdgraaf commented Apr 2, 2016

I'm running a git checkout, it's revision 633e48f3aec5f1c3c11d4498fc90d378d49e6e9f. I haven't had these issues yet (on a Adafruit huzzah board).

@odelot
Copy link
Owner

odelot commented Apr 3, 2016

I've got the latest version from git and I'm still having problems

this simple code shows that heapspace get smaller after each reconnection

        const char* host = "google.com";
    WiFiClient c;
    for (int i =0; i< 100; i+=1) {
        Serial.print (i);
        Serial.print ("-");
        Serial.println (ESP.getFreeHeap());
        if (c.connect (host,80)) {
            delay (100);
            c.flush ();
            c.stop ();
            delay (100);
        } else {
            Serial.println ("connection refused");
        }
    } 

this code is using esp8266 core :-( my board is a generic one but It seems to be a software issue.

@daniele-salvagni
Copy link

I tested that code and I had the same outcome, the heap keeps shrinking after every loop. I did a quick search and it looks like that you are not the first one to notice this, here is a blog post about that (see "The Unexpected Discovery" paragraph): http://internetofhomethings.com/homethings/?p=1232

And a couple issues I found showing more details and possible solutions: esp8266/Arduino#230, esp8266/Arduino#1070

Edit: I did some more testing, here is the output of 200 iterations with a 3 minutes delay in the middle. The heap does recover and this confirms that the issue is caused only because "closed connections go into TIME_WAIT state where they must be kept by the TCP stack for 2xMSL interval".

0-46128
1-45904
2-45712
[...]
97-27472
98-27280
99-27088
Sleep for 3 minutes
100-46128
101-45904
102-45712
[...]
197-27472
198-27280
199-27088

@odelot
Copy link
Owner

odelot commented Apr 4, 2016

thank you @daniele-salvagni, this mod fixed the issue esp8266/Arduino#230 (comment)

strange that it is not in the master branch yet.

@svdgraaf
Copy link
Contributor Author

svdgraaf commented Apr 8, 2016

I can concur that I'm having the same issue, I just did a recompile with the tcp_abort(_pcb); fix, let's see how long it will run.

btw: I had issues with connecting to the mqtt from the latest checkout on the master esp8266 branch, I'm back to my above mentioned hash now, and everything works like a charm so far.

@odelot
Copy link
Owner

odelot commented Apr 8, 2016

I'm still investigating before close issue #5

tcp_abort(_pcb) improved, but I am still losing memory. I am looking at the moment in the getCurrentTime function. Maybe it is a good idea to reuse the WifiClientobject. It seems that we lose some memory there. The problem is find free time to work on this :-( I will keep you informed about my findings

@svdgraaf
Copy link
Contributor Author

I'm now posting the heap size in my latest build to AWS as well, and plotting it into Kibana: screenshot 2016-04-12 08 03 26

The line is pretty lineair, so that's good 😉 I will see if I can find the time to debug the getCurrentTime() method. I saw some other library which used a ntp server for that. Might be a bit overkill though.

@odelot
Copy link
Owner

odelot commented Apr 12, 2016

@svdgraaf it is a great way to keep track of memory

Here is my findings:

Ive cleaned up the code a little bit (commited it already*) and changed the example (see below).

Now it seems that memory usage is almost ok, althought we need to test and investigate a little more. getCurrentTime is ok as well

Between the first and second connection, we lose 316 bytes (maybe the first subscribe? need to investigate it). After this, the memory seems to float but coming back eventually to the same level that was recorded in the second connection. When we got an error trying to connect to mqtt server, we lose 168 bytes as well.

program execution... free heap space between (parentheses)

connected
51191 - conn: 1 - (27744)
websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed
362290 - conn: 2 - (27408)
websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed
672864 - conn: 3 - (27408)
websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed
984979 - conn: 4 - (27408)
websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed
1298342 - conn: 5 - (13176)
websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed
1611359 - conn: 6 - (13176)
websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed
1922638 - conn: 7 - (27408)
websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed
2233730 - conn: 8 - (27408)
websocket layer connected / MQTT connecting / 
error connection to MQTT server2
2241845 - conn: 9 - (13176)
websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed
2551782 - conn: 10 - (27240)
websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed
2862456 - conn: 11 - (27240)
websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed

and here is my new example code (still didnt commit it*, I need to test more):

  • note that I've used 6000 bytes as circular buffer between WS layer and MQTT layer and 2056 bytes as max mqtt message lenght... these are far more than necessary, but help me to accelerate the soft reset caused by the memory leak
  • I transformed the mqtt client as a pointer to test its destroyer. I noticed that when mqtt connection got an error about 10000 bytes was lost, so I hope destroying it every connection helps us to avoid this. (interesting that after switching to pointer, we still lost 168 bytes when MQTT server could not connect)

* I probably should make a branch to avoid been commiting all the time. Sorry, I'am used to CVS, SVN, but git is still new to me. I will try to learn all git stuff in the near future ;-)

#include <Arduino.h>
#include <Stream.h>

#include <ESP8266WiFi.h>
#include <ESP8266WiFiMulti.h>

//AWS
#include "sha256.h"
#include "Utils.h"
#include "AWSClient2.h"

//WEBSockets
#include <Hash.h>
#include <WebSocketsClient.h>

//MQTT PAHO
#include <SPI.h>
#include <IPStack.h>
#include <Countdown.h>
#include <MQTTClient.h>



//AWS MQTT Websocket
#include "Client.h"
#include "AWSWebSocketClient.h"
#include "CircularByteBuffer.h"

//#define DEBUG_ESP_CORE
#define MQTT_DEBUG

//AWS IOT config, change these:
char wifi_ssid[]       = "ssid";
char wifi_password[]   = "pass";
char aws_endpoint[]    = "endpoit.iot.us-west-2.amazonaws.com";
char aws_key[]         = "AWSKEY";
char aws_secret[]      = "AWSSECRET";
char aws_region[]      = "us-west-2";
const char* aws_topic  = "$aws/things/owner-group-type-id/shadow/update";
int port = 443;

//MQTT config
const int maxMQTTpackageSize = 2056;
const int maxMQTTMessageHandlers = 1;

ESP8266WiFiMulti WiFiMulti;

AWSWebSocketClient awsWSclient(6000);

IPStack ipstack(awsWSclient);
MQTT::Client<IPStack, Countdown, maxMQTTpackageSize, maxMQTTMessageHandlers> *client = NULL;

//# of connections
long connection = 0;

//generate random mqtt clientID
char* generateClientID () {
  char* cID = new char[23]();
  for (int i=0; i<22; i+=1)
    cID[i]=(char)random(1, 256);
  return cID;
}

//count messages arrived
int arrivedcount = 0;

//callback to handle mqtt messages
void messageArrived(MQTT::MessageData& md)
{
  MQTT::Message &message = md.message;

  Serial.print("Message ");
  Serial.print(++arrivedcount);
  Serial.print(" arrived: qos ");
  Serial.print(message.qos);
  Serial.print(", retained ");
  Serial.print(message.retained);
  Serial.print(", dup ");
  Serial.print(message.dup);
  Serial.print(", packetid ");
  Serial.println(message.id);
  Serial.print("Payload ");
  char* msg = new char[message.payloadlen+1]();
  memcpy (msg,message.payload,message.payloadlen);
  Serial.println(msg);
  delete msg;
}

//connects to websocket layer and mqtt layer
bool connect () {

    if (client == NULL) {
      client = new MQTT::Client<IPStack, Countdown, maxMQTTpackageSize, maxMQTTMessageHandlers>(ipstack);
    } else {

      if (client->isConnected ()) {    
        client->disconnect ();
      }  
      delete client;
      client = new MQTT::Client<IPStack, Countdown, maxMQTTpackageSize, maxMQTTMessageHandlers>(ipstack);
    }


    //delay is not necessary... it just help us to get a "trustful" heap space value
    delay (1000);
    Serial.print (millis ());
    Serial.print (" - conn: ");
    Serial.print (++connection);
    Serial.print (" - (");
    Serial.print (ESP.getFreeHeap ());
    Serial.println (")");




   int rc = ipstack.connect(aws_endpoint, port);
    if (rc != 1)
    {
      Serial.println("error connection to the websocket server");
      return false;
    } else {
      Serial.println("websocket layer connected");
    }


    Serial.println("MQTT connecting");
    MQTTPacket_connectData data = MQTTPacket_connectData_initializer;
    data.MQTTVersion = 3;
    char* clientID = generateClientID ();
    data.clientID.cstring = clientID;
    rc = client->connect(data);
    delete[] clientID;
    if (rc != 0)
    {
      Serial.print("error connection to MQTT server");
      Serial.println(rc);
      return false;
    }
    Serial.println("MQTT connected");
    return true;
}

//subscribe to a mqtt topic
void subscribe () {
   //subscript to a topic
    int rc = client->subscribe(aws_topic, MQTT::QOS0, messageArrived);
    if (rc != 0) {
      Serial.print("rc from MQTT subscribe is ");
      Serial.println(rc);
      return;
    }
    Serial.println("MQTT subscribed");
}

//send a message to a mqtt topic
void sendmessage () {
    //send a message
    MQTT::Message message;
    char buf[100];
    strcpy(buf, "{\"state\":{\"reported\":{\"on\": false}, \"desired\":{\"on\": false}}}");
    message.qos = MQTT::QOS0;
    message.retained = false;
    message.dup = false;
    message.payload = (void*)buf;
    message.payloadlen = strlen(buf)+1;
    int rc = client->publish(aws_topic, message); 
}


void setup() {
    Serial.begin (115200);
    delay (2000);
    Serial.setDebugOutput(1);

    //fill with ssid and wifi password
    WiFiMulti.addAP(wifi_ssid, wifi_password);
    Serial.println ("connecting to wifi");
    while(WiFiMulti.run() != WL_CONNECTED) {
        delay(100);
        Serial.print (".");
    }
    Serial.println ("\nconnected");

    //fill AWS parameters    
    awsWSclient.setAWSRegion(aws_region);
    awsWSclient.setAWSDomain(aws_endpoint);
    awsWSclient.setAWSKeyID(aws_key);
    awsWSclient.setAWSSecretKey(aws_secret);
    awsWSclient.setUseSSL(true);

    if (connect ()){
      subscribe ();
      //sendmessage ();
    }

}

void loop() {
  //keep the mqtt up and running
  if (awsWSclient.connected ()) {    
      client->yield();
  } else {
    //handle reconnection
    if (connect ()){
      subscribe ();      
    }
  }

}

@odelot
Copy link
Owner

odelot commented Apr 14, 2016

after 28 hours, the esp8266 crashed again... but it seems not be caused by memory leak

I will try to investigate this exception (29) but the memory issue seems to be under control

have any of you guys already had problems with exception 29?

100187627 - conn: 372 - (27240)
websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed
100498617 - conn: 373 - (27240)
websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed
100809295 - conn: 374 - (27240)
websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed

Exception (29):
epc1=0x4000e1b2 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000000 depc=0x00000000

*note that memory usage (27240) from connection 374 is the same from the connection from last post

@svdgraaf
Copy link
Contributor Author

svdgraaf commented Apr 14, 2016

Nice! I haven't been able to look at your new code yet, but sounds very promising! :)

On 14 apr. 2016 14:16 +0200, odelotnotifications@github.com, wrote:

after 28 hours, the esp8266 crashed again... but it seems not be caused by memory leak

I will try to investigate this exception (29) but the memory issue seems to be under control

have any of you guys had already problems with exception 29?

100187627 - conn: 372 - (27240) websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed 100498617 - conn: 373 - (27240) websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed 100809295 - conn: 374 - (27240) websocket layer connected / MQTT connecting / MQTT connected / MQTT subscribed Exception (29): epc1=0x4000e1b2 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000000 depc=0x00000000

*note that memory usage (27240) from connection 374 is the same from the connection from last post


You are receiving this because you were mentioned.
Reply to this email directly orview it on GitHub(#2 (comment))

@svdgraaf
Copy link
Contributor Author

After some looking around, the 29 exception does point to an invalid reference somewhere (trying to overwrite a value in virtual memory which is prohibited). Can you perhaps create push a branch with your current code? I can hook up a debugger, and see where the issue is.

@svdgraaf
Copy link
Contributor Author

svdgraaf commented Apr 16, 2016

I figured my esp must have crashed by now, but hadn't checked the actual output. I just did, and it's still running (~29 hours now), the heap size is stable for the last few hours around 10k. It's posting to a mqtt channel every 10 seconds. I'm running 633e48f3aec5f1c3c11d4498fc90d378d49e6e9f of the esp8266 repo, with the manual 'fix' for the tcp_abort(). Let's see how long it will last...

screenshot 2016-04-16 12 45 47

@svdgraaf
Copy link
Contributor Author

Still going strong after 49 hours... Heap size still the same

@odelot
Copy link
Owner

odelot commented Apr 18, 2016

@svdgraaf the code at github is my most up to date code... I just didn't upload the example code yet, but I've posted it above. it seems that the memory issue was fixed.

@aaroneiche
Copy link

I found the sample code in the above comment to be tremendously helpful. The Sample I'd been working off previously was failing frequently on connection. This sample worked great :) I am connected and successfully receiving messages.

@aaroneiche
Copy link

It seems I spoke too soon - I'm still running into issues getting the websocket connection to work:


connected
11904 - conn: 1 - (26504)
please start sntp first !
[AWSc] Disconnected!
please start sntp first !
[AWSc] Disconnected!
please start sntp first !
pm open,type:2 0
[AWSc] Disconnected!
please start sntp first !
[AWSc] Disconnected!
please start sntp first !
error connection to the websocket server
19491 - conn: 2 - (11360)
[AWSc] Disconnected!
please start sntp first !
[AWSc] Disconnected!
please start sntp first !
[AWSc] Disconnected!
please start sntp first !
[AWSc] Disconnected!
please start sntp first !
[AWSc] Disconnected!
please start sntp first !
error connection to the websocket server
26987 - conn: 3 - (11360)
[AWSc] Disconnected!
please start sntp first !
[AWSc] Disconnected!
please start sntp first !
[AWSc] Disconnected!
please start sntp first !
[AWSc] Disconnected!
please start sntp first !
[AWSc] Disconnected!
please start sntp first !
error connection to the websocket server
34480 - conn: 4 - (11360)

It just keeps on rolling like that. Any suggestions on what to do to make the websocket connect more reliable? I got it working once, but now I have no idea what make that work.

@jatindera
Copy link

Hi,
Till yesterday, everything was working fine for me. I connected my ESP8266 development board to AWS IoT. Today morning, I connected it again without making any change and I started getting following error:

please start sntp first !
error connection to the websocket server

Appreciate your help on this.

@sh4nnongoh
Copy link

Regarding the SNTP issue:

Add the following line to the Setup section,

configTime(3 * 3600, 0, "pool.ntp.org", "time.nist.gov");

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants