ESP8266: Pelion connect fails with 'send tcp data' timeout #11544

ccli8 · 2019-09-23T07:44:28Z

Description

Environment

mbed-os-example-pelion
656809a Merge pull request #14 from ARMmbed/remove-patches

mbed-clout-client
3.4.0

mbed-os
83fca60 Merge pull request #11454 from Tharazi97/LSI_VALUE_STM

Target
NU_PFM_M2351_NPSA_NS

Pelion connection has some failure rate. For my check, I enable ESP8266 debug log. From the log, pelion connection failure results from send tcp data failing. Some questions with it:

The first AT+CIPSEND=0,642 fails due to timeout. If I enlarge ESP8266_SEND_TIMEOUT from 2000ms to e.g. 10000ms, the failure rate can decrease. But what is reasonable value of ESP8266_SEND_TIMEOUT? It depends on network status.
When the first AT+CIPSEND=0,642 times out, a second AT+CIPSEND=0,642 is retried. But there is Recv 642 bytes from ESP8266 module in the first AT+CIPSEND=0,642. In my understanding, this means ESP8266 module has received 642 bytes from host mcu and is sending it out to remote peer. If the second AT+CIPSEND=0,642 succeeds, then 642 bytes is duplicate sent?


AT> AT+CIPSEND=0,642

AT? >%n
AT< OK
AT= >
AT? SEND OK%n
AT<

AT< Recv 642 bytes


AT(Timeout)

AT? %n
AT? %n

AT> AT+CIPSEND=0,642

AT? >%n
AT! busy
AT(Aborted)
AT? %n
AT<
AT? %n
AT> AT+CIPSEND=0,642
AT? >%n
AT! busy
AT(Aborted)
AT? %n
AT<
AT! busy
AT(Aborted)
AT? %n
AT<
AT! busy
AT(Aborted)

Issue request type

[ ] Question
[ ] Enhancement
[x] Bug

The text was updated successfully, but these errors were encountered:

0xc0170 · 2019-09-23T09:57:48Z

cc @ARMmbed/mbed-os-wan please review

ciarmcom · 2019-09-23T11:46:34Z

Internal Jira reference: https://jira.arm.com/browse/MBOCUSTRIA-1805

michalpasztamobica · 2019-10-09T10:44:01Z

Hi @ccli8 ,

Ad 1) About the ESP8266_SEND_TIMEOUT - I can see you found a workaround by manually overriding the macro value in mbed_app.json. I think the current timeout is quite a reasonable value, as we keep running this in our CI every day and tests are pretty much always passing. What I can offer is that we can add a more convenient way of changing this configuration, like it was don for some pins config in this PR: #11331

Ad 2) This line:

AT? SEND OK%n

Is a mere debug print showing what we expect to receive.
A valid send operation will return a line:

AT= SEND OK

And only this is a confirmation that the data was indeed sent.

Regarding the AT< Recv 642 bytes line...
After we send the CIPSEND AT command ESP will start receiving bytes over serial data until it receives the amount provided in the CPISEND arguments.
The printout comes from this line, indicating this is an unexpected line as we are in fact looking for SEND OK.
I am not surprised we are not counting on this line as I haven't found a single word about it in the Espressifs documentation. I can therefore only assume that this is some undocumented message confirming the reception of serial data. It does not mean that the data was sent over the air.

michalpasztamobica · 2019-10-09T13:48:19Z

@teetak01 , have you seen ESP8266's tcp send timeout causing Pelion connection to fail?

ccli8 · 2019-10-10T02:28:28Z

Without timeout, I can see Recv 642 bytes and then AT= SEND OK. When timeout occurs, I just see Recv 642 bytes but no AT= SEND OK. I ever saw it takes 10s+ to get AT= SEND OK, so I enlarge ESP8266_SEND_TIMEOUT. Confirm one thing: SEND OK is replied by ESP8266 module to inform to host mcu that the data has sent successfully over the air (to remote peer), or just debug print?

teetak01 · 2019-10-10T05:17:30Z

@michalpasztamobica does this not look similar to https://jira.arm.com/browse/ONME-4352 ?

[DBG ][ESPA]: ESP8266::send(): send failed
[DBG ][ESPI]: Postponing SIGIO from the device
[DBG ][ESPA]: busy s...
[DBG ][ESPA]: ESP8266::send(): didn't get ">"
[DBG ][ESPA]: ESP8266::send(): modem busy
[DBG ][ESPI]: Postponing SIGIO from the device

michalpasztamobica · 2019-10-10T07:40:11Z

I assumed that Pelion connect will be something happening at the very beginning of operation, meaning that there would be no data pending to be received. Also - @ccli8 mentioned that the problem disappears when he increases timeout to 10 seconds. In ONME-4352 ESP gets stuck for minutes...

But just to be sure...
@ccli8 , there is an issue we've identified, that ESP8266 will not let tcp send take place, reporting it is busy s..., however in fact there is unread received data pending, which filled up the ESP's buffer. Could that by any chance be your case? Is there any tcp read-out taking place within the 10 seconds, that did not occur within 2 seconds?

Regarding your question in the last post. I confirm, only AT= SEND OK means data was successfully sent (and this is documented in ESP's AT command manual). If this message does not show in the logs - the data was not sent.

Just a side note - you can enable some higher-level debug traces by adding a call to mbed_trace_init(); at the beginning of your program. This way you will get logs similar to what Teemu pasted above, which may be easier for your debugging purposes.

ccli8 · 2019-10-10T12:56:21Z

@ccli8 mentioned that the problem disappears when he increases timeout to 10 seconds

It doesn't completely disappear. In most cases, send tcp data can complete (confirmed by receiving AT= SEND OK) in 2s. In somes cases, it takes 10s+ to complete.

Could that by any chance be your case? Is there any tcp read-out taking place within the 10 seconds, that did not occur within 2 seconds?

During the 10 seconds for 'send tcp data', I don't see tcp read-out taking place.

I guess when network status is bad, ESP8266 module doesn't receive ack from remote peer, so 'send tcp data' is not completed, and then no SEND OK replied to host mcu. 2s timeout cannot address this case.

michalpasztamobica · 2019-10-10T13:12:44Z

Ok, so it seems your case is not the one, that @teetak01 mentioned.

Poor network connectivity state may always cause some issue with connecting. Is there a retry mechanism that you could use instead of increasing the deeply buried HW-dependent timeout?

If not, then coming back to my original idea - would it be helpful for you if we provided a possibility of configuring the ESP8266_SEND_TIMEOUT inside mbed_app.json, in a similar way to how the pins or baudrate are configured now?

michalpasztamobica · 2019-10-10T13:22:03Z

Hi @juhaylinen ,
I found this PR to be the last time when the ESP internal timeouts were modified. Can you help us understand how they were chosen and if it makes sense to increase them or not?

ccli8 · 2019-10-10T14:38:03Z

Is there a retry mechanism that you could use instead of increasing the deeply buried HW-dependent timeout?

According to the log of #11544 (comment), there has been a retry mechanism. But I have concern with return code of ESP8266::send(...) on timeout. On receiving AT< Recv 642 bytes (but no AT= SEND OK) from ESP8266 module , it should be seen as if the data has sent out and ESP8266::send(...) should return NSAPI_ERROR_OK instead of NSAPI_ERROR_WOULD_BLOCK, even though the data is just buffered in ESP8266 module. If return code is NSAPI_ERROR_WOULD_BLOCK, the 642 bytes data is re-sent by the retry mechanism. That's duplicate. But if NSAPI_ERROR_OK is returned, the upper layer would send new data. In this case, the return code should be NSAPI_ERROR_WOULD_BLOCK if ESP8266 module returns busy. This makes upper layer address the timeout issue and ESP8266_SEND_TIMEOUT just for serial latency between host mcu and ESP8266 module, not related to network status.

Allow configuring the ESP8266_SEND_TIMEOUT in mbed_app.json is fine. But I think above is the point.

michalpasztamobica · 2019-10-14T12:29:52Z

I see your point, @ccli8 . We are in the grey area here - the data has been successfully sent to ESP8266 over UART, and unofficially acknowledged with AT< Recv 642 bytes, but the process of actually sending them over the network has not succeeded, as the AT= SEND OK did not arrive in time.

Please, bear in mind - the AT< Recv is a totally undocumented message, it would be risky to rely on it in our driver. The only official acknowledgement of the data being sent is the AT= SEND OK. I think that it is justified to return a timeout if the official acknowledgement doesn't arrive.

To confirm if the data is sent twice or not - may I ask you to take another look into your logs? If the data is indeed sent twice we would see a stray AT= SEND OK from the ESP, coming after the resend mechanism is triggered. In other words - the number of AT< Recv 642 bytes messages would be the same as the number of AT= SEND OK messages, even though we think one of them timed out.
If there is one AT=SEND OK message less that AT< Recv message - then it means the data was not sent out twice and timeout was justified.

Either way - double packets in TCP are not an issue, but I understand we want to avoid them for battery savings.

ccli8 · 2019-10-15T03:47:36Z

To see duplicate packets easily, I decrease ESP8266_SEND_TIMEOUT to 200ms.

First accepted AT+CIPSEND=0,90, but timeout with not receiving AT= SEND OK:

AT> AT+CIPSEND=0,90
AT? >%n
AT< OK
AT= >
AT? SEND OK%n
AT<
AT< Recv 90 bytes
AT(Timeout)
AT? %n
AT? %n

Because NSAPI_ERROR_WOULD_BLOCK is returned, following AT+CIPSEND=0,90 are retried, but rejected. I think ESP8266 module is in sending out the first AT+CIPSEND=0,90:

AT> AT+CIPSEND=0,90
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n
AT> AT+CIPSEND=0,90
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

Second accepted AT+CIPSEND=0,90. Notice AT< SEND OK is interrupting before AT< OK and AT= >. It should come from the first accepted AT+CIPSEND=0,90 that times out. The interrupting AT< SEND OK is not handled and NSAPI_ERROR_WOULD_BLOCK is returned:

AT> AT+CIPSEND=0,90
AT? >%n
AT< SEND OK
AT< OK
AT= >
AT? SEND OK%n
AT<
AT! +IPD
AT? ,%*d,%n
AT= ,0,
AT? %*d
%n
AT= 755

Expect SEND OK to match the second accepted AT+CIPSEND=0,90, but the log is disturbed by incoming packet.

AT? SEND OK%n
AT! +IPD
AT? ,%*d,%n
AT= ,0,
AT? %*d
%n
AT= 941

AT? SEND OK%n
AT< Recv 90 bytes
AT(Timeout)
AT? %n
AT? %n

Still repeat AT+CIPSEND=0,90:

AT> AT+CIPSEND=0,90
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n
AT> AT+CIPSEND=0,90
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n

Finally, see AT< SEND OK. This should match the second accepted AT+CIPSEND=0,90.

AT> AT+CIPSEND=0,90
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n
AT< SEND OK
AT! +IPD
AT? ,%*d,%n
AT= ,0,
AT? %*d
%n
AT= 948

AT? %n
AT! 0,CLOSED

See two AT< SEND OK, so data is duplicated.

Either way - double packets in TCP are not an issue, but I understand we want to avoid them for battery savings.

The duplicate comes from AT layer. TCP is unaware of it and still treats it as distinct packets.

ccli8 · 2019-10-15T06:00:59Z

I also do an experiment to show my above idea: decrease ESP8266_SEND_TIMEOUT to 200ms to meet timeout easily and skip SEND OK check

mbed-os/components/wifi/esp8266-driver/ESP8266/ESP8266.cpp > ESP8266::send(...):

    if (_parser.write((char *)data, (int)amount) >= 0 /*&& _parser.recv("SEND OK")*/) {
        ret = NSAPI_ERROR_OK;
    }

AT+CIPSEND=0,653 is accepted. Skip SEND OK check. Return NSAPI_ERROR_OK.

AT> AT+CIPSEND=0,653
AT? >%n
AT< OK
AT= >
AT? %n
AT<
AT< Recv 653 bytes
AT? %n
AT? %n

New data AT+CIPSEND=0,75 is rejected due to above pending AT+CIPSEND=0,653. Return NSAPI_ERROR_WOULD_BLOCK.

AT> AT+CIPSEND=0,75
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n
AT> AT+CIPSEND=0,75
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n

New data AT+CIPSEND=0,75 finally gets accepted. Skip SEND OK check. Notice the interrupting AT< SEND OK for above pending AT+CIPSEND=0,653. Return NSAPI_ERROR_OK.

AT> AT+CIPSEND=0,75
AT? >%n
AT< SEND OK
AT< OK
AT= >
AT? %n
AT<
AT< Recv 75 bytes
AT? %n

Another new data AT+CIPSEND=0,84 is rejected due to above pending AT+CIPSEND=0,75. Return NSAPI_ERROR_WOULD_BLOCK.

AT> AT+CIPSEND=0,84
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n
AT> AT+CIPSEND=0,84
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n

Finally, new data AT+CIPSEND=0,84 gets accepted. Notice the AT< SEND OK for above pending AT+CIPSEND=0,75.

AT? %n
AT< SEND OK
AT> AT+CIPSEND=0,84
AT? >%n
AT< OK
AT= >
AT? %n
AT<
AT< Recv 84 bytes
AT? %n

michalpasztamobica · 2019-10-15T08:18:41Z

@ccli8 , thanks a lot for these logs and explanation. You definitely have a point here - the driver's behavior needs improvement. Extending the timeout is a good workaround, but I think we should rework the driver:

Increase the send timeout, to at least see the "busy" message, preferably make it dependent on the amount of data sent over serial and the baud rate.
Respawn the timeout in case of "busy s", but with some reasonable total timeout (30s?) still in place.
(this is risky) include the "Recv XXX bytes" in our calculations - although undocumented, this is an important moment of communication...

FYI, @AnttiKauppila

ccli8 · 2019-12-17T05:48:09Z

Update for it?

michalpasztamobica · 2019-12-17T07:30:07Z

Hi @ccli8 . We are actively working on this in the current sprint. Just trying to figure out the least intrusive change that would make most sense.

1. Support Nuvoton targets: - NUMAKER_IOT_M487 V1.3 Compared to V1.2, V1.3 adds support for ESP8266 RTS/CTS/RST pins. V1.3 is incompatible with V1.2 on ESP8266. 2. Change storage to NUSD (SD card in SDIO bus mode) from SD (SD card SPI bus mode) (1) Add COMPONENT_NUSD.lib. (2) Add component NUSD (target.components_add). (3) Change default BlockDevice to NUSD: Override BlockDevice::get_default_instance (nusd.provide-default-blockdevice). Required since mbed-cloud-client 2.1.0 (arm_uc_blockdevice_ext) or in: mbed-bootloader/modules/storage/pal-blockdevice/source/arm_uc_pal_blockdevice_mbed.cpp (4) Enable kvstore with blockdevice type being "other". i) Override get_other_blockdevice() (nusd.provide-kvstore-other-blockdevice). ii) Configure storage type to "FILESYSTEM". iii)Configure file system type to "LITTLE". vi) Configure block device type to "other". v) Configure external size to 64MiB, which cannot overlap with update-client. storage-address/update-client.storage-size. 3. Change back UARTSerial tx/rx buffer size from 1024/1024 to 256/256. This can reduce memory footprint by (1024 - 256) * 2 = 1.5KiB. Because this configuration influences both ESP8266 and default console (platfrom/mbed_retarget.cpp), memory reduction can achieve 1.5KiB x 2 = 3KiB. 4. Enlarge ESP8266 'send tcp data' timeout on Nuvoton targets Pelion connection has some failure rate with this. Enlarging ESP8266_SEND_TIMEOUT can just relieve the issue. A ticket has raised to address it: ARMmbed/mbed-os#11544

1. Support Nuvoton targets: - NUMAKER_IOT_M263A 2. Change storage to NUSD (SD card in SDIO bus mode) from SD (SD card SPI bus mode) (1) Add COMPONENT_NUSD.lib. (2) Add component NUSD (target.components_add). (3) Change default BlockDevice to NUSD: Override BlockDevice::get_default_instance (nusd.provide-default-blockdevice). Required since mbed-cloud-client 2.1.0 (arm_uc_blockdevice_ext) or in: mbed-bootloader/modules/storage/pal-blockdevice/source/arm_uc_pal_blockdevice_mbed.cpp (4) Enable kvstore with blockdevice type being "other". i) Override get_other_blockdevice() (nusd.provide-kvstore-other-blockdevice). ii) Configure storage type to "FILESYSTEM". iii)Configure file system type to "LITTLE". vi) Configure block device type to "other". v) Configure external size to 64MiB, which cannot overlap with update-client. storage-address/update-client.storage-size. 3. Change back UARTSerial tx/rx buffer size from 1024/1024 to 256/256. This can reduce memory footprint by (1024 - 256) * 2 = 1.5KiB. Because this configuration influences both ESP8266 and default console (platfrom/mbed_retarget.cpp), memory reduction can achieve 1.5KiB x 2 = 3KiB. 4. Enlarge ESP8266 'send tcp data' timeout on Nuvoton targets Pelion connection has some failure rate with this. Enlarging ESP8266_SEND_TIMEOUT can just relieve the issue. A ticket has raised to address it: ARMmbed/mbed-os#11544

ciarmcom added type: bug Jira status: OPEN labels Sep 23, 2019

ciarmcom added the mirrored label Sep 23, 2019

ciarmcom added Jira status: IN PROGRESS and removed Jira status: OPEN labels Dec 4, 2019

michalpasztamobica mentioned this issue Dec 20, 2019

ESP8266: Avoid duplicate data sends #12157

Merged

ciarmcom added Jira status: RESOLVED and removed Jira status: IN PROGRESS labels Jan 10, 2020

0xc0170 closed this as completed in #12157 Jan 15, 2020

ciarmcom added Jira status: CLOSED and removed Jira status: RESOLVED labels Jan 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESP8266: Pelion connect fails with 'send tcp data' timeout #11544

ESP8266: Pelion connect fails with 'send tcp data' timeout #11544

ccli8 commented Sep 23, 2019

0xc0170 commented Sep 23, 2019

ciarmcom commented Sep 23, 2019

michalpasztamobica commented Oct 9, 2019

michalpasztamobica commented Oct 9, 2019

ccli8 commented Oct 10, 2019

teetak01 commented Oct 10, 2019

michalpasztamobica commented Oct 10, 2019

ccli8 commented Oct 10, 2019

michalpasztamobica commented Oct 10, 2019

michalpasztamobica commented Oct 10, 2019

ccli8 commented Oct 10, 2019 •

edited

Loading

michalpasztamobica commented Oct 14, 2019

ccli8 commented Oct 15, 2019

ccli8 commented Oct 15, 2019 •

edited

Loading

michalpasztamobica commented Oct 15, 2019

ccli8 commented Dec 17, 2019

michalpasztamobica commented Dec 17, 2019

ESP8266: Pelion connect fails with 'send tcp data' timeout #11544

ESP8266: Pelion connect fails with 'send tcp data' timeout #11544

Comments

ccli8 commented Sep 23, 2019

Description

Environment

Issue request type

0xc0170 commented Sep 23, 2019

ciarmcom commented Sep 23, 2019

michalpasztamobica commented Oct 9, 2019

michalpasztamobica commented Oct 9, 2019

ccli8 commented Oct 10, 2019

teetak01 commented Oct 10, 2019

michalpasztamobica commented Oct 10, 2019

ccli8 commented Oct 10, 2019

michalpasztamobica commented Oct 10, 2019

michalpasztamobica commented Oct 10, 2019

ccli8 commented Oct 10, 2019 • edited Loading

michalpasztamobica commented Oct 14, 2019

ccli8 commented Oct 15, 2019

ccli8 commented Oct 15, 2019 • edited Loading

michalpasztamobica commented Oct 15, 2019

ccli8 commented Dec 17, 2019

michalpasztamobica commented Dec 17, 2019

ccli8 commented Oct 10, 2019 •

edited

Loading

ccli8 commented Oct 15, 2019 •

edited

Loading