Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESP8266: Pelion connect fails with 'send tcp data' timeout #11544

Closed
ccli8 opened this issue Sep 23, 2019 · 17 comments · Fixed by #12157
Closed

ESP8266: Pelion connect fails with 'send tcp data' timeout #11544

ccli8 opened this issue Sep 23, 2019 · 17 comments · Fixed by #12157

Comments

@ccli8
Copy link
Contributor

ccli8 commented Sep 23, 2019

Description

Environment

mbed-os-example-pelion
656809a Merge pull request #14 from ARMmbed/remove-patches

mbed-clout-client
3.4.0

mbed-os
83fca60 Merge pull request #11454 from Tharazi97/LSI_VALUE_STM

Target
NU_PFM_M2351_NPSA_NS

Pelion connection has some failure rate. For my check, I enable ESP8266 debug log. From the log, pelion connection failure results from send tcp data failing. Some questions with it:

  1. The first AT+CIPSEND=0,642 fails due to timeout. If I enlarge ESP8266_SEND_TIMEOUT from 2000ms to e.g. 10000ms, the failure rate can decrease. But what is reasonable value of ESP8266_SEND_TIMEOUT? It depends on network status.
  2. When the first AT+CIPSEND=0,642 times out, a second AT+CIPSEND=0,642 is retried. But there is Recv 642 bytes from ESP8266 module in the first AT+CIPSEND=0,642. In my understanding, this means ESP8266 module has received 642 bytes from host mcu and is sending it out to remote peer. If the second AT+CIPSEND=0,642 succeeds, then 642 bytes is duplicate sent?

AT> AT+CIPSEND=0,642

AT? >%n
AT< OK
AT= >
AT? SEND OK%n
AT<

AT< Recv 642 bytes


AT(Timeout)

AT? %n
AT? %n

AT> AT+CIPSEND=0,642

AT? >%n
AT! busy
AT(Aborted)
AT? %n
AT<
AT? %n
AT> AT+CIPSEND=0,642
AT? >%n
AT! busy
AT(Aborted)
AT? %n
AT<
AT! busy
AT(Aborted)
AT? %n
AT<
AT! busy
AT(Aborted)

Issue request type

[ ] Question
[ ] Enhancement
[x] Bug
@0xc0170
Copy link
Contributor

0xc0170 commented Sep 23, 2019

cc @ARMmbed/mbed-os-wan please review

@ciarmcom
Copy link
Member

Internal Jira reference: https://jira.arm.com/browse/MBOCUSTRIA-1805

@michalpasztamobica
Copy link
Contributor

Hi @ccli8 ,

Ad 1) About the ESP8266_SEND_TIMEOUT - I can see you found a workaround by manually overriding the macro value in mbed_app.json. I think the current timeout is quite a reasonable value, as we keep running this in our CI every day and tests are pretty much always passing. What I can offer is that we can add a more convenient way of changing this configuration, like it was don for some pins config in this PR: #11331

Ad 2) This line:

AT? SEND OK%n

Is a mere debug print showing what we expect to receive.
A valid send operation will return a line:

AT= SEND OK

And only this is a confirmation that the data was indeed sent.

Regarding the AT< Recv 642 bytes line...
After we send the CIPSEND AT command ESP will start receiving bytes over serial data until it receives the amount provided in the CPISEND arguments.
The printout comes from this line, indicating this is an unexpected line as we are in fact looking for SEND OK.
I am not surprised we are not counting on this line as I haven't found a single word about it in the Espressifs documentation. I can therefore only assume that this is some undocumented message confirming the reception of serial data. It does not mean that the data was sent over the air.

@michalpasztamobica
Copy link
Contributor

@teetak01 , have you seen ESP8266's tcp send timeout causing Pelion connection to fail?

@ccli8
Copy link
Contributor Author

ccli8 commented Oct 10, 2019

Without timeout, I can see Recv 642 bytes and then AT= SEND OK. When timeout occurs, I just see Recv 642 bytes but no AT= SEND OK. I ever saw it takes 10s+ to get AT= SEND OK, so I enlarge ESP8266_SEND_TIMEOUT. Confirm one thing: SEND OK is replied by ESP8266 module to inform to host mcu that the data has sent successfully over the air (to remote peer), or just debug print?

@teetak01
Copy link
Contributor

@michalpasztamobica does this not look similar to https://jira.arm.com/browse/ONME-4352 ?

[DBG ][ESPA]: ESP8266::send(): send failed
[DBG ][ESPI]: Postponing SIGIO from the device
[DBG ][ESPA]: busy s...
[DBG ][ESPA]: ESP8266::send(): didn't get ">"
[DBG ][ESPA]: ESP8266::send(): modem busy
[DBG ][ESPI]: Postponing SIGIO from the device

@michalpasztamobica
Copy link
Contributor

I assumed that Pelion connect will be something happening at the very beginning of operation, meaning that there would be no data pending to be received. Also - @ccli8 mentioned that the problem disappears when he increases timeout to 10 seconds. In ONME-4352 ESP gets stuck for minutes...

But just to be sure...
@ccli8 , there is an issue we've identified, that ESP8266 will not let tcp send take place, reporting it is busy s..., however in fact there is unread received data pending, which filled up the ESP's buffer. Could that by any chance be your case? Is there any tcp read-out taking place within the 10 seconds, that did not occur within 2 seconds?

Regarding your question in the last post. I confirm, only AT= SEND OK means data was successfully sent (and this is documented in ESP's AT command manual). If this message does not show in the logs - the data was not sent.

Just a side note - you can enable some higher-level debug traces by adding a call to mbed_trace_init(); at the beginning of your program. This way you will get logs similar to what Teemu pasted above, which may be easier for your debugging purposes.

@ccli8
Copy link
Contributor Author

ccli8 commented Oct 10, 2019

@ccli8 mentioned that the problem disappears when he increases timeout to 10 seconds

It doesn't completely disappear. In most cases, send tcp data can complete (confirmed by receiving AT= SEND OK) in 2s. In somes cases, it takes 10s+ to complete.

Could that by any chance be your case? Is there any tcp read-out taking place within the 10 seconds, that did not occur within 2 seconds?

During the 10 seconds for 'send tcp data', I don't see tcp read-out taking place.

I guess when network status is bad, ESP8266 module doesn't receive ack from remote peer, so 'send tcp data' is not completed, and then no SEND OK replied to host mcu. 2s timeout cannot address this case.

@michalpasztamobica
Copy link
Contributor

Ok, so it seems your case is not the one, that @teetak01 mentioned.

Poor network connectivity state may always cause some issue with connecting. Is there a retry mechanism that you could use instead of increasing the deeply buried HW-dependent timeout?

If not, then coming back to my original idea - would it be helpful for you if we provided a possibility of configuring the ESP8266_SEND_TIMEOUT inside mbed_app.json, in a similar way to how the pins or baudrate are configured now?

@michalpasztamobica
Copy link
Contributor

Hi @juhaylinen ,
I found this PR to be the last time when the ESP internal timeouts were modified. Can you help us understand how they were chosen and if it makes sense to increase them or not?

@ccli8
Copy link
Contributor Author

ccli8 commented Oct 10, 2019

Is there a retry mechanism that you could use instead of increasing the deeply buried HW-dependent timeout?

According to the log of #11544 (comment), there has been a retry mechanism. But I have concern with return code of ESP8266::send(...) on timeout. On receiving AT< Recv 642 bytes (but no AT= SEND OK) from ESP8266 module , it should be seen as if the data has sent out and ESP8266::send(...) should return NSAPI_ERROR_OK instead of NSAPI_ERROR_WOULD_BLOCK, even though the data is just buffered in ESP8266 module. If return code is NSAPI_ERROR_WOULD_BLOCK, the 642 bytes data is re-sent by the retry mechanism. That's duplicate. But if NSAPI_ERROR_OK is returned, the upper layer would send new data. In this case, the return code should be NSAPI_ERROR_WOULD_BLOCK if ESP8266 module returns busy. This makes upper layer address the timeout issue and ESP8266_SEND_TIMEOUT just for serial latency between host mcu and ESP8266 module, not related to network status.

Allow configuring the ESP8266_SEND_TIMEOUT in mbed_app.json is fine. But I think above is the point.

@michalpasztamobica
Copy link
Contributor

I see your point, @ccli8 . We are in the grey area here - the data has been successfully sent to ESP8266 over UART, and unofficially acknowledged with AT< Recv 642 bytes, but the process of actually sending them over the network has not succeeded, as the AT= SEND OK did not arrive in time.

Please, bear in mind - the AT< Recv is a totally undocumented message, it would be risky to rely on it in our driver. The only official acknowledgement of the data being sent is the AT= SEND OK. I think that it is justified to return a timeout if the official acknowledgement doesn't arrive.

To confirm if the data is sent twice or not - may I ask you to take another look into your logs? If the data is indeed sent twice we would see a stray AT= SEND OK from the ESP, coming after the resend mechanism is triggered. In other words - the number of AT< Recv 642 bytes messages would be the same as the number of AT= SEND OK messages, even though we think one of them timed out.
If there is one AT=SEND OK message less that AT< Recv message - then it means the data was not sent out twice and timeout was justified.

Either way - double packets in TCP are not an issue, but I understand we want to avoid them for battery savings.

@ccli8
Copy link
Contributor Author

ccli8 commented Oct 15, 2019

To see duplicate packets easily, I decrease ESP8266_SEND_TIMEOUT to 200ms.

First accepted AT+CIPSEND=0,90, but timeout with not receiving AT= SEND OK:

AT> AT+CIPSEND=0,90
AT? >%n
AT< OK
AT= >
AT? SEND OK%n
AT<
AT< Recv 90 bytes
AT(Timeout)
AT? %n
AT? %n

Because NSAPI_ERROR_WOULD_BLOCK is returned, following AT+CIPSEND=0,90 are retried, but rejected. I think ESP8266 module is in sending out the first AT+CIPSEND=0,90:

AT> AT+CIPSEND=0,90
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n
AT> AT+CIPSEND=0,90
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

Second accepted AT+CIPSEND=0,90. Notice AT< SEND OK is interrupting before AT< OK and AT= >. It should come from the first accepted AT+CIPSEND=0,90 that times out. The interrupting AT< SEND OK is not handled and NSAPI_ERROR_WOULD_BLOCK is returned:

AT> AT+CIPSEND=0,90
AT? >%n
AT< SEND OK
AT< OK
AT= >
AT? SEND OK%n
AT<
AT! +IPD
AT? ,%*d,%n
AT= ,0,
AT? %*d
%n
AT= 755

Expect SEND OK to match the second accepted AT+CIPSEND=0,90, but the log is disturbed by incoming packet.

AT? SEND OK%n
AT! +IPD
AT? ,%*d,%n
AT= ,0,
AT? %*d
%n
AT= 941

AT? SEND OK%n
AT< Recv 90 bytes
AT(Timeout)
AT? %n
AT? %n

Still repeat AT+CIPSEND=0,90:

AT> AT+CIPSEND=0,90
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n
AT> AT+CIPSEND=0,90
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n

Finally, see AT< SEND OK. This should match the second accepted AT+CIPSEND=0,90.

AT> AT+CIPSEND=0,90
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n
AT< SEND OK
AT! +IPD
AT? ,%*d,%n
AT= ,0,
AT? %*d
%n
AT= 948

AT? %n
AT! 0,CLOSED

See two AT< SEND OK, so data is duplicated.

Either way - double packets in TCP are not an issue, but I understand we want to avoid them for battery savings.

The duplicate comes from AT layer. TCP is unaware of it and still treats it as distinct packets.

@ccli8
Copy link
Contributor Author

ccli8 commented Oct 15, 2019

I also do an experiment to show my above idea: decrease ESP8266_SEND_TIMEOUT to 200ms to meet timeout easily and skip SEND OK check

mbed-os/components/wifi/esp8266-driver/ESP8266/ESP8266.cpp > ESP8266::send(...):

    if (_parser.write((char *)data, (int)amount) >= 0 /*&& _parser.recv("SEND OK")*/) {
        ret = NSAPI_ERROR_OK;
    }

AT+CIPSEND=0,653 is accepted. Skip SEND OK check. Return NSAPI_ERROR_OK.

AT> AT+CIPSEND=0,653
AT? >%n
AT< OK
AT= >
AT? %n
AT<
AT< Recv 653 bytes
AT? %n
AT? %n

New data AT+CIPSEND=0,75 is rejected due to above pending AT+CIPSEND=0,653. Return NSAPI_ERROR_WOULD_BLOCK.

AT> AT+CIPSEND=0,75
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n
AT> AT+CIPSEND=0,75
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n

New data AT+CIPSEND=0,75 finally gets accepted. Skip SEND OK check. Notice the interrupting AT< SEND OK for above pending AT+CIPSEND=0,653. Return NSAPI_ERROR_OK.

AT> AT+CIPSEND=0,75
AT? >%n
AT< SEND OK
AT< OK
AT= >
AT? %n
AT<
AT< Recv 75 bytes
AT? %n

Another new data AT+CIPSEND=0,84 is rejected due to above pending AT+CIPSEND=0,75. Return NSAPI_ERROR_WOULD_BLOCK.

AT> AT+CIPSEND=0,84
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n
AT> AT+CIPSEND=0,84
AT? >%n
AT! busy
AT? %*4[^"]
%n
AT= s...

AT(Aborted)
AT? %n

Finally, new data AT+CIPSEND=0,84 gets accepted. Notice the AT< SEND OK for above pending AT+CIPSEND=0,75.

AT? %n
AT< SEND OK
AT> AT+CIPSEND=0,84
AT? >%n
AT< OK
AT= >
AT? %n
AT<
AT< Recv 84 bytes
AT? %n

@michalpasztamobica
Copy link
Contributor

@ccli8 , thanks a lot for these logs and explanation. You definitely have a point here - the driver's behavior needs improvement. Extending the timeout is a good workaround, but I think we should rework the driver:

  1. Increase the send timeout, to at least see the "busy" message, preferably make it dependent on the amount of data sent over serial and the baud rate.
  2. Respawn the timeout in case of "busy s", but with some reasonable total timeout (30s?) still in place.
  3. (this is risky) include the "Recv XXX bytes" in our calculations - although undocumented, this is an important moment of communication...

FYI, @AnttiKauppila

@ccli8
Copy link
Contributor Author

ccli8 commented Dec 17, 2019

Update for it?

@michalpasztamobica
Copy link
Contributor

Hi @ccli8 . We are actively working on this in the current sprint. Just trying to figure out the least intrusive change that would make most sense.

ccli8 added a commit to ccli8/mbed-os-example-pelion that referenced this issue Jan 9, 2020
1.  Support Nuvoton targets:
    -   NUMAKER_IOT_M487 V1.3
        Compared to V1.2, V1.3 adds support for ESP8266 RTS/CTS/RST pins. V1.3 is incompatible with V1.2 on ESP8266.
2.  Change storage to NUSD (SD card in SDIO bus mode) from SD (SD card SPI bus mode)
    (1) Add COMPONENT_NUSD.lib.
    (2) Add component NUSD (target.components_add).
    (3) Change default BlockDevice to NUSD:
        Override BlockDevice::get_default_instance (nusd.provide-default-blockdevice).
        Required since mbed-cloud-client 2.1.0 (arm_uc_blockdevice_ext) or in:
        mbed-bootloader/modules/storage/pal-blockdevice/source/arm_uc_pal_blockdevice_mbed.cpp
    (4) Enable kvstore with blockdevice type being "other".
        i)  Override get_other_blockdevice() (nusd.provide-kvstore-other-blockdevice).
        ii) Configure storage type to "FILESYSTEM".
        iii)Configure file system type to "LITTLE".
        vi) Configure block device type to "other".
        v)  Configure external size to 64MiB, which cannot overlap with update-client.
            storage-address/update-client.storage-size.
3.  Change back UARTSerial tx/rx buffer size from 1024/1024 to 256/256. This can reduce
    memory footprint by (1024 - 256) * 2 = 1.5KiB. Because this configuration influences
    both ESP8266 and default console (platfrom/mbed_retarget.cpp), memory reduction can
    achieve 1.5KiB x 2 = 3KiB.
4.  Enlarge ESP8266 'send tcp data' timeout on Nuvoton targets
    Pelion connection has some failure rate with this. Enlarging ESP8266_SEND_TIMEOUT
    can just relieve the issue. A ticket has raised to address it:
    ARMmbed/mbed-os#11544
ccli8 added a commit to ccli8/mbed-os-example-pelion that referenced this issue Feb 18, 2020
1.  Support Nuvoton targets:
    -   NUMAKER_IOT_M263A
2.  Change storage to NUSD (SD card in SDIO bus mode) from SD (SD card SPI bus mode)
    (1) Add COMPONENT_NUSD.lib.
    (2) Add component NUSD (target.components_add).
    (3) Change default BlockDevice to NUSD:
        Override BlockDevice::get_default_instance (nusd.provide-default-blockdevice).
        Required since mbed-cloud-client 2.1.0 (arm_uc_blockdevice_ext) or in:
        mbed-bootloader/modules/storage/pal-blockdevice/source/arm_uc_pal_blockdevice_mbed.cpp
    (4) Enable kvstore with blockdevice type being "other".
        i)  Override get_other_blockdevice() (nusd.provide-kvstore-other-blockdevice).
        ii) Configure storage type to "FILESYSTEM".
        iii)Configure file system type to "LITTLE".
        vi) Configure block device type to "other".
        v)  Configure external size to 64MiB, which cannot overlap with update-client.
            storage-address/update-client.storage-size.
3.  Change back UARTSerial tx/rx buffer size from 1024/1024 to 256/256. This can reduce
    memory footprint by (1024 - 256) * 2 = 1.5KiB. Because this configuration influences
    both ESP8266 and default console (platfrom/mbed_retarget.cpp), memory reduction can
    achieve 1.5KiB x 2 = 3KiB.
4.  Enlarge ESP8266 'send tcp data' timeout on Nuvoton targets
    Pelion connection has some failure rate with this. Enlarging ESP8266_SEND_TIMEOUT
    can just relieve the issue. A ticket has raised to address it:
    ARMmbed/mbed-os#11544
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants