Skip to content

OTA update fails every other attempt after reset (> r2.4.2) #5955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
6 tasks done
aboulfad opened this issue Apr 6, 2019 · 25 comments · Fixed by #5978
Closed
6 tasks done

OTA update fails every other attempt after reset (> r2.4.2) #5955

aboulfad opened this issue Apr 6, 2019 · 25 comments · Fixed by #5978

Comments

@aboulfad
Copy link

aboulfad commented Apr 6, 2019

Basic Infos

  • This issue complies with the issue POLICY doc.
  • I have read the documentation at readthedocs and the issue is not addressed there.
  • I have tested that the issue is present in current master branch (aka latest git).
  • I have searched the issue tracker for a similar issue.
  • If there is a stack dump, I have decoded it.
  • I have filled out all fields below.

Platform

  • Hardware: [ESP-12]
  • Core Version: [2.5.0-b3, 2.5.0, 2.4.2]
  • Development Env: [Arduino IDE, cmd line]
  • Operating System: [MacOS]

Settings in IDE

  • Module: [Nodemcu]
  • Flash Mode: [qio]
  • Flash Size: [4MB]
  • lwip Variant: [v2 Lower Memory]
  • Reset Method: [ck]
  • Flash Frequency: [40Mhz]
  • CPU Frequency: [80Mhz]
  • Upload Using: [OTA]
  • Upload Speed: OTA

Problem Description

Hello, Arduino OTA fails every other attempt after rest, after much searching and debugging I believe this is a side effect of issue #3481. Arduino OTA uses UDP to initiate the transfer and espota.py expects a reply to the UDP Flash command packet within a timeout. If the esp8266 does not reply with an UDP "OK" payload the script will timeout with the famous "No Answer".

Upon doing various tests, wireshark proves the above behaviour. The first UDP packet issue does not seem to exist in r2.4.2, where OTA update works all the time. The OTA UDP "OK" reply is always sent after an esp8266 reset.

I am not familiar with esp8266 src code, but i've looked at the diffs for 'ArduinoOTA.cpp' of 2.4.2 and 2.5.0, and although i know NULL about C++, it does not seem the issue is introduced from all those updates.

I also quickly diffed lwip v2 of 2.4.2 vs 2.5.0, there is a bunch of patches that also likely did not affect the UDP first packet issue.

I am willing to continue to dig further, if someone could be kind to point me in some direction! A temporary workaround could be to implement a retry on the sendto socket call

sent = sock2.sendto(message.encode(), remote_address)
in espota.py script.

Steps to reproduce

  • Use either 2.5.0 or later installed from IDE board manager.
  • Load BasicOTA sketch via serial, then reset
  • Attempt OTA update , it will fail on first try, observe that there is NO UDP OK packet sent from esp8266
  • Attempt another OTA update, it will work as the UDP OK packet is sent.

OTA_Fail_NoUDP_Reply
OTA_success

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 6, 2019

Me, I would replace this by

  for (int i = 0; i < 10; i++)
  {
    _udp_ota->append("OK", 2);
    _udp_ota->send(ota_ip, _ota_udp_port);
    delay(10);
  }

and replace python's if (data != "OK"): by if ("OK" not in data):.

Better, I would replace this UDP message by a TCP one.
Because UDP has nothing to do here. A TCP message, would be always sent, received and acknowledged reliably.

@aboulfad
Copy link
Author

aboulfad commented Apr 6, 2019

@d-a-v , merci 😄 your workaround should work as we wait for a conclusion on issue #3481 . I wrongly thought a temporary workaround to retry on revc fail in espota.py would work, but noticed that the ArduinoOTA code initiates the TCP transfer irregardless if the UDP OK is sent (and it can’t know as it’s UDP), whereafter espota times out on UDP recv.

I agree whole heartedly that the OTA update init sequence (Flash command/OK) is best served using TCP, but I can’t comment why UDP was chosen.

PS: I just tried your code and it works as intended.

@aboulfad
Copy link
Author

aboulfad commented Apr 7, 2019

Salut @d-a-v , is the enhancement tag for implementing a TCP solution for OTA session initiation instead of UDP? If yes, and if this is accepted, I would like to « attempt » implementing it. I know it would take you guys literally hours to do. I have much spare time and would be nice to make this tiny contribution. I think the scope is mostly limited to ArduinoOTA.cpp and espota.py?

What do you think?

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 7, 2019

Yes go ahead!

(Keep backward compatibility in mind though)

@aboulfad
Copy link
Author

aboulfad commented Apr 7, 2019

That was quick 😀

By backward compatibility, do you mean a solution that initiates the OTA handshake based on a common transport between client and esp?

A scenario I can think of, client has an old version of esports.py which does not support TCP - so switch to UDP, How valid is this?

Afraid I don’t much insight or history about all this so I may be way off... my first attempt will be a simple TCP transport implemented in both .py and .cpp.

PS: heavy focus on attempt as very new to esp code base and C++/python

@aboulfad
Copy link
Author

aboulfad commented Apr 7, 2019

@igrr , do you have any insight why UDP was chosen as the transport for the OTA Flash command? As you are one of the authors for OTA, do you have any additional thoughts regarding our proposal to use TCP for that initial OTA setup?

As this is my first public contrib, I just want to consult and include all relevant folks. Thank you.

@igrr
Copy link
Member

igrr commented Apr 7, 2019

UDP was chosen due to resources used, mostly. Keeping a TCP PCB (~socket) listening needed more memory than an open UDP PCB. Also the total number of TCP sockets was limited to 4 or so, in earlier versions of the LwIP (which came pre-compiled with the SDK).

Although on one hand I agree with the "eh, UDP is not reliable, let's slap on TCP here" approach, it might be worth actually looking into the UDP loss issue, especially if it happens predictably ("on every other reset").

@aboulfad
Copy link
Author

aboulfad commented Apr 7, 2019

Thank you for the very valuable insight and quick reply.
One point to add, and I know very little, after the U_FLASH command exchange which uses UDP, a TCP session is used for the firmware transfer.
Couldn’t this same TCP socket be also used for the initial commands too? This way there is no additional resource usage.

The UDP first packet loss is reproduced in every other OTA attempt, predictably, which I believe is related to issue #3481. I can also attempt to look into this but some guidance would be helpful.

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 7, 2019

it might be worth actually looking into the UDP loss issue, especially if it happens predictably ("on every other reset").

I agree with that. I got back to #3481 and found that without some delays, UDP packets are not even sent out.

Couldn’t this same TCP socket be also used for the initial commands too? This way there is no additional resource usage.

I agree with that too.

@aboulfad
Copy link
Author

aboulfad commented Apr 8, 2019

@d-a-v @igrr , Small update, lwip v1.4 (high BW) does NOT exhibit the issue compared to lwip v2 (high BW) in same release 2.5.0. OTA transfer (and hence UDP exchange) works everytime.

The lwip v1.4 -> v2 evolution is convoluted to my noob eyes, I don't think I can narrow it further as the lwip codebase is large and don't have much history.

@aboulfad
Copy link
Author

aboulfad commented Apr 8, 2019

@d-a-v , need some help svp, trying to use lwip debugging by setting ULWIPDEBUG, using your instructions from #3884 (comment), running the sketch outputs nothing related to those defines. I verified that sdk/lib contains the newly compiled -llwip* (am using git checkout 2.5.0)

Other defines that I set, e.g. DEBUG_ESP_MDNS_RESPONDER works fine and show up on serial port.

And I believe if I set both defines UDEBUG & ULWIPDEBUG, compilations breaks ...

lwip2 make install
C  build-536-feat-v6/glue/doprint.o
C  build-536-feat-v6/glue/uprint.o
C  build-536-feat-v6/glue-lwip/lwip-git.o
glue-lwip/lwip-git.c: In function 'new_display_netif':
glue-lwip/lwip-git.c:144:37: error: 'ip_addr_t' has no member named 'addr'
 display_ip32(" ip=", netif->ip_addr.addr);
                                    ^
glue-lwip/lwip-git.c:145:39: error: 'ip_addr_t' has no member named 'addr'
 display_ip32(" mask=", netif->netmask.addr);
                                      ^
glue-lwip/lwip-git.c:146:32: error: 'ip_addr_t' has no member named 'addr'
 display_ip32(" gw=", netif->gw.addr);
                               ^
glue-lwip/lwip-git.c: At top level:
glue-lwip/lwip-git.c:467:9: note: #pragma message: 

-------- TCP_MSS = 536 --------
-------- LWIP_FEATURES = 1 --------
-------- LWIP_IPV6 = 1 --------

#pragma message "\n\n" VAR_NAME_VALUE(TCP_MSS) VAR_NAME_VALUE(LWIP_FEATURES) VAR_NAME_VALUE(LWIP_IPV6)
        ^
make[3]: *** [build-536-feat-v6/glue-lwip/lwip-git.o] Error 1
make[2]: *** [liblwip6-536-feat.a] Error 2
make -f makefiles/Makefile.glue-esp BUILD=build-1460-feat-v6 V=0
C  build-1460-feat-v6/glue-esp/lwip-esp.o
AR liblwip6-1460-feat.a
make -f makefiles/Makefile.glue target=arduino BUILD=build-1460-feat-v6 TCP_MSS=1460 LWIP_FEATURES=1 LWIP_IPV6=1 V=0
C  build-1460-feat-v6/glue/doprint.o
C  build-1460-feat-v6/glue/uprint.o
C  build-1460-feat-v6/glue-lwip/lwip-git.o
glue-lwip/lwip-git.c: In function 'new_display_netif':
glue-lwip/lwip-git.c:144:37: error: 'ip_addr_t' has no member named 'addr'
 display_ip32(" ip=", netif->ip_addr.addr);
                                    ^
glue-lwip/lwip-git.c:145:39: error: 'ip_addr_t' has no member named 'addr'
 display_ip32(" mask=", netif->netmask.addr);
                                      ^
glue-lwip/lwip-git.c:146:32: error: 'ip_addr_t' has no member named 'addr'
 display_ip32(" gw=", netif->gw.addr);
                               ^
glue-lwip/lwip-git.c: At top level:
glue-lwip/lwip-git.c:467:9: note: #pragma message: 

-------- TCP_MSS = 1460 --------
-------- LWIP_FEATURES = 1 --------
-------- LWIP_IPV6 = 1 --------

#pragma message "\n\n" VAR_NAME_VALUE(TCP_MSS) VAR_NAME_VALUE(LWIP_FEATURES) VAR_NAME_VALUE(LWIP_IPV6)
        ^
make[3]: *** [build-1460-feat-v6/glue-lwip/lwip-git.o] Error 1
make[2]: *** [liblwip6-1460-feat.a] Error 2
make[1]: *** [install] Error 2
make: *** [install] Error 2```
</details>

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 9, 2019

This bug is already fixed in master (and I just fixed the previous fix), in tools/sdk/lwip2/:

cd builder
git fetch
git checkout master

edit: Sorry for being unclear. Only the compilation error is fixed.

@aboulfad
Copy link
Author

aboulfad commented Apr 9, 2019

Sorry, unsure to which bug you say is fixed, no lwip debugging output or setting both defines that breaks compilation? Ok the latter.

In any case I’ll checkout builder master and go from there. Thank you.
PS: found it d-a-v/esp82xx-nonos-linklayer@10c1fe1 Oh its fresh off the oven!

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 9, 2019

You might also have a look to #3481 (comment)

@aboulfad
Copy link
Author

aboulfad commented Apr 9, 2019

Thanks I saw it when you commented there. For this OTA usecase, it’s just one single UDP packet not being sent (the first one post boot). I’ve done about 100’s of uploads, and I can’t put my finger on it, but probably there’s some interaction with the first ARP. I’ve noticed in some case esp does not respond to ARP request. And one person in that issue also mentioned ARP...

It would really help me if I can activate lwip2 debugging, does your latest fix has anything to do with ULWIPDEBUG ?
PS: I will run your script as this netdump may be helpful

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 9, 2019

does your latest fix has anything to do with ULWIPDEBUG ?

No, it's supposed to work. You need to enable debug on serial in IDE Tools menu.

@aboulfad
Copy link
Author

aboulfad commented Apr 9, 2019

does your latest fix has anything to do with ULWIPDEBUG ?

No, it's supposed to work. You need to enable debug on serial in IDE Tools menu.

Yes, I did that, and also added (in case) Serial.setdebugouput(true) , no additional debug info is shown from whatever I set in debug level (OTA, wifi, core,...). Sadly I spent like 4 hours trying to figure it out before I asked for help!

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 9, 2019

Ah!
That was working with lwIP-2.0.x but when I updated to lwIP-2.1.2 I reworked the whole lwipopts.h
and I didn't change the debug default options.
You need to change all (or some) LWIP_DBG_OFF to LWIP_DBG_ON in:
https://github.com/d-a-v/esp82xx-nonos-linklayer/blob/25d5e8186f710a230221021cba97727dbfdfd953/glue-lwip/arduino/lwipopts.h#L3259

@aboulfad
Copy link
Author

aboulfad commented Apr 9, 2019

@d-a-v , I needed to enable both UDEBUG & ULWIPDEBUG to get debug info printed. In your old comment #3884 (comment) it seems to imply that one can define only ULWIPDEBUG ?

In any case, there's tons of debug info, and once when I attempted an OTA transfer, i got a stack dump and most other times during the upload. Not sure if its a side effect of the vvv debug logs or not ... and there are periodic Exception (28) crashes by just letting the sketch run. I just attached FYI.

Maybe I'll use master so I am at par w most of you and take it from there.

Crash during OTA update and UDEBUG/ULWIPDEBUG = 1
Decoding stack results
0x4020414d: ClientContext::_s_connected(void*, tcp_pcb*, long) at $ESP8266/libraries/ESP8266WiFi/src/include/ClientContext.h line 605
0x40219475: tcp_input at core/tcp_in.c line 408
0x4021dd4d: ip_reass_free_complete_datagram at core/ipv4/ip4_frag.c line 190
0x40100630: _umm_free at $ESP8266/cores/esp8266/umm_malloc/umm_malloc.c line 1300
0x40215a25: dns_recv at core/dns.c line 1239
0x4021581f: dns_check_entry at core/dns.c line 1010
0x40235474: wifi_softap_dhcps_client_update at glue-lwip/esp-dhcpserver.c line 1239
0x4020c4fc: esp_yield() at $ESP8266/cores/esp8266/core_esp8266_main.cpp line 91
0x40201ad4: uart0_write_char at $ESP8266/cores/esp8266/uart.c line 450
0x4020c400: String::trim() at $ESP8266/cores/esp8266/WString.cpp line 776
0x4020c4fc: esp_yield() at $ESP8266/cores/esp8266/core_esp8266_main.cpp line 91
0x4020157f: delay at $ESP8266/cores/esp8266/core_esp8266_wiring.c line 51
0x40201ad4: uart0_write_char at $ESP8266/cores/esp8266/uart.c line 450
0x40100d84: free at $ESP8266/cores/esp8266/umm_malloc/umm_malloc.c line 1760
0x40100298: vPortFree at $ESP8266/cores/esp8266/heap.c line 59
0x40100630: _umm_free at $ESP8266/cores/esp8266/umm_malloc/umm_malloc.c line 1300
0x401002a8: pvPortMalloc at $ESP8266/cores/esp8266/heap.c line 153
0x40100b66: malloc at $ESP8266/cores/esp8266/umm_malloc/umm_malloc.c line 1681
0x402150f0: netif_init_ap at glue-lwip/lwip-git.c line 344
0x4021560e: lwip_htonl at core/def.c line 92
0x4021a84c: sys_timeouts_init at core/timeouts.c line 271
0x4021df68: ip4_reass at core/ipv4/ip4_frag.c line 558
0x4021a84c: sys_timeouts_init at core/timeouts.c line 271
0x4021d0ac: icmp_input at core/ipv4/icmp.c line 250
0x4021a84c: sys_timeouts_init at core/timeouts.c line 271
0x4021b1d8: autoip_tmr at core/ipv4/autoip.c line 242
0x4021df68: ip4_reass at core/ipv4/ip4_frag.c line 558
0x4021b258: autoip_tmr at core/ipv4/autoip.c line 424
0x4021dfb0: ip4_reass at core/ipv4/ip4_frag.c line 289
0x402162d0: pbuf_free_ooseq at core/pbuf.c line 131
0x402162d0: pbuf_free_ooseq at core/pbuf.c line 131
0x4021dfd6: ip4_reass at core/ipv4/ip4_frag.c line 301
0x4021a766: lwip_cyclic_timer at core/timeouts.c line 244
0x40219a0c: tcp_write at core/tcp_out.c line 781
0x40217344: tcp_close_shutdown at core/tcp.c line 352
0x40204120: ClientContext::_s_connected(void*, tcp_pcb*, long) at $ESP8266/libraries/ESP8266WiFi/src/include/ClientContext.h line 635
0x40100b66: malloc at $ESP8266/cores/esp8266/umm_malloc/umm_malloc.c line 1681
0x402048ab: WiFiClient::connect(IPAddress const&, unsigned short) at $ESP8266/libraries/ESP8266WiFi/src/include/ClientContext.h line 127
0x40203e80: WiFiClient::connect(char const*, unsigned short) at $ESP8266/libraries/ESP8266WiFi/src/WiFiClient.cpp line 133
0x4020b460: Print::println() at $ESP8266/cores/esp8266/Print.cpp line 178
0x4020cbe4: PubSubClient::connected() at $Arduinolibraries/PubSubClient/src/PubSubClient.cpp line 606
0x40204d03: PubSubClient::connect(char const*, char const*, char const*, char const*, unsigned char, unsigned char, char const*, unsigned char) at $Arduinolibraries/PubSubClient/src/PubSubClient.cpp line 129
0x4020c61d: optimistic_yield(uint32_t) at $ESP8266/cores/esp8266/core_esp8266_main.cpp line 115
0x40204ef4: PubSubClient::connect(char const*, char const*, unsigned char, unsigned char, char const*) at $Arduinolibraries/PubSubClient/src/PubSubClient.cpp line 114
0x401001b6: millis at $ESP8266/cores/esp8266/core_esp8266_wiring.c line 183
0x402028fa: mqttreconnect() at $Arduinoscripts/rssi/rssi.ino line 95
0x40202a10: loop() at $Arduinoscripts/rssi/rssi.ino line 202
0x4020c5a8: loop_wrapper() at $ESP8266/cores/esp8266/core_esp8266_main.cpp line 125
PS: just saw your above post after this one. I'll follow-up, and selectively use the debug option I need.

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 9, 2019

it seems to imply that one can define only ULWIPDEBUG

It was enough when I tried today

periodic Exception (28) crashes

Can you please try and change

#define DEBUGV(fmt, ...) ::printf((PGM_P)PSTR(fmt), ## __VA_ARGS__)
to
#define DEBUGV(fmt, ...) ::printf(fmt, ## __VA_ARGS__) and report ?

@aboulfad
Copy link
Author

aboulfad commented Apr 9, 2019

Do you want me to try that on release 2.5.0 (which I am on now), or checkout master and apply that define change ? but first I need to digest the debug options in lwipopts.h

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 9, 2019

I recommend using git when it comes to debugging.
On 2.5.0 you don't have the same line but you can try this change. Please note that on 2.5.0 the firmware is quite slower (loop() called less often, and there are not-understood issues with some boards)

@aboulfad
Copy link
Author

aboulfad commented Apr 9, 2019

Ok I set ETHARP_DEBUG, I got nada in the serial console. so there still must be something else to set... 😤 darn darn darn, need to set both ULWIPDEBUG && the individual defines !

I just got a nasty panic, i'll implement your one liner and try again.
Panic $ESP8266/libraries/ESP8266WiFi/src/include/ClientContext.h:604 err_t ClientContext::_connected(tcp_pcb*, err: Assertion '_connect_pending' failed.

PS: from now on, I am on master.

@aboulfad
Copy link
Author

aboulfad commented Apr 9, 2019

ok I implemented your one liner above, recompiled, uploaded... run a bit, then I reset the board, I got 3 panics in a row... in between resets. Now, is this related to the debug defines ???
Panic 1,2,3:
Panic $ESP8266/libraries/ESP8266WiFi/src/include/ClientContext.h:604 err_t ClientContext::_connected(tcp_pcb*, err: Assertion '_connect_pending' failed.

@aboulfad
Copy link
Author

Ok solution and problem details in my comment #3481 (comment). In the end, we will not change the OTA transport to TCP given ARP_QUEUEING=1 solves the OTA UDP packet loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants