-
Notifications
You must be signed in to change notification settings - Fork 13.3k
OTA update fails every other attempt after reset (> r2.4.2) #5955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Me, I would replace this by
and replace python's Better, I would replace this UDP message by a TCP one. |
@d-a-v , merci 😄 your workaround should work as we wait for a conclusion on issue #3481 . I wrongly thought a temporary workaround to retry on revc fail in espota.py would work, but noticed that the ArduinoOTA code initiates the TCP transfer irregardless if the UDP OK is sent (and it can’t know as it’s UDP), whereafter espota times out on UDP recv. I agree whole heartedly that the OTA update init sequence (Flash command/OK) is best served using TCP, but I can’t comment why UDP was chosen. PS: I just tried your code and it works as intended. |
Salut @d-a-v , is the enhancement tag for implementing a TCP solution for OTA session initiation instead of UDP? If yes, and if this is accepted, I would like to « attempt » implementing it. I know it would take you guys literally hours to do. I have much spare time and would be nice to make this tiny contribution. I think the scope is mostly limited to What do you think? |
Yes go ahead! (Keep backward compatibility in mind though) |
That was quick 😀 By backward compatibility, do you mean a solution that initiates the OTA handshake based on a common transport between client and esp? A scenario I can think of, client has an old version of esports.py which does not support TCP - so switch to UDP, How valid is this? Afraid I don’t much insight or history about all this so I may be way off... my first attempt will be a simple TCP transport implemented in both .py and .cpp. PS: heavy focus on attempt as very new to esp code base and C++/python |
@igrr , do you have any insight why UDP was chosen as the transport for the OTA Flash command? As you are one of the authors for OTA, do you have any additional thoughts regarding our proposal to use TCP for that initial OTA setup? As this is my first public contrib, I just want to consult and include all relevant folks. Thank you. |
UDP was chosen due to resources used, mostly. Keeping a TCP PCB (~socket) listening needed more memory than an open UDP PCB. Also the total number of TCP sockets was limited to 4 or so, in earlier versions of the LwIP (which came pre-compiled with the SDK). Although on one hand I agree with the "eh, UDP is not reliable, let's slap on TCP here" approach, it might be worth actually looking into the UDP loss issue, especially if it happens predictably ("on every other reset"). |
Thank you for the very valuable insight and quick reply. The UDP first packet loss is reproduced in every other OTA attempt, predictably, which I believe is related to issue #3481. I can also attempt to look into this but some guidance would be helpful. |
I agree with that. I got back to #3481 and found that without some delays, UDP packets are not even sent out.
I agree with that too. |
@d-a-v @igrr , Small update, lwip v1.4 (high BW) does NOT exhibit the issue compared to lwip v2 (high BW) in same release 2.5.0. OTA transfer (and hence UDP exchange) works everytime. The lwip v1.4 -> v2 evolution is convoluted to my noob eyes, I don't think I can narrow it further as the lwip codebase is large and don't have much history. |
@d-a-v , need some help svp, trying to use lwip debugging by setting Other defines that I set, e.g. And I believe if I set both defines lwip2 make installC build-536-feat-v6/glue/doprint.o
C build-536-feat-v6/glue/uprint.o
C build-536-feat-v6/glue-lwip/lwip-git.o
glue-lwip/lwip-git.c: In function 'new_display_netif':
glue-lwip/lwip-git.c:144:37: error: 'ip_addr_t' has no member named 'addr'
display_ip32(" ip=", netif->ip_addr.addr);
^
glue-lwip/lwip-git.c:145:39: error: 'ip_addr_t' has no member named 'addr'
display_ip32(" mask=", netif->netmask.addr);
^
glue-lwip/lwip-git.c:146:32: error: 'ip_addr_t' has no member named 'addr'
display_ip32(" gw=", netif->gw.addr);
^
glue-lwip/lwip-git.c: At top level:
glue-lwip/lwip-git.c:467:9: note: #pragma message:
-------- TCP_MSS = 536 --------
-------- LWIP_FEATURES = 1 --------
-------- LWIP_IPV6 = 1 --------
#pragma message "\n\n" VAR_NAME_VALUE(TCP_MSS) VAR_NAME_VALUE(LWIP_FEATURES) VAR_NAME_VALUE(LWIP_IPV6)
^
make[3]: *** [build-536-feat-v6/glue-lwip/lwip-git.o] Error 1
make[2]: *** [liblwip6-536-feat.a] Error 2
make -f makefiles/Makefile.glue-esp BUILD=build-1460-feat-v6 V=0
C build-1460-feat-v6/glue-esp/lwip-esp.o
AR liblwip6-1460-feat.a
make -f makefiles/Makefile.glue target=arduino BUILD=build-1460-feat-v6 TCP_MSS=1460 LWIP_FEATURES=1 LWIP_IPV6=1 V=0
C build-1460-feat-v6/glue/doprint.o
C build-1460-feat-v6/glue/uprint.o
C build-1460-feat-v6/glue-lwip/lwip-git.o
glue-lwip/lwip-git.c: In function 'new_display_netif':
glue-lwip/lwip-git.c:144:37: error: 'ip_addr_t' has no member named 'addr'
display_ip32(" ip=", netif->ip_addr.addr);
^
glue-lwip/lwip-git.c:145:39: error: 'ip_addr_t' has no member named 'addr'
display_ip32(" mask=", netif->netmask.addr);
^
glue-lwip/lwip-git.c:146:32: error: 'ip_addr_t' has no member named 'addr'
display_ip32(" gw=", netif->gw.addr);
^
glue-lwip/lwip-git.c: At top level:
glue-lwip/lwip-git.c:467:9: note: #pragma message:
-------- TCP_MSS = 1460 --------
-------- LWIP_FEATURES = 1 --------
-------- LWIP_IPV6 = 1 --------
#pragma message "\n\n" VAR_NAME_VALUE(TCP_MSS) VAR_NAME_VALUE(LWIP_FEATURES) VAR_NAME_VALUE(LWIP_IPV6)
^
make[3]: *** [build-1460-feat-v6/glue-lwip/lwip-git.o] Error 1
make[2]: *** [liblwip6-1460-feat.a] Error 2
make[1]: *** [install] Error 2
make: *** [install] Error 2```
</details> |
This bug is already fixed in master (and I just fixed the previous fix), in
edit: Sorry for being unclear. Only the compilation error is fixed. |
Sorry, unsure to which bug you say is fixed, no lwip debugging output or setting both defines that breaks compilation? Ok the latter. In any case I’ll checkout builder master and go from there. Thank you. |
You might also have a look to #3481 (comment) |
Thanks I saw it when you commented there. For this OTA usecase, it’s just one single UDP packet not being sent (the first one post boot). I’ve done about 100’s of uploads, and I can’t put my finger on it, but probably there’s some interaction with the first ARP. I’ve noticed in some case esp does not respond to ARP request. And one person in that issue also mentioned ARP... It would really help me if I can activate lwip2 debugging, does your latest fix has anything to do with |
No, it's supposed to work. You need to enable debug on serial in IDE Tools menu. |
Yes, I did that, and also added (in case) |
Ah! |
@d-a-v , I needed to enable both In any case, there's tons of debug info, and once when I attempted an OTA transfer, i got a stack dump and most other times during the upload. Not sure if its a side effect of the vvv debug logs or not ... and there are periodic Maybe I'll use master so I am at par w most of you and take it from there. Crash during OTA update and UDEBUG/ULWIPDEBUG = 1
|
It was enough when I tried today
Can you please try and change Line 8 in 77451d6
#define DEBUGV(fmt, ...) ::printf(fmt, ## __VA_ARGS__) and report ?
|
Do you want me to try that on release 2.5.0 (which I am on now), or checkout master and apply that define change ? but first I need to digest the debug options in |
I recommend using git when it comes to debugging. |
Ok I set I just got a nasty panic, i'll implement your one liner and try again. PS: from now on, I am on |
ok I implemented your one liner above, recompiled, uploaded... run a bit, then I reset the board, I got 3 panics in a row... in between resets. Now, is this related to the debug defines ??? |
Ok solution and problem details in my comment #3481 (comment). In the end, we will not change the OTA transport to TCP given ARP_QUEUEING=1 solves the OTA UDP packet loss. |
Basic Infos
Platform
Settings in IDE
Problem Description
Hello, Arduino OTA fails every other attempt after rest, after much searching and debugging I believe this is a side effect of issue #3481. Arduino OTA uses UDP to initiate the transfer and espota.py expects a reply to the UDP Flash command packet within a timeout. If the esp8266 does not reply with an UDP "OK" payload the script will timeout with the famous "No Answer".
Upon doing various tests, wireshark proves the above behaviour. The first UDP packet issue does not seem to exist in r2.4.2, where OTA update works all the time. The OTA UDP "OK" reply is always sent after an esp8266 reset.
I am not familiar with esp8266 src code, but i've looked at the diffs for 'ArduinoOTA.cpp' of 2.4.2 and 2.5.0, and although i know NULL about C++, it does not seem the issue is introduced from all those updates.
I also quickly diffed lwip v2 of 2.4.2 vs 2.5.0, there is a bunch of patches that also likely did not affect the UDP first packet issue.
I am willing to continue to dig further, if someone could be kind to point me in some direction! A temporary workaround could be to implement a retry on the sendto socket call
Arduino/tools/espota.py
Line 95 in 77451d6
espota.py
script.Steps to reproduce
The text was updated successfully, but these errors were encountered: