Skip to content

OTA update seems corrupt on some nodes when updating from core 2.5.x or older #8264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 of 6 tasks
TD-er opened this issue Aug 11, 2021 · 5 comments
Closed
4 of 6 tasks

Comments

@TD-er
Copy link
Contributor

TD-er commented Aug 11, 2021

Basic Infos

  • This issue complies with the issue POLICY doc.
  • I have read the documentation at readthedocs and the issue is not addressed there.
  • I have tested that the issue is present in current master branch (aka latest git).
  • I have searched the issue tracker for a similar issue.
  • If there is a stack dump, I have decoded it.
  • I have filled out all fields below.

Platform

  • Hardware: [ESP-07]
  • Core Version: [latest git hash or date]
  • Development Env: [Platformio]
  • Operating System: [Windows]

Settings in IDE

  • Module: [Generic ESP8266 Module]
  • Flash Mode: [dio]
  • Flash Size: [4MB]
  • lwip Variant: [v2 Lower Memory|Higher Bandwidth]
  • Reset Method: [nodemcu]
  • Flash Frequency: [40Mhz]
  • CPU Frequency: [80Mhz]
  • Upload Using: [OTA|SERIAL]
  • Upload Speed: [115200] (serial upload only)

Problem Description

The last few months, I've encountered lots and lots of unexplainable issues on a specific project.
The update runs fine on my side and when updated on other nodes we see all kinds of random issues.

A few weeks ago we found out that when those nodes are (re)flashed via serial, the nodes run just fine.

The boards we made all have an ESP-07S on board, which were pre-flashed by the seller with our image.
However the initial image was based on core 2.5.x and we're now using core 2.7.4
So I wonder if it is possible that the bootloader part (or whatever it is called) may not be overwritten using OTA and that this part may use different flash settings?

The issues we encountered were so random, that I'm really starting to think the flash was either not written correctly, or maybe not always read correct.
I know for sure that the power supply is not to blame here, as I designed the boards myself.
The ESP is powered by its own AMS1117, has the capacitors described by Espressif's design guides and the boards run just fine after being flashed over serial.

Could this be somehow related to the changes made a while back regarding the driving voltage of XMC flash?
N.B. not all units seem to have the same flash brand. Some report as XMC, but not all and I am not entirely certian all boards with stability issues are specific to be using XMC flash.

Could it also be a write speed issue? No idea how quickly the flash is written via OTA.
When flashing using serial, there does not seem to be any difference when flashing at 115200 baud vs. 4 times that rate.

I also tested on some boards by flashing an older image (of the entire flash) and then performing an OTA update, that it would fail repeatedly.
Strangely enough, flashing another build with minor code changes, via OTA, could run stable.
So this does sound like we may have some very specific bit pattern which is harder to flash?

I assume the OTA flashed data is verified using some checksum, but is it possible this checksum is calculated when the ESP is running "older" code to set the flash parameters?
Or is there maybe a timing issue possible where reading the flash sequentially is different from normal use, so that an post-OTA check may be successful, but still cause failed reads when running in normal mode?

@Jason2866
Copy link
Contributor

Jason2866 commented Aug 15, 2021

I think you run in this issue #7267
In my experience every Arduino ESP8266 built with SDK 3.0 pre inside is not predictable when doing upgrades. Between core 2.3.0 and the actual 2.7.x all releases had major issues.
We got Tasmota only rock solid with core 2.3.0 and the actual 2.7.x
The never ending story is PWM without hickups :-). Different story, blame Espressif...

@TD-er
Copy link
Contributor Author

TD-er commented Aug 15, 2021

Yep looks like it.
I've had ESPEasy rock solid for only some builds. Even with core 2.6.x and 2.5.x
Problem is that only small changes can result in instability for next builds.
Really frustrating and seems to be even more of a problem with core 3.0.x

Could the flash issues described here in this issue be the cause of those instabilities?
I got the impression it was more like a linker issue, but now I come to think of it, it could also be related to flash read errors.

@Jason2866
Copy link
Contributor

Jason2866 commented Aug 15, 2021

XMC flash, well it is... When i encountering strange errors, i swap to a esp8266 device with winbond flash. I had one device which behaved weird. Read the flash with directly with a CH341, flashed a new Winbond chip and swapped flash chips. Device is working fine since this time.
So i still dont know. Defect flash chip, or XMC problems with Arduino ESP8266?
But old devices with Winbond chips are still working. Newer devices with XMC chips had already flash replacements. Coincidence? For me XMC is the same crap as PUYA and i avoid.

@TD-er
Copy link
Contributor Author

TD-er commented Aug 15, 2021

I recently started to look also at the quality of the crystals.

See the "Time Wander" value:
image

This is on an older NodeMCU board, which does have the Espressif ESP12F module (or ESP12E) and a Winbond flash by the way.
A value of less than 0.010 msec/sec (10 ppm) is well within specs.
But first replies from users already show that they're not getting even close to that.
Have to see how this develops, as it is now a tool to get a rough idea of the kind of quality we can expect from the board.
I suspect it will be mainly related to WiFi issues, but maybe there is also some correlation with flash reliability, or perhaps only some indicator of bad product design.

This "time wander" value is computed based on NTP or GPS updates over a longer period (at least 1 hour) compared to the internal time.

Tested it also on an Espressif ESP32 module and that one reports values ranging from - 0.001 to + 0.002, so those are excellent. (6h NTP interval)

@earlephilhower
Copy link
Collaborator

Looks like this is taken care of and the discussion wandering. Closing.

Also, FWIW, quartz crystal frequency is affected by temperature. So you could be stuck not only with poorly tuned xtals, but by variation due to ambient temps...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants