Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boot Loop after OTA Update #3443

Closed
danube opened this issue Dec 20, 2024 · 27 comments
Closed

Boot Loop after OTA Update #3443

danube opened this issue Dec 20, 2024 · 27 comments
Labels
bug Something isn't working

Comments

@danube
Copy link

danube commented Dec 20, 2024

The Problem

Installed 15.7.0 with web installer and then updated to 16.0.0RC5 via OTA update. Resulted in a boot loop.

esp-web-tools-logs.txt

Version

16.0.0 RC5

Logfile

attached

Expected Behavior

No response

Screenshots

No response

Additional Context

No response

@danube danube added the bug Something isn't working label Dec 20, 2024
@caco3
Copy link
Collaborator

caco3 commented Dec 20, 2024

can you please provide the config file?
my impression is that the SD card is corruped (not writable).
please format it and update the RC5 image once more.

@danube
Copy link
Author

danube commented Dec 21, 2024

sure, there you go:

[TakeImage]
;RawImagesLocation = /log/source
WaitBeforeTakingPicture = 5
;RawImagesRetention = 15
Demo = false
Brightness = 1
Contrast = 2
Saturation = 2
;Sharpness = 
LEDIntensity = 15
ImageQuality = 12
ImageSize = VGA
;Zoom = false
;ZoomMode = 0
;ZoomOffsetX = 
;ZoomOffsetY = 
;Grayscale = false
;Negative = false
;Aec2 = false
;AutoExposureLevel = 
FixedExposure = false

[Alignment]
InitialRotate = 184.5
InitialMirror = false
SearchFieldX = 20
SearchFieldY = 20
AlignmentAlgo = default
FlipImageSize = false
/config/ref0.jpg 102 188
/config/ref1.jpg 383 144

[Digits]
Model = /config/dig-cont_0611_s3.tflite
CNNGoodThreshold = 0.5
;ROIImagesLocation = /log/digit
;ROIImagesRetention = 3
main.dig1 183 131 30 54 false
main.dig2 222 131 30 54 false
main.dig3 261 131 30 54 false
main.dig4 300 131 30 54 0
main.dig5 339 131 30 54 0

[Analog]
Model = /config/ana-cont_1208_s2_q.tflite
;ROIImagesLocation = /log/analog
;ROIImagesRetention = 3
main.ana1 374 215 80 80 false
main.ana2 332 299 80 80 false
main.ana3 251 335 80 80 false
main.ana4 146 299 80 80 false

[PostProcessing]
main.DecimalShift = 3
main.AnalogDigitalTransitionStart = 9.2
PreValueUse = true
PreValueAgeStartup = 720
main.AllowNegativeRates = false
main.MaxRateValue = 1000
;main.MaxRateType = AbsoluteChange
main.ExtendedResolution = false
main.IgnoreLeadingNaN = false
ErrorMessage = true
CheckDigitIncreaseConsistency = false

[MQTT]
Uri = mqtt://masked:1883
MainTopic = sensors/tele/esp-watermeter
ClientID = watermeter
user = masked
password = masked
RetainMessages = true
HomeassistantDiscovery = false
;MeterType = other
;CACert = undefined
;ClientCert = undefined
;ClientKey = undefined

;[InfluxDB]
;Uri = undefined
;Database = undefined
;user = undefined
;password = undefined
;main.Measurement = undefined
;main.Field = undefined

[InfluxDBv2]
Uri = http://masked:8086
Bucket = smarthome
Org = myHome
Token = masked
main.Measurement = watermeter
main.Field = absoluteValue

[GPIO]
;IO0 = input disabled 10 false false 
;IO1 = input disabled 10 false false 
;IO3 = input disabled 10 false false 
;IO4 = built-in-led disabled 10 false false 
IO12 = external-flash-ws281x disabled 10 false false 
;IO13 = input-pullup disabled 10 false false 
LEDType = WS2812B
LEDNumbers = 3
LEDColor = 4 12 15

[AutoTimer]
AutoStart = true
Interval = 5

[DataLogging]
DataLogActive = true
DataFilesRetention = 3

[Debug]
LogLevel = 1
LogfilesRetention = 3

[System]
TimeZone = CET-1CEST,M3.5.0,M10.5.0/3
;TimeServer = pool.ntp.org
;Hostname = undefined
;RSSIThreshold = 0
CPUFrequency = 160
SetupMode = false

@danube
Copy link
Author

danube commented Dec 21, 2024

In the meantime, I deleted the entire contents of the SD and copied the contents of the manual-setup zip file. It now works for me. I just hope that by providing the necessary information, I can help prevent other people's devices from running into a boot loop.

@phulstaert
Copy link

I also have the same issue.
I had the 15.7.0, I applied the 16 RC5 (OTA with the update file) and now it does not start anymore. I would guess this is also a bootloop.
I'll try to clear the SD and put the files from the manual file and report back.

@caco3
Copy link
Collaborator

caco3 commented Dec 22, 2024

Thanks for providing the config.
I was now able to reproduce the issue and fix it.
It was an issue on the parameter migration from 0.15.7 to the latest release.

@caco3 caco3 closed this as completed Dec 22, 2024
@phulstaert
Copy link

Thank you
Is there a fix without erasing the whole SD? ex only the config or something like that?
Or where there only a few lines that did not got migrated and can we copy paste a solution?

@caco3
Copy link
Collaborator

caco3 commented Dec 22, 2024

If you haven't configured the device yet, it would be the easiest to replace the config/config.ini file with the one from https://github.com/jomjol/AI-on-the-edge-device/blob/v16.0.0-RC5/sd-card/config/config.ini

Alternatively you can try to replace just the following parameters:

CamZoom
CamZoomSize
CamZoomOffsetX
CamZoomOffsetY

with those:

CamZoom = false
CamZoomSize = 0
CamZoomOffsetX = 0
CamZoomOffsetY = 0
CamZoomSize = 0

Note that I did not test this variant!

@phulstaert
Copy link

Did I read it correctly that CamZoom settings are causing a bootloop?
I would expect that the device would boot, but capturing images or recognition did not run.

If it would be like this, I could update the settings trough the webinterface. Now I have to disassemble the case, take the ESP32 out, take out the SD, edit the file, put everything back, and rerun the setup for the camera/alignment/etc... because the camera moved.

I know for a developer this is "just the config.ini", but for a user this can really be a hassle.
I will look this evening at the config file and validate that those are the only variables that had issues.

@SybexX
Copy link
Collaborator

SybexX commented Dec 23, 2024

A bootlop is generally caused when a variable has no value behind it, such as "Zoom =" or "Zoom" instead of "Zoom = false"
And this usually happens when the user makes any changes to the config.ini without knowing what exactly they are doing.
We've already rewritten most of the code so that this no longer happens, but only "most".

@caco3
Copy link
Collaborator

caco3 commented Dec 23, 2024

🤣

It is not the parameters which cause the endless boot loops. We have precautions to prevent against endless boot loops during normal operations. How ever here the migration scripts to update the configuration from 15.7.0 to 16.0.0-RCx had a bug. Since the migration happens automatically and directly after startup, there is indeed no time to run the OTA Update to fix it before the next crash.

It is great that you and @danube reported this issue, so we could fix it. I hope you are fully aware that you installed a Pre-Release, it is even marked as RC (Release Candidate) to let you as a user know that it not a final release!

Please also keep in mind that you are allowed to use the software free of charge. On the opposite, we developers invested (and still do) hundred of hours into it, all in our free time. Now, it would also be great to give something back -> check the the donation link on https://github.com/jomjol/AI-on-the-edge-device?tab=readme-ov-file#donate-

@caco3
Copy link
Collaborator

caco3 commented Dec 23, 2024

And this usually happens when the user makes any changes to the config.ini without knowing what exactly they are doing.
We've already rewritten most of the code so that this no longer happens, but only "most".

No, here it indeed was a broken migration. The migration lost the last part of the string and when we split it we had an out-of-index access.

@danube
Copy link
Author

danube commented Dec 23, 2024

It is great that you and @danube reported this issue

@caco3 it's what I can do 😉

I know for a developer this is "just the config.ini", but for a user this can really be a hassle. I will look this evening at the config file and validate that those are the only variables that had issues.

@phulstaert, feel free to contribute with whatever skills you have. 🤟

@phulstaert
Copy link

I am a developer myself (PHP/NodeJS) and know what RC means.
RC comes after Alpha and Beta versions. If the migration of a rather standard settings does not work, that would be something that should not happen in a RC version and I would question the testingscripts. An RC should really be a 'candidate for realease' and something clients should be able to run. Only user specific situations should pose risc to errors. Not a basic part in the migration that affects every update.

But on the other hand, in my field is is easy to test, and I have no idea how it works to test something that runs on a ESP32. I don't know if there are emulation environments for this. Every ESP32 project I have written in the past was for personal use, was in the Arduino suite for use with a single sensor and only a hardcoded MQTT client. So just saying "test better" is easy to say, but maybe not that easy to do; so everything I say is with the upmost respect and meaning to help to make this project better and more robust.

On the other hand, there are 2 issues at play here.

  1. The migration changed and was not tested. (I say "not" because these are fields that every config has, so it should happen every time you run the OTA)
  2. The robustness of reading the config.ini should be better. (to help with 'user error' and 'migration errors' like this)
  • either you have comments (starts with ; in an ini file)
  • either you have a valid key value pair
  • all the rest should be ignored (no crash, bootloop, etc... - and maybe be changed with "; unknown statement - " so the user can find his mistake more easly.)
  • all the missing required keys should be added with a default value, that is sufficient to enter the running state (or if for example the SSID is missing - you could flash the status LED in a specific color or sequence)

@SybexX
Copy link
Collaborator

SybexX commented Dec 23, 2024

This error does not occur when updating from version 15.7.0 to 16.0.0 (which was just tested), but if you update from eg. 15.4.0 to 15.7.0 and then to 16.0.0 it may possibly happen.

@phulstaert
Copy link

That is strange.
I had a new ESP32; installed the 15.7 from the webinstaller and after a few days i updated to 16.0 RC5 using the update file.
I will look at my config.ini tonight (in 7 hours from now) to confirm that this is the issue with my device.. It could always be I have another issue..

@phulstaert
Copy link

There are multiple lines completely missing
(left = my file; right = file from github)

image

image

image

@phulstaert
Copy link

phulstaert commented Dec 23, 2024

The logfile kept repeating this:

[0d00h00m00s] 1970-01-01T01:23:12	<INF>	[MAIN] =================================================
[0d00h00m00s] 1970-01-01T01:23:12	<INF>	[MAIN] ==================== Start ======================
[0d00h00m00s] 1970-01-01T01:23:12	<INF>	[MAIN] =================================================
[0d00h00m00s] 1970-01-01T01:23:12	<INF>	[MAIN] PSRAM size: 8388608 byte (8MB / 64MBit)
[0d00h00m00s] 1970-01-01T01:23:12	<INF>	[MAIN] Total heap: 4382863 byte
[0d00h00m03s] 1970-01-01T01:23:15	<INF>	[MAIN] Camera info: PID: 0x26, VER: 0x42, MIDL: 0x7f, MIDH: 0xa2
[0d00h00m03s] 1970-01-01T01:23:15	<INF>	[SDCARD] Basic R/W check started...
[0d00h00m03s] 1970-01-01T01:23:15	<INF>	[SDCARD] Basic R/W check successful
[0d00h00m03s] 1970-01-01T01:23:16	<ERR>	[HELPER] Migrated Configfile line ';Brightness = ' to ';CamBrightness = '
[0d00h00m03s] 1970-01-01T01:23:16	<ERR>	[HELPER] Migrated Configfile line ';Contrast = ' to ';CamContrast = '
[0d00h00m04s] 1970-01-01T01:23:16	<ERR>	[HELPER] Migrated Configfile line ';Saturation = ' to ';CamSaturation = '
[0d00h00m04s] 1970-01-01T01:23:16	<ERR>	[HELPER] Migrated Configfile line ';Sharpness = ' to ';CamSharpness = '
[0d00h00m04s] 1970-01-01T01:23:16	<ERR>	[HELPER] Migrated Configfile line ';ImageQuality = ' to ';CamQuality = '
[0d00h00m04s] 1970-01-01T01:23:16	<ERR>	[HELPER] Migrated Configfile line ';ImageSize = VGA' to ';;UNUSED_PARAMETER = VGA'
[0d00h00m00s] 1970-01-01T01:23:18	<INF>	[MAIN] =================================================
[0d00h00m00s] 1970-01-01T01:23:18	<INF>	[MAIN] ==================== Start ======================
[0d00h00m00s] 1970-01-01T01:23:19	<INF>	[MAIN] =================================================
[0d00h00m00s] 1970-01-01T01:23:19	<INF>	[MAIN] PSRAM size: 8388608 byte (8MB / 64MBit)
[0d00h00m00s] 1970-01-01T01:23:19	<INF>	[MAIN] Total heap: 4382863 byte
[0d00h00m03s] 1970-01-01T01:23:21	<INF>	[MAIN] Camera info: PID: 0x26, VER: 0x42, MIDL: 0x7f, MIDH: 0xa2
[0d00h00m03s] 1970-01-01T01:23:22	<INF>	[SDCARD] Basic R/W check started...
[0d00h00m03s] 1970-01-01T01:23:22	<INF>	[SDCARD] Basic R/W check successful
[0d00h00m03s] 1970-01-01T01:23:22	<ERR>	[HELPER] Migrated Configfile line ';Brightness = ' to ';CamBrightness = '
[0d00h00m03s] 1970-01-01T01:23:22	<ERR>	[HELPER] Migrated Configfile line ';Contrast = ' to ';CamContrast = '
[0d00h00m04s] 1970-01-01T01:23:22	<ERR>	[HELPER] Migrated Configfile line ';Saturation = ' to ';CamSaturation = '
[0d00h00m04s] 1970-01-01T01:23:22	<ERR>	[HELPER] Migrated Configfile line ';Sharpness = ' to ';CamSharpness = '
[0d00h00m04s] 1970-01-01T01:23:23	<ERR>	[HELPER] Migrated Configfile line ';ImageQuality = ' to ';CamQuality = '
[0d00h00m04s] 1970-01-01T01:23:23	<ERR>	[HELPER] Migrated Configfile line ';ImageSize = VGA' to ';;UNUSED_PARAMETER = VGA'
[0d00h00m00s] 1970-01-01T01:23:25	<INF>	[MAIN] =================================================
[0d00h00m00s] 1970-01-01T01:23:25	<INF>	[MAIN] ==================== Start ======================
[0d00h00m00s] 1970-01-01T01:23:25	<INF>	[MAIN] =================================================
[0d00h00m00s] 1970-01-01T01:23:25	<INF>	[MAIN] PSRAM size: 8388608 byte (8MB / 64MBit)
[0d00h00m00s] 1970-01-01T01:23:25	<INF>	[MAIN] Total heap: 4382863 byte
[0d00h00m03s] 1970-01-01T01:23:28	<INF>	[MAIN] Camera info: PID: 0x26, VER: 0x42, MIDL: 0x7f, MIDH: 0xa2
[0d00h00m03s] 1970-01-01T01:23:28	<INF>	[SDCARD] Basic R/W check started...
[0d00h00m03s] 1970-01-01T01:23:28	<INF>	[SDCARD] Basic R/W check successful
[0d00h00m03s] 1970-01-01T01:23:28	<ERR>	[HELPER] Migrated Configfile line ';Brightness = ' to ';CamBrightness = '
[0d00h00m03s] 1970-01-01T01:23:29	<ERR>	[HELPER] Migrated Configfile line ';Contrast = ' to ';CamContrast = '
[0d00h00m04s] 1970-01-01T01:23:29	<ERR>	[HELPER] Migrated Configfile line ';Saturation = ' to ';CamSaturation = '
[0d00h00m04s] 1970-01-01T01:23:29	<ERR>	[HELPER] Migrated Configfile line ';Sharpness = ' to ';CamSharpness = '
[0d00h00m04s] 1970-01-01T01:23:29	<ERR>	[HELPER] Migrated Configfile line ';ImageQuality = ' to ';CamQuality = '
[0d00h00m04s] 1970-01-01T01:23:29	<ERR>	[HELPER] Migrated Configfile line ';ImageSize = VGA' to ';;UNUSED_PARAMETER = VGA'

@caco3
Copy link
Collaborator

caco3 commented Dec 23, 2024

Did you follow one of my in #3443 (comment) proposed ways?

@phulstaert
Copy link

phulstaert commented Dec 23, 2024

I copied the version from github and adjusted what needed to be adjusted; now it works.
The extra information is more for debugging if you or any other contributors were interested.

The logfile was 23MB from bootlooping. Earlier was said that there are systems in place to ensure there is no bootlooping. I would argue that this failsafe was not working... :-)

You can also see that it was migrating the configfile, but while it was writable, it dit not write anything. (except for the logfile)

@caco3
Copy link
Collaborator

caco3 commented Dec 23, 2024

Earlier was said that there are systems in place to ensure there is no bootlooping. I would argue that this failsafe was not working... :-)

No that is not exactly right. The fail safe mechanism works as intended, how ever due to the complexity of the system, it only gets active after the migration.

You can also see that it was migrating the configfile, but while it was writable, it dit not write anything. (except for the logfile)

If you would have used the latest main instead of RC5, this would no longer have been an issue since I fixed it in #3450

Since you say you are a developer I would have expected that you interpret this right:

grafik

@phulstaert
Copy link

There were no empty parameters in my config.ini nor in my config.bak.
You can see this in the screenshots I posted; so I don't know why this would have any effect on my situation.
It is exactly because of this, I posted the screenshots.

If you would have used the latest main instead of RC5, this would no longer have been an issue since I fixed it in #3450

I don't know what you mean by this.

  • I updated to RC5 yesterday at 16:00 local time (GMT +1) and got in the bootloop.
  • I posted my issues here at 19:00 local time since I found another similar looking issue here.
  • You merged catch empty parameters in migration #3450 at 22:00 local time.
    When should I have used the main instead of the RC5? Before I knew this was going to happen, or after I got an unresponsive 'system'?

By the way, the reason I updated to a newer version was because the LED (flash) would not turn on a second time. Only a first time and then I needed to reboot. This issue is not present anymore in the RC5 - so I am happy with it.

@SybexX
Copy link
Collaborator

SybexX commented Dec 23, 2024

There were no empty parameters in my config.ini nor in my config.bak.
You can see this in the screenshots I posted; so I don't know why this would have any effect on my situation.
It is exactly because of this, I posted the screenshots.

There are multiple lines completely missing
(left = my file; right = file from github)
398229443-92c9a5bf-4180-462f-93a8-3cacd8087310

If you would have used the latest main instead of RC5, this would no longer have been an issue since I fixed it in #3450

I don't know what you mean by this.

That's what he meant: https://github.com/jomjol/AI-on-the-edge-device/actions/runs/12457667210

Unbenannt
Unbenannt1

I'm starting to get the feeling that you don't really know what migration does.
It looks for variables in the config.ini where we have made a name change and changes them.

Brightness = 1 is changed to CamBrightness = 1
Contrast = 2 is changed to CamContrast = 2
......
;Zoom = false is changed to CamZoom = false
;ZoomMode = 0 is changed to CamZoomSize = 0
;ZoomOffsetX = 0 is changed to CamZoomOffsetX = 0
;ZoomOffsetY = 0 is changed to CamZoomOffsetY = 0
........

That is strange.
I had a new ESP32; installed the 15.7 from the webinstaller and after a few days i updated to 16.0 RC5 using the update file.
I will look at my config.ini tonight (in 7 hours from now) to confirm that this is the issue with my device.. It could always be I >have another issue..

Your config.ini is not from version 15.7.0, so you had an earlier version on the SD.
This is one from version 15.7.0 and as you can see there is no variable without a value behind it.
Unbenannt

@phulstaert
Copy link

Ok, so in the migration, you are also parsing comment lines. I skipped those when going through the file, because comments are comments in my mind. This does however explain the "empty parameters" issue.

When I -personally- do migrations, I create a function to create a new configfile and fill it with known values; the unknown keys are always filled with default values when they are required. Doing it like this, I skip commented lines.
This ensures that there is always a valid configfile. (for example when a config file has a specific key multiple times in the file - a very common error when users are manually editing files, and not always that easy to find - you always rewrite the config in a known and valid format) It is also a good idea to include extra data like "what version of the software generated this config", "what version of config file is is", "when was it last generated", etc...

My screenshot of the config however is really from a 15.7 version.
I formatted the SD (it was ext3 before), so certainly no old files. I did the webinstall and followed the instructions and used only those links. I only have the 15.7 zip and RC5 update file in my downloads folder. (just checked to be on the safe side)
I did however had issues with the LED flash and tried a few settings I adjusted using the web configuration page.

Wat parameters are triggering you to say this is not a 15.7 version? I have a few ESP32-CAM's, so I am willing to try to reproduce what I had. But knowing what to search for is rather valuable information when trying to reproduce something.

I hope it is only a language barrier thing, but I sense a sort of hostility and condescending attitude . To be clear; I really want to help. There is no other project like this and I want to help create a stable and useful product for the world to use. You know as much as me that a product that does not work from the first try, people often abandon. I am trying to identify and rectify such an error.

@SybexX
Copy link
Collaborator

SybexX commented Dec 24, 2024

I always use a translator because my English isn't that good^^
The parameters are assigned standard values ​​in the firmware and, if an entry exists in the config.ini, initialized with these.
My picture shows the original config.ini from version 15.7.0 and this should actually be copied to the SD during a new installation, meaning there shouldn't be any parameters without a value behind them.

@phulstaert
Copy link

Ok, I'll leave it at that regarding the "language" issues. I know that when translating Dutch (this is my native language), it also doesn't always relay the message with the same 'finesse' a native English speaker can write it. Because of this, I try to write my English directly, but as you might have noticed, I do sometimes make spelling or grammar mistakes...

Tonight, I will try to reproduce the error on a new ESP32-CAM and a new SD card. (or is there an open source/free emulator available that I can use to speed this up?)

Is this project open for new Collaborators? Or are you a more closed set of people?
I always like to do optimalisation, formatting, code deduplication, etc... On another great project (Leantime) I did the same thing and it resulted in a speed gain of 90% and the install package went from ~30MB to ~9MB. (ok, most of that was images, unused fonts and cached files, but still...) The challenge on the limited resources on an ESP32 looks like fun...

@caco3
Copy link
Collaborator

caco3 commented Dec 24, 2024

@phulstaert The project is in need for more contributors who are willing to deep-dive into the legacy code.
But we need active contributors, not people who just say they would do it better.

The project has grown a lot over the time and as you might know, this never is good for clean code.
jomjol did a very great job when he started the project. But some things are IMO not state of the art anymore. Eg. The configuration has some (IMO) odd concepts. Settings can be have a value or be unchecked. unchecked means they are changed to comments and it is handled like a missing parameter. Since a long time I would like to replace this but it would be a compatibility change to how the config is used now. because of this and the HUGE effort we did not touch it.

Also keep in mind that there are only a few automated tests for the algorithm but not for the overall software. In my paid job, I do DevOps and implement testsystems for such hardware-near software projects. And believe me, it would be a huge job to implement tests. Especially since there are no clear concepts written down and the APIs sadly are inconsistent (again due to the growth over the time).
There are no emulators for ESPs AFAIK. So I would have to setup a runner at my home. but I am not willing to maintain such a system long term as it needs continuous maintenance. Keep in mind that we do it all for free and all we get is a few thank-you's and sometimes a some donations!

I invest my time because I use the system myself, want to learn/try out new things and help other to get it working. But I am not investing months of time to refactor the software just for fun.

The best way would be to start from scratch, write a good concept and implement it with proper APIs and guidelines. But unless somebody provides me a years salary, I am not going to do it as it works-for-me as is.

@phulstaert
Copy link

If you are ok with it, I will start with a bit of cleanup in the sense of "consistency". And while I am going through the files, I can get a sense of how everything is structured. I will try to do it in small pull requests per topic. This is a nice opportunity to refresh the language a bit; I have been using Javascript/PHP exclusively for a while now, so some refreshing is needed...

Next I will suggest to update things in a certain order so we can make it better one part at a time. - But this does indeed depends on how it is build.
Everything I suggest, I will do. Not just ideas and remarks, but also execution.

I would be nice if I did cleanup of code, it would not be an unending job, but new code and changes by others are up to the same standards and conventions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants