Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker key.json has invalid contents and the system refuses to boot #1706

Closed
mdcollins05 opened this issue Jan 6, 2022 · 18 comments
Closed
Labels
board/generic-x86-64 Generic x86-64 Boards (like Intel NUC) bug

Comments

@mdcollins05
Copy link

mdcollins05 commented Jan 6, 2022

Describe the issue you are experiencing

I'm 2 for 2 now on this. Each reboot after an upgrade to the OS somehow leads to a corrupt key.json file on my Home Assistant OS install. Each time I have to boot via a recovery USB drive, mount the right partition (in this case /dev/sda7) and delete the /etc/docker/key.json file to fix the system. I don't intentionally manage this file in any way, nor do I care about the contents, so the Docker generated version is just fine with me.

As a test, I fixed the file, booted into Home Assistant and then restarted back into my recovery USB to inspect the contents. They were valid JSON instead of the invalid content.

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

6.6

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

  1. Update Home Assistant OS via supervisor in the web interface
  2. Home Assistant then fails to reboot into a valid system

Anything in the Supervisor logs that might be useful for us?

Nothing in there from the update or prior to the successful boot of my system.

Anything in the Host logs that might be useful for us?

Nothing

System Health information

System Health

version core-2021.11.1
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.9.7
os_name Linux
os_version 5.10.88
arch x86_64
timezone America/Los_Angeles
Home Assistant Community Store
GitHub API ok
Github API Calls Remaining 4890
Installed Version 1.16.0
Stage running
Available Repositories 932
Installed Repositories 9
Home Assistant Cloud
logged_in true
subscription_expiration January 13, 2022, 4:00 PM
relayer_connected true
remote_enabled false
remote_connected false
alexa_enabled true
google_enabled false
remote_server us-west-2-1.ui.nabu.casa
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud ok
Home Assistant Supervisor
host_os Home Assistant OS 7.1
update_channel stable
supervisor_version supervisor-2021.12.2
docker_version 20.10.9
disk_total 219.4 GB
disk_used 7.4 GB
healthy true
supported true
board generic-x86-64
supervisor_api ok
version_api ok
installed_addons chrony (2.2.1), Samba share (9.5.1), Z-Wave JS to MQTT (0.27.0), Node-RED (10.1.1), Terminal & SSH (9.2.1), Network UPS Tools (0.9.0)
keymaster
zwave_integration zwave_js
network_status on
Lovelace
dashboards 4
resources 6
views 8
mode storage

Additional information

This is the contents of the file:
PXL_20220106_031713090

And this is the log entries from journalctl:
PXL_20220106_032525837

@mdcollins05 mdcollins05 added the bug label Jan 6, 2022
@agners agners added the board/generic-x86-64 Generic x86-64 Boards (like Intel NUC) label Jan 6, 2022
@agners
Copy link
Member

agners commented Jan 6, 2022

Hm, we don't really touch the bind mounted overlay partition on update, so I can't really see how that is related to the update process. Also it seems that Docker uses AtomicWriteFile to make sure it gets written completely to disk before replacing it:

liusdu/moby@dcc1d2e#diff-2c9b2092e4c0945dfd6f1a67b07a560c472bb9d66a02f9d2333fb7f9d4b46eafR147

I'd call hardware issue, but its a bit strange that the same file was affected twice (and non other, at least not that we know of...).

@mdcollins05
Copy link
Author

Yeah, it's definitely a strange one. I've had the box for a good number of years but the SSD was new at the time I installed Home Assistant OS on it (~4 months ago). I've had no other issues (no freezing, reboots, or any other file corruption) that I've noticed. It's only during an OS upgrade that I have an issue.

@tomwilkie
Copy link

Just hit this issue too, updating to 5.7; appears this person also had the issue: https://community.home-assistant.io/t/hassio-wont-start-failed-to-start-docker-application-container-engine/95407/31

The file seems to be pretty empty:

root@ubuntu:~/mnt/etc/docker# hexdump -C key.json 
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000f0  00 00 00 00                                       |....|
000000f4
root@ubuntu:~/mnt/etc/docker# file key.json
key.json: data
root@ubuntu:~/mnt/etc/docker# ls -alh key.json 
-rw------- 1 root root 244 Mar  2 18:53 key.json

@agners
Copy link
Member

agners commented Apr 5, 2022

5.7 is rather old at this point 😅

Did the update and the reboot thereafter go through without problems?

@tomwilkie
Copy link

Sorry I meant 7.6 (I don't know where I got that number from). Post update the reboot failed; I booted off an Ubuntu stick and deleted the key.json, and then the reboot worked.

@agners
Copy link
Member

agners commented Apr 5, 2022

A clean reboot should always cause that file to be properly synced to disk, its really not clear to me why this can happen.

What kind of system are you using?

@mdcollins05
Copy link
Author

mdcollins05 commented May 18, 2022

FWIW I woke up to my Home Assistant box not responding and after a reboot hit this docker key.json issue again. After fixing that I upgraded to core OS 7.6 and it did not run into the same problem. So it seems that a reboot after an upgrade exposes this issue but may not be the cause of it.

@vlycop
Copy link

vlycop commented Jun 16, 2022

I also had this happened to me 2 day ago on a new install (a couple week) running the X86 version of the os on a NUC.
Both boot partition A and B would not start.

After the upgrade the device bootlooped for ever, restarting after the docker service.
I managed to pull all the systemd journal and found this.

juin 15 09:25:33 ap-nuc-001 dockerd[444]: failed to start daemon: Error loading key file /etc/docker/key.json: unable to decode private key JWK: decoding JWK Private Key JSON data: invalid character '\x00' looking for beginning of value

The file, who isn't in the homeassistant_data partition but one called "homeassistant_overlay" (don't know what it's used for), was full of `\00\00\00\00\00\00....00" a very long line.
Removing the file fixed the issue

When asking for help on the discord i found that another user had that same issue
a week ago (@\MiKe on the discord) and had to reinstall the nuc and restore backup not knowing what else to do.

mgolisch also found the cause of the reboot : https://github.com/home-assistant/operating-system/blob/dev/buildroot-external/rootfs-overlay/etc/systemd/system/docker.service.d/failure.conf
I think it is a terrible idea to boot loop for ever, so fast that you can't read what is causing the issue, on a project made for low skill user. IMHO this should be revert to droping in a rescue shell.

Lastly, i don't believe the update processes is in cause here, as i didn't find anything about it, but something is breaking the key.json file on many device, and it break rebooting, which append during upgrade.

I was moving from core to os for better uptime and quicker upgrade cycle as i am slowly leaving the house, but this incident and it's apparently somewhat regular occurrence made me go back on that for now

As i am not comfortable to release the full journalctl file from installation to crash in the wild (i don't know what kind of sensible information there is on it) i won't upload them but i keep them at the discretion of any maintainer who may ask for them.

Sincerely

@mfattoru
Copy link

As mentioned above, I'm having the same issue, which forces me to never upgrade the os, or reboot the host for fear of a broken inaccessible install.

Having at least the possibility to work with the cli while a fix is found would be great.

@agners
Copy link
Member

agners commented Jun 18, 2022

I think it is a terrible idea to boot loop for ever, so fast that you can't read what is causing the issue, on a project made for low skill user. IMHO this should be revert to droping in a rescue shell.

The boot loop is implemented so that the bootloader can switch to the other (presumably good) installation (HAOS has two OS installation, A and B. Each upgrade updates the other, not currenlty running system). However, if the old presumably good installation is not booting as well, simply rebooting is indeed not helpful. Currently we don't detect that situation/behave accordingly.

This case is somewhat special as data corruption in the shared overlay partition causes both installations (A and B) to fail.

The file, who isn't in the homeassistant_data partition but one called "homeassistant_overlay" (don't know what it's used for), was full of `\00\00\00\00\00\00....00" a very long line.
Removing the file fixed the issue

We use the overlay partition to make certain parts of /etc writeable. It is a simple ext4 file system with some directories bind mounted to certain directories in real /etc, like /etc/docker.

Now your case looks a corruption happened to that partition. However, as written in #1706 (comment) I don't really understand why that can happen in first place: ext4 is a journaling file system, and Docker uses Atomic write for this particular file.

If that is indeed a more common problem, maybe we should sanity check that file or something.

@mfattoru
Copy link

I think it is a terrible idea to boot loop for ever, so fast that you can't read what is causing the issue, on a project made for low skill user. IMHO this should be revert to droping in a rescue shell.

The boot loop is implemented so that the bootloader can switch to the other (presumably good) installation (HAOS has two OS installation, A and B. Each upgrade updates the other, not currenlty running system). However, if the old presumably good installation is not booting as well, simply rebooting is indeed not helpful. Currently we don't detect that situation/behave accordingly.

This case is somewhat special as data corruption in the shared overlay partition causes both installations (A and B) to fail.

The file, who isn't in the homeassistant_data partition but one called "homeassistant_overlay" (don't know what it's used for), was full of `\00\00\00\00\00\00....00" a very long line.
Removing the file fixed the issue

We use the overlay partition to make certain parts of /etc writeable. It is a simple ext4 file system with some directories bind mounted to certain directories in real /etc, like /etc/docker.

Now your case looks a corruption happened to that partition. However, as written in #1706 (comment) I don't really understand why that can happen in first place: ext4 is a journaling file system, and Docker uses Atomic write for this particular file.

If that is indeed a more common problem, maybe we should sanity check that file or something.

In the last month, I've had a similar bootloop happen 3 times. I've always had to reflash my ssd, as I could not do much with the install, and the error itself was hard to find.
I'm not sure if other people had this happen as well, and due of the complexity of finding the error, they jist decided to clean install, making the error silently live undetected.

As you pointed out, this might not be 100% relayed to OS upgrade, as I've had it happen once not during the update process. But can't give more specific details.

For now I'll try avoiding rebooting host, and if a reboot happens, will try to gather more details

@vlycop
Copy link

vlycop commented Jun 19, 2022

The boot loop is implemented so that the bootloader can switch to the other (presumably good) installation (HAOS has two OS installation, A and B. Each upgrade updates the other, not currenlty running system). However, if the old presumably good installation is not booting as well, simply rebooting is indeed not helpful. Currently we don't detect that situation/behave accordingly.

This case is somewhat special as data corruption in the shared overlay partition causes both installations (A and B) to fail.

Ok, i understand the goal there, but in that case you need an extra grub entry for an emergency or "safe mode" shell.

Bringing the monitor and the keyboard to edit the boot command was way to involved to guide someone on discord, and so is booting a live cd in my mind.

That aside, i was going to offer a execpre to docker checking the file and wiping it if needed, but i was bothered by the idea of hiding file corruption ...

@agners
Copy link
Member

agners commented Jun 24, 2022

One way to remove the corrupted key.json is to use e to edit the boot entry, go at the end of the 3rd line and add systemd.unit=rescue.target, then F10. This should boot into a rescue shell. Enter it using Enter, then type rm /mnt/overlay/etc/docker/key.json.

agners added a commit that referenced this issue Jun 25, 2022
* Remove key.json file if it appears to be corrupted (#1706)

* Check with jq if key.json is parsable
@jbg
Copy link

jbg commented Jul 22, 2022

I just hit this after an unclean shutdown. The debugging experience was really bad because of the boot loop.

On a hunch, based on the boot loop and not having found this issue yet, I grepped this repo for FailureAction=reboot and found that in docker.service, so I booted with systemd.mask=docker.service and tried to run dockerd manually, which exited with an error about key.json, which indeed was full of null bytes. I removed it and rebooted and all was back to normal.

There is a docker-failure script which seems to automate this, which I guess will be in the next release, but the messages it prints will never be seen on hardware that reboots quickly.

You could consider using StartLimitAction=reboot instead of FailureAction=reboot, with StartLimitIntervalSec and StartLimitBurst set accordingly, so systemd will try to start docker a few times, giving an operator a chance to see which service is failing and any output from the docker-failure script, before rebooting if it continues to fail to start. This also means that if the problem was a corrupted key.json, it will successfully start on the second try with no need for a reboot.

I don't know much about docker, but does key.json actually need to be persisted? If it's just a key for docker clients to communicate with dockerd, why not just generate a random one before docker starts on each boot?

@agners
Copy link
Member

agners commented Sep 12, 2022

With HAOS 9.0 a invalid key file will be detected, see #1988.

@mfattoru
Copy link

Thanks for your work

@MasterCATZ
Copy link

MasterCATZ commented Nov 27, 2022

I believe I just fell victim to this as well and with HAOS 9.3 !!!

how to correct the corrupted key.json ?

/etc/docker/daemon.json

is the only file ?

{
    "storage-driver": "overlay2",
    "log-driver": "journald",
    "experimental": true,
    "ip6tables": true,
    "log-opts": {
        "tag": "{{.Name}}"
    },
    "data-root": "/mnt/data/docker",
    "deprecated-key-path": "/mnt/overlay/etc/docker/key.json"
}

and using this still boot loops

systemd.unit=rescue.target

@agners
Copy link
Member

agners commented Nov 28, 2022

I believe I just fell victim to this as well and with HAOS 9.3 !!!

how to correct the corrupted key.json ?

You can just delete the file at /mnt/overlay/etc/docker/key.json

and using this still boot loops

Hm, that then sounds like a different problem. systemd.unit=rescue.target really should not attempt to start docker daemon, and therefor not cause a bootloop. Can you share the boot log when using rescue.target?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
board/generic-x86-64 Generic x86-64 Boards (like Intel NUC) bug
Projects
None yet
Development

No branches or pull requests

7 participants