Docker key.json has invalid contents and the system refuses to boot #1706

mdcollins05 · 2022-01-06T03:50:17Z

Describe the issue you are experiencing

I'm 2 for 2 now on this. Each reboot after an upgrade to the OS somehow leads to a corrupt key.json file on my Home Assistant OS install. Each time I have to boot via a recovery USB drive, mount the right partition (in this case /dev/sda7) and delete the /etc/docker/key.json file to fix the system. I don't intentionally manage this file in any way, nor do I care about the contents, so the Docker generated version is just fine with me.

As a test, I fixed the file, booted into Home Assistant and then restarted back into my recovery USB to inspect the contents. They were valid JSON instead of the invalid content.

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

6.6

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

Update Home Assistant OS via supervisor in the web interface
Home Assistant then fails to reboot into a valid system

Anything in the Supervisor logs that might be useful for us?

Nothing in there from the update or prior to the successful boot of my system.

Anything in the Host logs that might be useful for us?

Nothing

System Health information

System Health

version	core-2021.11.1
installation_type	Home Assistant OS
dev	false
hassio	true
docker	true
user	root
virtualenv	false
python_version	3.9.7
os_name	Linux
os_version	5.10.88
arch	x86_64
timezone	America/Los_Angeles

Home Assistant Community Store

GitHub API	ok
Github API Calls Remaining	4890
Installed Version	1.16.0
Stage	running
Available Repositories	932
Installed Repositories	9

Home Assistant Cloud

logged_in	true
subscription_expiration	January 13, 2022, 4:00 PM
relayer_connected	true
remote_enabled	false
remote_connected	false
alexa_enabled	true
google_enabled	false
remote_server	us-west-2-1.ui.nabu.casa
can_reach_cert_server	ok
can_reach_cloud_auth	ok
can_reach_cloud	ok

Home Assistant Supervisor

host_os	Home Assistant OS 7.1
update_channel	stable
supervisor_version	supervisor-2021.12.2
docker_version	20.10.9
disk_total	219.4 GB
disk_used	7.4 GB
healthy	true
supported	true
board	generic-x86-64
supervisor_api	ok
version_api	ok
installed_addons	chrony (2.2.1), Samba share (9.5.1), Z-Wave JS to MQTT (0.27.0), Node-RED (10.1.1), Terminal & SSH (9.2.1), Network UPS Tools (0.9.0)

keymaster

zwave_integration	zwave_js
network_status	on

Lovelace

dashboards	4
resources	6
views	8
mode	storage

Additional information

This is the contents of the file:

And this is the log entries from journalctl:

The text was updated successfully, but these errors were encountered:

agners · 2022-01-06T10:01:45Z

Hm, we don't really touch the bind mounted overlay partition on update, so I can't really see how that is related to the update process. Also it seems that Docker uses AtomicWriteFile to make sure it gets written completely to disk before replacing it:

liusdu/moby@dcc1d2e#diff-2c9b2092e4c0945dfd6f1a67b07a560c472bb9d66a02f9d2333fb7f9d4b46eafR147

I'd call hardware issue, but its a bit strange that the same file was affected twice (and non other, at least not that we know of...).

mdcollins05 · 2022-01-07T19:42:32Z

Yeah, it's definitely a strange one. I've had the box for a good number of years but the SSD was new at the time I installed Home Assistant OS on it (~4 months ago). I've had no other issues (no freezing, reboots, or any other file corruption) that I've noticed. It's only during an OS upgrade that I have an issue.

tomwilkie · 2022-04-03T10:11:23Z

Just hit this issue too, updating to 5.7; appears this person also had the issue: https://community.home-assistant.io/t/hassio-wont-start-failed-to-start-docker-application-container-engine/95407/31

The file seems to be pretty empty:

root@ubuntu:~/mnt/etc/docker# hexdump -C key.json 
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000f0  00 00 00 00                                       |....|
000000f4
root@ubuntu:~/mnt/etc/docker# file key.json
key.json: data
root@ubuntu:~/mnt/etc/docker# ls -alh key.json 
-rw------- 1 root root 244 Mar  2 18:53 key.json

agners · 2022-04-05T07:14:37Z

5.7 is rather old at this point 😅

Did the update and the reboot thereafter go through without problems?

tomwilkie · 2022-04-05T13:34:51Z

Sorry I meant 7.6 (I don't know where I got that number from). Post update the reboot failed; I booted off an Ubuntu stick and deleted the key.json, and then the reboot worked.

agners · 2022-04-05T13:42:56Z

A clean reboot should always cause that file to be properly synced to disk, its really not clear to me why this can happen.

What kind of system are you using?

mdcollins05 · 2022-05-18T18:19:54Z

FWIW I woke up to my Home Assistant box not responding and after a reboot hit this docker key.json issue again. After fixing that I upgraded to ~~core~~ OS 7.6 and it did not run into the same problem. So it seems that a reboot after an upgrade exposes this issue but may not be the cause of it.

vlycop · 2022-06-16T19:24:13Z

I also had this happened to me 2 day ago on a new install (a couple week) running the X86 version of the os on a NUC.
Both boot partition A and B would not start.

After the upgrade the device bootlooped for ever, restarting after the docker service.
I managed to pull all the systemd journal and found this.

juin 15 09:25:33 ap-nuc-001 dockerd[444]: failed to start daemon: Error loading key file /etc/docker/key.json: unable to decode private key JWK: decoding JWK Private Key JSON data: invalid character '\x00' looking for beginning of value

The file, who isn't in the homeassistant_data partition but one called "homeassistant_overlay" (don't know what it's used for), was full of `\00\00\00\00\00\00....00" a very long line.
Removing the file fixed the issue

When asking for help on the discord i found that another user had that same issue
a week ago (@\MiKe on the discord) and had to reinstall the nuc and restore backup not knowing what else to do.

mgolisch also found the cause of the reboot : https://github.com/home-assistant/operating-system/blob/dev/buildroot-external/rootfs-overlay/etc/systemd/system/docker.service.d/failure.conf
I think it is a terrible idea to boot loop for ever, so fast that you can't read what is causing the issue, on a project made for low skill user. IMHO this should be revert to droping in a rescue shell.

Lastly, i don't believe the update processes is in cause here, as i didn't find anything about it, but something is breaking the key.json file on many device, and it break rebooting, which append during upgrade.

I was moving from core to os for better uptime and quicker upgrade cycle as i am slowly leaving the house, but this incident and it's apparently somewhat regular occurrence made me go back on that for now

As i am not comfortable to release the full journalctl file from installation to crash in the wild (i don't know what kind of sensible information there is on it) i won't upload them but i keep them at the discretion of any maintainer who may ask for them.

Sincerely

mfattoru · 2022-06-18T21:23:01Z

As mentioned above, I'm having the same issue, which forces me to never upgrade the os, or reboot the host for fear of a broken inaccessible install.

Having at least the possibility to work with the cli while a fix is found would be great.

agners · 2022-06-18T22:45:52Z

I think it is a terrible idea to boot loop for ever, so fast that you can't read what is causing the issue, on a project made for low skill user. IMHO this should be revert to droping in a rescue shell.

The boot loop is implemented so that the bootloader can switch to the other (presumably good) installation (HAOS has two OS installation, A and B. Each upgrade updates the other, not currenlty running system). However, if the old presumably good installation is not booting as well, simply rebooting is indeed not helpful. Currently we don't detect that situation/behave accordingly.

This case is somewhat special as data corruption in the shared overlay partition causes both installations (A and B) to fail.

The file, who isn't in the homeassistant_data partition but one called "homeassistant_overlay" (don't know what it's used for), was full of `\00\00\00\00\00\00....00" a very long line.
Removing the file fixed the issue

We use the overlay partition to make certain parts of /etc writeable. It is a simple ext4 file system with some directories bind mounted to certain directories in real /etc, like /etc/docker.

Now your case looks a corruption happened to that partition. However, as written in #1706 (comment) I don't really understand why that can happen in first place: ext4 is a journaling file system, and Docker uses Atomic write for this particular file.

If that is indeed a more common problem, maybe we should sanity check that file or something.

mfattoru · 2022-06-18T22:55:19Z

I think it is a terrible idea to boot loop for ever, so fast that you can't read what is causing the issue, on a project made for low skill user. IMHO this should be revert to droping in a rescue shell.

The boot loop is implemented so that the bootloader can switch to the other (presumably good) installation (HAOS has two OS installation, A and B. Each upgrade updates the other, not currenlty running system). However, if the old presumably good installation is not booting as well, simply rebooting is indeed not helpful. Currently we don't detect that situation/behave accordingly.

This case is somewhat special as data corruption in the shared overlay partition causes both installations (A and B) to fail.

The file, who isn't in the homeassistant_data partition but one called "homeassistant_overlay" (don't know what it's used for), was full of `\00\00\00\00\00\00....00" a very long line.
Removing the file fixed the issue

We use the overlay partition to make certain parts of /etc writeable. It is a simple ext4 file system with some directories bind mounted to certain directories in real /etc, like /etc/docker.

Now your case looks a corruption happened to that partition. However, as written in #1706 (comment) I don't really understand why that can happen in first place: ext4 is a journaling file system, and Docker uses Atomic write for this particular file.

If that is indeed a more common problem, maybe we should sanity check that file or something.

In the last month, I've had a similar bootloop happen 3 times. I've always had to reflash my ssd, as I could not do much with the install, and the error itself was hard to find.
I'm not sure if other people had this happen as well, and due of the complexity of finding the error, they jist decided to clean install, making the error silently live undetected.

As you pointed out, this might not be 100% relayed to OS upgrade, as I've had it happen once not during the update process. But can't give more specific details.

For now I'll try avoiding rebooting host, and if a reboot happens, will try to gather more details

vlycop · 2022-06-19T21:06:52Z

The boot loop is implemented so that the bootloader can switch to the other (presumably good) installation (HAOS has two OS installation, A and B. Each upgrade updates the other, not currenlty running system). However, if the old presumably good installation is not booting as well, simply rebooting is indeed not helpful. Currently we don't detect that situation/behave accordingly.

This case is somewhat special as data corruption in the shared overlay partition causes both installations (A and B) to fail.

Ok, i understand the goal there, but in that case you need an extra grub entry for an emergency or "safe mode" shell.

Bringing the monitor and the keyboard to edit the boot command was way to involved to guide someone on discord, and so is booting a live cd in my mind.

That aside, i was going to offer a execpre to docker checking the file and wiping it if needed, but i was bothered by the idea of hiding file corruption ...

agners · 2022-06-24T18:56:48Z

One way to remove the corrupted key.json is to use e to edit the boot entry, go at the end of the 3rd line and add systemd.unit=rescue.target, then F10. This should boot into a rescue shell. Enter it using Enter, then type rm /mnt/overlay/etc/docker/key.json.

* Remove key.json file if it appears to be corrupted (#1706) * Check with jq if key.json is parsable

jbg · 2022-07-22T04:36:33Z

I just hit this after an unclean shutdown. The debugging experience was really bad because of the boot loop.

On a hunch, based on the boot loop and not having found this issue yet, I grepped this repo for FailureAction=reboot and found that in docker.service, so I booted with systemd.mask=docker.service and tried to run dockerd manually, which exited with an error about key.json, which indeed was full of null bytes. I removed it and rebooted and all was back to normal.

There is a docker-failure script which seems to automate this, which I guess will be in the next release, but the messages it prints will never be seen on hardware that reboots quickly.

You could consider using StartLimitAction=reboot instead of FailureAction=reboot, with StartLimitIntervalSec and StartLimitBurst set accordingly, so systemd will try to start docker a few times, giving an operator a chance to see which service is failing and any output from the docker-failure script, before rebooting if it continues to fail to start. This also means that if the problem was a corrupted key.json, it will successfully start on the second try with no need for a reboot.

I don't know much about docker, but does key.json actually need to be persisted? If it's just a key for docker clients to communicate with dockerd, why not just generate a random one before docker starts on each boot?

agners · 2022-09-12T12:32:28Z

With HAOS 9.0 a invalid key file will be detected, see #1988.

mfattoru · 2022-09-12T12:42:41Z

Thanks for your work

MasterCATZ · 2022-11-27T05:42:13Z

I believe I just fell victim to this as well and with HAOS 9.3 !!!

how to correct the corrupted key.json ?

/etc/docker/daemon.json

is the only file ?

{
    "storage-driver": "overlay2",
    "log-driver": "journald",
    "experimental": true,
    "ip6tables": true,
    "log-opts": {
        "tag": "{{.Name}}"
    },
    "data-root": "/mnt/data/docker",
    "deprecated-key-path": "/mnt/overlay/etc/docker/key.json"
}

and using this still boot loops

systemd.unit=rescue.target

agners · 2022-11-28T08:31:54Z

I believe I just fell victim to this as well and with HAOS 9.3 !!!

how to correct the corrupted key.json ?

You can just delete the file at /mnt/overlay/etc/docker/key.json

and using this still boot loops

Hm, that then sounds like a different problem. systemd.unit=rescue.target really should not attempt to start docker daemon, and therefor not cause a bootloop. Can you share the boot log when using rescue.target?

mdcollins05 added the bug label Jan 6, 2022

agners added the board/generic-x86-64 Generic x86-64 Boards (like Intel NUC) label Jan 6, 2022

agners added a commit that referenced this issue Jun 24, 2022

Remove key.json file if it appears to be corrupted (#1706)

4f18d91

agners added a commit that referenced this issue Jun 25, 2022

Remove key.json file if it appears to be corrupted (#1706) (#1988)

26bca26

* Remove key.json file if it appears to be corrupted (#1706) * Check with jq if key.json is parsable

agners mentioned this issue Sep 2, 2022

Bump commit interval to 30s #2103

Merged

agners closed this as completed Sep 12, 2022

agners mentioned this issue Sep 12, 2022

Boot Loop Issue #2092

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker key.json has invalid contents and the system refuses to boot #1706

Docker key.json has invalid contents and the system refuses to boot #1706

mdcollins05 commented Jan 6, 2022 •

edited

Loading

agners commented Jan 6, 2022

mdcollins05 commented Jan 7, 2022

tomwilkie commented Apr 3, 2022

agners commented Apr 5, 2022

tomwilkie commented Apr 5, 2022

agners commented Apr 5, 2022

mdcollins05 commented May 18, 2022 •

edited

Loading

vlycop commented Jun 16, 2022 •

edited

Loading

mfattoru commented Jun 18, 2022

agners commented Jun 18, 2022

mfattoru commented Jun 18, 2022

vlycop commented Jun 19, 2022

agners commented Jun 24, 2022

jbg commented Jul 22, 2022 •

edited

Loading

agners commented Sep 12, 2022

mfattoru commented Sep 12, 2022

MasterCATZ commented Nov 27, 2022 •

edited

Loading

agners commented Nov 28, 2022

Docker key.json has invalid contents and the system refuses to boot #1706

Docker key.json has invalid contents and the system refuses to boot #1706

Comments

mdcollins05 commented Jan 6, 2022 • edited Loading

Describe the issue you are experiencing

What operating system image do you use?

What version of Home Assistant Operating System is installed?

Did you upgrade the Operating System.

Steps to reproduce the issue

Anything in the Supervisor logs that might be useful for us?

Anything in the Host logs that might be useful for us?

System Health information

System Health

Additional information

agners commented Jan 6, 2022

mdcollins05 commented Jan 7, 2022

tomwilkie commented Apr 3, 2022

agners commented Apr 5, 2022

tomwilkie commented Apr 5, 2022

agners commented Apr 5, 2022

mdcollins05 commented May 18, 2022 • edited Loading

vlycop commented Jun 16, 2022 • edited Loading

mfattoru commented Jun 18, 2022

agners commented Jun 18, 2022

mfattoru commented Jun 18, 2022

vlycop commented Jun 19, 2022

agners commented Jun 24, 2022

jbg commented Jul 22, 2022 • edited Loading

agners commented Sep 12, 2022

mfattoru commented Sep 12, 2022

MasterCATZ commented Nov 27, 2022 • edited Loading

agners commented Nov 28, 2022

mdcollins05 commented Jan 6, 2022 •

edited

Loading

mdcollins05 commented May 18, 2022 •

edited

Loading

vlycop commented Jun 16, 2022 •

edited

Loading

jbg commented Jul 22, 2022 •

edited

Loading

MasterCATZ commented Nov 27, 2022 •

edited

Loading