After a reboot, the node came up in maintenance mode. #9927

chumkaska · 2024-12-11T18:08:21Z

Bug Report

Description

There is a Talos cluster with three nodes. One of the nodes froze: it stopped responding on port 50k, the network card was responsive, but VNC showed that the system was completely stuck. After a reboot, the node came back up and rejoined the cluster. However, etcd was dead on the second node. Logs indicated that the API server was also not working.

CNI logs revealed widespread issues. Linstor on this node was also not functioning. After a reboot, the node came up in maintenance mode. I applied the configuration, and everything started working again, but what could have caused this behavior?

The main question is: why did the Talos node come up in maintenance mode?
Talos v1.8.1

Logs

Environment

Talos version: v1.8.1
Kubernetes version: 1.30
Platform:

smira · 2024-12-12T10:30:37Z

You need to capture the kernel logs on boot and see what happened.

There are two possibilities:

disk got corrupted bad enough that STATE partition is missing now, or the machine config is gone (unlikely, but still)
there's a race with udevd reporting devices as ready, and the moment Talos looks for STATE. This is specifically true for some RAID controllers. In that case, try setting https://www.talos.dev/v1.8/reference/kernel/#talosdevicesettle_time

To understand better, follow the Talos logs for two most important pieces:

the moment your disk is reported by the kernel, e.g. sda appears in the logs
the moment Talos reports a message from VolumeManagerController about a transition of STATE volume between phases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After a reboot, the node came up in maintenance mode. #9927

After a reboot, the node came up in maintenance mode. #9927

chumkaska commented Dec 11, 2024

smira commented Dec 12, 2024

After a reboot, the node came up in maintenance mode. #9927

After a reboot, the node came up in maintenance mode. #9927

Comments

chumkaska commented Dec 11, 2024

Bug Report

Description

Logs

Environment

smira commented Dec 12, 2024