Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After a reboot, the node came up in maintenance mode. #9927

Open
chumkaska opened this issue Dec 11, 2024 · 1 comment
Open

After a reboot, the node came up in maintenance mode. #9927

chumkaska opened this issue Dec 11, 2024 · 1 comment

Comments

@chumkaska
Copy link

Bug Report

Description

There is a Talos cluster with three nodes. One of the nodes froze: it stopped responding on port 50k, the network card was responsive, but VNC showed that the system was completely stuck. After a reboot, the node came back up and rejoined the cluster. However, etcd was dead on the second node. Logs indicated that the API server was also not working.

CNI logs revealed widespread issues. Linstor on this node was also not functioning. After a reboot, the node came up in maintenance mode. I applied the configuration, and everything started working again, but what could have caused this behavior?

The main question is: why did the Talos node come up in maintenance mode?
Talos v1.8.1

Logs

Environment

  • Talos version: v1.8.1
  • Kubernetes version: 1.30
  • Platform:
@smira
Copy link
Member

smira commented Dec 12, 2024

You need to capture the kernel logs on boot and see what happened.

There are two possibilities:

  • disk got corrupted bad enough that STATE partition is missing now, or the machine config is gone (unlikely, but still)
  • there's a race with udevd reporting devices as ready, and the moment Talos looks for STATE. This is specifically true for some RAID controllers. In that case, try setting https://www.talos.dev/v1.8/reference/kernel/#talosdevicesettle_time

To understand better, follow the Talos logs for two most important pieces:

  • the moment your disk is reported by the kernel, e.g. sda appears in the logs
  • the moment Talos reports a message from VolumeManagerController about a transition of STATE volume between phases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants