Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node stuck in boot loop on Talos v1.8.0 upgrade #9369

Closed
chrede88 opened this issue Sep 24, 2024 · 28 comments
Closed

Node stuck in boot loop on Talos v1.8.0 upgrade #9369

chrede88 opened this issue Sep 24, 2024 · 28 comments

Comments

@chrede88
Copy link

Bug Report

While upgrading my cluster to Talos v1.8.0, my first node (haven't tried upgrading the other two) is stuck in a boot loop. My machine gets past the initial Linux boot splash screen (showing the number of cores etc.), but then reboots just before reaching the dashboard.

Environment

  • Talos version: v1.8.0

  • Kubernetes version:
    Client Version: v1.31.1
    Kustomize Version: v5.4.2
    Server Version: v1.30.3

  • Platform:
    Intel Nuc 12

@smira
Copy link
Member

smira commented Sep 24, 2024

Thanks for reporting this, but we'll need some kernel logs to understand what is wrong.

You can also try adding panic=30 to the kernel args via the GRUB prompt (https://www.talos.dev/v1.8/reference/kernel/#panic) to make it pause before rebooting (if it's actually a panic).

@chrede88
Copy link
Author

chrede88 commented Sep 24, 2024

How do I get any kernel logs if the machine never boots?

I've added the panic=30 to grub. It makes no difference.
I also can't boot into maintenance mode.

@smira
Copy link
Member

smira commented Sep 24, 2024

You have GRUB menu where you can select previous version of Talos if you did an upgrade.

There's no boot to maintenance mode option, I'm not sure what you're talking about.

@chrede88
Copy link
Author

chrede88 commented Sep 24, 2024

Yeah, unfortunately I booted to the previous version (v1.7.6) and tried again. This was a bad decision, as I now don't have that option anymore. It did boot into "the old" version just fine and came back to a healthy state. But in any case, I can't get any kernel logs for v1.8.0 from the v1.7.6, right?

The boot menu has a "reset and return to maintenance mode" option.

@chrede88
Copy link
Author

IMG_2046.mov

Maybe this short video of the boot process can help shed a light on the issue?

@PGimenez
Copy link

I'm having the same issue on a Minisforum MS-01. I can run 1.7.6 just fine, but If I burn the 1.8 iso to a USB stick and boot from it, it gets stuck in a boot loop.

CleanShot 2024-09-24 at 23 23 17

I also enabled the panic=0 option but it does nothing

IMG_6225

Here's a video of the boot process

fff.mov

@smira
Copy link
Member

smira commented Sep 25, 2024

I'm not quite sure what this might be, I guess only serial console can help here.

It seems to panic around device detection process, might be a bug in Linux which will be fixed in the follow-up releases.

@chrede88
Copy link
Author

@PGimenez people in the Home-operations discord report that they needed to add i915 and intel-mei kernel extensions for the MS-01. You can use Talos Linux image factory to create custom upgrade images.

@smira do you think this could also be my issue? The MS-01 runs either 13th gen or 12th gen Intel CPUs. Mine is a 12th gen.

@smira
Copy link
Member

smira commented Sep 26, 2024

I'm confused why it would do a reboot still, not having drivers for i915, and moreover for intel-mei shouldn't lead to a reboot.

@chrede88
Copy link
Author

I'm just relaying information. It may very well be that these two drivers aren't related to the issue at all.

@tpretz
Copy link

tpretz commented Sep 26, 2024

had same issue with an intel n100 device
using a 1.8.0 image with the previously mentioned drivers allowed it to boot

@kondanta
Copy link

Had the same issue with the same hardware as @tpretz. Using the drivers mentioned above also allowed me to boot into the system.

@smira
Copy link
Member

smira commented Sep 27, 2024

If anyone could submit the logs from the successful boot (talosctl dmesg) + (talosctl get pcidevices), that would be perfect, thank you!

@tpretz
Copy link

tpretz commented Sep 28, 2024

sure, let me know if you need anything else
dmesg.log
pcidevices.txt

@nathanpaul
Copy link

I also had the same issue and resolved it with the drivers specified in this thread

@hikkoiri
Copy link

hikkoiri commented Sep 28, 2024

I can confirm that building your own image on https://factory.talos.dev with the extensions

siderolabs/i915-ucode (20240909) - This system extension provides Intel GPU microcode binaries.
siderolabs/mei (v1.8.0) - This system extension provides Intel Management Engine drivers kernel modules built against a specific Talos version. This driver enables the Intel Management Engine, a prerequisite for Intel Arc discrete GPUs.

fixed the boot loop issue on my 12th gen Alderlake N100.

EDIT:
Using the custom image from an USB stick as boot medium works. The OS start into maintenance mode. After applying the config with talosctl apply-config and removing the installation medium, the installed version on the configured hard drive is now stuck in the boot loop. It seems like these additional extension were not passed.

@chrede88
Copy link
Author

I can confirm that building your own image on https://factory.talos.dev with the extensions

siderolabs/i915-ucode (20240909) - This system extension provides Intel GPU microcode binaries.
siderolabs/mei (v1.8.0) - This system extension provides Intel Management Engine drivers kernel modules built against a specific Talos version. This driver enables the Intel Management Engine, a prerequisite for Intel Arc discrete GPUs.

fixed the boot loop issue on my 12th gen Alderlake N100.

EDIT: Using the custom image from an USB stick as boot medium works. The OS start into maintenance mode. After applying the config with talosctl apply-config and removing the installation medium, the installed version on the configured hard drive is now stuck in the boot loop. It seems like these additional extension were not passed.

Maybe you didn't pass your custom image to the machineconfig? if you don't specify an install-image it'll use the latest standard image, i.e. vanilla v1.8.0.

@hikkoiri
Copy link

Ahh, I see. Well, yes that fixed it for me. Thanks! :)

@tom130
Copy link

tom130 commented Sep 28, 2024

Same here, also forgot to put the custom image with extensions. Maybe talosctl upgrade could warn the users to not to put vanilla if you already have extensions on the node.

@chrede88
Copy link
Author

I finally got around to try v1.8.0 with the two extra drivers. It work like a charm for me too. I'm also including the dmseg and pci output, after successful bootstrap, for my three nodes.
talos_v180_logs.zip

I'll close this issue for now, as adding i915 and intel-mei drivers to the talos image seems to work for everyone. @smira feel free to reopen if you need more info.

UnknownBlunders added a commit to UnknownBlunders/K8s that referenced this issue Oct 1, 2024
Talos bootloops on my control plane nodes without these extensions
Didn't test without mei extension on workers, but they're similar enough

see: siderolabs/talos#9369
@smira
Copy link
Member

smira commented Oct 2, 2024

Thank you, it feels like i915 without firmware might lead to the reboot because fbcon initialization fails? (just guessing here)

@tzabbi
Copy link

tzabbi commented Oct 9, 2024

Why was the fix not included in the new version 1.8.1?

@chrede88
Copy link
Author

chrede88 commented Oct 9, 2024

Why was the fix not included in the new version 1.8.1?

As far as I know, this is not considered a Talos "issue". The drivers were probably dropped from the new Linux kernel (just guessing here), which means you'll have to add them using the Talos Image Factory. Just add the Intel i915 and Intel mei drivers (not actually sure you need the mei driver) in the gui and use the custom image URL provided at the end. This method works.

@tzabbi
Copy link

tzabbi commented Oct 9, 2024

Okay thanks for the update. I just ask, because I find it very handy to bring it in the default image and don't have to specify and build it everytime on our own.

@chrede88
Copy link
Author

chrede88 commented Oct 9, 2024

Okay thanks for the update. I just ask, because I find it very handy to bring it in the default image and don't have to specify and build it everytime on our own.

Right, agreed. This is caused by the decision made by the Talos guys, to not build a lot of different images (as per the v1.8.0 release notes).

@tzabbi
Copy link

tzabbi commented Oct 10, 2024

Why was the fix not included in the new version 1.8.1?

As far as I know, this is not considered a Talos "issue". The drivers were probably dropped from the new Linux kernel (just guessing here), which means you'll have to add them using the Talos Image Factory. Just add the Intel i915 and Intel mei drivers (not actually sure you need the mei driver) in the gui and use the custom image URL provided at the end. This method works.

You are right. It works only with Intel i915

@gerhard
Copy link

gerhard commented Oct 20, 2024

This fixed my issue too, thanks @rothgar for mentioning this 💪

This is the second issue that I've hit when upgrading Talos from v1.7 to v1.8

@devzerops
Copy link

devzerops commented Nov 20, 2024

If you boot from a USB, you must add the above plugin to boot.
When updating the control-plane or worker using talosctl, you must add the above plugin, otherwise it will result in an infinite boot loop.

customization:
    systemExtensions:
        officialExtensions:
            - siderolabs/i915-ucode
            - siderolabs/mei

As of now, you need to go to https://factory.talos.dev/ and boot from a USB with the above plugin installed. You should also use the pre-installed image for installation. This approach resolves the infinite boot loop issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests