Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker node not booting (segfault) after upgrade to 1.7.1 #8753

Closed
jonkerj opened this issue May 17, 2024 · 2 comments · Fixed by #8772
Closed

Worker node not booting (segfault) after upgrade to 1.7.1 #8753

jonkerj opened this issue May 17, 2024 · 2 comments · Fixed by #8772
Assignees

Comments

@jonkerj
Copy link
Contributor

jonkerj commented May 17, 2024

Bug Report

I've upgraded my CP from 1.6.7 to 1.7.1 (which went fine), and after upgrading first of my worker nodes, Talos is segfauling.

Description

My nodes (and there machine config) were created 3y40d ago, so my guess is that this may be caused by legacy stuff present in MC.

Logs

As it is a virtual machine (qemu x86_64), I was able to catch messages over serial console:

[  493.462887] [talos] controller failed {"component": "controller-runtime", "controller": "secrets.RootOSController", "error": "controller \"secrets.RootOSController\" panicked: runtime error: invalid memory address or nil pointer dereference\n\ngoroutine 398 [running]:\nruntime/debug.Stack()\n\t/toolchain/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/cosi-project/runtime/pkg/controller/runtime/internal/rruntime.(*Adapter).runOnce.func2()\n\t/.cache/mod/github.com/cosi-project/runtime@v0.4.1/pkg/controller/runtime/internal/rruntime/run.go:67 +0x58\npanic({0x2bc5be0?, 0x526eeb0?})\n\t/toolchain/go/src/runtime/panic.go:770 +0x132\ngithub.com/siderolabs/talos/internal/app/machined/pkg/controllers/secrets.NewRootOSController.func1({0xc000b74c08?, 0x40f2ff?}, {0x2e51380?, 0x30ac3c0?}, 0xc0017db501?, 0x36741e0?, 0xc0007fe360)\n\t/src/internal/app/machined/pkg/controllers/secrets/root.go:170 +0x1dc\ngithub.com/cosi-project/runtime/pkg/controlle
r/generic/transform.NewControl...

Formatted, the trace would look like this:

runtime error: invalid memory address or nil pointer dereference

goroutine 398 [running]:
runtime/debug.Stack()
        /toolchain/go/src/runtime/debug/stack.go:24 +0x5e
github.com/cosi-project/runtime/pkg/controller/runtime/internal/rruntime.(*Adapter).runOnce.func2()
        /.cache/mod/github.com/cosi-project/runtime@v0.4.1/pkg/controller/runtime/internal/rruntime/run.go:67 +0x58
panic({0x2bc5be0?, 0x526eeb0?})
        /toolchain/go/src/runtime/panic.go:770 +0x132
github.com/siderolabs/talos/internal/app/machined/pkg/controllers/secrets.NewRootOSController.func1({0xc000b74c08?, 0x40f2ff?}, {0x2e51380?, 0x30ac3c0?}, 0xc0017db501?, 0x36741e0?, 0xc0007fe360)
        /src/internal/app/machined/pkg/controllers/secrets/root.go:170 +0x1dc
github.com/cosi-project/runtime/pkg/controller/generic/transform.NewControl...

If needed, I can try to find a way to read beyond the final ..., it might show on the VM console, but resolution is (very) limited, it scrolls fast and the dashboard is in the way

Environment

  • Talos version: v1.7.1
  • Kubernetes version: v1.29.4
  • Platform: qemu x86_64
@smira
Copy link
Member

smira commented May 21, 2024

Thanks for the report, it's indeed a bug, but I wonder how it works in other aspects without worker's Talos API cert

@smira smira self-assigned this May 21, 2024
@jonkerj
Copy link
Contributor Author

jonkerj commented May 21, 2024

Here is my (redacted) machine config, which lacks machine.ca (a smoking gun @smira discussed per Slack): https://gist.github.com/jonkerj/1e17140d0dea9eeedd6d6e36ac7bab6b

smira added a commit to smira/talos that referenced this issue May 21, 2024
Fixes siderolabs#8753

There seems to be a problem in the machine config anyways, as
`machine.ca.crt` is missing for the worker (this should break `apid`
connectivity), but still Talos controller shouldn't enter a panic loop.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
smira added a commit to smira/talos that referenced this issue May 28, 2024
Fixes siderolabs#8753

There seems to be a problem in the machine config anyways, as
`machine.ca.crt` is missing for the worker (this should break `apid`
connectivity), but still Talos controller shouldn't enter a panic loop.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit ce8c86d)
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants