-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support hardware watchdog timers #8313
Conversation
} | ||
|
||
logger.Info("Opened hardware watchdog", zap.String("filename", dev)) | ||
// TODO: support reboot/shutdown watchdogs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently disable watchdog when quitting. systemd however allows for shutdown, reboot and kexec watchdogs to be set up to ensure power operations don't hang the system in inaccessible state. I could also do so if we clarify requirements for this and how are controllers being shut down during a restart or poweroff to be sure what timeout would be sane
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Controllers are terminated right before the reboot/shutdown (from the controller point of view, context will be cancelled).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, then I might add a prop for that after the first review. However shutdown/reboot watchdog won't be turned on by default as I'm not sure what are the safe values and they depend on target config.
Why are the CI checks failing? Does |
You can ignore the gpg check, it needs to be signed by one of our devs |
Ok, would be happy to hear and address reviews then |
@dsseng your PR looks good to me, I would like to refactor some machine configuration stuff and probably see if we can do an integration test. I will find some time to look into that, but it might not be quick. |
No problem, I'll do needed improvements as soon as you request something. Thanks for reviewing! |
Note: kernel changes are required to load watchdog drivers, I haven't committed those yet |
d135c0e
to
88ca123
Compare
See siderolabs/talos#8313 Signed-off-by: Dmitry Sharshakov <dmitry.sharshakov@siderolabs.com> Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
} | ||
|
||
// 3 pings per timeout should suffice in any case | ||
ticker = time.NewTicker(time.Duration(timeout/3) * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't stop old ticker here... do we even need a ticker?
we could do something like https://github.com/cosi-project/runtime/blob/main/pkg/controller/runtime/internal/qruntime/internal/timer/resettable.go, but for Tickers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't a ticker stop when GC'ed? Actually we can make use of resettable timer and reset it at each trigger, or develop an in-tree wrapper (or one contributed to the Cosi runtime) for ticker yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added ticker.Stop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to have a resettable Ticker here copying the approach from the above (to some extent).
The reason is that we don't want to have a ticker if there's no watchdog.
Or it could be implemented easier (there are plenty of examples in the codebase) by using a channel variable and setting it to nil when ticket is inactive.
internal/app/machined/pkg/controllers/runtime/watchdog_timer.go
Outdated
Show resolved
Hide resolved
internal/app/machined/pkg/controllers/runtime/watchdog_timer.go
Outdated
Show resolved
Hide resolved
Will update the kernel once more and rebase/squash stuff |
internal/app/machined/pkg/controllers/runtime/watchdog_timer.go
Outdated
Show resolved
Hide resolved
fb19f4a
to
60f47ab
Compare
@dsseng refactored controller a bit, other changes are config-related:
I will ask you to do integration test as a separate PR. |
/ok-to-test |
@dsseng added
|
""" | ||
|
||
[notes.watchdog] | ||
title = "Hardware Watchdog Timers" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess we also mention this is only available for amd64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is available on aarch64. We actually had all drivers built-in previously, so I didn't have to modify aarch64 config except for sysfs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, were they built in? I guess we need to keep them also as modules if not, just to be in parity with amd64 🤔 . I'm not so sure, @smira wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some of those mustn't be modules due to the way they need to be probed, so that should be checked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it supports PCI id database https://cateee.net/lkddb/web-lkddb/WATCHDOG.html, but we might hit other issue like we try to read before udev picks them up, I was more talking about these:
❯ grep IPMI_WA
kernel/build/config-arm64
3586:# CONFIG_IPMI_WATCHDOG is not set
kernel/build/config-amd64
3157:CONFIG_IPMI_WATCHDOG=m
❯ grep MFD_CORE
kernel/build/config-arm64
4514:CONFIG_MFD_CORE=y
kernel/build/config-amd64
3630:CONFIG_MFD_CORE=m
❯ grep SP5100
kernel/build/config-amd64
3581:CONFIG_SP5100_TCO=m
which are not consistent across amd64 and arm64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can fix it up, but it shouldn't be related to to this PR specifically.
I think arm64 in general needs a module cleanup the way I did for amd64 - try to make everything which can be a module to be a module.
arm64 kernel is not compressed, so moving to initramfs.xz is a good thing imho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh, arm64 needs some cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SP5100 is AMD WDT, MFD is for Intel one, but IPMI might work for ARM. Regarding PCI ID DB: aarch64 WDTs are typically part of the SoC
@@ -93,6 +93,11 @@ func (s *EventSinkV1Alpha1) KmsgLogURLs() []*url.URL { | |||
return nil | |||
} | |||
|
|||
// WatchdogTimer implements config.RuntimeConfig interface. | |||
func (s *EventSinkV1Alpha1) WatchdogTimer() config.WatchdogTimerConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm why does eventsink need watchdogtimer method implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's the way I implemented Runtime()
config interface - all documents implement all methods, but return nil
if it's not their thing. it might be good or bad idea, idk, but it is this way right now ;)
We should probably report that using sysfs to also know timeout and timeleft |
Timeout we already report (as we set it), while |
Only enabled when activated by config, disabled on shutdown/reboot Fixes siderolabs#8284 Signed-off-by: Dmitry Sharshakov <dmitry.sharshakov@siderolabs.com> Signed-off-by: Dmitry Sharshakov <d3dx12.xx@gmail.com> Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
/m |
Pull Request
What? (description)
Add support for Linux hardware watchdog API and corresponding configuration values. Watchdog timer is armed and fed by a controller (is it reliable to be sure a ticker will tick fine as long as no hang occurs?).
Watchdog timer can help automatically restoring a bare metal/supported VM node after a hang caused by a software or hardware error.
Why? (reasoning)
Fixes #8284, approved by developers in the comments
Acceptance
Please use the following checklist:
make conformance
- not done due to certainly being unrelated)make fmt
)make lint
)make docs
)make unit-tests
)