-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework exception & fatal error handling framework #11751
Comments
+1000 Eager to help with this. I've been wanting this to be common code for a long time. |
Here's another requirement:
I spent a few hours last night chasing a bug where I knew a state was wrong at the top of an ISR yet really should have been locked. The ability to ask "where was I in code that this lock wasn't taken?" would have been very useful. Instead, I resorted (seriously) to manually walking the stack (even that's hard to extract from an ISR, but thankfully it was easy enough to see that ARM puts the interrupted SP at the very top of the isr_stack) dumping pointers and checking them against zephyr.lst to see if they looked suspicious. And even then I missed the actual bug (it was inside _Swap where BASEPRI gets released having set the pended exception, and I figured that was a mistake and took the next pointer which was its caller). |
stack unwinding on ARM will require some effort, it's not simple at all. |
Yeah, it's a nice-to-have. At least there should be a spot for it so more conventional architectures can provide it. |
@andyross @andrewboie ping :) |
I'm working on a PR which implements some of this, mostly to at least move the error functions to common code. |
My initial PR is now merged. Remaining TODO items, which we can either do here or open separate tickets for:
This item I am not sure is feasible in an arch-agnostic way:
I think this has to be done by arch code. We have a stack unwinding implementation for X86. ARM is very tricky, stack unwinding on that arch is extremely complex. Might be simple to do for other arches though, I haven't looked into it. Suggest we drop the priority to 'low' since the broad goals have been achieved now. |
@andrewboie Is this issue completed? |
this is done now. |
We need a redesign of the error handling API. Right now, most architectures try to do this in some kind of fancy way, but in fact nothing actually tests any of that stuff and everything is either unexercised or incompatible.
Working up a x86_64 implementation, I found that all that's actually required and tested is:
_NanoFatalErrorHandler gets called with only three "reason" values (STACK_CHK_FAIL, OOPS, and PANIC), though these are not defined in a Zephyr header and have to be done by the arch.
The only expected/tested action of _NanoFatalErrorHandler() is to call back into _SysFatalErrorHandler() with the same arguments. The difference is that the _Sys variant is a weak symbol that can be overriden by the app. Zephyr doesn't manage this, each arch does it on its own.
Nothing anywhere outside of arch code inspects any part of a NANO_ESF (stack frame) handle, not even for equality with another. It's totally ignored by everything.
Thus making the existance of _default_esf (instead of, say, a NULL pointer) especially weird.
So we should either rip all this out or make it work right. Some plausible requirements we would want in a portable system (that in some cases are done by the architectures already in varying ways):
An extensible reason code enumeration with better mapping to "mostly portable" values (i.e. all sane architectures have a divide-by-zero exception, a illegal instruction exception, a memory protection fault, those can all be standardized).
Unified logging. Most architectures will dump PC and register state (and sometimes stack frames) at exception time using printk(). The arch should provide this as some kind of "output to" or "stringify" API to the Zephyr level so we can share this appropriately (and redirect to the new logging framework, etc...)
Some kind of per-exception metadata framework, e.g. the fault address of an unrecoverable MMU fault should be exposable in a standard way.
Future debug integration. The ability to do a stack walk over an exception stack frame would be really useful, etc... We don't have a standard debugger layer either, but if we did this should work with it.
The text was updated successfully, but these errors were encountered: