Rework exception & fatal error handling framework #11751

andyross · 2018-11-29T20:24:27Z

We need a redesign of the error handling API. Right now, most architectures try to do this in some kind of fancy way, but in fact nothing actually tests any of that stuff and everything is either unexercised or incompatible.

Working up a x86_64 implementation, I found that all that's actually required and tested is:

_NanoFatalErrorHandler gets called with only three "reason" values (STACK_CHK_FAIL, OOPS, and PANIC), though these are not defined in a Zephyr header and have to be done by the arch.
The only expected/tested action of _NanoFatalErrorHandler() is to call back into _SysFatalErrorHandler() with the same arguments. The difference is that the _Sys variant is a weak symbol that can be overriden by the app. Zephyr doesn't manage this, each arch does it on its own.
Nothing anywhere outside of arch code inspects any part of a NANO_ESF (stack frame) handle, not even for equality with another. It's totally ignored by everything.
Thus making the existance of _default_esf (instead of, say, a NULL pointer) especially weird.

So we should either rip all this out or make it work right. Some plausible requirements we would want in a portable system (that in some cases are done by the architectures already in varying ways):

An extensible reason code enumeration with better mapping to "mostly portable" values (i.e. all sane architectures have a divide-by-zero exception, a illegal instruction exception, a memory protection fault, those can all be standardized).
Unified logging. Most architectures will dump PC and register state (and sometimes stack frames) at exception time using printk(). The arch should provide this as some kind of "output to" or "stringify" API to the Zephyr level so we can share this appropriately (and redirect to the new logging framework, etc...)
Some kind of per-exception metadata framework, e.g. the fault address of an unrecoverable MMU fault should be exposable in a standard way.
Future debug integration. The ability to do a stack walk over an exception stack frame would be really useful, etc... We don't have a standard debugger layer either, but if we did this should work with it.

andrewboie · 2018-11-29T20:31:53Z

+1000

Eager to help with this. I've been wanting this to be common code for a long time.

andyross · 2019-01-04T22:30:29Z

Here's another requirement:

It should be possible via the API in a standard way to get a register dump and (ideally) stack walk of any thread via only its struct k_thread pointer, including pended ones and the one that has been interrupted by an executing interrupt or exception handler.

I spent a few hours last night chasing a bug where I knew a state was wrong at the top of an ISR yet really should have been locked. The ability to ask "where was I in code that this lock wasn't taken?" would have been very useful. Instead, I resorted (seriously) to manually walking the stack (even that's hard to extract from an ISR, but thankfully it was easy enough to see that ARM puts the interrupted SP at the very top of the isr_stack) dumping pointers and checking them against zephyr.lst to see if they looked suspicious. And even then I missed the actual bug (it was inside _Swap where BASEPRI gets released having set the pended exception, and I figured that was a mistake and took the next pointer which was its caller).

andrewboie · 2019-01-05T01:39:04Z

stack walk of any thread via only its struct k_thread pointer

stack unwinding on ARM will require some effort, it's not simple at all.
https://wiki.linaro.org/KenWerner/Sandbox/libunwind?action=AttachFile&do=get&target=libunwind-LDS.pdf

andyross · 2019-01-05T02:03:41Z

Yeah, it's a nice-to-have. At least there should be a spot for it so more conventional architectures can provide it.

ioannisg · 2019-07-01T08:55:42Z

@andyross @nashif are we doing this for 2.0.0?

ioannisg · 2019-07-12T07:21:03Z

@andyross @andrewboie ping :)

andrewboie · 2019-07-13T22:13:07Z

I'm working on a PR which implements some of this, mostly to at least move the error functions to common code.

andrewboie · 2019-07-25T22:12:28Z

My initial PR is now merged.

Remaining TODO items, which we can either do here or open separate tickets for:

Extend the current enumeration of exception types to cover more error types, such as memory protection errors
Would be nice to make z_fatal_print() a macro, but I ran into an include hell putting it into the header. Would be nice to disentangle this.
Some kind of per-exception metadata framework, e.g. the fault address of an unrecoverable MMU fault should be exposable in a standard way. Started this, but decided to punt until the initial PR went in.

This item I am not sure is feasible in an arch-agnostic way:

Future debug integration. The ability to do a stack walk over an exception stack frame would be really useful, etc... We don't have a standard debugger layer either, but if we did this should work with it.

I think this has to be done by arch code. We have a stack unwinding implementation for X86. ARM is very tricky, stack unwinding on that arch is extremely complex. Might be simple to do for other arches though, I haven't looked into it.

Suggest we drop the priority to 'low' since the broad goals have been achieved now.

ioannisg · 2019-09-17T17:14:36Z

@andrewboie Is this issue completed?

nashif · 2020-02-13T23:31:23Z

this is done now.

nashif added Feature A planned feature with a milestone priority: medium Medium impact/importance bug labels Nov 29, 2018

nashif added Enhancement Changes/Updates/Additions to existing features and removed Feature A planned feature with a milestone labels Feb 21, 2019

andrewboie self-assigned this Jul 13, 2019

andrewboie mentioned this issue Jul 15, 2019

Fatal error handling consolidation and revamp #17544

Merged

jenmwms closed this as completed Aug 28, 2019

jenmwms reopened this Aug 28, 2019

andrewboie added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels Sep 5, 2019

nashif closed this as completed Feb 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework exception & fatal error handling framework #11751

Rework exception & fatal error handling framework #11751

andyross commented Nov 29, 2018

andrewboie commented Nov 29, 2018

andyross commented Jan 4, 2019 •

edited

Loading

andrewboie commented Jan 5, 2019

andyross commented Jan 5, 2019

ioannisg commented Jul 1, 2019

ioannisg commented Jul 12, 2019

andrewboie commented Jul 13, 2019

andrewboie commented Jul 25, 2019

ioannisg commented Sep 17, 2019

nashif commented Feb 13, 2020

Rework exception & fatal error handling framework #11751

Rework exception & fatal error handling framework #11751

Comments

andyross commented Nov 29, 2018

andrewboie commented Nov 29, 2018

andyross commented Jan 4, 2019 • edited Loading

andrewboie commented Jan 5, 2019

andyross commented Jan 5, 2019

ioannisg commented Jul 1, 2019

ioannisg commented Jul 12, 2019

andrewboie commented Jul 13, 2019

andrewboie commented Jul 25, 2019

ioannisg commented Sep 17, 2019

nashif commented Feb 13, 2020

andyross commented Jan 4, 2019 •

edited

Loading