Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework exception & fatal error handling framework #11751

Closed
andyross opened this issue Nov 29, 2018 · 10 comments
Closed

Rework exception & fatal error handling framework #11751

andyross opened this issue Nov 29, 2018 · 10 comments
Assignees
Labels
Enhancement Changes/Updates/Additions to existing features priority: low Low impact/importance bug

Comments

@andyross
Copy link
Contributor

We need a redesign of the error handling API. Right now, most architectures try to do this in some kind of fancy way, but in fact nothing actually tests any of that stuff and everything is either unexercised or incompatible.

Working up a x86_64 implementation, I found that all that's actually required and tested is:

  • _NanoFatalErrorHandler gets called with only three "reason" values (STACK_CHK_FAIL, OOPS, and PANIC), though these are not defined in a Zephyr header and have to be done by the arch.

  • The only expected/tested action of _NanoFatalErrorHandler() is to call back into _SysFatalErrorHandler() with the same arguments. The difference is that the _Sys variant is a weak symbol that can be overriden by the app. Zephyr doesn't manage this, each arch does it on its own.

  • Nothing anywhere outside of arch code inspects any part of a NANO_ESF (stack frame) handle, not even for equality with another. It's totally ignored by everything.

  • Thus making the existance of _default_esf (instead of, say, a NULL pointer) especially weird.

So we should either rip all this out or make it work right. Some plausible requirements we would want in a portable system (that in some cases are done by the architectures already in varying ways):

  • An extensible reason code enumeration with better mapping to "mostly portable" values (i.e. all sane architectures have a divide-by-zero exception, a illegal instruction exception, a memory protection fault, those can all be standardized).

  • Unified logging. Most architectures will dump PC and register state (and sometimes stack frames) at exception time using printk(). The arch should provide this as some kind of "output to" or "stringify" API to the Zephyr level so we can share this appropriately (and redirect to the new logging framework, etc...)

  • Some kind of per-exception metadata framework, e.g. the fault address of an unrecoverable MMU fault should be exposable in a standard way.

  • Future debug integration. The ability to do a stack walk over an exception stack frame would be really useful, etc... We don't have a standard debugger layer either, but if we did this should work with it.

@andrewboie
Copy link
Contributor

+1000

Eager to help with this. I've been wanting this to be common code for a long time.

@nashif nashif added Feature A planned feature with a milestone priority: medium Medium impact/importance bug labels Nov 29, 2018
@andyross
Copy link
Contributor Author

andyross commented Jan 4, 2019

Here's another requirement:

  • It should be possible via the API in a standard way to get a register dump and (ideally) stack walk of any thread via only its struct k_thread pointer, including pended ones and the one that has been interrupted by an executing interrupt or exception handler.

I spent a few hours last night chasing a bug where I knew a state was wrong at the top of an ISR yet really should have been locked. The ability to ask "where was I in code that this lock wasn't taken?" would have been very useful. Instead, I resorted (seriously) to manually walking the stack (even that's hard to extract from an ISR, but thankfully it was easy enough to see that ARM puts the interrupted SP at the very top of the isr_stack) dumping pointers and checking them against zephyr.lst to see if they looked suspicious. And even then I missed the actual bug (it was inside _Swap where BASEPRI gets released having set the pended exception, and I figured that was a mistake and took the next pointer which was its caller).

@andrewboie
Copy link
Contributor

stack walk of any thread via only its struct k_thread pointer

stack unwinding on ARM will require some effort, it's not simple at all.
https://wiki.linaro.org/KenWerner/Sandbox/libunwind?action=AttachFile&do=get&target=libunwind-LDS.pdf

@andyross
Copy link
Contributor Author

andyross commented Jan 5, 2019

Yeah, it's a nice-to-have. At least there should be a spot for it so more conventional architectures can provide it.

@nashif nashif added Enhancement Changes/Updates/Additions to existing features and removed Feature A planned feature with a milestone labels Feb 21, 2019
@ioannisg
Copy link
Member

ioannisg commented Jul 1, 2019

@andyross @nashif are we doing this for 2.0.0?

@ioannisg
Copy link
Member

@andyross @andrewboie ping :)

@andrewboie
Copy link
Contributor

I'm working on a PR which implements some of this, mostly to at least move the error functions to common code.

@andrewboie
Copy link
Contributor

My initial PR is now merged.

Remaining TODO items, which we can either do here or open separate tickets for:

  • Extend the current enumeration of exception types to cover more error types, such as memory protection errors
  • Would be nice to make z_fatal_print() a macro, but I ran into an include hell putting it into the header. Would be nice to disentangle this.
  • Some kind of per-exception metadata framework, e.g. the fault address of an unrecoverable MMU fault should be exposable in a standard way. Started this, but decided to punt until the initial PR went in.

This item I am not sure is feasible in an arch-agnostic way:

Future debug integration. The ability to do a stack walk over an exception stack frame would be really useful, etc... We don't have a standard debugger layer either, but if we did this should work with it.

I think this has to be done by arch code. We have a stack unwinding implementation for X86. ARM is very tricky, stack unwinding on that arch is extremely complex. Might be simple to do for other arches though, I haven't looked into it.

Suggest we drop the priority to 'low' since the broad goals have been achieved now.

@jenmwms jenmwms closed this as completed Aug 28, 2019
@jenmwms jenmwms reopened this Aug 28, 2019
@andrewboie andrewboie added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels Sep 5, 2019
@ioannisg
Copy link
Member

@andrewboie Is this issue completed?

@nashif
Copy link
Member

nashif commented Feb 13, 2020

this is done now.

@nashif nashif closed this as completed Feb 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Changes/Updates/Additions to existing features priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests

5 participants