abolish Z_OOPS() in system call handlers #17735

andrewboie · 2019-07-23T19:22:01Z

If a user thread passes incorrect parameters to a system call, in most cases this will trigger a fatal exception with Z_OOPS() instead of returning an error.

This was done to keep kernel APIs exactly the same, as many did not propagate return values either. However we should provide opportunities for recovery, and not unconditionally explode like this, it's not good from a FuSa perspective and is fundamentally user-hostile.

For common errors like kernel object or memory buffer permissions, standardize on a set of errno codes for these and be consistent across system calls when these issues occur.

Change system call handlers to return an error value instead. This is related work to #16702

andrewboie · 2019-10-06T22:47:13Z

Has a dependency on #16702

andrewboie · 2020-02-07T21:16:01Z

Proposed error conditions returned, for any API exposed as system call.

Most of these checks can only be determined if invoked from user mode.

Note that we have some syscalls which return pointers and will require some thought on how to propagate stuff, for the moment we'll just have to return NULL:

-ENOSYS system call not implemented. For any syscall which has no z_vrfy/z_impl implemented or ifdef'd out, or driver syscall whose API struct member is NULL. ONLY for these cases.
-EBADF when provided a kernel object that is not of the expected type, or not a kernel object
-EACCES when provided a kernel object that the caller does not have permissions on
-EINVAL when provided a kernel object that must be initialized, but is not
-EADDRINUSE when provided a kernel object that must not be initialized, but is
-EFAULT when the caller supplies a memory buffer that is not part of its memory domain or has incorrect read/write permissions
-EINVAL returned when a supplied parameter that has some kind of validity or bounds check fails
-EINVAL returned when a size_t is provided and some final size value must be computed from it, and this generates an overflow (basically any size_*_overflow check)
-ENOMEM when the system call requires a resource pool allocation that fails

andrewboie · 2020-02-07T22:25:24Z

Another thing that needs fixing: Z_SYSCALL_VERIFY(), Z_SYSCALL_OBJ, Z_SYSCALL_MEMORY and related functions do something goofy: they expect their input to be a function which returns 0 on success and nonzero on failure. Z_SYSCALL_VERIFY() casts this to a boolean and then inverts its value, returning False on success and True on failure.

Probably need to provide new versions of these. They can be expressed as inline functions instead of macros I think since SSF is no longer passed around.

andrewboie · 2020-02-07T22:49:29Z

Questions to answer:

For syscalls that return pointers, do we want to use an ERR_PTR() mechanism to propagate return values?
For syscalls that return unsigned values, should we change them all to return signed values so we can return -errno?
Are there any syscall which return signed values where negative values are valid? Do we change them?
Are we OK with changing all syscalls that return void, to return integer, even if all checks are only done from user mode verification functions?
Should we use something like CHECKIF() in system call verification functions? Obviously CONFIG_NO_RUNTIME_CHECKS should be forbidden if CONFIG_USERSPACE is enabled, but I do see some value in CONFIG_ASSERT_ON_ERRORS which would induce a k_oops() on error instead of returning a value (which mimics the existing behavior of Z_OOPS()). We could do this in the syscall dispatch logic, just generate an oops if the return value is negative if we agree all negative return values are errors.
How much logging should we do when system call errors are encountered? Currently we are rather verbose, but part of this was because we immediately did a Z_OOPS() afterwards

andrewboie · 2020-02-07T22:54:12Z

My initial feelings on the above post, which are certainly subject to persuasion:

No, these APIs need to be changed. All APIs exposed as system calls must return some kind of signed integer value
Unsure about this. k_sem_count_get() is a good example. We'd have to change the prototype for k_sem_init() to take a signed integer for the initial count and limit values.
Not sure what to do here
I think this is OK but may irritate people who never use user mode, there may be a small footprint cost.
My feeling on the last two points is 1) yes use CHECKIF() 2) only verbosely log errors if CONFIG_ASSERT_ON_ERRORS is enabled. Any oopses triggered in this way should use the existing infra to make the check look like it came from the call site of the syscall to make debug easier.

andrewboie · 2020-02-08T01:22:25Z

Big big questions to answer:

what do functional safety requirements say about this?
Of the scenarios described in abolish Z_OOPS() in system call handlers #17735 (comment) can we leave some, or even all of them as Z_OOPS()? what are the use-cases for recovery from these scenarios in Zephyr context?

For example, we might decide to leave the EACCES, EBADF, EFAULT, ENOSYS, EADDRINUSE, and EINVAL (uniitialized object) cases as Z_OOPS(), in which case most syscalls will not need an API change.

ruuddw · 2020-02-08T22:05:35Z

Not so sure about this. Most 'new' error codes proposed would be systematic/design errors, that should never happen. Error return codes might not always be checked, introducing more latent problems. Would recovery be really easier with return values, or would the Z_OOPS simply be called by the application code that now needs to check for these errors?

andrewboie · 2020-02-12T20:56:39Z

@ruuddw I agree. And I spoke with @ceolin some more, we think we don't have specific fusa requirements to return values, killing the caller is acceptable as well. So in the end we may not do much here.

I think it helps to decompose this into categories. Using my previous post of return codes as a starting point:

The user has invoked an unimplemented system call (ENOSYS case)
The user has used a kernel object pointer improperly: they passed in a bad kernel object pointer, or it was of the wrong type, wrong initialization state, or the caller doesn't have permission on said object. (EBADF, EACCES, EADDRINUSE, one of the EINVAL cases)
Issues with pointers to memory buffers supplied by the user, either the caller doesn't have appropriate permission on the memory buffer, or some size calculation overflowed
General parameter bounds checks, done in the implementation function
General parameter bounds checks, done ONLY in the verification function due to user-mode only policies on use of that API (k_thread_create() is an excellent example)
Issues with failed heap allocations (ENOMEM case)

I think for categories 1, 2, and 3, these are purely programming mistakes. We could continue to raise Z_OOPS() for these. This would result in most syscall APIs staying as-is.

For category 4, these return errors. In the past, many of these were left as assertions, but are now more robustly checked with CHECKIF().

For category 5, checks that currently only happen when invoked from user mode are currently usually failed with Z_OOPS(), but should we change to return values to be consistent with category 4? There are very few syscalls that do this, I will put together a list.

For category 6, I think ENOMEM should definitely be propagated. I believe we already do so for those system calls that perform resource pool allocations.

ruuddw · 2020-02-18T15:48:18Z

Sounds like a good refinement, indeed ENOMEM is an example of a possible transient fault: if the caller tries again (maybe after cleaning something up) it could be recoverable.
On the other hand, safety critical systems prefer static allocation, exactly to avoid such ENOMEMs.
I would not be against adding return codes and propagating errors for user decisions in specific cases where more gracefull handling makes sense. Propagating error codes for some 4,5,6 items could well make sense.

As described in zephyrproject-rtos#17735. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

As described in #17735. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

As described in zephyrproject-rtos#17735. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

andrewboie added Enhancement Changes/Updates/Additions to existing features area: Memory Protection labels Jul 23, 2019

andrewboie mentioned this issue Aug 7, 2019

userspace: Support for split 64 bit arguments #18064

Merged

andrewboie added this to the v2.1.0 milestone Sep 11, 2019

andrewboie self-assigned this Oct 6, 2019

andrewboie modified the milestones: v2.1.0, v2.2.0 Oct 24, 2019

andrewboie modified the milestones: v2.2.0, v2.3.0 Feb 7, 2020

andrewboie mentioned this issue Feb 8, 2020

WIP: Abolish Z_OOPS() #22617

Closed

andrewboie mentioned this issue Mar 25, 2020

doc: clarify syscall Z_OOPS() policy #23752

Merged

andrewboie pushed a commit to andrewboie/zephyr that referenced this issue Mar 25, 2020

doc: clarify syscall Z_OOPS() policy

897bb1d

As described in zephyrproject-rtos#17735. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

andrewboie pushed a commit that referenced this issue Mar 26, 2020

doc: clarify syscall Z_OOPS() policy

df2b114

As described in #17735. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

andrewboie closed this as completed Mar 26, 2020

hakehuang pushed a commit to hakehuang/zephyr that referenced this issue Apr 30, 2020

doc: clarify syscall Z_OOPS() policy

efdf5e2

As described in zephyrproject-rtos#17735. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

abolish Z_OOPS() in system call handlers #17735

abolish Z_OOPS() in system call handlers #17735

andrewboie commented Jul 23, 2019

andrewboie commented Oct 6, 2019

andrewboie commented Feb 7, 2020 •

edited

Loading

andrewboie commented Feb 7, 2020

andrewboie commented Feb 7, 2020 •

edited

Loading

andrewboie commented Feb 7, 2020 •

edited

Loading

andrewboie commented Feb 8, 2020 •

edited

Loading

ruuddw commented Feb 8, 2020

andrewboie commented Feb 12, 2020 •

edited

Loading

ruuddw commented Feb 18, 2020

abolish Z_OOPS() in system call handlers #17735

abolish Z_OOPS() in system call handlers #17735

Comments

andrewboie commented Jul 23, 2019

andrewboie commented Oct 6, 2019

andrewboie commented Feb 7, 2020 • edited Loading

andrewboie commented Feb 7, 2020

andrewboie commented Feb 7, 2020 • edited Loading

andrewboie commented Feb 7, 2020 • edited Loading

andrewboie commented Feb 8, 2020 • edited Loading

ruuddw commented Feb 8, 2020

andrewboie commented Feb 12, 2020 • edited Loading

ruuddw commented Feb 18, 2020

andrewboie commented Feb 7, 2020 •

edited

Loading

andrewboie commented Feb 7, 2020 •

edited

Loading

andrewboie commented Feb 7, 2020 •

edited

Loading

andrewboie commented Feb 8, 2020 •

edited

Loading

andrewboie commented Feb 12, 2020 •

edited

Loading