Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abolish Z_OOPS() in system call handlers #17735

Closed
andrewboie opened this issue Jul 23, 2019 · 9 comments
Closed

abolish Z_OOPS() in system call handlers #17735

andrewboie opened this issue Jul 23, 2019 · 9 comments
Assignees
Labels
area: Memory Protection Enhancement Changes/Updates/Additions to existing features
Milestone

Comments

@andrewboie
Copy link
Contributor

If a user thread passes incorrect parameters to a system call, in most cases this will trigger a fatal exception with Z_OOPS() instead of returning an error.

This was done to keep kernel APIs exactly the same, as many did not propagate return values either. However we should provide opportunities for recovery, and not unconditionally explode like this, it's not good from a FuSa perspective and is fundamentally user-hostile.

For common errors like kernel object or memory buffer permissions, standardize on a set of errno codes for these and be consistent across system calls when these issues occur.

Change system call handlers to return an error value instead. This is related work to #16702

@andrewboie andrewboie added Enhancement Changes/Updates/Additions to existing features area: Memory Protection labels Jul 23, 2019
@andrewboie andrewboie added this to the v2.1.0 milestone Sep 11, 2019
@andrewboie andrewboie self-assigned this Oct 6, 2019
@andrewboie
Copy link
Contributor Author

Has a dependency on #16702

@andrewboie andrewboie modified the milestones: v2.1.0, v2.2.0 Oct 24, 2019
@andrewboie
Copy link
Contributor Author

andrewboie commented Feb 7, 2020

Proposed error conditions returned, for any API exposed as system call.

Most of these checks can only be determined if invoked from user mode.

Note that we have some syscalls which return pointers and will require some thought on how to propagate stuff, for the moment we'll just have to return NULL:

  • -ENOSYS system call not implemented. For any syscall which has no z_vrfy/z_impl implemented or ifdef'd out, or driver syscall whose API struct member is NULL. ONLY for these cases.
  • -EBADF when provided a kernel object that is not of the expected type, or not a kernel object
  • -EACCES when provided a kernel object that the caller does not have permissions on
  • -EINVAL when provided a kernel object that must be initialized, but is not
  • -EADDRINUSE when provided a kernel object that must not be initialized, but is
  • -EFAULT when the caller supplies a memory buffer that is not part of its memory domain or has incorrect read/write permissions
  • -EINVAL returned when a supplied parameter that has some kind of validity or bounds check fails
  • -EINVAL returned when a size_t is provided and some final size value must be computed from it, and this generates an overflow (basically any size_*_overflow check)
  • -ENOMEM when the system call requires a resource pool allocation that fails

@andrewboie
Copy link
Contributor Author

Another thing that needs fixing: Z_SYSCALL_VERIFY(), Z_SYSCALL_OBJ, Z_SYSCALL_MEMORY and related functions do something goofy: they expect their input to be a function which returns 0 on success and nonzero on failure. Z_SYSCALL_VERIFY() casts this to a boolean and then inverts its value, returning False on success and True on failure.

Probably need to provide new versions of these. They can be expressed as inline functions instead of macros I think since SSF is no longer passed around.

@andrewboie
Copy link
Contributor Author

andrewboie commented Feb 7, 2020

Questions to answer:

  • For syscalls that return pointers, do we want to use an ERR_PTR() mechanism to propagate return values?
  • For syscalls that return unsigned values, should we change them all to return signed values so we can return -errno?
  • Are there any syscall which return signed values where negative values are valid? Do we change them?
  • Are we OK with changing all syscalls that return void, to return integer, even if all checks are only done from user mode verification functions?
  • Should we use something like CHECKIF() in system call verification functions? Obviously CONFIG_NO_RUNTIME_CHECKS should be forbidden if CONFIG_USERSPACE is enabled, but I do see some value in CONFIG_ASSERT_ON_ERRORS which would induce a k_oops() on error instead of returning a value (which mimics the existing behavior of Z_OOPS()). We could do this in the syscall dispatch logic, just generate an oops if the return value is negative if we agree all negative return values are errors.
  • How much logging should we do when system call errors are encountered? Currently we are rather verbose, but part of this was because we immediately did a Z_OOPS() afterwards

@andrewboie
Copy link
Contributor Author

andrewboie commented Feb 7, 2020

My initial feelings on the above post, which are certainly subject to persuasion:

  • No, these APIs need to be changed. All APIs exposed as system calls must return some kind of signed integer value
  • Unsure about this. k_sem_count_get() is a good example. We'd have to change the prototype for k_sem_init() to take a signed integer for the initial count and limit values.
  • Not sure what to do here
  • I think this is OK but may irritate people who never use user mode, there may be a small footprint cost.
  • My feeling on the last two points is 1) yes use CHECKIF() 2) only verbosely log errors if CONFIG_ASSERT_ON_ERRORS is enabled. Any oopses triggered in this way should use the existing infra to make the check look like it came from the call site of the syscall to make debug easier.

@andrewboie andrewboie modified the milestones: v2.2.0, v2.3.0 Feb 7, 2020
@andrewboie
Copy link
Contributor Author

andrewboie commented Feb 8, 2020

Big big questions to answer:

  1. what do functional safety requirements say about this?
  2. Of the scenarios described in abolish Z_OOPS() in system call handlers #17735 (comment) can we leave some, or even all of them as Z_OOPS()? what are the use-cases for recovery from these scenarios in Zephyr context?

For example, we might decide to leave the EACCES, EBADF, EFAULT, ENOSYS, EADDRINUSE, and EINVAL (uniitialized object) cases as Z_OOPS(), in which case most syscalls will not need an API change.

@ruuddw
Copy link
Member

ruuddw commented Feb 8, 2020

Not so sure about this. Most 'new' error codes proposed would be systematic/design errors, that should never happen. Error return codes might not always be checked, introducing more latent problems. Would recovery be really easier with return values, or would the Z_OOPS simply be called by the application code that now needs to check for these errors?

@andrewboie
Copy link
Contributor Author

andrewboie commented Feb 12, 2020

@ruuddw I agree. And I spoke with @ceolin some more, we think we don't have specific fusa requirements to return values, killing the caller is acceptable as well. So in the end we may not do much here.

I think it helps to decompose this into categories. Using my previous post of return codes as a starting point:

  1. The user has invoked an unimplemented system call (ENOSYS case)
  2. The user has used a kernel object pointer improperly: they passed in a bad kernel object pointer, or it was of the wrong type, wrong initialization state, or the caller doesn't have permission on said object. (EBADF, EACCES, EADDRINUSE, one of the EINVAL cases)
  3. Issues with pointers to memory buffers supplied by the user, either the caller doesn't have appropriate permission on the memory buffer, or some size calculation overflowed
  4. General parameter bounds checks, done in the implementation function
  5. General parameter bounds checks, done ONLY in the verification function due to user-mode only policies on use of that API (k_thread_create() is an excellent example)
  6. Issues with failed heap allocations (ENOMEM case)

I think for categories 1, 2, and 3, these are purely programming mistakes. We could continue to raise Z_OOPS() for these. This would result in most syscall APIs staying as-is.

For category 4, these return errors. In the past, many of these were left as assertions, but are now more robustly checked with CHECKIF().

For category 5, checks that currently only happen when invoked from user mode are currently usually failed with Z_OOPS(), but should we change to return values to be consistent with category 4? There are very few syscalls that do this, I will put together a list.

For category 6, I think ENOMEM should definitely be propagated. I believe we already do so for those system calls that perform resource pool allocations.

@ruuddw
Copy link
Member

ruuddw commented Feb 18, 2020

Sounds like a good refinement, indeed ENOMEM is an example of a possible transient fault: if the caller tries again (maybe after cleaning something up) it could be recoverable.
On the other hand, safety critical systems prefer static allocation, exactly to avoid such ENOMEMs.
I would not be against adding return codes and propagating errors for user decisions in specific cases where more gracefull handling makes sense. Propagating error codes for some 4,5,6 items could well make sense.

andrewboie pushed a commit to andrewboie/zephyr that referenced this issue Mar 25, 2020
As described in zephyrproject-rtos#17735.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
andrewboie pushed a commit that referenced this issue Mar 26, 2020
As described in #17735.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
hakehuang pushed a commit to hakehuang/zephyr that referenced this issue Apr 30, 2020
As described in zephyrproject-rtos#17735.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Memory Protection Enhancement Changes/Updates/Additions to existing features
Projects
None yet
Development

No branches or pull requests

2 participants