Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86: enable CONFIG_USERSPACE on x86-64 #21710

Merged
merged 29 commits into from
Jan 13, 2020

Conversation

andrewboie
Copy link
Contributor

This patch series implements CONFIG_USERSPACE on x86-64.

Kernel Page Table Isolation (KPTI) is still a work-in-progress, so user mode cannot be enabled unless the target board sets CONFIG_X86_NO_MELTDOWN, indicating it is immune to these attacks. This is set for qemu_x86_64 to allow testing.

All tests are reliably passing with SMP disabled. With SMP enabled, the sporadic data corruption issue described in #21317 may be observed in CI runs, although this issue has been present before this patch series.

Major changes:

  • CPU exceptions are now recoverable on x86_64, allowing z_arch_user_string_nlen() to be implemented similar to how it is done in 32-bit.
  • GDT data descriptors for TSS regions (used to set %gs for per-cpu data) removed; instead we use MSRs to set GS_BASE. "swapgs" management implemented on privilege transitions to ensure the kernel is always using its expected GS segment. lfences added to mitigate the swapgs Spectre V1 variant.
  • Support for creating 64-bit thread page tables
  • Nearly all the userspace code written in C up-leveled to be used by both 32-bit and 64-bit
  • arch_new_thread() simplified for both 32-bit and 64-bit, with userspace logic moved to new functions in common code.
  • Linker script additions for kernel object tables and automatic memory partitions
  • System calls implemented using syscall/sysret instructions
  • hardware-assisted kernel oops exceptions implemented, Kconfig option removed (there's no compelling reason not to build this)
  • bit-rotted retpoline macros removed (we use Extended IBRS to mitigate Spectre V2 anyway)
  • Some test case fixes/workarounds for open bugs

Please see individual commit messages for more details.

@zephyrbot zephyrbot added area: Boards area: X86 x86 Architecture (32-bit) area: API Changes to public APIs area: Samples Samples area: Tests Issues related to a particular existing or missing test area: Kernel labels Jan 6, 2020
@zephyrbot
Copy link
Collaborator

zephyrbot commented Jan 6, 2020

All checks are passing now.

Tip: The bot edits this comment instead of posting a new one, so you can check the comment's history to see earlier messages.

Copy link
Contributor

@wentongwu wentongwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

description of how retpolines work can be found here[1].

[1] https://support.google.com/faqs/answer/7625886

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

owww this is not working at all :/ totally in favor of removing it for now. The only caveat is that our fix requires cpu feature while thiis sw fix should work for all cpus.

Copy link
Member

@ceolin ceolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pr seems good to me

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't find anything that looks wrong. One moderately serious nitpick and a few notes on test config.

@@ -3,3 +3,4 @@ CONFIG_ZTEST=y
CONFIG_ZTEST_STACKSIZE=2048
CONFIG_MAX_THREAD_BYTES=4
CONFIG_USERSPACE=y
CONFIG_SMP=n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but I tend to prever CONFIG_MP_NUM_CPUS=1 to CONFIG_SMP=n for code coverage reasons. The former has the effect you want, without the other configuration changes that disabling SMP causes.

ztest_1cpu_unit_test(test_multiple_futex_wait_wake),
ztest_1cpu_unit_test(test_futex_wait_forever),
ztest_1cpu_unit_test(test_futex_wait_timeout),
ztest_1cpu_unit_test(test_futex_wait_nowait));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the 1cpu feature halts one CPU by grabbing a lock on it and spinning, which is fine from a correctness perspective but a performance hassle when run under sanitycheck (it doubles the load, and makes timer-related spurious failures more likely on other tests running on the host). In situations like this where basically everything is MP-unsafe it's probably better to use CONFIG_MP_NUM_CPUS=1 instead (or in addition to -- 1cpu obviously is a noop if there's only one cpu)

movl $0x600, %eax
xorl %edx, %edx
movl $X86_FMASK_MSR, %ecx
wrmsr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semi-hard style issue here. I think this is the wrong place. go64 is the lowest level bootstrap code required to get us to z_cstart(), these MSRs won't be inspected by the hardware until after the first user thread is spawned much, much later. Why can't this be done in the C side of initialization with much simpler code, e.g.:

z_x86_msr_write(X86_LSTAR_MSR, z_x86_syscall_entry_stub);
z_x86_msr_write(X86_STAR_MSR, X86_STAR_UPPER << 32);
z_x86_msr_write(X86_FMASK_MSR, 0x600);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyross sure, it's better to do stuff in C if we can. I'll adjust this.

arch/x86/core/userspace.c Outdated Show resolved Hide resolved
@andrewboie
Copy link
Contributor Author

updated:

  • cleanup checkpatch whines
  • fix test cases per @andyross suggestion
  • add extra lfence per @ceolin suggestion
  • clean up arch_start_cpu to discard unused parameter and add a typedef for cpu start functions
  • move some per-cpu initialization to C domain per @andyross review
  • fix an incorrect comment I found

Andrew Boie added 26 commits January 13, 2020 12:26
This doesn't work properly on x86 unless the dynamic thread
struct allocated gets lucky and is aligned to 16 bytes.
Disabling for now until zephyrproject-rtos#17893 is fixed.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
The regular expressions used by this test to determine
success or failure get confounded if the log subsystem
drops the wrong messages due to buffers being full.

Just use minimal logging which synchronously logs
everything.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Exceptions on x86_64 are incorrectly implemented, and if
a preemptible thread faults, and in its overridden
k_sys_fatal_error_handler() does something which invokes
a scheduling point (such as here where we give semaphores),
the thread will be swapped out on the per-CPU exception stack
and probably explode when it is switched back in.

For now, change the faulting thread priority to co-op so this
doesn't happen.

Workaround for zephyrproject-rtos#21462

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
These were previously assumed to always be fatal.
We can't have the faulting thread's XMM registers
clobbered, so put the SIMD/FPU state onto the stack
as well. This is fairly large (512 bytes) and the
execption stack is already uncomfortably small, so
increase to 2K.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Define MSR register addresses for various MSRs related to
SYSCALL/SYSRET. We also add MSRs for FS/GS base addresses
(for GS, both kernel and user mode) to support SWAPGS.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Easier to establish correspondence with the technical
manuals.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
We don't need to set up GDT data descriptors for setting
%gs. Instead, we use the x86 MSRs to set GS_BASE and
KERNEL_GS_BASE.

We don't currently allow user mode to set %gs on its own,
but later on if we do, we have everything set up to issue
'swapgs' instructions on syscall or IRQ.

Unused entries in the GDT have been removed.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
These are arranged in the particular order required
by the syscall/sysret instructions.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
The esf has a different set of members on 64-bit.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Slightly different layout since the top-lebel PML4
is page-sized and must be page-aligned.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Add two new non-static APIs for dumping out the
page table entries for a specified memory address,
and move to the main MMU code. Has debugging uses
when trying to figure out why memory domains are not
set up correctly.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
z_x86_thread_page_tables_get() now works for both user
and supervisor threads, returning the kernel page tables
in the latter case. This API has been up-leveled to
a common header.

The per-thread privilege elevation stack initial stack
pointer, and the per-thread page table locations are no
longer computed from other values, and instead are stored
in thread->arch.

A problem where the wrong page tables were dumped out
on certain kinds of page faults has been fixed.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
These are now common code, all are related to user mode
threads. The rat's nest of ifdefs in ia32's arch_new_thread
has been greatly simplified, there is now just one hook
if user mode is turned on.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Includes linker script fragments for the kernel object
tables and automatic memory partitions. The data section
is moved to the end per the requirements of
include/linker/kobject.h.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Nothing too fancy here, we try as much as possible to
use the same register layout as the C calling convention.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
These are now part of the common Kconfig and we
build spec_ctrl.c for all.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Hidden config to select dependencies.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
We should not be casting stuff related to memory sizes
to u32_t.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
This is causing problems, as if we create a thread in
a system call we will *not* be using the kernel page
tables if CONFIG_KPTI=n.

Just don't fiddle with this page's permissions; we don't
need it as a guard area anyway since we have a stack
guard placed immediately before it, and this page
is unused if user mode isn't active.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
We use a fixed value of 32 as the way interrupts/exceptions
are setup in x86_64's locore.S do not lend themselves to
Kconfig configuration of the vector to use.

HW-based kernel oops is now permanently on, there's no reason
to make it optional that I can see.

Default vectors for IPI and irq offload adjusted to not
collide.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
This code:

1) Doesn't work
2) Hasn't ever been enabled by default
3) We mitigate Spectre V2 via Extended IBRS anyway

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
- In early boot, enable the syscall instruction and set up
  necessary MSRs
- Add a hook to update page tables on context switch
- Properly initialize thread based on whether it will
  start in user or supervisor mode
- Add landing function for system calls to execute the
  desired handler
- Implement arch_user_string_nlen()
- Implement logic for dropping a thread down to user mode
- Reserve per-CPU storage space for user and privilege
  elevation stack pointers, necessary for handling syscalls
  when no free registers are available
- Proper handling of gs register considerations when
  transitioning privilege levels

Kernel page table isolation (KPTI) is not yet implemented.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
See CVE-2019-1125. We mitigate this by adding an 'lfence'
upon interrupt/exception entry after the decision has been
made whether it's necessary to invoke 'swapgs' or not.

Only applies to x86_64, 32-bit doesn't use swapgs.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
KPTI is still work-in-progress on x86_64. Don't allow
user mode to be enabled unless the SOC/board configuration
indicates that the CPU in use is invulnerable to meltdown
attacks.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
The "key" parameter is legacy, remove it.

Add a typedef for the expected function pointer type.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
No reason we need to stay in assembly domain once we have
GS and a stack set up.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great here

@nashif nashif merged commit 2690c9e into zephyrproject-rtos:master Jan 13, 2020
@andrewboie andrewboie deleted the x86-64-user-mode branch September 24, 2020 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: API Changes to public APIs area: ARC ARC Architecture area: Boards area: Kernel area: Memory Protection area: Samples Samples area: Tests Issues related to a particular existing or missing test area: X86 x86 Architecture (32-bit)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants