Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crashes in tests/kernel/mem_protect/userspace case pass_noperms_object on x86_64 #22738

Closed
andrewboie opened this issue Feb 12, 2020 · 6 comments · Fixed by #25011
Closed

crashes in tests/kernel/mem_protect/userspace case pass_noperms_object on x86_64 #22738

andrewboie opened this issue Feb 12, 2020 · 6 comments · Fixed by #25011
Assignees
Labels
area: Tests Issues related to a particular existing or missing test area: X86_64 x86-64 Architecture (64-bit) bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@andrewboie
Copy link
Contributor

===================================================================
starting test - pass_noperms_object
E: Page fault at address 0xd (error code 0x0)
E: LineaE:r  address nott present in page thables
read 0xE: P0ML4E0:00000000139b40 ( 0x0000000000178827 2W)r itable, User, Execute Enabled
does not have permission on k_sem 0x00000E0: PDPTE: 0x00000000000179827 Writable, Us0er, Execute Enabled
0013aa50
E:   PDE: 0x000000000017a827 Writable, User, Execute Enabled
E:   PTE: Non-present
E: permission bitmap
E: RAX: 0x0000000000000001 RBX: 0x0000000000139b40 RCX: 0x0000000000000001 RDX: 0x000000E0000000046
: E: RSI: 0x0000000000000002 RDI: 0x0000000000139b40 RBP: 0x000000000010e6b9 RSP: 0x000000000017ecf0
E:  R8: 0x00000000ffffffff  R9: 0x000000000017f000 R10: 0x000000000010afe2 R11: 0x0000000000000000
E: R12: 0x0000000000000002 R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x0000000000000000
E: RSP: 0x000000000017ecf0 RFLAGS: 0x0000000000000202 CS: 0x0018 CR3: 0x0000000000126000
E: RIP: 0x000000000010e7008
E: >>> ZEPHYR FATAL ERROR 00 : CPU exception on CPU 0
E: Current thread: 0x0000000000139b40 (unknown)
Caught system error -- reason 0
starting test - pass_noperms_object
E: thread 0x0000000000139b40 (2) does not have permission on k_sem 0x000000000013aa50
E: p: permission bitmap
E%c: 0x00000000000003f8 used befoE: rE: Page fault at address 0x200010329a00  (error code 0x10)
E: Linear address not present in page tables
00 07 00 00 E: PML4E: 0x   0      000000000178827 Writable, User, Execute Enabled
E: PDPTE: Non-present
|.....   E: RAX: 0x
0000000000000072 RBX: 0x0000000000102119 RCX: 0x00000000000003f8 RDX: 0x00000000000003f8
EE: RSI: 0x:000000000 0syscall z_vrfy_k_sem_init failed check: access denied
000072 RDI: 0x00000000001bf760 RBP: 0x000000000017ebe8 RSP: 0x000000000017eb40
E:  R8: 0x0000000000000000  R9: 0x000000000013aa50 R10: 0xE: Bad system call from RIP 0x0100525
00000000010afe2 R11: 0x0000000000000000
E: R12: 0x0000000000000072 R13: 0x000000000017ec28 R14: 0x0000000000000001 R15: 0x0000000000000000
E: RSP: 0x000000000017eb40 RFLAGS: 0x0000000000000202 CS: 0x0018 CR3: 0x0000000000126000
E: RIP: 0x000000200010329a
E: >>> ZEPHYR FATAL ERROR E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 30
unknown)E: Current thread: 0x0000000000139b40 ( oops on CPU 1

Caught system error -- reason E: Current thread: 0x0000000000139b40 (unknown)
0Caught system error -- reason 3
starting test - pass_noperms_object
E%: s
%a: 0E: x0000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

I've also seen this scenario just get stuck with no output. This seems fairly easy to reproduce outside of CI, it doesn't always fail, but it doesn't take long to get a run to go haywire.

@andrewboie andrewboie added the bug The issue is a bug, or the PR is fixing a bug label Feb 12, 2020
@andrewboie andrewboie self-assigned this Feb 12, 2020
@andrewboie
Copy link
Contributor Author

I have a theory that this has something to do with the system trying to reschedule while in exception context, which doesn't work right on x86-64 due to per-cpu exception stacks, exceptions are not taken on a thread stack.

@andrewboie
Copy link
Contributor Author

Experimenting with per-thread exception stacks but still having crashes. My code might not be right however. https://github.com/andrewboie/zephyr/tree/x86-64-exceptions

@jhedberg jhedberg added the priority: medium Medium impact/importance bug label Feb 18, 2020
@carlescufi carlescufi added area: Tests Issues related to a particular existing or missing test area: X86_64 x86-64 Architecture (64-bit) labels Apr 30, 2020
@andrewboie
Copy link
Contributor Author

I can't seem to get this to reproduce if I set CONFIG_MP_NUM_CPUS=1, so I think there is some interaction with SMP going on. I'm checking all the test code for potential races.

@andrewboie
Copy link
Contributor Author

andrewboie commented May 6, 2020

On x86_64 we should not be getting an exception dump at all. If we get a bad system call on x86_64, arch_syscall_oops() is called, which then invokes z_x86_fatal_error(K_ERR_KERNEL_OOPS, NULL) -> z_fatal_error(K_ERR_KERNEL_OOPS, NULL) -> the test case custom k_sys_fatal_error_handler()

pass_noperms_object is the second test which invokes an arch_syscall_oops(). It's nearly always this case that blows up, so now I'm thinking this is some kind of bad state or race condition related to the first time arch_syscall_oops() was invoked, which is from the immediately preceding pass_user_object scenario.

This may be an interaction with the custom k_sys_fatal_error_handler() provided by the test case.

Some common, robust infrastructure for doing ztests on APIs which can trigger a fatal error would be valuable, instead of implementing on an ad-hoc basis for every test.

I have a theory that this has something to do with the system trying to reschedule while in exception context, which doesn't work right on x86-64 due to per-cpu exception stacks, exceptions are not taken on a thread stack.

Definitely wrong, we're not in exception context...all of this is synchronous with the thread, on the privilege elevation stack.

@andrewboie
Copy link
Contributor Author

Adding k_sleep(K_MSEC(100)) to the beginning of pass_noperms_object() makes that crash much less likely to reproduce, although I got it to still hang once with no output. So I'm thinking race condition.

There is another pain point in this test however. I also see intermittent crashes in the domain_add_part_drop_to_user():

starting test - domain_add_part_drop_to_user
E: Page fault at address 0x0 (error code 0x2)
E: Linear address not present in page tables
E: PML4E: 0x0000000000137827 Writable, User, Execute Enabled
E: PDPTE: 0x0000000000136827 Writable, User, Execute Enabled
E:   PDE: 0x0000000000135827 Writable, User, Execute Enabled
E:   PTE: Non-present
E: RAX: 0x0000000000000000 RBX: 0x000000000013ab70 RCX: 0x0000000000000011 RDX: 0x0000000000000000
E: RSI: 0x0000000000000011 RDI: 0x000000000013ab70 RBP: 0x0000000000000247 RSP: 0x00000000001bffd8
E:  R8: 0x0000000000000001  R9: 0x0000000000000000 R10: 0x0000000000000000 R11: 0x0000000000000000
E: R12: 0x0000000000000000 R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x0000000000000000
E: RSP: 0x00000000001bffd8 RFLAGS: 0x0000000000000046 CS: 0x0018 CR3: 0x0000000000127000
E: RIP: 0x000000000010d114
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 1
E: Current thread: 0x000000000013b280 (idle)
Caught system error -- reason 0

Adding a sleep doesn't help. I am suspecting a concurrency issue. Whenever I see this scenario fail, the active thread is idle and RIP points to:

(gdb) x/fx 0x000000000010d114
0x10d114 <remove_timeout+32>:	0x48028948
(gdb) 

It's always the idle thread at that instruction.

@andrewboie
Copy link
Contributor Author

We are actually in the timer interrupt. When this fails, remove_timeout() is being called on a timeout object with NULL values for both the next and prev pointers.

The exception code is not reporting that we are in an ISR, which needs to be fixed. I caught this in a backtrace and got:

Thread 2 hit Breakpoint 1, remove_timeout (t=t@entry=0x13ab70 <ztest_thread+48>)
    at /home/apboie/projects/zephyr/zephyr/kernel/timeout.c:65
65			printk("t->node.next == NULL\n");
(gdb) bt
#0  remove_timeout (t=t@entry=0x13ab70 <ztest_thread+48>)
    at /home/apboie/projects/zephyr/zephyr/kernel/timeout.c:65
#1  0x000000000010d900 in z_clock_announce (ticks=<optimized out>)
    at /home/apboie/projects/zephyr/zephyr/kernel/timeout.c:243
#2  0x000000000000854e in irq_dispatch ()
    at /home/apboie/projects/zephyr/zephyr/arch/x86/core/intel64/locore.S:606
#3  0x0000000000000000 in ?? ()

z_clock_announce() gets invoked from the timer driver. I am not sure how we got into a state where sys_dlist_peek_head(&timeout_list) points to a sys_dnode_t with NULL next/prev pointers.

All access to timeout_list ought to be serialized on timeout_lock, maybe there's a case where it isn't, or something completely zeroed the associated struct _timeout which contains the dnode.

The bad struct _timeout is ztest_thread.base.timeout. Since it belongs to a thread, and when a thread exits z_abort_timeout() on its timeout struct gets called, I'm thinking this might be a race where k_thread_create() is being called on ztest_thread before the abort completes from it's previous lifecycle.

As it turns out, this test case calls ztest_test_pass() from a child of ztest_thread, and not ztest_thread itself. This causes the ztest infrastructure to try to re-use ztest_thread without aborting it first.

andrewboie pushed a commit to andrewboie/zephyr that referenced this issue May 6, 2020
If a ztest test case creates child thread(s), and one of the
descendent threads invokes ztest_test_pass(), ztest_test_fail(),
or ztest_thread_skip(), only that descendent thread will be
aborted.

Then ztest will try to run the next scenario on the ztest_thread
which is already in use. This was causing corruption issues on
SMP systems, and possibly other subtle, hard-to-debug
situations.

This patch ensures that ztest_thread is always dead before
re-using it, as run_test() now attempts to join on it instead
of using a semaphore.

The ztest_test_* functions now ensure that the ztest_thread
is always aborted, in addition to the current thread.

This isn't perfect. If the testcase spawned other threads,
they will keep running. The most robust way to fix this is to
iterate over all non-essential threads in the system and abort
them. Unfortunately, Zephyr doesn't have a facility to do
this safely.

It would also be simpler to re-use thread objects if
k_thread_create() could detect whether the thread was already
active and abort it, but this is currently not possible
since k_thread_create() can be used with uninitialzed
thread object memory and no checks are possible. This
may be improved in the future, see zephyrproject-rtos#23030.

Fixes: zephyrproject-rtos#22738
Partial fix for: zephyrproject-rtos#24713

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
nashif pushed a commit that referenced this issue May 7, 2020
If a ztest test case creates child thread(s), and one of the
descendent threads invokes ztest_test_pass(), ztest_test_fail(),
or ztest_thread_skip(), only that descendent thread will be
aborted.

Then ztest will try to run the next scenario on the ztest_thread
which is already in use. This was causing corruption issues on
SMP systems, and possibly other subtle, hard-to-debug
situations.

This patch ensures that ztest_thread is always dead before
re-using it, as run_test() now attempts to join on it instead
of using a semaphore.

The ztest_test_* functions now ensure that the ztest_thread
is always aborted, in addition to the current thread.

This isn't perfect. If the testcase spawned other threads,
they will keep running. The most robust way to fix this is to
iterate over all non-essential threads in the system and abort
them. Unfortunately, Zephyr doesn't have a facility to do
this safely.

It would also be simpler to re-use thread objects if
k_thread_create() could detect whether the thread was already
active and abort it, but this is currently not possible
since k_thread_create() can be used with uninitialzed
thread object memory and no checks are possible. This
may be improved in the future, see #23030.

Fixes: #22738
Partial fix for: #24713

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
hakehuang pushed a commit to hakehuang/zephyr that referenced this issue Jun 20, 2020
If a ztest test case creates child thread(s), and one of the
descendent threads invokes ztest_test_pass(), ztest_test_fail(),
or ztest_thread_skip(), only that descendent thread will be
aborted.

Then ztest will try to run the next scenario on the ztest_thread
which is already in use. This was causing corruption issues on
SMP systems, and possibly other subtle, hard-to-debug
situations.

This patch ensures that ztest_thread is always dead before
re-using it, as run_test() now attempts to join on it instead
of using a semaphore.

The ztest_test_* functions now ensure that the ztest_thread
is always aborted, in addition to the current thread.

This isn't perfect. If the testcase spawned other threads,
they will keep running. The most robust way to fix this is to
iterate over all non-essential threads in the system and abort
them. Unfortunately, Zephyr doesn't have a facility to do
this safely.

It would also be simpler to re-use thread objects if
k_thread_create() could detect whether the thread was already
active and abort it, but this is currently not possible
since k_thread_create() can be used with uninitialzed
thread object memory and no checks are possible. This
may be improved in the future, see zephyrproject-rtos#23030.

Fixes: zephyrproject-rtos#22738
Partial fix for: zephyrproject-rtos#24713

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Tests Issues related to a particular existing or missing test area: X86_64 x86-64 Architecture (64-bit) bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants