crashes in tests/kernel/mem_protect/userspace case pass_noperms_object on x86_64 #22738

andrewboie · 2020-02-12T00:09:42Z

===================================================================
starting test - pass_noperms_object
E: Page fault at address 0xd (error code 0x0)
E: LineaE:r  address nott present in page thables
read 0xE: P0ML4E0:00000000139b40 ( 0x0000000000178827 2W)r itable, User, Execute Enabled
does not have permission on k_sem 0x00000E0: PDPTE: 0x00000000000179827 Writable, Us0er, Execute Enabled
0013aa50
E:   PDE: 0x000000000017a827 Writable, User, Execute Enabled
E:   PTE: Non-present
E: permission bitmap
E: RAX: 0x0000000000000001 RBX: 0x0000000000139b40 RCX: 0x0000000000000001 RDX: 0x000000E0000000046
: E: RSI: 0x0000000000000002 RDI: 0x0000000000139b40 RBP: 0x000000000010e6b9 RSP: 0x000000000017ecf0
E:  R8: 0x00000000ffffffff  R9: 0x000000000017f000 R10: 0x000000000010afe2 R11: 0x0000000000000000
E: R12: 0x0000000000000002 R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x0000000000000000
E: RSP: 0x000000000017ecf0 RFLAGS: 0x0000000000000202 CS: 0x0018 CR3: 0x0000000000126000
E: RIP: 0x000000000010e7008
E: >>> ZEPHYR FATAL ERROR 00 : CPU exception on CPU 0
E: Current thread: 0x0000000000139b40 (unknown)
Caught system error -- reason 0

starting test - pass_noperms_object
E: thread 0x0000000000139b40 (2) does not have permission on k_sem 0x000000000013aa50
E: p: permission bitmap
E%c: 0x00000000000003f8 used befoE: rE: Page fault at address 0x200010329a00  (error code 0x10)
E: Linear address not present in page tables
00 07 00 00 E: PML4E: 0x   0      000000000178827 Writable, User, Execute Enabled
E: PDPTE: Non-present
|.....   E: RAX: 0x
0000000000000072 RBX: 0x0000000000102119 RCX: 0x00000000000003f8 RDX: 0x00000000000003f8
EE: RSI: 0x:000000000 0syscall z_vrfy_k_sem_init failed check: access denied
000072 RDI: 0x00000000001bf760 RBP: 0x000000000017ebe8 RSP: 0x000000000017eb40
E:  R8: 0x0000000000000000  R9: 0x000000000013aa50 R10: 0xE: Bad system call from RIP 0x0100525
00000000010afe2 R11: 0x0000000000000000
E: R12: 0x0000000000000072 R13: 0x000000000017ec28 R14: 0x0000000000000001 R15: 0x0000000000000000
E: RSP: 0x000000000017eb40 RFLAGS: 0x0000000000000202 CS: 0x0018 CR3: 0x0000000000126000
E: RIP: 0x000000200010329a
E: >>> ZEPHYR FATAL ERROR E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 30
unknown)E: Current thread: 0x0000000000139b40 ( oops on CPU 1

Caught system error -- reason E: Current thread: 0x0000000000139b40 (unknown)
0Caught system error -- reason 3

starting test - pass_noperms_object
E%: s
%a: 0E: x0000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

I've also seen this scenario just get stuck with no output. This seems fairly easy to reproduce outside of CI, it doesn't always fail, but it doesn't take long to get a run to go haywire.

The text was updated successfully, but these errors were encountered:

andrewboie · 2020-02-12T18:15:16Z

I have a theory that this has something to do with the system trying to reschedule while in exception context, which doesn't work right on x86-64 due to per-cpu exception stacks, exceptions are not taken on a thread stack.

andrewboie · 2020-02-12T20:39:05Z

Experimenting with per-thread exception stacks but still having crashes. My code might not be right however. https://github.com/andrewboie/zephyr/tree/x86-64-exceptions

andrewboie · 2020-05-06T00:28:00Z

I can't seem to get this to reproduce if I set CONFIG_MP_NUM_CPUS=1, so I think there is some interaction with SMP going on. I'm checking all the test code for potential races.

andrewboie · 2020-05-06T00:53:41Z

On x86_64 we should not be getting an exception dump at all. If we get a bad system call on x86_64, arch_syscall_oops() is called, which then invokes z_x86_fatal_error(K_ERR_KERNEL_OOPS, NULL) -> z_fatal_error(K_ERR_KERNEL_OOPS, NULL) -> the test case custom k_sys_fatal_error_handler()

pass_noperms_object is the second test which invokes an arch_syscall_oops(). It's nearly always this case that blows up, so now I'm thinking this is some kind of bad state or race condition related to the first time arch_syscall_oops() was invoked, which is from the immediately preceding pass_user_object scenario.

This may be an interaction with the custom k_sys_fatal_error_handler() provided by the test case.

Some common, robust infrastructure for doing ztests on APIs which can trigger a fatal error would be valuable, instead of implementing on an ad-hoc basis for every test.

I have a theory that this has something to do with the system trying to reschedule while in exception context, which doesn't work right on x86-64 due to per-cpu exception stacks, exceptions are not taken on a thread stack.

Definitely wrong, we're not in exception context...all of this is synchronous with the thread, on the privilege elevation stack.

andrewboie · 2020-05-06T01:13:50Z

Adding k_sleep(K_MSEC(100)) to the beginning of pass_noperms_object() makes that crash much less likely to reproduce, although I got it to still hang once with no output. So I'm thinking race condition.

There is another pain point in this test however. I also see intermittent crashes in the domain_add_part_drop_to_user():

starting test - domain_add_part_drop_to_user
E: Page fault at address 0x0 (error code 0x2)
E: Linear address not present in page tables
E: PML4E: 0x0000000000137827 Writable, User, Execute Enabled
E: PDPTE: 0x0000000000136827 Writable, User, Execute Enabled
E:   PDE: 0x0000000000135827 Writable, User, Execute Enabled
E:   PTE: Non-present
E: RAX: 0x0000000000000000 RBX: 0x000000000013ab70 RCX: 0x0000000000000011 RDX: 0x0000000000000000
E: RSI: 0x0000000000000011 RDI: 0x000000000013ab70 RBP: 0x0000000000000247 RSP: 0x00000000001bffd8
E:  R8: 0x0000000000000001  R9: 0x0000000000000000 R10: 0x0000000000000000 R11: 0x0000000000000000
E: R12: 0x0000000000000000 R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x0000000000000000
E: RSP: 0x00000000001bffd8 RFLAGS: 0x0000000000000046 CS: 0x0018 CR3: 0x0000000000127000
E: RIP: 0x000000000010d114
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 1
E: Current thread: 0x000000000013b280 (idle)
Caught system error -- reason 0

Adding a sleep doesn't help. I am suspecting a concurrency issue. Whenever I see this scenario fail, the active thread is idle and RIP points to:

(gdb) x/fx 0x000000000010d114
0x10d114 <remove_timeout+32>:	0x48028948
(gdb)

It's always the idle thread at that instruction.

andrewboie · 2020-05-06T03:39:35Z

We are actually in the timer interrupt. When this fails, remove_timeout() is being called on a timeout object with NULL values for both the next and prev pointers.

The exception code is not reporting that we are in an ISR, which needs to be fixed. I caught this in a backtrace and got:

Thread 2 hit Breakpoint 1, remove_timeout (t=t@entry=0x13ab70 <ztest_thread+48>)
    at /home/apboie/projects/zephyr/zephyr/kernel/timeout.c:65
65			printk("t->node.next == NULL\n");
(gdb) bt
#0  remove_timeout (t=t@entry=0x13ab70 <ztest_thread+48>)
    at /home/apboie/projects/zephyr/zephyr/kernel/timeout.c:65
#1  0x000000000010d900 in z_clock_announce (ticks=<optimized out>)
    at /home/apboie/projects/zephyr/zephyr/kernel/timeout.c:243
#2  0x000000000000854e in irq_dispatch ()
    at /home/apboie/projects/zephyr/zephyr/arch/x86/core/intel64/locore.S:606
#3  0x0000000000000000 in ?? ()

z_clock_announce() gets invoked from the timer driver. I am not sure how we got into a state where sys_dlist_peek_head(&timeout_list) points to a sys_dnode_t with NULL next/prev pointers.

All access to timeout_list ought to be serialized on timeout_lock, maybe there's a case where it isn't, or something completely zeroed the associated struct _timeout which contains the dnode.

The bad struct _timeout is ztest_thread.base.timeout. Since it belongs to a thread, and when a thread exits z_abort_timeout() on its timeout struct gets called, I'm thinking this might be a race where k_thread_create() is being called on ztest_thread before the abort completes from it's previous lifecycle.

As it turns out, this test case calls ztest_test_pass() from a child of ztest_thread, and not ztest_thread itself. This causes the ztest infrastructure to try to re-use ztest_thread without aborting it first.

If a ztest test case creates child thread(s), and one of the descendent threads invokes ztest_test_pass(), ztest_test_fail(), or ztest_thread_skip(), only that descendent thread will be aborted. Then ztest will try to run the next scenario on the ztest_thread which is already in use. This was causing corruption issues on SMP systems, and possibly other subtle, hard-to-debug situations. This patch ensures that ztest_thread is always dead before re-using it, as run_test() now attempts to join on it instead of using a semaphore. The ztest_test_* functions now ensure that the ztest_thread is always aborted, in addition to the current thread. This isn't perfect. If the testcase spawned other threads, they will keep running. The most robust way to fix this is to iterate over all non-essential threads in the system and abort them. Unfortunately, Zephyr doesn't have a facility to do this safely. It would also be simpler to re-use thread objects if k_thread_create() could detect whether the thread was already active and abort it, but this is currently not possible since k_thread_create() can be used with uninitialzed thread object memory and no checks are possible. This may be improved in the future, see zephyrproject-rtos#23030. Fixes: zephyrproject-rtos#22738 Partial fix for: zephyrproject-rtos#24713 Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

If a ztest test case creates child thread(s), and one of the descendent threads invokes ztest_test_pass(), ztest_test_fail(), or ztest_thread_skip(), only that descendent thread will be aborted. Then ztest will try to run the next scenario on the ztest_thread which is already in use. This was causing corruption issues on SMP systems, and possibly other subtle, hard-to-debug situations. This patch ensures that ztest_thread is always dead before re-using it, as run_test() now attempts to join on it instead of using a semaphore. The ztest_test_* functions now ensure that the ztest_thread is always aborted, in addition to the current thread. This isn't perfect. If the testcase spawned other threads, they will keep running. The most robust way to fix this is to iterate over all non-essential threads in the system and abort them. Unfortunately, Zephyr doesn't have a facility to do this safely. It would also be simpler to re-use thread objects if k_thread_create() could detect whether the thread was already active and abort it, but this is currently not possible since k_thread_create() can be used with uninitialzed thread object memory and no checks are possible. This may be improved in the future, see #23030. Fixes: #22738 Partial fix for: #24713 Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

If a ztest test case creates child thread(s), and one of the descendent threads invokes ztest_test_pass(), ztest_test_fail(), or ztest_thread_skip(), only that descendent thread will be aborted. Then ztest will try to run the next scenario on the ztest_thread which is already in use. This was causing corruption issues on SMP systems, and possibly other subtle, hard-to-debug situations. This patch ensures that ztest_thread is always dead before re-using it, as run_test() now attempts to join on it instead of using a semaphore. The ztest_test_* functions now ensure that the ztest_thread is always aborted, in addition to the current thread. This isn't perfect. If the testcase spawned other threads, they will keep running. The most robust way to fix this is to iterate over all non-essential threads in the system and abort them. Unfortunately, Zephyr doesn't have a facility to do this safely. It would also be simpler to re-use thread objects if k_thread_create() could detect whether the thread was already active and abort it, but this is currently not possible since k_thread_create() can be used with uninitialzed thread object memory and no checks are possible. This may be improved in the future, see zephyrproject-rtos#23030. Fixes: zephyrproject-rtos#22738 Partial fix for: zephyrproject-rtos#24713 Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

andrewboie added the bug The issue is a bug, or the PR is fixing a bug label Feb 12, 2020

andrewboie self-assigned this Feb 12, 2020

jhedberg added the priority: medium Medium impact/importance bug label Feb 18, 2020

carlescufi added area: Tests Issues related to a particular existing or missing test area: X86_64 x86-64 Architecture (64-bit) labels Apr 30, 2020

andrewboie mentioned this issue May 6, 2020

ztest: end tests more robustly #25011

Merged

nashif closed this as completed in #25011 May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crashes in tests/kernel/mem_protect/userspace case pass_noperms_object on x86_64 #22738

crashes in tests/kernel/mem_protect/userspace case pass_noperms_object on x86_64 #22738

andrewboie commented Feb 12, 2020

andrewboie commented Feb 12, 2020

andrewboie commented Feb 12, 2020

andrewboie commented May 6, 2020

andrewboie commented May 6, 2020 •

edited

Loading

andrewboie commented May 6, 2020

andrewboie commented May 6, 2020

crashes in tests/kernel/mem_protect/userspace case pass_noperms_object on x86_64 #22738

crashes in tests/kernel/mem_protect/userspace case pass_noperms_object on x86_64 #22738

Comments

andrewboie commented Feb 12, 2020

andrewboie commented Feb 12, 2020

andrewboie commented Feb 12, 2020

andrewboie commented May 6, 2020

andrewboie commented May 6, 2020 • edited Loading

andrewboie commented May 6, 2020

andrewboie commented May 6, 2020

andrewboie commented May 6, 2020 •

edited

Loading