-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crashes in tests/kernel/mem_protect/userspace case pass_noperms_object on x86_64 #22738
Comments
I have a theory that this has something to do with the system trying to reschedule while in exception context, which doesn't work right on x86-64 due to per-cpu exception stacks, exceptions are not taken on a thread stack. |
Experimenting with per-thread exception stacks but still having crashes. My code might not be right however. https://github.com/andrewboie/zephyr/tree/x86-64-exceptions |
I can't seem to get this to reproduce if I set |
On x86_64 we should not be getting an exception dump at all. If we get a bad system call on x86_64,
This may be an interaction with the custom Some common, robust infrastructure for doing ztests on APIs which can trigger a fatal error would be valuable, instead of implementing on an ad-hoc basis for every test.
Definitely wrong, we're not in exception context...all of this is synchronous with the thread, on the privilege elevation stack. |
Adding There is another pain point in this test however. I also see intermittent crashes in the
Adding a sleep doesn't help. I am suspecting a concurrency issue. Whenever I see this scenario fail, the active thread is
It's always the idle thread at that instruction. |
We are actually in the timer interrupt. When this fails, The exception code is not reporting that we are in an ISR, which needs to be fixed. I caught this in a backtrace and got:
All access to The bad As it turns out, this test case calls ztest_test_pass() from a child of ztest_thread, and not ztest_thread itself. This causes the ztest infrastructure to try to re-use ztest_thread without aborting it first. |
If a ztest test case creates child thread(s), and one of the descendent threads invokes ztest_test_pass(), ztest_test_fail(), or ztest_thread_skip(), only that descendent thread will be aborted. Then ztest will try to run the next scenario on the ztest_thread which is already in use. This was causing corruption issues on SMP systems, and possibly other subtle, hard-to-debug situations. This patch ensures that ztest_thread is always dead before re-using it, as run_test() now attempts to join on it instead of using a semaphore. The ztest_test_* functions now ensure that the ztest_thread is always aborted, in addition to the current thread. This isn't perfect. If the testcase spawned other threads, they will keep running. The most robust way to fix this is to iterate over all non-essential threads in the system and abort them. Unfortunately, Zephyr doesn't have a facility to do this safely. It would also be simpler to re-use thread objects if k_thread_create() could detect whether the thread was already active and abort it, but this is currently not possible since k_thread_create() can be used with uninitialzed thread object memory and no checks are possible. This may be improved in the future, see zephyrproject-rtos#23030. Fixes: zephyrproject-rtos#22738 Partial fix for: zephyrproject-rtos#24713 Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
If a ztest test case creates child thread(s), and one of the descendent threads invokes ztest_test_pass(), ztest_test_fail(), or ztest_thread_skip(), only that descendent thread will be aborted. Then ztest will try to run the next scenario on the ztest_thread which is already in use. This was causing corruption issues on SMP systems, and possibly other subtle, hard-to-debug situations. This patch ensures that ztest_thread is always dead before re-using it, as run_test() now attempts to join on it instead of using a semaphore. The ztest_test_* functions now ensure that the ztest_thread is always aborted, in addition to the current thread. This isn't perfect. If the testcase spawned other threads, they will keep running. The most robust way to fix this is to iterate over all non-essential threads in the system and abort them. Unfortunately, Zephyr doesn't have a facility to do this safely. It would also be simpler to re-use thread objects if k_thread_create() could detect whether the thread was already active and abort it, but this is currently not possible since k_thread_create() can be used with uninitialzed thread object memory and no checks are possible. This may be improved in the future, see #23030. Fixes: #22738 Partial fix for: #24713 Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
If a ztest test case creates child thread(s), and one of the descendent threads invokes ztest_test_pass(), ztest_test_fail(), or ztest_thread_skip(), only that descendent thread will be aborted. Then ztest will try to run the next scenario on the ztest_thread which is already in use. This was causing corruption issues on SMP systems, and possibly other subtle, hard-to-debug situations. This patch ensures that ztest_thread is always dead before re-using it, as run_test() now attempts to join on it instead of using a semaphore. The ztest_test_* functions now ensure that the ztest_thread is always aborted, in addition to the current thread. This isn't perfect. If the testcase spawned other threads, they will keep running. The most robust way to fix this is to iterate over all non-essential threads in the system and abort them. Unfortunately, Zephyr doesn't have a facility to do this safely. It would also be simpler to re-use thread objects if k_thread_create() could detect whether the thread was already active and abort it, but this is currently not possible since k_thread_create() can be used with uninitialzed thread object memory and no checks are possible. This may be improved in the future, see zephyrproject-rtos#23030. Fixes: zephyrproject-rtos#22738 Partial fix for: zephyrproject-rtos#24713 Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
I've also seen this scenario just get stuck with no output. This seems fairly easy to reproduce outside of CI, it doesn't always fail, but it doesn't take long to get a run to go haywire.
The text was updated successfully, but these errors were encountered: