-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
intermittent SMP crashes on x86_64 #21317
Comments
At this time I don't believe any of the changes I have made to page tables are the culprit. I did a git checkout of 1228798 which is the revision that introduced SMP into the qemu_x86_long target. I then did multiple runs of "sanitycheck -n -p qemu_x86_long; stty sane". I am seeing tests randomly failing as well, although the errors are different as the boot page tables have every page accessible and present. Not only that, but I have found a way to make the problem reproduce much more frequently:
I don't understand what this means yet, but instead of 0-2 failures for any given sanitycheck run, I get more like 8-10. There's no reason why this change should break anything, and indeed it doesn't on our other uniprocessor targets. I then tested, at the same revision 1228798, whether I can get failures with the older x86_64 port written by @andyross , which has since been removed from Zephyr. Unfortunately, I can. Multiple runs of "sanitycheck -n -p qemu_x86_64; stty sane" also produce about 10 or so errors each run. It's not the same set of tests each time, although some tests tend to fail more often than others. I then learned that our ARC and Xtensa SMP implementations are not being tested in CI. I filed #21469 for ARC. For Xtensa I don't think we have an emulator which will let us run SMP in CI. Because I can reproduce similar failures in both the old and new x86_64 ports, I am starting to think this is an issue in the core kernel. |
I've prepared a branch at the revision where SMP was introduced in qemu_x86_long with my change to ztest to add the sleep after running a test. https://github.com/andrewboie/zephyr/tree/smp-issues To reproduce what I am seeing, check this out, update west, and then run sanitycheck for either qemu_x86_long (new port) or qemu_x86_64 (old port); both should exhibit random failures. Example output in this branch:
|
A really good data point would be to see if this can be observed with nsim_hs_smp or some Xtensa target. I'm trying to globally enable SMP for nsim_hs_smp but I'm having some issues. |
Some other observations:
|
We have some races causing random failures with this platform, set cpu number to one while we investigate and fix the issue. Related to zephyrproject-rtos#21317 Signed-off-by: Anas Nashif <anas.nashif@intel.com>
We have some races causing random failures with this platform, set cpu number to one while we investigate and fix the issue. Related to #21317 Signed-off-by: Anas Nashif <anas.nashif@intel.com>
OK, we're getting farther. There's another, somewhat similar race involved with waiting for a z_swap() to complete. When swap begins, it does the scheduler work involved with re-queuing the _current thread (inside the scheduler lock, of course), and then it enters arch_switch() to do the actual context switch. But in the cycles between those two steps, the old/switching-from thread is in the queue and able to be run on the other CPU, despite the fact that it won't have its registers saved until somewhere in the middle of arch_switch! A PoC "fix" for x86_64 appears below. Basically it stuffs a magic cookie into the last field saved in switch, and spins for it to be saved in the scheduler before returning. Applying that results in an almost 100% reliable sanitycheck run (well under a failure per run, anyway, which is around the threshold where measurement gets confounded by the known timing slop). Now I just need to find a way to do this portably and simply without putting too many weird requirements on the architecture layer... diff --git a/arch/x86/core/intel64/locore.S b/arch/x86/core/intel64/locore.S
index e38a0fa993..5140c3a0e8 100644
--- a/arch/x86/core/intel64/locore.S
+++ b/arch/x86/core/intel64/locore.S
@@ -212,7 +212,6 @@ z_x86_switch:
movq %r12, _thread_offset_to_r12(%rsi)
movq %r13, _thread_offset_to_r13(%rsi)
movq %r14, _thread_offset_to_r14(%rsi)
- movq %r15, _thread_offset_to_r15(%rsi)
#ifdef CONFIG_USERSPACE
/* We're always in supervisor mode if we get here, the other case
* is when __resume is invoked from irq_dispatch
@@ -220,6 +219,8 @@ z_x86_switch:
movq $X86_KERNEL_CS, _thread_offset_to_cs(%rsi)
movq $X86_KERNEL_DS, _thread_offset_to_ss(%rsi)
#endif
+ // HACK: move R15 to the end so it's the last value saved
+ movq %r15, _thread_offset_to_r15(%rsi)
movq %gs:__x86_tss64_t_ist1_OFFSET, %rsp
/* fall through to __resume */
diff --git a/kernel/include/kswap.h b/kernel/include/kswap.h
index 3537a24d21..e98a6cf609 100644
--- a/kernel/include/kswap.h
+++ b/kernel/include/kswap.h
@@ -57,8 +57,19 @@ static ALWAYS_INLINE unsigned int do_swap(unsigned int key,
k_spin_release(lock);
}
+ // HACK: poison the last value saved in arch_switch, so we can
+ // tell when it's done (i.e. when this magic number gets
+ // clobbered)
+ u64_t old_r15 = old_thread->callee_saved.r15;
+ old_thread->callee_saved.r15 = 0xff55aa11;
+
new_thread = z_get_next_ready_thread();
+ if (new_thread == old_thread) {
+ // (HACK: not switching, put it back)
+ _current->callee_saved.r15 = old_r15;
+ }
+
if (new_thread != old_thread) {
sys_trace_thread_switched_out();
#ifdef CONFIG_TIMESLICING
diff --git a/kernel/sched.c b/kernel/sched.c
index 2ebc9d4be9..007fc5bab4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -248,6 +248,12 @@ static ALWAYS_INLINE struct k_thread *next_up(void)
}
z_mark_thread_as_not_queued(thread);
+ // HACK: spin for an incoming switch
+ if (thread != _current) {
+ volatile u64_t *xx = &thread->callee_saved.r15;
+ while (*xx == 0xff55aa11);
+ }
+
return thread;
#endif
} |
(applying that on top of #21903 that is) |
OK, #21903 now has both fixes (and a reversion of the CPU count workaround) and seems much improved to me. I'm still seeing occasional failures in samples/userspace/prod_consumer, samples/userspace/shared_mem and tests/kernel/common. A quick check running prod_consumer in isolation on an unloaded system has shown 6 failures in 44 runs, all of which seem to be detected as a "no data?" error by the test before a crash (not sure if the crash is the bug or just an effect of the edge case). Still investigating, but I'd suggest reviewing and merging those fixes ASAP |
Well, that one was much quicker. Tiny 1-line SMP race in queue was the source of the prod_consumer failures. Added to the series. The others might have gotten a little better, hard to say. |
The major issues here are now fixed and we've re-enabled SMP in QEMU. One of the races was fixed with a delay, but I requested we leave this ticket open until IPI can be re-worked to make it bulletproof as described here: #21903 (comment) |
I'm still seeing some instabilities in the latest code after merging #21903. The first is a problem in tests/kernel/sched/schedule_api, where the test_user_k_is_preempt is failing the second k_thread_create() syscall due to "k_thread in use". It's calling k_thread_abort() on a thread object, and then immediately calling k_thread_create() on the same object. either k_thread_abort() needs to block until the target thread actually exits, or we need a new blocking kernel API call to wait for a thread to exit. This doesn't always fail, but the explanation of the failure seems straightforward. The second is that samples/userspace/shared_mem still appears to be exiting the emulator. I am probably going to have to instrument QEMU itself to do a debug dump when it triple faults to discover more. This happens a lot. |
Saw a similar "k_thread in use" error in tests/kernel/mem_protect/sys_sem, this time from re-using a thread_object from another testcase. this looks like a test case issue that can only be solved by some kind of wait() API:
here there is no explicit k_thread_abort() call. the code is using the semaphore to determine that the thread is finished...but it really isn't. unfortunately I think a fair number of tests do this. |
Anecdotally, the frequency of timing-related QEMU failures seems to have increased, although I don't have concrete data or even know if this affects platforms outside of x86_64 I've observed the following weird crashes in test runs this morning. I have seen these only once:
|
The issue with samples/userspace/shared_mem is not a triple fault. It's a sample, which doesn't enable CONFIG_TEST, which in turn doesn't enable CONFIG_LOG, which is needed to see fatal errors. I turned that on CONFIG_LOG=y CONFIG_LOG_MINIMAL=y in prj.conf and got more information. Running this test locally in a terminal, I am able to reproduce it locally outside of sanitycheck, although it can take several iterations before it crashes out, and sometimes instead of crashing it just hangs:
Two CPUs are crashing simultaneously, which is why the exception info is interleaved and a little garbled. But something really sticks out:
How is it that the same thread object at 0x122700 is running on two CPUs at the same time? |
in do_swap function , what about if cpu1 access the readyqueue before the cpu0 do the actual save context operations? |
addressed |
I'm seeing some sporadic crashes on x86_64.
These crashes seem to have the following characteristics:
Here's an example, but I have seen this occur in a lot of tests:
Started noticing this after I enabled boot page tables. It's unclear whether my work introduced this, or this was an issue that was already present, although I'm starting to suspect the latter since the code I brought in works great for 32-bit.
Due to sanitycheck automatic retries of failed test cases (see #14173) this has gone undetected in CI.
The text was updated successfully, but these errors were encountered: