-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SMP races with thread swap and abort #21903
Fix SMP races with thread swap and abort #21903
Conversation
Think I have this now. It was a little subtle. Actually one of the hardest parts was writing a clear explanation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can still reproduce #21317 , patching ztest to add the sleep call makes this easier to repro
* complete. Otherwise the thread still runs a bit | ||
* until it can swap, requiring a delay. | ||
*/ | ||
if (IS_ENABLED(CONFIG_SMP) && arch_is_in_isr()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this. z_thread_single_abort() can be called with a thread object running on another CPU, but we're checking if the current CPU is in an ISR. do we need to check the state of the CPU that the thread is running on?
* delay in that circumstance. | ||
*/ | ||
if ((thread->base.thread_state & _THREAD_ABORTED_IN_ISR) == 0U) { | ||
k_busy_wait(THREAD_ABORT_DELAY_US); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel uneasy about this .. is there some state in the other thread we can poll in a loop, instead of just an arbitrary delay?
Also I'm assuming that this logic happens with interrupts unlocked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not, absent changing the way swap works internally. The thread stack is in use right up to the swap call, but the swapped-to context may have been interrupted, there is nowhere at the OS level to place code to run "after swap", it just swaps back to whatever it was doing. I tried to find a way to make this work such that the scheduler lock is held and synchronously released at swap time, but even then "synchronously" isn't synchronous enough, because the CPU is still running C code at the point where that spinlock gets released.
An architectural fix would have every call to z_swap set a value in the swapped-from thread struct after the last memory access to its data had occurred, then we could spin waiting for that.
And yes, this delay happens outside the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud here: if the race condition is two different contexts both trying to complete k_thread_abort() on the same thread object, could we introduce a per-thread object spinlock to synchronize this?
The synchronization is already there. The problem is that AFTER you've "completed" all the bookkeeping for thread abort and released the lock, you're still running until you get to the swap call. So another CPU can't be allowed to think that its own call to abort is complete until you're definitely not running any more. (Which is why this is SMP-only: on UP, even if the newly-dead-but-still-running thread is interrupted before it swaps, it will never be resumed, so no race can happen. With SMP there is another processor that can do it.) Basically, what happens now is that the other CPU thinks "Oh, OK, you're finished aborting" and then launches another thread in your stack area while you're still running, causing one or both to crash weirdly. So we need to "lock" not only the metadata needed for thread abort but the whole thread, its stack, and tracking object. And the only way to do that is deep inside swap. |
4391b76
to
fe4bb07
Compare
Update to correct the too-clever handling of the ABORTED_IN_ISR flag. This fixes the thread end/spawn races for me, though there are still some spurious failures in there that need attention. FWIW: applying this patch above the last commit before x86_64 userspace results in a tree that passes basically everything (though there's a semi-routine failure in the shell test that persists). The ones remaining look odd -- RIP is getting set to bad values, sometimes nulls, sometimes pointers to rodata areas, and it's happening in interrupt contexts where a few dozen bytes have been pushed on the stack already. It has the look of double-fault kind of things where it's actually logging the second exception and not the first, maybe? Still looking. |
The exception code reports the vector that caused the fault, so you should see "Double Fault" if that was the case. Unless it faulted in the exception / irq code path...
I'm not seeing this. I checked out my tree at revision d3c525b which is immediately before the userspace series. I then applied this patch, and then added the sleep to z_ztest_run_test_suite(). I get several failures each run in a variety of tests. Here's the output of four runs in a row:
|
It may also be useful in testing to clone this branch: https://github.com/andrewboie/zephyr/tree/smp-issues Which shows similar behavior with the old x86-64 port. If you backport your fix there (it doesn't apply cleanly) and run tests with the old x86_64 target, it will help identify whether the behavior is related to the arch port or not. |
I can't tell if you're saying you can still see problems (where I agree) or whether you're saying this patch doesn't fix anything for you (seems like it's a clear bug and it absolutely improves things a ton for me)? I genuinely don't think the stuff you're seeing is related to this issue. It's a separate bug. |
I honestly cannot tell if this patch improves the situation. I have no objections to merging it. What I'm trying to make clear is that this doesn't fix the behavior reported in #21317 and I'm not seeing a significant reduction in the tests that are failing. |
Actually that't not quite true, which is why I haven't +1'd this. I'm very uneasy with the solution to add a 50us busy-wait to correct this race. I need some kind of guarantee that this could never fail to work, in all conceivable scheduling and interrupt delivery scenarios. Think hundreds/thousands of devices, in a functional safety scenario -- any failure would be unacceptable. I understand a more robust solution requires much more invasive changes but I'm not convinced yet that these changes don't need to be made. If you can prove that the delay is always guaranteed to work then I'll add a +1. I'm not at the point where I'd leave a -1, but this doesn't sit well enough with me yet to +1 it. |
fe4bb07
to
4c5f542
Compare
OK, try this. It now includes a fix for a second race in z_swap(), and IMHO a pretty clever synchronization trick that avoids the need to change the architecture API. This is working to well under a failure per run on my rig. Will continue to look at the remaining spurious failures (samples/userspace/shared_mem and samples/userspace/prod_consumer are the most affected, with a few tests/kernel/common failures too, it seems). Note that the original abort fix (with the delay) remains, unmodified. You might ask: can the tight spin loop used for swap be used for the same purpose here? And I think it can, but it's going to require some rework to the way that IPI works currently. |
All checks are passing now. Tip: The bot edits this comment instead of posting a new one, so you can check the comment's history to see earlier messages. |
I'm getting really good results with this PR. In several runs with the delay placed in ztest, most of the failures I see are just timing related stuff that we already know about. I concur that just a few tests left have problems: samples/userspace/shared_mem - This test seems to run and suddenly exit the emulator. QEMU does this when it triple faults and in a lot of cases dumps no debug information. I think this one is probably on my end. tests/kernel/common - Some of these failures I see are due to timing slop. But sometimes it crashes, with a trace that looks like it was trying to read/write a member of a NULL struct address. Not sure what it is about this test that elicits behavior like this I haven't seen any crashes in prod_consumer, got any logs?
How's this sound for a deal -- I've voiced some doubts about using a delay, but I'll shut up about it if we can file a GH enhancement for implementing a more robust solution in the fullness of time? |
One more easy one in shared_mem that seems to have resolved that test. Down to what I'm measuring as 4 total failures across 7 runs so far, with no repetitions. |
Races seem fixed, at least the ones measurable by existing tests. This reverts commit 3f3c72a. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The original intent was that the output handle be written through the pointer in the second argument, though not all architectures used that scheme. As it turns out, that write is becoming a synchronization signal, so it's no longer optional. Clarify the documentation in arch_switch() about this requirement, and add an instruction to the x86_64 context switch to implement it as original envisioned. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
These two spots were calling z_sched_ipi() (the IPI handler run under the ISR, which is a noop here because obviously the current thread isn't DEAD) and not arch_sched_ipi() (which triggers an IPI on other CPUs to inform them of scheduling state changes), presumably because of a typo. Apparently we don't have tests for k_wakeup() and k_thread_priority_set() that are sensitive to latency in SMP contexts... Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
It's possible for a thread to abort itself simultaneously with an external abort from another thread. In fact in our test suite this is a common thing, as ztest will abort its own spawend threads at the end of a test, as they tend to be exiting on their own. When that happens, the thread marks itself DEAD and does all its scheduler bookeeping, but it is STILL RUNNING on its own stack until it makes its way to its final swap. The external context would see that "dead" metadata and return from k_thread_abort(), allowing the next test to reuse and spawn the same thread struct while the old context was still running. Obviously that's bad. Unfortunately, this is impossible to address completely without modifying every SMP architecture to add a API-visible hook to every swap that signals completion. In practice the best we can do is add a delay. But note the optimization: almost always, the scheduler IPI catches the running thread and kills it from interrupt context (i.e. on a different stack). When that happens, we know that the interrupted thread will never be resumed (because it's dead) and can elide the delay. We only pay the cost when we actually detect a race. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
On SMP, there is an inherent race when swapping: the old thread adds itself back to the run queue before calling into the arch layer to do the context switch. The former is properly synchronized under the scheduler lock, and the later operates with interrupts locally disabled. But until somewhere in the middle of arch_switch(), the old thread (that is in the run queue!) does not have complete saved state that can be restored. So it's possible for another CPU to grab a thread before it is saved and try to restore its unsaved register contents (which are garbage -- typically whatever state it had at the last interrupt). Fix this by leveraging the "swapped_from" pointer already passed to arch_switch() as a synchronization primitive. When the switch implementation writes the new handle value, we know the switch is complete. Then we can wait for that in z_swap() and at interrupt exit. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Calling z_ready_thread() means the thread is now ready and can wake up at any moment on another CPU. But we weren't finished setting the return value! So the other side could wake up with a spurious "error" condition if it ran too soon. Note that on systems with a working IPI, that wakeup can happen much faster than you might think. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
There was some unlocked initialization code in the "enc" thread that would race with the "pt" and "ct" threads if another CPU was available to run them (it was safe on UP because "enc" entered the queue first and was cooperative, the others wouldn't run until it blocked). Move it to main() and remove the enc_state guard variable which is no longer doing anything. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
173108d
to
dbaf43a
Compare
Fix commit message warning; I'd left a bare revert in. |
@@ -104,6 +103,15 @@ void main(void) | |||
struct k_mem_partition *dom0_parts[] = {&part0, &part1}; | |||
k_tid_t tPT, tENC, tCT; | |||
|
|||
fBUFIN = 0; /* clear flags */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change looks good but I'm still worried about how this was crashing by the emulator suddenly exiting. even with a race it should at least produce an exception that can be handled, if the whole thing triple faults that tells me we might have a bug in arch/x86 somewhere that I should try to investigate more, unless this race somehow got the page tables corrupted. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need some questions I asked earlier answered before this can go in.
-
Are there any scenarios of interrupt delivery/scheduling where the k_busy_wait(THREAD_ABORT_DELAY_US) will not work, no matter how obscure? If the answer is yes, then we need at minimum to have an issue in the backlog to address it with a solution like was done with wait_for_switch(). Or ideally, just do it like that now.
-
Was the samples/userspace/shared_mem test triple-faulting and exiting the emulator, or was it simply getting stuck and timing out? If it was triple-faulting and exiting there may be a serious bug somewhere that needs to be looked at, no user thread should be able to get the kernel to explode like that
This was just a vanilla failure. The triple fault behavior was AFAICT exclusively related to the "two CPUs running on the same kernel stack" symptom. I don' t have a good explanation for exactly how the process got out of control like that either, but it hasn't reappeared with the two main fixes here. Regarding the delay: we don't currently spin on anything non-deterministic on the abort paths (i.e. from the point where the thread has self-aborted to the next swap or return from ISR). But obviously it's not possible to prove that will always be so; new APIs will always be able to invent new ways to k_thread_abort(_current) and then do something crazy like spin on an external resource. If it bugs you we can just wait: I have a handle on how to rework IPI to be synchronous (there's a cost there too: you have to wait for all CPUs to have handled the IPI and not just for one to have seen this particular thread, but that's not awful given that we're talking about thread abort and not a performance path). |
Gotcha, thanks for the additional details.
Let's get this merged now then, but I would like to leave #21317 open until we can do the IPI re-work. It really has been bugging me, mostly because of our fusa positioning. |
Fixes for two races under smp:
A thread aborting another thread that just self-aborted will return from k_thread_abort() before the target thread is actually finished running on its stack, leading to a corruption/crash if the same thread object is then re-used immediately.
A runnable thread that enters z_swap() can be seen by other CPUs as part of the run queue before its own CPU has finished saving its state inside of arch_switch().