Fix SMP races with thread swap and abort #21903

andyross · 2020-01-14T14:33:59Z

Fixes for two races under smp:

A thread aborting another thread that just self-aborted will return from k_thread_abort() before the target thread is actually finished running on its stack, leading to a corruption/crash if the same thread object is then re-used immediately.
A runnable thread that enters z_swap() can be seen by other CPUs as part of the run queue before its own CPU has finished saving its state inside of arch_switch().

andyross · 2020-01-14T14:34:32Z

Think I have this now. It was a little subtle. Actually one of the hardest parts was writing a clear explanation.

andrewboie

I can still reproduce #21317 , patching ztest to add the sleep call makes this easier to repro

andrewboie · 2020-01-14T20:32:45Z

kernel/sched.c

+		 * complete.  Otherwise the thread still runs a bit
+		 * until it can swap, requiring a delay.
+		 */
+		if (IS_ENABLED(CONFIG_SMP) && arch_is_in_isr()) {


I'm not sure I understand this. z_thread_single_abort() can be called with a thread object running on another CPU, but we're checking if the current CPU is in an ISR. do we need to check the state of the CPU that the thread is running on?

andrewboie · 2020-01-14T20:35:34Z

kernel/sched.c

+	 * delay in that circumstance.
+	 */
+	if ((thread->base.thread_state & _THREAD_ABORTED_IN_ISR) == 0U) {
+		k_busy_wait(THREAD_ABORT_DELAY_US);


I feel uneasy about this .. is there some state in the other thread we can poll in a loop, instead of just an arbitrary delay?

Also I'm assuming that this logic happens with interrupts unlocked?

There is not, absent changing the way swap works internally. The thread stack is in use right up to the swap call, but the swapped-to context may have been interrupted, there is nowhere at the OS level to place code to run "after swap", it just swaps back to whatever it was doing. I tried to find a way to make this work such that the scheduler lock is held and synchronously released at swap time, but even then "synchronously" isn't synchronous enough, because the CPU is still running C code at the point where that spinlock gets released.

An architectural fix would have every call to z_swap set a value in the swapped-from thread struct after the last memory access to its data had occurred, then we could spin waiting for that.

And yes, this delay happens outside the lock.

Thinking out loud here: if the race condition is two different contexts both trying to complete k_thread_abort() on the same thread object, could we introduce a per-thread object spinlock to synchronize this?

andyross · 2020-01-14T22:26:15Z

Thinking out loud here: if the race condition is two different contexts both trying to complete k_thread_abort() on the same thread object, could we introduce a per-thread object spinlock to synchronize this?

The synchronization is already there. The problem is that AFTER you've "completed" all the bookkeeping for thread abort and released the lock, you're still running until you get to the swap call. So another CPU can't be allowed to think that its own call to abort is complete until you're definitely not running any more. (Which is why this is SMP-only: on UP, even if the newly-dead-but-still-running thread is interrupted before it swaps, it will never be resumed, so no race can happen. With SMP there is another processor that can do it.)

Basically, what happens now is that the other CPU thinks "Oh, OK, you're finished aborting" and then launches another thread in your stack area while you're still running, causing one or both to crash weirdly.

So we need to "lock" not only the metadata needed for thread abort but the whole thread, its stack, and tracking object. And the only way to do that is deep inside swap.

andyross · 2020-01-14T23:42:09Z

Update to correct the too-clever handling of the ABORTED_IN_ISR flag.

This fixes the thread end/spawn races for me, though there are still some spurious failures in there that need attention. FWIW: applying this patch above the last commit before x86_64 userspace results in a tree that passes basically everything (though there's a semi-routine failure in the shell test that persists).

The ones remaining look odd -- RIP is getting set to bad values, sometimes nulls, sometimes pointers to rodata areas, and it's happening in interrupt contexts where a few dozen bytes have been pushed on the stack already. It has the look of double-fault kind of things where it's actually logging the second exception and not the first, maybe? Still looking.

andrewboie · 2020-01-14T23:53:44Z

It has the look of double-fault kind of things where it's actually logging the second exception and not the first, maybe? Still looking.

The exception code reports the vector that caused the fault, so you should see "Double Fault" if that was the case. Unless it faulted in the exception / irq code path...

FWIW: applying this patch above the last commit before x86_64 userspace results in a tree that passes basically everything (though there's a semi-routine failure in the shell test that persists).

I'm not seeing this. I checked out my tree at revision d3c525b which is immediately before the userspace series. I then applied this patch, and then added the sleep to z_ztest_run_test_suite(). I get several failures each run in a variety of tests. Here's the output of four runs in a row:

$ scr64
Deleting output directory /dev/shm/sanity-out
INFO    - JOBS: 48
INFO    - Building initial testcase list...
INFO    - 162 test configurations selected, 575 configurations discarded due to filters.
INFO    - Adding tasks to the queue...
INFO    - Total complete:   94/ 162  58%  skipped:   27, failed:    0
ERROR   - qemu_x86_64               tests/kernel/common/kernel.common.misra            FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/common/kernel.common.misra/handler.log
INFO    - Total complete:  119/ 162  73%  skipped:   27, failed:    1
ERROR   - qemu_x86_64               tests/kernel/queue/kernel.queue.poll               FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/queue/kernel.queue.poll/handler.log
INFO    - Total complete:  122/ 162  75%  skipped:   27, failed:    2
ERROR   - qemu_x86_64               tests/kernel/queue/kernel.queue                    FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/queue/kernel.queue/handler.log
INFO    - Total complete:  137/ 162  84%  skipped:   27, failed:    3
ERROR   - qemu_x86_64               tests/lib/json/libraries.encoding.json             FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/lib/json/libraries.encoding.json/handler.log
INFO    - Total complete:  162/ 162  100%  skipped:   27, failed:    4at 
INFO    - 131 of 162 tests passed (97.04%), 4 failed, 27 skipped with 0 warnings in 102.18 seconds
INFO    - In total 748 test cases were executed on 1 out of total 218 platforms (0.46%)
apboie@shodan:~/projects/zephyr/zephyr (1) (HEAD detached from d3c525b986) /home/apboie/projects/zephyr/zephyr
$ scr64
Deleting output directory /dev/shm/sanity-out
INFO    - JOBS: 48
INFO    - Building initial testcase list...
INFO    - 162 test configurations selected, 575 configurations discarded due to filters.
INFO    - Adding tasks to the queue...
INFO    - Total complete:   75/ 162  46%  skipped:   23, failed:    0
ERROR   - qemu_x86_64               tests/kernel/common/kernel.common.misra            FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/common/kernel.common.misra/handler.log
INFO    - Total complete:   77/ 162  47%  skipped:   24, failed:    1
ERROR   - qemu_x86_64               tests/shell/shell.core                             FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/shell/shell.core/handler.log
INFO    - Total complete:   88/ 162  54%  skipped:   25, failed:    2
ERROR   - qemu_x86_64               tests/kernel/mem_protect/sys_sem/kernel.memory_protection.sys_sem FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/mem_protect/sys_sem/kernel.memory_protection.sys_sem/handler.log
INFO    - Total complete:   92/ 162  56%  skipped:   27, failed:    3
ERROR   - qemu_x86_64               tests/kernel/common/kernel.common                  FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/common/kernel.common/handler.log
INFO    - Total complete:   98/ 162  60%  skipped:   27, failed:    4
ERROR   - qemu_x86_64               tests/kernel/fifo/fifo_api/kernel.fifo             FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/fifo/fifo_api/kernel.fifo/handler.log
INFO    - Total complete:  116/ 162  71%  skipped:   27, failed:    5
ERROR   - qemu_x86_64               tests/lib/sprintf/libraries.libc.sprintf           FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/lib/sprintf/libraries.libc.sprintf/handler.log
INFO    - Total complete:  162/ 162  100%  skipped:   27, failed:    6
INFO    - 129 of 162 tests passed (95.56%), 6 failed, 27 skipped with 0 warnings in 47.67 seconds
INFO    - In total 748 test cases were executed on 1 out of total 218 platforms (0.46%)
apboie@shodan:~/projects/zephyr/zephyr (1) (HEAD detached from d3c525b986) /home/apboie/projects/zephyr/zephyr
$ scr64
Deleting output directory /dev/shm/sanity-out
INFO    - JOBS: 48
INFO    - Building initial testcase list...
INFO    - 162 test configurations selected, 575 configurations discarded due to filters.
INFO    - Adding tasks to the queue...
INFO    - Total complete:   50/ 162  30%  skipped:   23, failed:    0
ERROR   - qemu_x86_64               tests/subsys/logging/log_msg/logging.log_msg       FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/subsys/logging/log_msg/logging.log_msg/handler.log
INFO    - Total complete:   76/ 162  46%  skipped:   23, failed:    1
ERROR   - qemu_x86_64               tests/posix/fs/portability.posix.fs                FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/posix/fs/portability.posix.fs/handler.log
INFO    - Total complete:   86/ 162  53%  skipped:   25, failed:    2
ERROR   - qemu_x86_64               tests/kernel/common/kernel.common.misra            FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/common/kernel.common.misra/handler.log
INFO    - Total complete:   88/ 162  54%  skipped:   25, failed:    3
ERROR   - qemu_x86_64               tests/kernel/common/kernel.common                  FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/common/kernel.common/handler.log
INFO    - Total complete:   97/ 162  59%  skipped:   27, failed:    4
ERROR   - qemu_x86_64               tests/kernel/fifo/fifo_api/kernel.fifo.poll        FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/fifo/fifo_api/kernel.fifo.poll/handler.log
INFO    - Total complete:  106/ 162  65%  skipped:   27, failed:    5
ERROR   - qemu_x86_64               tests/kernel/mem_heap/mheap_api_concept/kernel.memory_heap FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/mem_heap/mheap_api_concept/kernel.memory_heap/handler.log
INFO    - Total complete:  162/ 162  100%  skipped:   27, failed:    6
INFO    - 129 of 162 tests passed (95.56%), 6 failed, 27 skipped with 0 warnings in 47.46 seconds
INFO    - In total 748 test cases were executed on 1 out of total 218 platforms (0.46%)
apboie@shodan:~/projects/zephyr/zephyr (1) (HEAD detached from d3c525b986) /home/apboie/projects/zephyr/zephyr
$ scr64
Deleting output directory /dev/shm/sanity-out
INFO    - JOBS: 48
INFO    - Building initial testcase list...
INFO    - 162 test configurations selected, 575 configurations discarded due to filters.
INFO    - Adding tasks to the queue...
INFO    - Total complete:   58/ 162  35%  skipped:   23, failed:    0
ERROR   - qemu_x86_64               tests/subsys/shell/shell_history/shell.history     FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/subsys/shell/shell_history/shell.history/handler.log
INFO    - Total complete:   78/ 162  48%  skipped:   24, failed:    1
ERROR   - qemu_x86_64               tests/kernel/common/kernel.common                  FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/common/kernel.common/handler.log
INFO    - Total complete:   80/ 162  49%  skipped:   24, failed:    2
ERROR   - qemu_x86_64               tests/kernel/common/kernel.common.misra            FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/common/kernel.common.misra/handler.log
INFO    - Total complete:  110/ 162  67%  skipped:   27, failed:    3
ERROR   - qemu_x86_64               tests/kernel/mem_pool/sys_mem_pool/kernel.memory_pool.sys FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/mem_pool/sys_mem_pool/kernel.memory_pool.sys/handler.log
INFO    - Total complete:  111/ 162  68%  skipped:   27, failed:    4
ERROR   - qemu_x86_64               tests/kernel/mem_heap/mheap_api_concept/kernel.memory_heap FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/mem_heap/mheap_api_concept/kernel.memory_heap/handler.log
INFO    - Total complete:  118/ 162  72%  skipped:   27, failed:    5
ERROR   - qemu_x86_64               tests/kernel/mem_pool/mem_pool_api/kernel.memory_pool.api FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/mem_pool/mem_pool_api/kernel.memory_pool.api/handler.log
INFO    - Total complete:  162/ 162  100%  skipped:   27, failed:    6
INFO    - 129 of 162 tests passed (95.56%), 6 failed, 27 skipped with 0 warnings in 47.93 seconds
INFO    - In total 748 test cases were executed on 1 out of total 218 platforms (0.46%)
apboie@shodan:~/projects/zephyr/zephyr (1) (HEAD detached from d3c525b986) /home/apboie/projects/zephyr/zephyr
$

andrewboie · 2020-01-14T23:56:58Z

It may also be useful in testing to clone this branch:

https://github.com/andrewboie/zephyr/tree/smp-issues

Which shows similar behavior with the old x86-64 port. If you backport your fix there (it doesn't apply cleanly) and run tests with the old x86_64 target, it will help identify whether the behavior is related to the arch port or not.

andyross · 2020-01-15T00:09:53Z

I can't tell if you're saying you can still see problems (where I agree) or whether you're saying this patch doesn't fix anything for you (seems like it's a clear bug and it absolutely improves things a ton for me)?

I genuinely don't think the stuff you're seeing is related to this issue. It's a separate bug.

andrewboie · 2020-01-15T00:17:25Z

I can't tell if you're saying you can still see problems (where I agree) or whether you're saying this patch doesn't fix anything for you (seems like it's a clear bug and it absolutely improves things a ton for me)?

I honestly cannot tell if this patch improves the situation. I have no objections to merging it.

What I'm trying to make clear is that this doesn't fix the behavior reported in #21317 and I'm not seeing a significant reduction in the tests that are failing.

andrewboie · 2020-01-15T00:25:56Z

I have no objections to merging it.

Actually that't not quite true, which is why I haven't +1'd this. I'm very uneasy with the solution to add a 50us busy-wait to correct this race. I need some kind of guarantee that this could never fail to work, in all conceivable scheduling and interrupt delivery scenarios. Think hundreds/thousands of devices, in a functional safety scenario -- any failure would be unacceptable. I understand a more robust solution requires much more invasive changes but I'm not convinced yet that these changes don't need to be made.

If you can prove that the delay is always guaranteed to work then I'll add a +1. I'm not at the point where I'd leave a -1, but this doesn't sit well enough with me yet to +1 it.

andyross · 2020-01-17T17:48:02Z

OK, try this. It now includes a fix for a second race in z_swap(), and IMHO a pretty clever synchronization trick that avoids the need to change the architecture API. This is working to well under a failure per run on my rig. Will continue to look at the remaining spurious failures (samples/userspace/shared_mem and samples/userspace/prod_consumer are the most affected, with a few tests/kernel/common failures too, it seems).

Note that the original abort fix (with the delay) remains, unmodified. You might ask: can the tight spin loop used for swap be used for the same purpose here? And I think it can, but it's going to require some rework to the way that IPI works currently.

zephyrbot · 2020-01-17T17:54:19Z

All checks are passing now.

Tip: The bot edits this comment instead of posting a new one, so you can check the comment's history to see earlier messages.

andrewboie · 2020-01-17T19:40:13Z

I'm getting really good results with this PR. In several runs with the delay placed in ztest, most of the failures I see are just timing related stuff that we already know about.

I concur that just a few tests left have problems:

samples/userspace/shared_mem - This test seems to run and suddenly exit the emulator. QEMU does this when it triple faults and in a lot of cases dumps no debug information. I think this one is probably on my end.

tests/kernel/common - Some of these failures I see are due to timing slop. But sometimes it crashes, with a trace that looks like it was trying to read/write a member of a NULL struct address. Not sure what it is about this test that elicits behavior like this

I haven't seen any crashes in prod_consumer, got any logs?

Note that the original abort fix (with the delay) remains, unmodified. You might ask: can the tight spin loop used for swap be used for the same purpose here? And I think it can, but it's going to require some rework to the way that IPI works currently.

How's this sound for a deal -- I've voiced some doubts about using a delay, but I'll shut up about it if we can file a GH enhancement for implementing a more robust solution in the fullness of time?

andyross · 2020-01-17T20:15:24Z

One more easy one in shared_mem that seems to have resolved that test. Down to what I'm measuring as 4 total failures across 7 runs so far, with no repetitions.

Races seem fixed, at least the ones measurable by existing tests. This reverts commit 3f3c72a. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

The original intent was that the output handle be written through the pointer in the second argument, though not all architectures used that scheme. As it turns out, that write is becoming a synchronization signal, so it's no longer optional. Clarify the documentation in arch_switch() about this requirement, and add an instruction to the x86_64 context switch to implement it as original envisioned. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

These two spots were calling z_sched_ipi() (the IPI handler run under the ISR, which is a noop here because obviously the current thread isn't DEAD) and not arch_sched_ipi() (which triggers an IPI on other CPUs to inform them of scheduling state changes), presumably because of a typo. Apparently we don't have tests for k_wakeup() and k_thread_priority_set() that are sensitive to latency in SMP contexts... Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

It's possible for a thread to abort itself simultaneously with an external abort from another thread. In fact in our test suite this is a common thing, as ztest will abort its own spawend threads at the end of a test, as they tend to be exiting on their own. When that happens, the thread marks itself DEAD and does all its scheduler bookeeping, but it is STILL RUNNING on its own stack until it makes its way to its final swap. The external context would see that "dead" metadata and return from k_thread_abort(), allowing the next test to reuse and spawn the same thread struct while the old context was still running. Obviously that's bad. Unfortunately, this is impossible to address completely without modifying every SMP architecture to add a API-visible hook to every swap that signals completion. In practice the best we can do is add a delay. But note the optimization: almost always, the scheduler IPI catches the running thread and kills it from interrupt context (i.e. on a different stack). When that happens, we know that the interrupted thread will never be resumed (because it's dead) and can elide the delay. We only pay the cost when we actually detect a race. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

On SMP, there is an inherent race when swapping: the old thread adds itself back to the run queue before calling into the arch layer to do the context switch. The former is properly synchronized under the scheduler lock, and the later operates with interrupts locally disabled. But until somewhere in the middle of arch_switch(), the old thread (that is in the run queue!) does not have complete saved state that can be restored. So it's possible for another CPU to grab a thread before it is saved and try to restore its unsaved register contents (which are garbage -- typically whatever state it had at the last interrupt). Fix this by leveraging the "swapped_from" pointer already passed to arch_switch() as a synchronization primitive. When the switch implementation writes the new handle value, we know the switch is complete. Then we can wait for that in z_swap() and at interrupt exit. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

Calling z_ready_thread() means the thread is now ready and can wake up at any moment on another CPU. But we weren't finished setting the return value! So the other side could wake up with a spurious "error" condition if it ran too soon. Note that on systems with a working IPI, that wakeup can happen much faster than you might think. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

There was some unlocked initialization code in the "enc" thread that would race with the "pt" and "ct" threads if another CPU was available to run them (it was safe on UP because "enc" entered the queue first and was cooperative, the others wouldn't run until it blocked). Move it to main() and remove the enc_state guard variable which is no longer doing anything. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

andyross · 2020-01-17T22:35:59Z

Fix commit message warning; I'd left a bare revert in.

andrewboie · 2020-01-17T23:07:06Z

samples/userspace/shared_mem/src/main.c

@@ -104,6 +103,15 @@ void main(void)
 	struct k_mem_partition *dom0_parts[] = {&part0, &part1};
 	k_tid_t tPT, tENC, tCT;

+	fBUFIN = 0; /* clear flags */


change looks good but I'm still worried about how this was crashing by the emulator suddenly exiting. even with a race it should at least produce an exception that can be handled, if the whole thing triple faults that tells me we might have a bug in arch/x86 somewhere that I should try to investigate more, unless this race somehow got the page tables corrupted. what do you think?

andrewboie

I need some questions I asked earlier answered before this can go in.

Are there any scenarios of interrupt delivery/scheduling where the k_busy_wait(THREAD_ABORT_DELAY_US) will not work, no matter how obscure? If the answer is yes, then we need at minimum to have an issue in the backlog to address it with a solution like was done with wait_for_switch(). Or ideally, just do it like that now.
Was the samples/userspace/shared_mem test triple-faulting and exiting the emulator, or was it simply getting stuck and timing out? If it was triple-faulting and exiting there may be a serious bug somewhere that needs to be looked at, no user thread should be able to get the kernel to explode like that

andyross · 2020-01-21T18:59:01Z

change looks good but I'm still worried about how this was crashing by the emulator suddenly exiting.

This was just a vanilla failure. The triple fault behavior was AFAICT exclusively related to the "two CPUs running on the same kernel stack" symptom. I don' t have a good explanation for exactly how the process got out of control like that either, but it hasn't reappeared with the two main fixes here.

Regarding the delay: we don't currently spin on anything non-deterministic on the abort paths (i.e. from the point where the thread has self-aborted to the next swap or return from ISR). But obviously it's not possible to prove that will always be so; new APIs will always be able to invent new ways to k_thread_abort(_current) and then do something crazy like spin on an external resource.

If it bugs you we can just wait: I have a handle on how to rework IPI to be synchronous (there's a cost there too: you have to wait for all CPUs to have handled the IPI and not just for one to have seen this particular thread, but that's not awful given that we're talking about thread abort and not a performance path).

andrewboie · 2020-01-21T22:47:30Z

This was just a vanilla failure. The triple fault behavior was AFAICT exclusively related to the "two CPUs running on the same kernel stack" symptom. I don' t have a good explanation for exactly how the process got out of control like that either, but it hasn't reappeared with the two main fixes here.

Gotcha, thanks for the additional details.

Regarding the delay: we don't currently spin on anything non-deterministic on the abort paths (i.e. from the point where the thread has self-aborted to the next swap or return from ISR). But obviously it's not possible to prove that will always be so; new APIs will always be able to invent new ways to k_thread_abort(_current) and then do something crazy like spin on an external resource.

If it bugs you we can just wait: I have a handle on how to rework IPI to be synchronous (there's a cost there too: you have to wait for all CPUs to have handled the IPI and not just for one to have seen this particular thread, but that's not awful given that we're talking about thread abort and not a performance path).

Let's get this merged now then, but I would like to leave #21317 open until we can do the IPI re-work. It really has been bugging me, mostly because of our fusa positioning.

andyross requested review from nashif, ceolin, andrewboie and dcpleung January 14, 2020 14:33

andyross requested review from carlescufi, galak and MaureenHelm as code owners January 14, 2020 14:33

zephyrbot added area: API Changes to public APIs area: Kernel labels Jan 14, 2020

dcpleung approved these changes Jan 14, 2020

View reviewed changes

nashif approved these changes Jan 14, 2020

View reviewed changes

andrewboie reviewed Jan 14, 2020

View reviewed changes

andyross force-pushed the smp-abort-race branch from 4391b76 to fe4bb07 Compare January 14, 2020 23:38

andyross mentioned this pull request Jan 15, 2020

intermittent SMP crashes on x86_64 #21317

Closed

andyross force-pushed the smp-abort-race branch from fe4bb07 to 4c5f542 Compare January 17, 2020 17:43

andyross changed the title ~~kernel/sched: Address thread abort termination delay issue on SMP~~ Fix SMP races with thread swap and abort Jan 17, 2020

andyross requested review from andrewboie, dcpleung and nashif January 17, 2020 17:51

zephyrbot added area: Boards area: X86 x86 Architecture (32-bit) labels Jan 17, 2020

zephyrbot added the area: Samples Samples label Jan 17, 2020

Andy Ross added 7 commits January 17, 2020 14:01

Revert "x86: qemu_x86_64: workaround SMP issues in x86"

ad44159

Races seem fixed, at least the ones measurable by existing tests. This reverts commit 3f3c72a. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

andyross force-pushed the smp-abort-race branch from 173108d to dbaf43a Compare January 17, 2020 22:34

andrewboie reviewed Jan 17, 2020

View reviewed changes

andrewboie suggested changes Jan 21, 2020

View reviewed changes

andrewboie approved these changes Jan 21, 2020

View reviewed changes

andrewboie merged commit 86ac7dc into zephyrproject-rtos:master Jan 21, 2020

Fix SMP races with thread swap and abort #21903

Fix SMP races with thread swap and abort #21903

Uh oh!

Conversation

andyross commented Jan 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyross commented Jan 14, 2020

Uh oh!

andrewboie left a comment

Choose a reason for hiding this comment

Uh oh!

andrewboie Jan 14, 2020

Choose a reason for hiding this comment

Uh oh!

andrewboie Jan 14, 2020

Choose a reason for hiding this comment

Uh oh!

andyross Jan 14, 2020

Choose a reason for hiding this comment

Uh oh!

andrewboie Jan 14, 2020

Choose a reason for hiding this comment

Uh oh!

andyross commented Jan 14, 2020

Uh oh!

andyross commented Jan 14, 2020

Uh oh!

andrewboie commented Jan 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewboie commented Jan 14, 2020

Uh oh!

andyross commented Jan 15, 2020

Uh oh!

andrewboie commented Jan 15, 2020

Uh oh!

andrewboie commented Jan 15, 2020

Uh oh!

andyross commented Jan 17, 2020

Uh oh!

zephyrbot commented Jan 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewboie commented Jan 17, 2020

Uh oh!

andyross commented Jan 17, 2020

Uh oh!

andyross commented Jan 17, 2020

Uh oh!

andrewboie Jan 17, 2020

Choose a reason for hiding this comment

Uh oh!

andrewboie left a comment

Choose a reason for hiding this comment

Uh oh!

andyross commented Jan 21, 2020

Uh oh!

andrewboie commented Jan 21, 2020

Uh oh!

Uh oh!

andyross commented Jan 14, 2020 •

edited

Loading

andrewboie commented Jan 14, 2020 •

edited

Loading

zephyrbot commented Jan 17, 2020 •

edited

Loading