-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: SMP scheduling: edge case with pinned threads #69005
Comments
@peter-mitsis @andyross Please evaluate and assign a priority to this issue. |
Which modifies the run queue, which will trigger another IPI, right? I'm pretty sure we get this right already? Is there an existence proof of the bug? |
I may be mistaken, but I suspect that you are referring to the call z_requeue_current() from within do_swap(). That routine does indeed call signal_pending_ipi(), however signal_pending_ipi() will only call arch_sched_ipi() if pending_ipi is true, and it is only set to true in flag_ipi(). The routine flag_ipi() is called from a few places, but I think that the only relevant place in this scenario is ready_thread() (which is called by z_ready_thread()). As you know, z_ready_thread() can get invoked when unblocking a thread from one of our kernel objects. In the scenario I described, when CPU0 switches out T1 for T3, the pending_ipi flag is currently false because the original call to arch_sched_ipi() on another CPU cleared it and nothing has happened since to set it again.
As long as the threads involved do not have any restrictions as to which CPU they may run on, everything is correct and it works. The problem only rears its head in SMP when a system has a thread that is not allowed to run on one or more CPUs. This is because a CPU no longer picks the highest priority ready thread, but rather the highest priority ready thread that it is allowed to run.
I did create a test project to force this scenario, and I am confident that I succeeded. However, due to its strict need to control for scheduling points it came out somewhat harder to follow and I need to clean it up (and hopefully simplify it) before I post it here. |
I am uploading a zip file that contains a test case to reproduce the failure. Build it using ...
When it fails, you should observe something like ...
To make it pass, edit the source file and remove/comment the following line ...
Rebuild, and the output when it passes will look something like ...
|
Oh... I think I see what you're saying. FWIW I don't think this really has anything to do with cpu_mask except as an exerciser strategy. It's a race/misfeature with the way we do preemption:
It's true that #2 is unlikely to happen under typical scheduling, because the preempting thread will likely have preempted whatever CPU it was made runnable on, which (by construction) is running a lower priority thread than ours. But you can also make it runnable from an interrupt, and see this effect too. I'm... frankly not sure this is a bug we want to fix. The situation corrects itself the very next time an IPI is signaled for any reason, or when the CPU running the "wrong" thread gets an interrupt for any reason, or if its current thread hits a scheduling point for any reason. I think that's largely acceptable: single CPU apps simply can't preempt except in those circumstances anyway (stated diferently: we're demanding SMP add behavior to fix a bug that all single-core apps need to deal with anyway) The fix, if we really want it, would be to synchronously IPI every time we preempt on an SMP system, and that's extremely expensive. [1] The assumption being that whatever IPI needed to happen already did when the run queue was modified making us preemptible. But that IPI might have already been handled by the time we get to the context switch. |
To clarify: you should be able to make this happen more simply (on a 2-CPU system in this context):
Again though, the fix will hurt more than the problem. If we do this, let's for sure gate it behind an default-n kconfig. |
Thanks for the insights. If this goes forward, I definitely see the value of having this as a Kconfig option whose default is 'n'. On that note, one idea I have been exploring to help with the cost of IPI cascades and IPIs in general is to target the IPIs only to relevant CPUs. If I am following things correctly, any time a thread is made available to run, we broadcast an IPI to all CPUs. As a result, each CPU has to go through the scheduler--even if to determine that it does not need to schedule in a new thread. If we can target the IPIs to CPUs that are executing a lower priority thread, then that ought to give us some performance benefit. |
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
Now that the IPI delivery optimization has been merged, I am going to finish tackling this issue. |
Describe the bug
SMP systems (with IPI support) that make use of pinned threads may fail to schedule the highest priority thread that is ready to run.
The last CPU in a SMP system to service an IPI and switch its thread out in favour of a higher priority thread pinned to that CPU can result in that switched-out thread not getting scheduled anywhere until another schedule point is reached.
To Reproduce
Imagine a system with 2 CPUs. In this system there are the following four threads to consider.
T1 - executing on CPU0, not pinned, has priority 10
T2 - execution on CPU1, not pinned, has priority 15
T3 - pended, pinned to CPU0, has priority 5
T4 - pended, not pinned, has priority 12
Some event happens and CPU1 unpends both T3 and T4 and makes them executable at the same time. Since threads were made ready, flag_ipi() is called and the pending_ipi flag is set. CPU1 hits its reschedule point to switch out T2 for T4 and issues a call to signal_pending_ipi(). This will clear the pending_ipi flag and trigger the IPI. CPU0 will get the IPI and hit a reschedule point. It is at this point that CPU0 will switch out T1 for T3. At this stage of events, we now have ...
Final Summary
T1 - priority 10, ready but not executing anywhere
T2 - priority 15, ready but not executing anywhere
T3 - priority 5, ready and executing on CPU0
T4 - priority 12, ready and executing on CPU1
Expected behavior
Based on the Final Summary shown above, since T1 is of a higher priority than T4 (10 vs 12), it is expected that T1 would be executing on CPU1 and not T4.
A clear and concise description of what you expected to happen.
Impact
It is recognised that although T1 will eventually get scheduled when CPU1 reaches its next scheduling point, we can not know when that will be. It might be in a few hundred microseconds, or it could be seconds later, or even never. Until that happens, we essentially have a type of priority inversion on T1. In a system with a tickless kernel, this could be a long long time.
Fortunately, this inversion can only apply to systems that pin threads to execute on a subset of the CPUs in the system.
Additional context
I think that in general, if the CPUs that are switching our for pinned threads are not the first ones doing so, then we can encounter this issue.
** Possible Solution **
If a thread in an SMP is being switched out for a pinned or CPU restricted thread, then we could signal another IPI to the other non-restricted CPUs. Although this may lead to more interrupts and ISRs (which would have a performance impact), we would know that this type of priority inversion would be kept to a minimum.
The text was updated successfully, but these errors were encountered: