-
Notifications
You must be signed in to change notification settings - Fork 29
Conversation
This caches whether there are any idle CPUs which we used to test a lot more frequently. Now, the only remaining user is WAKE_SYNC path in scx_pick_idle_cpu(), which is not high-frequency enough to warrant track and cache the test result. Remove it.
Add @flags to scx_bpf_pick_idle_cpu() and implement SCX_PICK_IDLE_CPU_WHOLE flag which makes the function try to pick a CPU whose SMT siblings are also idle. This will be used to improve idle CPU selection.
Use the newly added SCX_PICK_IDLE_CPU_WHOLE to make scx_select_cpu_dfl() prefer CPUs which are wholly idle over prev_cpu that's only partially idle. While at it, add SMT conditionals to scx_pick_idle_cpu().
In simple and flatcg, p->scx.dsq_vtime wasn't initialized on load or when a task is moved across cgroups. As vtimes start at zero, the bug is less noticeable on the first load; however, on subsequent loads and after cgroup migrations, some tasks may end up with vtime far into the future and stall for extended period of time. Re-init dsq_vtime on load and cgroup migrations.
* Drop short optoins for two less used options and always explicitly specify the char used for short option. * Use more compact code form for looking up a map and checking its result. * Make error messages more consistent. * s/dq/dsq/g * Other cosmetic & misc changes.
Drop unused LAST_TASK and re-group the rest.
task_set_dsq() is a bit of misnomer and inconsistent with other functions including pick_task_domain(). The names are really confusing. Clean up. * task_set_dsq() -> task_set_domain() * pick_task_domain() -> task_pick_domain() * task_set_domain() -> task_pick_and_set_domain()
to prepare for future changes. No functional changes.
The BPF scheduler may depend on select_task_rq() being invoked during wakeups and @p may end up executing on a different CPU regardless of what happens in the wakeup path making the ttwu_queue optimization ineffective. Skip if on SCX. Combined with an atropos bug, this caused significant execution stalls in low to moderate load conditions.
_padding's size is off by 4 bytes. This was okay in practice because struct alignment would still put it at the same size but let's still fix it.
…reign CPU When not dispatching directly, ->select_cpu() would always put the waking task on one of the domestic CPUs which guarantees that a CPU will soon try to dispatch from the DSQ. However, a task can be enqueued outside the enqueue path and thus without going through ->select_cpu(). If the task was on a foreign CPU before, this could lead to the task being queued on its domain's DSQ while none of its CPUs being woken up leading to execution bubbles. Fix it by explicitly kicking a domestic CPU if the task being enqueued to a domain DSQ is on a foreign CPU.
p->scx.dsq_vtime wasn't being initialized on scheduler load which could put some tasks far into the future leading to long stalls. Fix it.
To avoid stepping over the L3-aware behavior of the new workqueue code which sets permissive cpus_allowed and uses ->wake_cpu to steer kworkers.
W/ fio on dm-crypt + L3-aware workqueue benchmark, it was observed that LOAD_IMBAL_REDUCTION_MIN_RATIO can often persist significant load imbalance for a very long preiod because it rejects a task transfer if the load imbal reduction from that particular transfer is less than 10%. If there is a large load imbalance which is caused by combination of many smaller loads, they'll never get balanced. However, if we remove LOAD_IMBAL_REDUCTION_MIN_RATIO, the load balancer becomes prone to oscillations because it tries to balance fully on each round. Instead, add LOAD_IMBAL_XFER_TARGET_RATIO which dictates the target transfer amount on each push/pull pair and set it to 50%. ie. Now the load balancer tries transfer candidate search targeting 50% load imbalance reduction. However, the final transfer candidate selection is still done based on the eventual load imbalance reduction so that we can eventually converge on balanced load. After the above changes, we always converge to the balanced state (as defined by LOAD_IMBAL_HIGH_RATIO) but gradually. To ensure work conservation, let's set the default --greedy-threshold to 1. This does have negative performance implications but this will be addressed in the future patches.
Neither procfs or fb_procfs can determine per-CPU utilization reliably with CPU hot[un]plugs. Roll our own. eminence/procfs#274 facebookincubator/below#8190
To enable adding a mechanism which runs at a different frequency than the load balancer: * Refactor scheduling loop into Scheduler::run(). * Refactor topology handling into struct Topology.
Target CPU selection in select_cpu() is more challenging for atropos as there are tradeoffs between keeping a task local and trying to exploit a less occupied CPU in a foreign domain. atropos didn't consider CPU utilizations when making target CPU selections which limited its effectiveness. This patch improves the situation by: * Implement a new high-level userland subsystem called Tuner which runs more frequently than LoadBalancer (100ms by default) and updates the parameters in struct tune_input to modify the BPF scheduler's behavior. Currently, Tuner only consider CPU utilizatoin and tells the BPF scheduler whether a given CPU is eligible for the new DIRECT_GREEDY and KICK_GREEDY optimizations. * For domestic CPUs, select_cpu() is updated to use the new SCX_PICK_IDLE_CPU_WHOLE to prefer wholly-idle CPUs over partially idle @prev_cpu. * If @prev_cpu was in a foreign domain which has enough unused CPU cycles and @prev_cpu is wholly idle (all its siblings are idle too), keep @prev_cpu by putting the task directly into @prev_cpu's DSQ which is counted as DIRECT_GREEDY. Keeping @prev_cpu only when it's wholly idle removes the negative performance impact of this mechanism observed in fio on kcryptd benchmark. * Implement similar pattern of preferring wholly idle CPUs and gating foreign DIRECT_GREEDY according to the domain's CPU utilization down the CPU preference ladder. * If no direct dispatch path could be taken and if there are any idle CPUs in the system, domestic or foreign, kick it after enqueueing to the domain. This doesn't seem to have negative performance impact while improving work conservation in some cases. With these optimizations, atropos matches or outperforms CFS at various load levels when running fio on top of a dm-crypt device.
If a CPU's cache information can't be determined, atropos falls back to DOM0 after generating a warning. This can be jarring and confusing on many modern AMD processors which have some cores disabled and report them as possible but present. Update the rust side so that it uses Option::None in Topology::cpu_dom to clearly distinguish the CPUs that are on DOM0 because cache information couldn't be determined. From the BPF side, this doesn't make any difference; however, the userland side now can differentiate those CPUs and handle them appropriate (e.g. not setting them in direct/kick_greedy_cpumask). As userland can now detect CPUs that have come up after atropos is initialized and thus doesn't have a proper domain assigned, it now should be able to possible to reload itself to reflect the new hardware configuration. This is left for the future for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job with all the debugging. Overall the changes look great, just left some comments on a first read-through. Haven't looked at main.rs yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just 1 very minor nit for main.rs. Honestly kinda skimmed over it, but it looks reasonable
…ctly Suggested by David Vernet. This also allows removing explicit #ifdef's.
Whenever a CPU's idle state is cleared, the matching SMT mask should be updated accordingly; however, scx_pick_idle_cpu() open coded this logic and other users of test_and_clear_idle_cpu() wasn't updating idle_masks.smt correctly. Fix it by moving SMT mask clearing from scx_pick_idle_cpu() to test_and_clear_idle_cpu(). This issue was noticed by David Vernet.
On some error exit paths, atropos_select_cpu() was releasing idle_smtmask before it was acquired. For some reason, the BPF verifier didn't notice this. It didn't actually break anything because the release operation is noop in this specific case. Fix it by acquiring idle_smtmask at the beginning of function and always releasing it before returning. While at it, make error return values more consistent.
scx_bpf_pick_idle_cpu() returning -EBUSY doesn't mean that the caller can assume that there will be future dispatch events because idle tracking may race with actual scheduler state transitions. The only way to guarantee timely future dispatch is kicking one of the target CPUs. Document it and add scx_bpf_pick_any_cpu() which can be used to pick a CPU, idle or not. atropos is updated to use the new function instead of using cpumask_any() directly on the allowed CPUs.
atropos_enqueue() currently kicks the previous domain if the task has dispatch_local set with the comment explaining that we need to do so because we own the idle state of the task. This doesn't seem to make sense because if select_cpu() picked prev_cpu and set dispatch_local, the CPU is already woken up and enqueue() would be running on that CPU. Whether the task is dispatched locally or not doesn't the CPU to stall. It will go check all the queues after this and reassert idle. As there were other stall issues, it could just be that I was confused. Preliminary testing doesn't show any problems after removing. However, on the new domain side, it could miss waking up a CPU when no idle bits are set which is racy and has some chance of causing a temporary stall. Fix it using scx_bpf_pick_any_cpu().
std::thread::sleep( | ||
next_sched_at | ||
.min(next_tune_at) | ||
.duration_since(std::time::SystemTime::now())?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@htejun if I run rcutorture with atropos, the scheduler gets kicked out after a minute or so with "Error: second time provided was later than self" from, presumably, here. It looks like SystemTime.duration_since() isn't really safe to use according to https://doc.rust-lang.org/std/time/struct.SystemTime.html#method.duration_since
This function may fail because measurements taken earlier are not guaranteed to always be before later measurements (due to anomalies such as the system clock being adjusted either forwards or backwards). Instant can be used to measure elapsed time without this risk of failure.
Should we rework this to use Instant and Duration? Or to make it resilient to clock skew?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think use of Instant
seems more reasonable here. As code don't actually require "full" time to check if interval elapsed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vnepogodin yep, that was merged in 52db350
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't finished looking through main.rs - but went through the bpf logic and it makes sense - few comments inline
* If offline, @cpu is not its own sibling and | ||
* scx_pick_idle_cpu() can get caught in an infinite loop as | ||
* @cpu is never cleared from idle_masks.smt. Ensure that @cpu | ||
* is eventually cleared. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand this change a bit more? What does this test and clear have to do with smt siblings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, there are two levels of idle masks, cpumaks and smtmask. cpumask is the real one and tracks the idle cpus. smtmask is the hint mask that tracks whether the whole core is idle. Let's say the toplogy is [0, 1], [2, 3] where the bracketed pairs are SMT siblings.
- If all are idle, both cpumask and smtmask would be 1111.
- If then CPU0 becomes busy, cpumask=0111 smtmask=0011
- If then CPU1 becomes busy, cpumask=0011 smtmask=0011
- If then CPU0 becomes idle, cpumask=1011 smtmask=0011
- If then CPU1 becomes idle, cpumask=1111 smtmask=1111
ie. the bits for the smtmask get cleared for the whole physical core when any of its logical threads are busy. Now, this is implemented by clearing what the kernel thinks SMT mask is - topology_sibling_cpumask()
- whenever a CPU becomes busy and asserting the same mask on idle transition if all siblings are idle.
Now, select_cpu does sth like the following:
retry:
if (there's a whole core idle) {
clear the sibling group in the smtmask;
try test_and_set that CPU in cpumask, if fails, goto retry;
}
if (there's an idle cpu)
try test_and_set that CPU in cpumask, if fails, goto retry;
The problem is that topology_sibling_cpumask()
of a CPU becomes empty when a CPU is offline. So, the "clearing the sibling group" part doesn't do anything and the the test_and_set part may fail because smtmask can go out of sync with cpumask, which puts us in an infinite loop. So, we clear cpumask_of(cpu)
instead in that case to guarantee forward progress.
* @flags: %SCX_PICK_IDLE_CPU_* flags | ||
* | ||
* Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any | ||
* CPU in @cpus_allowed. Guaranteed to suceed and returns the picked idle cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: suceed => succeed
} tune_input; | ||
|
||
__u64 tune_params_gen; | ||
private(A) struct bpf_cpumask __kptr *direct_greedy_cpumask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with this private(A) syntax - what does it do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It hides it from the automatic global mmapped access from userland. Otherwise, bpf gets unhappy with putting kptrs (allocated cpumask) there because userspace can clobber the value.
stat_add(ATROPOS_STAT_PREV_IDLE, 1); | ||
cpu = prev_cpu; | ||
goto local; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming - you say that removing this prev-idle check improved perf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, yeah, that was my initial observation. For whatever reason, removing that noticeably improved performance. However, what the new code does now is to keep the foreign prev_idle if the whole smt pair is idle and that actually gains performance.
// with CPU hot[un]plugs. Roll our own. | ||
// | ||
// https://github.com/eminence/procfs/issues/274 | ||
// https://github.com/facebookincubator/below/issues/8190 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brianc118 fixed this in below this week. I'll move us back onto fb_procfs unless you have any concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's fine but let's do that next week after sending out the patchset.
dom_cpus[dom_id].set(cpu, true); | ||
cpu_dom.push(Some(dom_id)); | ||
} | ||
None => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what case can you lack a cpu -> cache mapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, on my zen2 test machine:
root@rbw /s/d/s/cpu# cat /proc/cpuinfo | grep name | head -1
model name : AMD Ryzen 9 3900X 12-Core Processor
root@rbw ~# cat /sys/devices/system/cpu/possible
0-31
root@rbw ~# cat /sys/devices/system/cpu/present
0-23
root@rbw ~# cat /sys/devices/system/cpu/online
0-23
root@rbw ~# cd /sys/devices/system/cpu
root@rbw /s/d/s/cpu# cat possible present online
0-31
0-23
0-23
root@rbw /s/d/s/cpu# ls -d cpu*/cache | wc -l
24
The CCXs have some CPUs disabled. They can't be brought online but the firmware still reports them as possible but not present as if they can become present. So, we end up with possible CPUs without cache information. The same thing happens with some CCXs on belgamos IIRC. No idea why firmware is reporting it like that. It's silly because we end up allocating for the impossible possible CPUs and iterating over them too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I mean, in general, if hotplug is possible, then, the cache information may not be known until the CPU is plugged in. Even CPUs which are plugged in but brought down by echoing 0 to online don't report cache information. So, we need to be able to handle them. Eventually, I think what we should do is just reloading the scheduler when hotplugged CPUs are detected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cat /sys/devices/system/cpu/possible
Seems to be the same issue as I have reported to you with my 5900X.
@htejun
I remember that this "issue" was not present in 6.2 Kernels, but in 6.1 and 6.3/6.4.
But good there is a solution for atropos! I give in the next time a testing again. Thanks for working on it!
commit 133466c ("net: stmmac: use per-queue 64 bit statistics where necessary") caused one regression as found by Uwe, the backtrace looks like: INFO: trying to register non-static key. The code is fine but needs lockdep annotation, or maybe you didn't initialize this object before use? turning off the locking correctness validator. CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.5.0-rc1-00449-g133466c3bbe1-dirty #21 Hardware name: STM32 (Device Tree Support) unwind_backtrace from show_stack+0x18/0x1c show_stack from dump_stack_lvl+0x60/0x90 dump_stack_lvl from register_lock_class+0x98c/0x99c register_lock_class from __lock_acquire+0x74/0x293c __lock_acquire from lock_acquire+0x134/0x398 lock_acquire from stmmac_get_stats64+0x2ac/0x2fc stmmac_get_stats64 from dev_get_stats+0x44/0x130 dev_get_stats from rtnl_fill_stats+0x38/0x120 rtnl_fill_stats from rtnl_fill_ifinfo+0x834/0x17f4 rtnl_fill_ifinfo from rtmsg_ifinfo_build_skb+0xc0/0x144 rtmsg_ifinfo_build_skb from rtmsg_ifinfo+0x50/0x88 rtmsg_ifinfo from __dev_notify_flags+0xc0/0xec __dev_notify_flags from dev_change_flags+0x50/0x5c dev_change_flags from ip_auto_config+0x2f4/0x1260 ip_auto_config from do_one_initcall+0x70/0x35c do_one_initcall from kernel_init_freeable+0x2ac/0x308 kernel_init_freeable from kernel_init+0x1c/0x138 kernel_init from ret_from_fork+0x14/0x2c The reason is the rxq|txq_stats structures are not what expected because stmmac_open() -> __stmmac_open() the structure is overwritten by "memcpy(&priv->dma_conf, dma_conf, sizeof(*dma_conf));" This causes the well initialized syncp member of rxq|txq_stats is overwritten unexpectedly as pointed out by Johannes and Uwe. Fix this issue by moving rxq|txq_stats back to stmmac_extra_stats. For SMP cache friendly, we also mark stmmac_txq_stats and stmmac_rxq_stats as ____cacheline_aligned_in_smp. Fixes: 133466c ("net: stmmac: use per-queue 64 bit statistics where necessary") Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Tested-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Link: https://lore.kernel.org/r/20230917165328.3403-1-jszhang@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When creating ceq_0 during probing irdma, cqp.sc_cqp will be sent as a cqp_request to cqp->sc_cqp.sq_ring. If the request is pending when removing the irdma driver or unplugging its aux device, cqp.sc_cqp will be dereferenced as wrong struct in irdma_free_pending_cqp_request(). PID: 3669 TASK: ffff88aef892c000 CPU: 28 COMMAND: "kworker/28:0" #0 [fffffe0000549e38] crash_nmi_callback at ffffffff810e3a34 #1 [fffffe0000549e40] nmi_handle at ffffffff810788b2 #2 [fffffe0000549ea0] default_do_nmi at ffffffff8107938f #3 [fffffe0000549eb8] do_nmi at ffffffff81079582 #4 [fffffe0000549ef0] end_repeat_nmi at ffffffff82e016b4 [exception RIP: native_queued_spin_lock_slowpath+1291] RIP: ffffffff8127e72b RSP: ffff88aa841ef778 RFLAGS: 00000046 RAX: 0000000000000000 RBX: ffff88b01f849700 RCX: ffffffff8127e47e RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffffff83857ec0 RBP: ffff88afe3e4efc8 R8: ffffed15fc7c9dfa R9: ffffed15fc7c9dfa R10: 0000000000000001 R11: ffffed15fc7c9df9 R12: 0000000000740000 R13: ffff88b01f849708 R14: 0000000000000003 R15: ffffed1603f092e1 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 -- <NMI exception stack> -- #5 [ffff88aa841ef778] native_queued_spin_lock_slowpath at ffffffff8127e72b #6 [ffff88aa841ef7b0] _raw_spin_lock_irqsave at ffffffff82c22aa4 #7 [ffff88aa841ef7c8] __wake_up_common_lock at ffffffff81257363 #8 [ffff88aa841ef888] irdma_free_pending_cqp_request at ffffffffa0ba12cc [irdma] #9 [ffff88aa841ef958] irdma_cleanup_pending_cqp_op at ffffffffa0ba1469 [irdma] #10 [ffff88aa841ef9c0] irdma_ctrl_deinit_hw at ffffffffa0b2989f [irdma] #11 [ffff88aa841efa28] irdma_remove at ffffffffa0b252df [irdma] #12 [ffff88aa841efae8] auxiliary_bus_remove at ffffffff8219afdb #13 [ffff88aa841efb00] device_release_driver_internal at ffffffff821882e6 #14 [ffff88aa841efb38] bus_remove_device at ffffffff82184278 #15 [ffff88aa841efb88] device_del at ffffffff82179d23 #16 [ffff88aa841efc48] ice_unplug_aux_dev at ffffffffa0eb1c14 [ice] #17 [ffff88aa841efc68] ice_service_task at ffffffffa0d88201 [ice] #18 [ffff88aa841efde8] process_one_work at ffffffff811c589a #19 [ffff88aa841efe60] worker_thread at ffffffff811c71ff #20 [ffff88aa841eff10] kthread at ffffffff811d87a0 #21 [ffff88aa841eff50] ret_from_fork at ffffffff82e0022f Fixes: 44d9e52 ("RDMA/irdma: Implement device initialization definitions") Link: https://lore.kernel.org/r/20231130081415.891006-1-lishifeng@sangfor.com.cn Suggested-by: "Ismail, Mustafa" <mustafa.ismail@intel.com> Signed-off-by: Shifeng Li <lishifeng@sangfor.com.cn> Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
If function-like macros do not utilize a parameter, it might result in a build warning. In our coding style guidelines, we advocate for utilizing static inline functions to replace such macros. This patch verifies compliance with the new rule. For a macro such as the one below, #define test(a) do { } while (0) The test result is as follows. WARNING: Argument 'a' is not used in function-like macro #21: FILE: mm/init-mm.c:20: +#define test(a) do { } while (0) total: 0 errors, 1 warnings, 8 lines checked Link: https://lkml.kernel.org/r/20240507032757.146386-3-21cnbao@gmail.com Signed-off-by: Xining Xu <mac.xxn@outlook.com> Tested-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Joe Perches <joe@perches.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@loongson.cn> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Mark Brown <broonie@kernel.org> Cc: Andy Whitcroft <apw@canonical.com> Cc: Dwaipayan Ray <dwaipayanray1@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Jeff Johnson <quic_jjohnson@quicinc.com> Cc: Charlemagne Lasse <charlemagnelasse@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fixes for a couple tricky bugs and an improvement to select_cpu_dfl().
Atropos saw a lot of updates:
With these improvements, on an AMD zen2 CPU, atropos reliably outperforms CFS when running fio randrw workloads on a dm-crypted SSD across different concurrency levels.