scx_layered: More optimal core allocation #1109

kkdwivedi · 2024-12-16T22:39:01Z

This PR documents my initial attempt at doing more optimal layer core order generation, and invites others to provide ideas.

I am not continuing fixing the current attempt (there's a few odd order generations with more LLCs), but the idea is to space out layers as much as possible to minimize overlaps. For this we use a greedy approach to finding segments with maximum run of unallocated LLCs and try to place a new layer there. Then grow it in either direction before running out of cores.

For a machine with two LLCs, 0-19, and 20-39.

Old:

layer: kkd algo: Sticky core order: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
layer: kkd 2 algo: Sticky core order: [28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]
layer: kkd 3 algo: Sticky core order: [32, 33, 34, 35, 36, 37, 38, 39, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]

New:

layer: kkd algo: StickyTopo core order: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
layer: kkd 2 algo: StickyTopo core order: [39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
layer: kkd 3 algo: StickyTopo core order: [19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]

We can be more intelligent here, like growing out in our own NUMA domain's LLCs first before spreading outside that, but all of this is futile because this algorithm won't be globally optimal. It assumes all layers have equal sizes and load, which is not true. Using a "weight" to push layers left or right when allocating LLCs works, but cannot adapt to changing layer sizes at runtime, which is the prevalent case. Thus, there will be overrun eventually even with this algorithm.

Instead, after talking to Tejun, I will update this to estimate the layer size at runtime, and regenerate the core order by picking free LLCs for each layer starting with those with the greatest size (thus precedence), and attempt to pack the rest if necessary.

likewhatevs · 2024-12-17T15:10:07Z

invites others to provide ideas.

this is cool. it might be cool if this were p-core e-core aware, but idk how that could work well without being aware of if the user preferred energy efficiency or performance (maybe via a flag, idk)?

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

kkdwivedi · 2025-01-14T14:49:23Z

Here's an updated approach, some implementation details may keep changing until I've tested this with a real-workload properly. I will remove the draft status once I've tested with some real workload and verified different for edge cases properly and then it should be good to go.

Approach

The idea is to assign entire LLCs to layers at once, therefore allocation happens at LLC granularity.

There is an idea of heavy "sticky" layers and light "low" layers, which is based on utilization. Sticky layers can forcibly reclaim/reassign LLCs from low layers. LLCs used for sticky layers are also not visible in the "free" LLC pool for allocation.

For now I was hacking into the code to set sticky by matching on the name, but it should either be done automatically (some threshold layer size) or through the config by indicating main workload vs misc stuff through it.

Compaction

Compaction is driven by layer utilization. Low/light layers are merged into the same or fewer LLCs based on utilization target (harcoded to 20% but will change to something configurable).

Hysteresis

If we see that a layer goes above or below continually for 2 (arbitrarily chosen for now) intervals of step function, only then do we grow or shrink a layer. This avoids flipping the algorithm of reallocating cores on sudden spikes or boundary conditions. The utilization range is hardcoded for testing for now but it can be changed to something configurable.

TODOs:

Still debugging an issue where LLC picking logic is behaving incorrectly with multiple loaded layers. WIP.
More testing; so far very lightly tested with stress-ng on my own server with 8 LLCs.
Refactor code as a new StickyDynamic algorithm if approach looks sane. Currently, I directly modified the step() function for testing.
Split into smaller commits and commit logs.
Remove hardcoded utilization constants, and set them from config and at runtime.

htejun · 2025-01-16T20:11:25Z

scheds/rust/scx_layered/src/main.rs

+        for (i, layer) in self.layers.iter().enumerate() {
+            let owned = self.sched_stats.layer_utils[i][LAYER_USAGE_OWNED];
+            let open = self.sched_stats.layer_utils[i][LAYER_USAGE_OPEN];
+            let total_util = owned + open;


When determining target number of CPUs, open consumptions are considered iff the layer has no CPUs because otherwise grouped layers end up overallocating. Also, I wonder whether it'd make more sense to determine the number of LLCs to allocate in terms of the the result of target CPU calculations.

Note that the allocation needs to be fair within each LLC too for intel and other CPUs with one or few LLCs.

htejun · 2025-01-16T20:12:51Z

scheds/rust/scx_layered/src/main.rs

+        wants
+    }
+
+    fn weighted_target_llcs(&self, raw_wants: &[usize]) -> Vec<usize> {


Ditto, wouldn't it make more sense to determine this according to the number of CPUs allocated to each layer?

htejun · 2025-01-16T20:19:01Z

scheds/rust/scx_layered/src/main.rs

+            let assigned_count = layer.llcs_assigned.len();
+            if layer.is_sticky && assigned_count > 0 {
+                // remove from free_llcs
+                for &llc_id in layer.llcs_assigned.keys() {


These operations may be easier with HashSet or BTreeSet.

LLC core allocation + compaction

2639fda

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

kkdwivedi force-pushed the growth branch from 12f8f3d to 2639fda Compare January 14, 2025 14:47

kkdwivedi changed the title ~~scx_layered: More optimal layer core order generation~~ scx_layered: More optimal core allocation Jan 14, 2025

htejun reviewed Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_layered: More optimal core allocation #1109

scx_layered: More optimal core allocation #1109

kkdwivedi commented Dec 16, 2024

likewhatevs commented Dec 17, 2024

kkdwivedi commented Jan 14, 2025 •

edited

Loading

htejun Jan 16, 2025

htejun Jan 16, 2025

htejun Jan 16, 2025

htejun Jan 16, 2025

scx_layered: More optimal core allocation #1109

Are you sure you want to change the base?

scx_layered: More optimal core allocation #1109

Conversation

kkdwivedi commented Dec 16, 2024

likewhatevs commented Dec 17, 2024

kkdwivedi commented Jan 14, 2025 • edited Loading

Approach

Compaction

Hysteresis

htejun Jan 16, 2025

Choose a reason for hiding this comment

htejun Jan 16, 2025

Choose a reason for hiding this comment

htejun Jan 16, 2025

Choose a reason for hiding this comment

htejun Jan 16, 2025

Choose a reason for hiding this comment

kkdwivedi commented Jan 14, 2025 •

edited

Loading