From 33ac6cb41d9be6ba5714b7814f9432197e86a8c6 Mon Sep 17 00:00:00 2001 From: Chris Fallin Date: Tue, 13 Apr 2021 23:26:56 -0700 Subject: [PATCH] Heuristic improvement: reg-scan offset by inst location. We currently use a heuristic that our scan for an available PReg starts at an index into the register list that rotates with the bundle index. This is a simple way to distribute contention across the whole register file more evenly and avoid repeating less-likely-to-succeed reg-map probes to lower-numbered registers for every bundle. After some experimentation with different options (queue that dynamically puts registers at end after allocating, various ways of mixing/hashing indices, etc.), adding the *instruction offset* (of the start of the first range in the bundle) as well gave the best results. This is very simple and gives us a likely better-than-random conflict avoidance because ranges tend to be local, so rotating through registers as we scan down the list of instructions seems like a very natural strategy. On the tests used by our `cargo bench` benchmark, this reduces regfile probes for the largest (459 instruction) benchmark from 1538 to 829, i.e., approximately by half, and results in an 11% allocation speedup. --- README.md | 20 +++++++------------- src/ion/mod.rs | 15 +++++++++++++-- 2 files changed, 20 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index e755c4c0..c187fe91 100644 --- a/README.md +++ b/README.md @@ -111,27 +111,21 @@ benches/0 time: [365.68 us 367.36 us 369.04 us] ``` I then measured three different fuzztest-SSA-generator test cases in -this allocator, `regalloc2`, measuring between 1.05M and 2.3M +this allocator, `regalloc2`, measuring between 1.1M and 2.3M instructions per second (closer to the former for larger functions): ```plain ==== 459 instructions -benches/0 time: [424.46 us 425.65 us 426.59 us] - thrpt: [1.0760 Melem/s 1.0784 Melem/s 1.0814 Melem/s] +benches/0 time: [377.91 us 378.09 us 378.27 us] + thrpt: [1.2134 Melem/s 1.2140 Melem/s 1.2146 Melem/s] ==== 225 instructions -benches/1 time: [213.05 us 213.28 us 213.54 us] - thrpt: [1.0537 Melem/s 1.0549 Melem/s 1.0561 Melem/s] +benches/1 time: [202.03 us 202.14 us 202.27 us] + thrpt: [1.1124 Melem/s 1.1131 Melem/s 1.1137 Melem/s] -Found 1 outliers among 100 measurements (1.00%) - 1 (1.00%) high mild ==== 21 instructions -benches/2 time: [9.0495 us 9.0571 us 9.0641 us] - thrpt: [2.3168 Melem/s 2.3186 Melem/s 2.3206 Melem/s] - -Found 4 outliers among 100 measurements (4.00%) - 2 (2.00%) high mild - 2 (2.00%) high severe +benches/2 time: [9.5605 us 9.5655 us 9.5702 us] + thrpt: [2.1943 Melem/s 2.1954 Melem/s 2.1965 Melem/s] ``` Though not apples-to-apples (SSA vs. non-SSA, completely different diff --git a/src/ion/mod.rs b/src/ion/mod.rs index 78d42dca..303c31ae 100644 --- a/src/ion/mod.rs +++ b/src/ion/mod.rs @@ -2570,6 +2570,17 @@ impl<'a, F: Function> Env<'a, F> { } else { n_regs }; + // Heuristic: start the scan for an available + // register at an offset influenced both by our + // location in the code and by the bundle we're + // considering. This has the effect of spreading + // demand more evenly across registers. + let scan_offset = self.ranges[self.bundles[bundle.index()].first_range.index()] + .range + .from + .inst + .index() + + bundle.index(); for i in 0..loop_count { // The order in which we try registers is somewhat complex: // - First, if there is a hint, we try that. @@ -2587,7 +2598,7 @@ impl<'a, F: Function> Env<'a, F> { (0, Some(hint_reg)) => hint_reg, (i, Some(hint_reg)) => { let reg = self.env.regs_by_class[class as u8 as usize] - [(i - 1 + bundle.index()) % n_regs]; + [(i - 1 + scan_offset) % n_regs]; if reg == hint_reg { continue; } @@ -2595,7 +2606,7 @@ impl<'a, F: Function> Env<'a, F> { } (i, None) => { self.env.regs_by_class[class as u8 as usize] - [(i + bundle.index()) % n_regs] + [(i + scan_offset) % n_regs] } };