Fastalloc1 #181

d-sonuga · 2024-08-05T16:55:22Z

This is the initial implementation of the fast register allocator in src/fastalloc.
~~It's still a work in progress and I haven't done any kind of optimizations on it. Still using less-than-optimal data structures and stuff.~~

When I was initially developing it, I used solely the fuzzer to give me the cases to think about and to determine correctness, and I thought I was done with getting the algorithm correct after running the fuzzer for a few days without any errors. But after trying to measure its performance with sightglass and watching it crash several times, I've rethought through the code and reworked it and I've run it through the fuzzer for a few hours now, but it still crashes on some of the benchmarks, and that's what I'm working on resolving now.

~~I'm making this draft PR now to get some feedback on what I've done so far and on the algorithm itself.~~

A summary of performance:

Comparing the Rust compiler benchmarks with fastalloc vs ion, we have a speedup of 0.3%-18%.
Comparing with the Sightglass benchmarks: two had no significant difference in performance; for the rest, compiling with fastalloc gives a speedup of about 1.07-5.26x.
Profiling wasmtime compilation on inputs from the Sightglass benchmarks shows an increase in regalloc speed
of about 6x.

…to fastalloc1

cfallin

Some initial comments from a pass over most of the "peripheral" code -- everything but the core in fastalloc/mod.rs. I'll continue my detailed pass over that and have some more thoughts later.

Overall impression: a very good start! I'm happy to see some nice abstractions built up, and the overall shape of the code looks like what I would expect. We should be pretty close to able to land the initial version, with some notes here and whatever Amanieu sees addressed, and with the correctness issue you're chasing now fixed.

Thanks again for all the work on this!

cfallin · 2024-08-05T21:13:54Z

src/lib.rs

@@ -1558,4 +1563,7 @@ pub struct RegallocOptions {

    /// Run the SSA validator before allocating registers.
    pub validate_ssa: bool,
+
+    /// Run the SSRA algorithm
+    pub use_fastalloc: bool,


Rather than a bool, can we define an enum Algorithm { Ion, Fastalloc } and take an option pub algorithm: Algorithm? This leaves room for adding others in the future and makes the intent more clear.

(We can keep the Default derivation working by adding #[default] to the Ion arm of the enum)

cfallin · 2024-08-05T21:14:41Z

src/fastalloc/mod.rs

+
+trace!("Final edits: {:?}", env.edits);
+trace!("safepoint_slots: {:?}", env.safepoint_slots);
+trace!("\n\n\n\n\n\n\n");


minor nit, but cargo fmt should indent these statements properly

cfallin · 2024-08-05T21:19:26Z

src/fastalloc/mod.rs

+        struct Z(usize);
+        impl std::fmt::Debug for Z {
+            fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
+                write!(f, "v{}", self.0)


I think the VReg should have an equivalent Display implementation, so instead we could push VReg::new(i, RegClass::Int) below (class isn't printed by Display)

cfallin · 2024-08-05T21:21:20Z

src/fastalloc/iter.rs

+    Reuse,
+}
+
+impl PartialEq<OperandConstraint> for OperandConstraintKind {


Rather than this impl, can we implement From<OperandConstraint> for OperandConstraintKind, then one could do constraint.kind() == OperandConstraintKind::Reg? This is slightly more general and cleaner in my experience...

cfallin · 2024-08-05T21:27:34Z

src/fastalloc/iter.rs

+}
+
+#[derive(Clone, Copy, PartialEq)]
+struct SearchConstraint {


This is a neat abstraction and is certainly providing some benefit by generalizing all the cases below. However I worry a little bit about efficiency here: we're explicitly building a data structure in memory to define what kind of operands we're matching, and then dispatching based on this (iterating, in some cases). Data-driven control flow that is "constant in practice" can sometimes be optimized but often not. It may be simpler and faster to do something like:

impl<'a> Operands<'a> { fn matching<F: Fn(Operand)>(&self, predicate: F) -> impl Iterator<Item = Operand> + 'a { self.iter().filter(|op| predicate(*op)) } }

then

impl<'a> Operands<'a> { fn non_fixed_non_late(&self) -> impl Iterator<Item = Operand> + 'a { self.matching(|op| op.pos() == ... && op.constraint().kind() != ...)) } }

This offers a little more flexibility (vs. e.g. the need to have two slots for "must not have") and should inline nicely into a filtered for-loop.

cfallin · 2024-08-05T21:28:56Z

src/fastalloc/lru.rs

+#[derive(Clone, Copy, Debug)]
+pub struct LruNode {
+    /// The previous physical register in the list.
+    pub prev: usize,


These can almost certainly be u16 and we'll save a little memory (so cache locality) by doing so...

Is there any reason why it can't be a u8?

I had been thinking about the total number of possible PRegs and for some reason had thought we could have more than 256 of them across all classes; but (i) we actually can't, since the entire PReg is packed into 8 bits, and (ii) your LRU data structure is per-class anyway. So given both those reasons it can definitely be a u8!

cfallin · 2024-08-05T21:32:08Z

src/fastalloc/lru.rs

+        }
+        self.head = i;
+        trace!("Poked {:?} in {:?} LRU", preg, self.regclass);
+        self.check_for_cycle();


This could probably be annotated with #[cfg(debug_assertions)] -- that will enable this call when we have assertions compiled in, e.g. for fuzzing or in a general debug build, but avoid it in release builds where performance matters. I suspect that this is probably a reasonably significant slowdown right now if it's running in all your performance tests!

(likewise below too)

cfallin · 2024-08-05T21:37:04Z

src/fastalloc/mod.rs

+
+impl Allocs {
+    fn new<F: Function>(func: &F, env: &MachineEnv) -> Self {
+        // The number of operands is <= number of virtual registers


I don't think this is actually true -- we could have more operands than virtual registers because of uses (imagine we have a function with a single vreg v1, and a instruction that uses v1 a million times). It shouldn't really matter here since this is just a pre-allocation heuristic; but it may be better to do something like func.num_insts() * 3 as a first guess.

Ah, yes. This is an outdated comment.

cfallin · 2024-08-05T21:37:59Z

src/fastalloc/mod.rs

+    /// The virtual registers that are currently live.
+    live_vregs: HashSet<VReg>,
+    /// Allocatable free physical registers for classes Int, Float, and Vector, respectively.
+    freepregs: PartedByRegClass<BTreeSet<PReg>>,


as Amanieu mentioned earlier, a PRegSet should be much more efficient here (and will allow us to replace some iteration below with intersection/union operations I think).

cfallin · 2024-08-05T21:39:45Z

src/fastalloc/mod.rs

+    /// When allocation is completed, this contains all the safepoint instructions
+    /// in the function.
+    /// This is used to build the stackmap after allocation is complete.
+    safepoint_insts: Vec<(Block, Inst)>,


Also as we noted in the call -- we can remove safepoint-related logic as we're likely to switch over to a non-regalloc-provided safepoint implementation in Cranelift soon. This should simplify things quite a bit!

…ranch

cfallin

I made another pass over this, and in src/fastalloc/mod.rs (the core algorithms), reviewed for plausibility and code style -- open threads with @Amanieu should be resolved still, but once they are, I'm happy to see this merge!

cfallin · 2024-09-14T17:12:00Z

src/fastalloc/vregset.rs

+
+    pub fn insert(&mut self, vreg: VReg) {
+        // Intentionally assuming that the set doesn't already
+        // contain `vreg`.


Can we assert this? It looks like we always use self.items[vreg.vreg()] as the node for this entry -- can we check if its vreg field is VReg::invalid(), for example, and set it as such when we remove?

cfallin · 2024-09-14T17:12:53Z

src/fastalloc/vregset.rs

+                    next: VRegIndex::new(num_vregs),
+                    vreg: VReg::invalid()
+                };
+                num_vregs + 1


it looks like we're using index 0 as a sentinel node (which is a great idea); do we need to add one below when we index into it with vreg numbers?

It's the last index that's used as a sentinel, so there isn't any need to add one.
vreg indexes range from 0 to n - 1. The last position n is used as the head.

cfallin · 2024-09-14T17:20:07Z

src/fastalloc/lru.rs

+                regs[i.checked_sub(1).unwrap_or(no_of_regs - 1)],
+                regs[if i >= no_of_regs - 1 { 0 } else { i + 1 }],
+            );
+            data[reg.hw_enc()].prev = prev_reg.hw_enc() as u8;


Indeed, the issue is that this will result in sparse usage of data if we use the bitpacked index -- not the worst problem but something it'd be reasonable to avoid if we can. I'm personally fine with hw_enc here.

cfallin · 2024-09-14T17:24:53Z

src/fastalloc/mod.rs

+        // Some edits are added due to clobbers, not operands.
+        // Anyways, I think this may be a reasonable guess.
+        let inst_edits_len_guess = max_operand_len as usize * 2;
+        let total_edits_len_guess = inst_edits_len_guess * num_insts;


it might be good to collect some data manually: record total_edits_len_guess in this struct, and when done with allocation, print that guess, as well as edits.len(). This worst-case-size logic seems reasonable, but it also assumes all instructions will have max_operand_len (which may be large due to e.g. a call with many arguments) so I wonder if it may actually be much larger than needed. If we find that's the case, we could make a capacity estimate based on (say) num_insts * K where K is 2, or 4, or whatever factor we see.

cfallin · 2024-09-14T17:25:40Z

src/fastalloc/mod.rs

+            return self.fixed_stack_slots.contains(alloc.as_reg().unwrap());
+        }
+        false
+    }


We can write this as alloc.is_stack() || (alloc.is_reg() && self.fixed_stack_slots.contains(...)) -- possibly a little more concise?

cfallin · 2024-09-14T17:28:56Z

src/fastalloc/mod.rs

+        }
+    }
+
+    fn evict_vreg_in_preg_before_inst(&mut self, inst: Inst, preg: PReg) {


It looks like this function and evict_vreg_in_preg below are the same except for the InstPosition on the move, is that right? Can we pass it in as a parameter and deduplicate the two functions?

cfallin

Thanks a bunch for this new allocator implementation! Now that the initial version is complete and reviewed, let's merge it, and we can iterate on it and provide some performance summaries as followups.

d-sonuga · 2024-09-30T17:40:25Z

A summary of performance:

Comparing the Rust compiler benchmarks with fastalloc vs ion, we have a speedup of 0.3%-18%.
Comparing with the Sightglass benchmarks: two had no significant difference in performance; for the rest, compiling with fastalloc gives a speedup of about 1.07-5.26x.
Profiling wasmtime compilation on inputs from the Sightglass benchmarks shows an increase in regalloc speed
of about 6x.

This includes two major updates: - The new single-pass fast allocator (bytecodealliance#181); - An ability to reuse allocations across runs (bytecodealliance#196).

@d-sonuga

In bytecodealliance/regalloc2#181, @d-sonuga added a fast single-pass algorithm option to regalloc2, in addition to its existing backtracking allocator. This produces code much more quickly, at the expense of code quality. Sometimes this tradeoff is desirable (e.g. when performing a debug build in a fast-iteration development situation, or in an initial JIT tier). This PR adds a Cranelift option to select the RA2 algorithm, plumbs it through to a Wasmtime option, and adds the option to Wasmtime fuzzing as well. An initial compile-time measurement in Wasmtime: `spidermonkey.wasm` builds in 1.383s with backtracking (existing algorithm), and 1.065s with single-pass. The resulting binary runs a simple Fibonacci benchmark in 2.060s with backtracking vs. 3.455s with single-pass. Hence, the single-pass algorithm yields a 23% compile-time reduction, at the cost of a 67% runtime increase.

@d-sonuga

In bytecodealliance/regalloc2#181, @d-sonuga added a fast single-pass algorithm option to regalloc2, in addition to its existing backtracking allocator. This produces code much more quickly, at the expense of code quality. Sometimes this tradeoff is desirable (e.g. when performing a debug build in a fast-iteration development situation, or in an initial JIT tier). This PR adds a Cranelift option to select the RA2 algorithm, plumbs it through to a Wasmtime option, and adds the option to Wasmtime fuzzing as well. An initial compile-time measurement in Wasmtime: `spidermonkey.wasm` builds in 1.383s with backtracking (existing algorithm), and 1.065s with single-pass. The resulting binary runs a simple Fibonacci benchmark in 2.060s with backtracking vs. 3.455s with single-pass. Hence, the single-pass algorithm yields a 23% compile-time reduction, at the cost of a 67% runtime increase.

This includes two major updates: - The new single-pass fast allocator (#181); - An ability to reuse allocations across runs (#196).

@d-sonuga

In bytecodealliance/regalloc2#181, @d-sonuga added a fast single-pass algorithm option to regalloc2, in addition to its existing backtracking allocator. This produces code much more quickly, at the expense of code quality. Sometimes this tradeoff is desirable (e.g. when performing a debug build in a fast-iteration development situation, or in an initial JIT tier). This PR adds a Cranelift option to select the RA2 algorithm, plumbs it through to a Wasmtime option, and adds the option to Wasmtime fuzzing as well. An initial compile-time measurement in Wasmtime: `spidermonkey.wasm` builds in 1.383s with backtracking (existing algorithm), and 1.065s with single-pass. The resulting binary runs a simple Fibonacci benchmark in 2.060s with backtracking vs. 3.455s with single-pass. Hence, the single-pass algorithm yields a 23% compile-time reduction, at the cost of a 67% runtime increase.

@d-sonuga

In bytecodealliance/regalloc2#181, @d-sonuga added a fast single-pass algorithm option to regalloc2, in addition to its existing backtracking allocator. This produces code much more quickly, at the expense of code quality. Sometimes this tradeoff is desirable (e.g. when performing a debug build in a fast-iteration development situation, or in an initial JIT tier). This PR adds a Cranelift option to select the RA2 algorithm, plumbs it through to a Wasmtime option, and adds the option to Wasmtime fuzzing as well. An initial compile-time measurement in Wasmtime: `spidermonkey.wasm` builds in 1.383s with backtracking (existing algorithm), and 1.065s with single-pass. The resulting binary runs a simple Fibonacci benchmark in 2.060s with backtracking vs. 3.455s with single-pass. Hence, the single-pass algorithm yields a 23% compile-time reduction, at the cost of a 67% runtime increase.

@d-sonuga

In bytecodealliance/regalloc2#181, @d-sonuga added a fast single-pass algorithm option to regalloc2, in addition to its existing backtracking allocator. This produces code much more quickly, at the expense of code quality. Sometimes this tradeoff is desirable (e.g. when performing a debug build in a fast-iteration development situation, or in an initial JIT tier). This PR adds a Cranelift option to select the RA2 algorithm, plumbs it through to a Wasmtime option, and adds the option to Wasmtime fuzzing as well. An initial compile-time measurement in Wasmtime: `spidermonkey.wasm` builds in 1.383s with backtracking (existing algorithm), and 1.065s with single-pass. The resulting binary runs a simple Fibonacci benchmark in 2.060s with backtracking vs. 3.455s with single-pass. Hence, the single-pass algorithm yields a 23% compile-time reduction, at the cost of a 67% runtime increase.

@d-sonuga

* Cranelift: add option to use new single-pass register allocator. In bytecodealliance/regalloc2#181, @d-sonuga added a fast single-pass algorithm option to regalloc2, in addition to its existing backtracking allocator. This produces code much more quickly, at the expense of code quality. Sometimes this tradeoff is desirable (e.g. when performing a debug build in a fast-iteration development situation, or in an initial JIT tier). This PR adds a Cranelift option to select the RA2 algorithm, plumbs it through to a Wasmtime option, and adds the option to Wasmtime fuzzing as well. An initial compile-time measurement in Wasmtime: `spidermonkey.wasm` builds in 1.383s with backtracking (existing algorithm), and 1.065s with single-pass. The resulting binary runs a simple Fibonacci benchmark in 2.060s with backtracking vs. 3.455s with single-pass. Hence, the single-pass algorithm yields a 23% compile-time reduction, at the cost of a 67% runtime increase. * cargo-vet audit for allocator-api2 0.2.18 -> 0.2.20.

d-sonuga added 25 commits March 19, 2024 10:59

added implementation of SSRA for a single basic block

9e35679

extended implementation to account for multiple basic blocks

46e38c3

updated readme

ac293b8

some progress

1456d2b

basic algorithm correct

fe31f41

can now handle reused inputs

1ce6169

can now handle fixed registers

ed6c3f2

can now handle fixed non-allocatable constraints

fc7bfed

can now handle clobbers

90d3c4f

can now handle safepoint instructions

7bf8e09

remove stack only alloc panic

b93cb6a

Merge branch 'bytecodealliance:main' into fastalloc1

516e026

fixed bug with the LRU's PReg handling

2fbb251

fix: registers can now be evicted to use as scratch registers

d1a868e

fixed bugs

8e88c40

fixed bug with lru management

06b7bfc

fixed problems with LRU management

28278dd

fixed register leak of clobbered regs

a909242

rethought reuse operand handling

747d975

condition to remove clobber from free list changed

792bd18

fixed bug with adding clobbers back to the free list

5f3b355

fixed bug with adding clobbers back to the free list

7f26dc6

Merge branch 'fastalloc1' of https://github.com/d-sonuga/regalloc2 in…

98d3861

…to fastalloc1

Merge branch 'bytecodealliance:main' into main

677baed

Merge branch 'main' into fastalloc1

90fb2f4

cfallin reviewed Aug 5, 2024

View reviewed changes

d-sonuga added 4 commits August 6, 2024 19:15

change condition to remove clobber from free list

958864c

added cfgs to logging and validation function calls

70d6fde

remove support for safepoints

c3cf1e4

added FromIterator implementation to PRegSet

0824a18

d-sonuga added 2 commits September 6, 2024 13:18

formatting

b26d928

fixed issue with scratch reg function for parallel moves in process_b…

e365469

…ranch

d-sonuga force-pushed the fastalloc1 branch from 2ed0b99 to e365469 Compare September 6, 2024 19:47

d-sonuga requested a review from Amanieu September 6, 2024 20:32

d-sonuga force-pushed the fastalloc1 branch from 692a4e6 to a9049a2 Compare September 14, 2024 11:27

removed unnecessary imports in test

4d6be14

d-sonuga force-pushed the fastalloc1 branch from a9049a2 to 4d6be14 Compare September 14, 2024 11:28

cfallin reviewed Sep 14, 2024

View reviewed changes

d-sonuga added 2 commits September 14, 2024 19:29

vregset insert now asserts vreg absence

784fc66

removed duplicated functions

e4b9fc0

d-sonuga force-pushed the fastalloc1 branch from 78ef76e to e4b9fc0 Compare September 14, 2024 18:50

changed the edits length guess for initializing the edits vector

da22294

d-sonuga force-pushed the fastalloc1 branch from 65f31a7 to da22294 Compare September 14, 2024 21:49

d-sonuga added 2 commits September 21, 2024 16:35

removed unnecessary loop in alloc_inst

cd5a7c2

removed unnecessary iter function and lru constant

afb1802

Amanieu approved these changes Sep 23, 2024

View reviewed changes

cfallin approved these changes Sep 23, 2024

View reviewed changes

cfallin merged commit e684ee5 into bytecodealliance:main Sep 23, 2024
6 checks passed

bjorn3 mentioned this pull request Nov 12, 2024

Add flag to select the fast regalloc impl bytecodealliance/wasmtime#9596

Closed

cfallin added a commit to cfallin/regalloc2 that referenced this pull request Nov 15, 2024

Bump to version 0.11.0.

1e377a6

This includes two major updates: - The new single-pass fast allocator (bytecodealliance#181); - An ability to reuse allocations across runs (bytecodealliance#196).

cfallin mentioned this pull request Nov 15, 2024

Bump to version 0.11.0. #201

Merged

cfallin mentioned this pull request Nov 15, 2024

Cranelift: add option to use new single-pass register allocator. bytecodealliance/wasmtime#9611

Merged

cfallin added a commit that referenced this pull request Nov 15, 2024

Bump to version 0.11.0. (#201)

2a5777a

This includes two major updates: - The new single-pass fast allocator (#181); - An ability to reuse allocations across runs (#196).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fastalloc1 #181

Fastalloc1 #181

d-sonuga commented Aug 5, 2024 •

edited

Loading

cfallin left a comment

cfallin Aug 5, 2024

cfallin Aug 5, 2024

cfallin Aug 5, 2024

cfallin Aug 5, 2024

cfallin Aug 5, 2024

cfallin Aug 5, 2024

d-sonuga Aug 7, 2024

cfallin Aug 7, 2024

cfallin Aug 5, 2024

cfallin Aug 5, 2024

d-sonuga Aug 7, 2024

cfallin Aug 5, 2024

cfallin Aug 5, 2024

cfallin left a comment

cfallin Sep 14, 2024

cfallin Sep 14, 2024

d-sonuga Sep 14, 2024 •

edited

Loading

cfallin Sep 14, 2024

cfallin Sep 14, 2024

cfallin Sep 14, 2024

cfallin Sep 14, 2024

cfallin left a comment

d-sonuga commented Sep 30, 2024 •

edited

Loading

Fastalloc1 #181

Fastalloc1 #181

Conversation

d-sonuga commented Aug 5, 2024 • edited Loading

cfallin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cfallin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-sonuga Sep 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cfallin left a comment

Choose a reason for hiding this comment

d-sonuga commented Sep 30, 2024 • edited Loading

d-sonuga commented Aug 5, 2024 •

edited

Loading

d-sonuga Sep 14, 2024 •

edited

Loading

d-sonuga commented Sep 30, 2024 •

edited

Loading