Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastalloc1 #181

Merged
merged 103 commits into from
Sep 23, 2024
Merged

Fastalloc1 #181

merged 103 commits into from
Sep 23, 2024

Conversation

d-sonuga
Copy link
Contributor

@d-sonuga d-sonuga commented Aug 5, 2024

This is the initial implementation of the fast register allocator in src/fastalloc.
It's still a work in progress and I haven't done any kind of optimizations on it. Still using less-than-optimal data structures and stuff.

When I was initially developing it, I used solely the fuzzer to give me the cases to think about and to determine correctness, and I thought I was done with getting the algorithm correct after running the fuzzer for a few days without any errors. But after trying to measure its performance with sightglass and watching it crash several times, I've rethought through the code and reworked it and I've run it through the fuzzer for a few hours now, but it still crashes on some of the benchmarks, and that's what I'm working on resolving now.

I'm making this draft PR now to get some feedback on what I've done so far and on the algorithm itself.

A summary of performance:

Comparing the Rust compiler benchmarks with fastalloc vs ion, we have a speedup of 0.3%-18%.
Comparing with the Sightglass benchmarks: two had no significant difference in performance; for the rest, compiling with fastalloc gives a speedup of about 1.07-5.26x.
Profiling wasmtime compilation on inputs from the Sightglass benchmarks shows an increase in regalloc speed
of about 6x.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments from a pass over most of the "peripheral" code -- everything but the core in fastalloc/mod.rs. I'll continue my detailed pass over that and have some more thoughts later.

Overall impression: a very good start! I'm happy to see some nice abstractions built up, and the overall shape of the code looks like what I would expect. We should be pretty close to able to land the initial version, with some notes here and whatever Amanieu sees addressed, and with the correctness issue you're chasing now fixed.

Thanks again for all the work on this!

src/lib.rs Outdated
@@ -1558,4 +1563,7 @@ pub struct RegallocOptions {

/// Run the SSA validator before allocating registers.
pub validate_ssa: bool,

/// Run the SSRA algorithm
pub use_fastalloc: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a bool, can we define an enum Algorithm { Ion, Fastalloc } and take an option pub algorithm: Algorithm? This leaves room for adding others in the future and makes the intent more clear.

(We can keep the Default derivation working by adding #[default] to the Ion arm of the enum)


trace!("Final edits: {:?}", env.edits);
trace!("safepoint_slots: {:?}", env.safepoint_slots);
trace!("\n\n\n\n\n\n\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit, but cargo fmt should indent these statements properly

struct Z(usize);
impl std::fmt::Debug for Z {
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
write!(f, "v{}", self.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the VReg should have an equivalent Display implementation, so instead we could push VReg::new(i, RegClass::Int) below (class isn't printed by Display)

Reuse,
}

impl PartialEq<OperandConstraint> for OperandConstraintKind {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than this impl, can we implement From<OperandConstraint> for OperandConstraintKind, then one could do constraint.kind() == OperandConstraintKind::Reg? This is slightly more general and cleaner in my experience...

}

#[derive(Clone, Copy, PartialEq)]
struct SearchConstraint {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a neat abstraction and is certainly providing some benefit by generalizing all the cases below. However I worry a little bit about efficiency here: we're explicitly building a data structure in memory to define what kind of operands we're matching, and then dispatching based on this (iterating, in some cases). Data-driven control flow that is "constant in practice" can sometimes be optimized but often not. It may be simpler and faster to do something like:

impl<'a> Operands<'a> {
  fn matching<F: Fn(Operand)>(&self, predicate: F) -> impl Iterator<Item = Operand> + 'a {
    self.iter().filter(|op| predicate(*op))
  }
}

then

impl<'a> Operands<'a> {
  fn non_fixed_non_late(&self) -> impl Iterator<Item = Operand> + 'a {
    self.matching(|op| op.pos() == ... && op.constraint().kind() != ...))
  }
}

This offers a little more flexibility (vs. e.g. the need to have two slots for "must not have") and should inline nicely into a filtered for-loop.

#[derive(Clone, Copy, Debug)]
pub struct LruNode {
/// The previous physical register in the list.
pub prev: usize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can almost certainly be u16 and we'll save a little memory (so cache locality) by doing so...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why it can't be a u8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had been thinking about the total number of possible PRegs and for some reason had thought we could have more than 256 of them across all classes; but (i) we actually can't, since the entire PReg is packed into 8 bits, and (ii) your LRU data structure is per-class anyway. So given both those reasons it can definitely be a u8!

}
self.head = i;
trace!("Poked {:?} in {:?} LRU", preg, self.regclass);
self.check_for_cycle();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be annotated with #[cfg(debug_assertions)] -- that will enable this call when we have assertions compiled in, e.g. for fuzzing or in a general debug build, but avoid it in release builds where performance matters. I suspect that this is probably a reasonably significant slowdown right now if it's running in all your performance tests!

(likewise below too)


impl Allocs {
fn new<F: Function>(func: &F, env: &MachineEnv) -> Self {
// The number of operands is <= number of virtual registers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is actually true -- we could have more operands than virtual registers because of uses (imagine we have a function with a single vreg v1, and a instruction that uses v1 a million times). It shouldn't really matter here since this is just a pre-allocation heuristic; but it may be better to do something like func.num_insts() * 3 as a first guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. This is an outdated comment.

/// The virtual registers that are currently live.
live_vregs: HashSet<VReg>,
/// Allocatable free physical registers for classes Int, Float, and Vector, respectively.
freepregs: PartedByRegClass<BTreeSet<PReg>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as Amanieu mentioned earlier, a PRegSet should be much more efficient here (and will allow us to replace some iteration below with intersection/union operations I think).

/// When allocation is completed, this contains all the safepoint instructions
/// in the function.
/// This is used to build the stackmap after allocation is complete.
safepoint_insts: Vec<(Block, Inst)>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as we noted in the call -- we can remove safepoint-related logic as we're likely to switch over to a non-regalloc-provided safepoint implementation in Cranelift soon. This should simplify things quite a bit!

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made another pass over this, and in src/fastalloc/mod.rs (the core algorithms), reviewed for plausibility and code style -- open threads with @Amanieu should be resolved still, but once they are, I'm happy to see this merge!


pub fn insert(&mut self, vreg: VReg) {
// Intentionally assuming that the set doesn't already
// contain `vreg`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert this? It looks like we always use self.items[vreg.vreg()] as the node for this entry -- can we check if its vreg field is VReg::invalid(), for example, and set it as such when we remove?

next: VRegIndex::new(num_vregs),
vreg: VReg::invalid()
};
num_vregs + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like we're using index 0 as a sentinel node (which is a great idea); do we need to add one below when we index into it with vreg numbers?

Copy link
Contributor Author

@d-sonuga d-sonuga Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the last index that's used as a sentinel, so there isn't any need to add one.
vreg indexes range from 0 to n - 1. The last position n is used as the head.

regs[i.checked_sub(1).unwrap_or(no_of_regs - 1)],
regs[if i >= no_of_regs - 1 { 0 } else { i + 1 }],
);
data[reg.hw_enc()].prev = prev_reg.hw_enc() as u8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the issue is that this will result in sparse usage of data if we use the bitpacked index -- not the worst problem but something it'd be reasonable to avoid if we can. I'm personally fine with hw_enc here.

// Some edits are added due to clobbers, not operands.
// Anyways, I think this may be a reasonable guess.
let inst_edits_len_guess = max_operand_len as usize * 2;
let total_edits_len_guess = inst_edits_len_guess * num_insts;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be good to collect some data manually: record total_edits_len_guess in this struct, and when done with allocation, print that guess, as well as edits.len(). This worst-case-size logic seems reasonable, but it also assumes all instructions will have max_operand_len (which may be large due to e.g. a call with many arguments) so I wonder if it may actually be much larger than needed. If we find that's the case, we could make a capacity estimate based on (say) num_insts * K where K is 2, or 4, or whatever factor we see.

return self.fixed_stack_slots.contains(alloc.as_reg().unwrap());
}
false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can write this as alloc.is_stack() || (alloc.is_reg() && self.fixed_stack_slots.contains(...)) -- possibly a little more concise?

}
}

fn evict_vreg_in_preg_before_inst(&mut self, inst: Inst, preg: PReg) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function and evict_vreg_in_preg below are the same except for the InstPosition on the move, is that right? Can we pass it in as a parameter and deduplicate the two functions?

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a bunch for this new allocator implementation! Now that the initial version is complete and reviewed, let's merge it, and we can iterate on it and provide some performance summaries as followups.

@cfallin cfallin merged commit e684ee5 into bytecodealliance:main Sep 23, 2024
6 checks passed
@d-sonuga
Copy link
Contributor Author

d-sonuga commented Sep 30, 2024

A summary of performance:

Comparing the Rust compiler benchmarks with fastalloc vs ion, we have a speedup of 0.3%-18%.
Comparing with the Sightglass benchmarks: two had no significant difference in performance; for the rest, compiling with fastalloc gives a speedup of about 1.07-5.26x.
Profiling wasmtime compilation on inputs from the Sightglass benchmarks shows an increase in regalloc speed
of about 6x.

cfallin added a commit to cfallin/regalloc2 that referenced this pull request Nov 15, 2024
This includes two major updates:

- The new single-pass fast allocator (bytecodealliance#181);
- An ability to reuse allocations across runs (bytecodealliance#196).
@cfallin cfallin mentioned this pull request Nov 15, 2024
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 15, 2024
In bytecodealliance/regalloc2#181, @d-sonuga added a fast single-pass
algorithm option to regalloc2, in addition to its existing backtracking
allocator. This produces code much more quickly, at the expense of code
quality. Sometimes this tradeoff is desirable (e.g. when performing a
debug build in a fast-iteration development situation, or in an initial
JIT tier).

This PR adds a Cranelift option to select the RA2 algorithm, plumbs it
through to a Wasmtime option, and adds the option to Wasmtime fuzzing as
well.

An initial compile-time measurement in Wasmtime: `spidermonkey.wasm`
builds in 1.383s with backtracking (existing algorithm), and 1.065s with
single-pass. The resulting binary runs a simple Fibonacci benchmark in
2.060s with backtracking vs. 3.455s with single-pass.

Hence, the single-pass algorithm yields a 23% compile-time reduction, at
the cost of a 67% runtime increase.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 15, 2024
In bytecodealliance/regalloc2#181, @d-sonuga added a fast single-pass
algorithm option to regalloc2, in addition to its existing backtracking
allocator. This produces code much more quickly, at the expense of code
quality. Sometimes this tradeoff is desirable (e.g. when performing a
debug build in a fast-iteration development situation, or in an initial
JIT tier).

This PR adds a Cranelift option to select the RA2 algorithm, plumbs it
through to a Wasmtime option, and adds the option to Wasmtime fuzzing as
well.

An initial compile-time measurement in Wasmtime: `spidermonkey.wasm`
builds in 1.383s with backtracking (existing algorithm), and 1.065s with
single-pass. The resulting binary runs a simple Fibonacci benchmark in
2.060s with backtracking vs. 3.455s with single-pass.

Hence, the single-pass algorithm yields a 23% compile-time reduction, at
the cost of a 67% runtime increase.
cfallin added a commit that referenced this pull request Nov 15, 2024
This includes two major updates:

- The new single-pass fast allocator (#181);
- An ability to reuse allocations across runs (#196).
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 15, 2024
In bytecodealliance/regalloc2#181, @d-sonuga added a fast single-pass
algorithm option to regalloc2, in addition to its existing backtracking
allocator. This produces code much more quickly, at the expense of code
quality. Sometimes this tradeoff is desirable (e.g. when performing a
debug build in a fast-iteration development situation, or in an initial
JIT tier).

This PR adds a Cranelift option to select the RA2 algorithm, plumbs it
through to a Wasmtime option, and adds the option to Wasmtime fuzzing as
well.

An initial compile-time measurement in Wasmtime: `spidermonkey.wasm`
builds in 1.383s with backtracking (existing algorithm), and 1.065s with
single-pass. The resulting binary runs a simple Fibonacci benchmark in
2.060s with backtracking vs. 3.455s with single-pass.

Hence, the single-pass algorithm yields a 23% compile-time reduction, at
the cost of a 67% runtime increase.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 15, 2024
In bytecodealliance/regalloc2#181, @d-sonuga added a fast single-pass
algorithm option to regalloc2, in addition to its existing backtracking
allocator. This produces code much more quickly, at the expense of code
quality. Sometimes this tradeoff is desirable (e.g. when performing a
debug build in a fast-iteration development situation, or in an initial
JIT tier).

This PR adds a Cranelift option to select the RA2 algorithm, plumbs it
through to a Wasmtime option, and adds the option to Wasmtime fuzzing as
well.

An initial compile-time measurement in Wasmtime: `spidermonkey.wasm`
builds in 1.383s with backtracking (existing algorithm), and 1.065s with
single-pass. The resulting binary runs a simple Fibonacci benchmark in
2.060s with backtracking vs. 3.455s with single-pass.

Hence, the single-pass algorithm yields a 23% compile-time reduction, at
the cost of a 67% runtime increase.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 15, 2024
In bytecodealliance/regalloc2#181, @d-sonuga added a fast single-pass
algorithm option to regalloc2, in addition to its existing backtracking
allocator. This produces code much more quickly, at the expense of code
quality. Sometimes this tradeoff is desirable (e.g. when performing a
debug build in a fast-iteration development situation, or in an initial
JIT tier).

This PR adds a Cranelift option to select the RA2 algorithm, plumbs it
through to a Wasmtime option, and adds the option to Wasmtime fuzzing as
well.

An initial compile-time measurement in Wasmtime: `spidermonkey.wasm`
builds in 1.383s with backtracking (existing algorithm), and 1.065s with
single-pass. The resulting binary runs a simple Fibonacci benchmark in
2.060s with backtracking vs. 3.455s with single-pass.

Hence, the single-pass algorithm yields a 23% compile-time reduction, at
the cost of a 67% runtime increase.
github-merge-queue bot pushed a commit to bytecodealliance/wasmtime that referenced this pull request Nov 15, 2024
* Cranelift: add option to use new single-pass register allocator.

In bytecodealliance/regalloc2#181, @d-sonuga added a fast single-pass
algorithm option to regalloc2, in addition to its existing backtracking
allocator. This produces code much more quickly, at the expense of code
quality. Sometimes this tradeoff is desirable (e.g. when performing a
debug build in a fast-iteration development situation, or in an initial
JIT tier).

This PR adds a Cranelift option to select the RA2 algorithm, plumbs it
through to a Wasmtime option, and adds the option to Wasmtime fuzzing as
well.

An initial compile-time measurement in Wasmtime: `spidermonkey.wasm`
builds in 1.383s with backtracking (existing algorithm), and 1.065s with
single-pass. The resulting binary runs a simple Fibonacci benchmark in
2.060s with backtracking vs. 3.455s with single-pass.

Hence, the single-pass algorithm yields a 23% compile-time reduction, at
the cost of a 67% runtime increase.

* cargo-vet audit for allocator-api2 0.2.18 -> 0.2.20.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants