Skip to content

[AVR/RegAllocGreedy] Allocator clobbers a live register #81911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Patryk27 opened this issue Feb 15, 2024 · 11 comments · Fixed by #85277
Closed

[AVR/RegAllocGreedy] Allocator clobbers a live register #81911

Patryk27 opened this issue Feb 15, 2024 · 11 comments · Fixed by #85277
Assignees

Comments

@Patryk27
Copy link
Contributor

Patryk27 commented Feb 15, 2024

Hi,

I've got some code which works on-O0 and -O1 -regalloc=basic, but gets miscompiled with -O1 (and even -O1 -opt-bisect-limit=0!) - the issue downstream is:

Rahix/avr-hal#505 (comment)

... which I've been able to somewhat minimize, down to:

Show LLVM IR
@buf = private unnamed_addr constant <{}> zeroinitializer, align 1

define internal fastcc void @main() unnamed_addr addrspace(1) #1 {
start:
  %0 = alloca i64, align 1
  %1 = alloca { ptr, i16 }, align 1
  %2 = alloca i64, align 1
  %buf = alloca [3 x i8], align 1
  tail call void asm sideeffect alignstack "sei", "~{sreg},~{memory}"() #5, !srcloc !3
  %3 = tail call i8 asm sideeffect alignstack "in ${0}, 0x3F", "=&r,~{sreg},~{memory}"() #5, !srcloc !4
  tail call void asm sideeffect alignstack "cli", "~{sreg},~{memory}"() #5, !srcloc !5
  store i64 12, ptr %2, align 1
  call void asm sideeffect "", "r,~{memory}"(ptr nonnull %2) #5, !srcloc !6
  %counter = load i64, ptr %2, align 1, !noundef !7
  %dummy4.i = urem i64 %counter, 10
  store i64 %dummy4.i, ptr %0, align 1, !noalias !8
  call void asm sideeffect "", "r,~{memory}"(ptr nonnull %0) #5, !noalias !8, !srcloc !6
  %4 = icmp ult i64 %counter, 10
  br i1 %4, label %main.fmt_num.exit, label %bb6.i

bb2.i:
  %5 = icmp ult i32 %i.05.i, 2147483647
  %spec.select = select i1 %5, i16 0, i16 3
  %spec.select2 = select i1 %5, ptr @buf, ptr %buf
  br label %main.fmt_num.exit

bb6.i:
  %n.06.i = phi i64 [ %6, %bb6.i ], [ %counter, %start ]
  %i.05.i = phi i32 [ %7, %bb6.i ], [ 0, %start ]
  %6 = udiv i64 %n.06.i, 10
  %7 = add i32 %i.05.i, 1
  %dummy.i = urem i64 %6, 10
  store i64 %dummy.i, ptr %0, align 1, !noalias !8
  call void asm sideeffect "", "r,~{memory}"(ptr nonnull %0) #5, !noalias !8, !srcloc !6
  %8 = icmp ult i64 %n.06.i, 100
  br i1 %8, label %bb2.i, label %bb6.i

main.fmt_num.exit:
  %9 = phi i16 [ 3, %start ], [ %spec.select, %bb2.i ]
  %10 = phi ptr [ %buf, %start ], [ %spec.select2, %bb2.i ]
  store ptr %10, ptr %1, align 1
  %11 = getelementptr inbounds i8, ptr %1, i16 2
  store i16 %9, ptr %11, align 1
  call void asm sideeffect "", "r,~{memory}"(ptr nonnull %1) #5, !srcloc !6
  %12 = icmp sgt i8 %3, -1
  br i1 %12, label %bb3, label %bb2

bb3:
  call fastcc addrspace(1) void @report_irq_disabled() #5
  br label %bb5.preheader

bb2:
  call fastcc addrspace(1) void @report_irq_enabled() #5
  br label %bb5.preheader

bb5.preheader:
  br label %bb5

bb5:
  br label %bb5
}

declare noundef zeroext i1 @uart_write(ptr noalias noundef nonnull align 1, i8 noundef) unnamed_addr addrspace(1) #2
declare fastcc void @report_irq_enabled() unnamed_addr addrspace(1) #0;
declare fastcc void @report_irq_disabled() unnamed_addr addrspace(1) #0;

attributes #0 = { noinline nounwind "target-cpu"="atmega328p" }
attributes #1 = { noreturn nounwind "target-cpu"="atmega328p" }
attributes #2 = { nounwind "target-cpu"="atmega328p" }
attributes #3 = { mustprogress nocallback nofree nounwind willreturn memory(argmem: write) }
attributes #4 = { mustprogress nocallback nofree nosync nounwind willreturn memory(argmem: readwrite) }
attributes #5 = { nounwind }
attributes #6 = { noreturn nounwind }

!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}

!0 = !{i32 8, !"PIC Level", i32 2}
!1 = !{i32 7, !"PIE Level", i32 2}
!2 = !{!"rustc version 1.78.0-nightly (ee9c7c940 2024-02-14)"}
!3 = !{i32 733181}
!4 = !{i32 732071}
!5 = !{i32 731113}
!6 = !{i32 189156}
!7 = !{}
!8 = !{!9}
!9 = distinct !{!9, !10, !"main.fmt_num: %buf.0"}
!10 = distinct !{!10, !"main.fmt_num"}

Now, the most important bit is this assignment:

%3 = tail call addrspace(0) i8 asm sideeffect alignstack "in ${0}, 0x3F", "=&r,~{sreg},~{memory}"() #5, !srcloc !4

... which doesn't get used up until the end of the program, when it's being tested on:

  %12 = icmp sgt i8 %3, -1
  br i1 %12, label %bb3, label %bb2

LLVM (understandably) decides to spill the result of that in ..., 0x3f into memory:

	in	r24, 63
	std	Y+3, r24

... to later load back:

	ldd	r22, Y+3

... and - even later - test on:

.LBB0_18:
	/* ... */        
	tst	r22
	brmi	.LBB0_20

The direct issue is that this ldd r22, Y+3 instruction is executed only in one code path (out of two) that leads to .LBB0_18!

(that is, in the generated assembly, .LBB0_18 can be reached from two blocks, but ldd happens in only one of them)

Code seems to be compiled correctly with -O0 and -O1 -regalloc=basic - the later in particular keeps ldd and tst together instead of two different basic blocks:

	ldd	r24, Y+1
	tst	r24
	brmi	.LBB0_19
@benshi001
Copy link
Member

It seems -O2 and -O3 also have this problem.
o2o3.zip

@Patryk27
Copy link
Contributor Author

Patryk27 commented Mar 3, 2024

I've got some progress! -- I think this issue might be actually related to opt.

If we take a look at the input LLVM IR (the one directly generated by rustc), we'll see:

  %4 = call addrspace(0) i8 asm sideeffect alignstack "ldi ${0}, 123", "=&r,~{sreg},~{memory}"(), !srcloc !7
  store i8 %4, ptr %canary, align 1
/* ... rest of the code ... */
  %_7 = load i8, ptr %0, align 1, !noundef !3
  %10 = icmp eq i8 %_7, 123 /* the actual comparison */

But for some reason, opt moves this store from the beginning of the function up to its end!

  %5 = tail call addrspace(0) i8 asm sideeffect alignstack "ldi ${0}, 123", "=&r,~{sreg},~{memory}"() #4, !srcloc !3
/* ... rest of the code ... */
  store i8 %5, ptr %1, align 1
  %_7 = load i8, ptr %1, align 1, !noundef !5
  call addrspace(1) void @llvm.lifetime.end.p0(i64 1, ptr nonnull %1)
  %15 = icmp eq i8 %_7, 123

... this might cause the codegen/regalloc/something to go mayhem and confuse the registers in-between.

I'm attaching the input LLVM IR here (it's somewhat big, 94 KB, but opt gets that down into 13 KB):
bug.txt

Seizing the day, I've also simplified the case here - it corresponds to the following Rust code:

Show
#![feature(asm_experimental_arch)]
#![no_std]
#![no_main]

use core::arch::asm;
use core::hint::black_box;
use panic_halt as _;

#[arduino_hal::entry]
fn main() -> ! {
    let canary: u8;

    unsafe {
        asm!("ldi {reg}, 123", reg = out(reg) canary);
    }

    black_box(magic(black_box(&[]), black_box(12)));

    if black_box(canary) == 123 {
        report_ok();
    } else {
        report_err();
    }

    loop {
        //
    }
}

fn magic(buf: &[u8], mut n: u64) -> &[u8] {
    let mut i = 0;

    loop {
        black_box(n % 10);

        n /= 10;

        if n == 0 {
            return if i > 0 { &[] } else { buf };
        } else {
            i += 1;
        }
    }
}

#[inline(never)]
fn report_ok() {
    let dp = unsafe { arduino_hal::Peripherals::steal() };
    let pins = arduino_hal::pins!(dp);
    let mut serial = arduino_hal::default_serial!(dp, pins, 57600);

    serial.write_byte(b'O');
    serial.write_byte(b'\n');
}

#[inline(never)]
fn report_err() {
    let dp = unsafe { arduino_hal::Peripherals::steal() };
    let pins = arduino_hal::pins!(dp);
    let mut serial = arduino_hal::default_serial!(dp, pins, 57600);

    serial.write_byte(b'E');
    serial.write_byte(b'\n');
}

... which currently prints E instead of O.

@Patryk27
Copy link
Contributor Author

Progress: disabling register coalescing (--join-liveintervals=false) generates correct code.

@Patryk27
Copy link
Contributor Author

Patryk27 commented Mar 12, 2024

Progress: found the part where regalloc does the illegal split!

If you compile this with:

llc -march=avr -mcpu=atmega328p -O1 -o bug.asm bug.ll -debug

... you'll see this bit in logs:

selectOrSplit GPR8:%73 [16e,1032r:0)[1336r,1360B:1)[2608B,2704r:2) 0@16e 1@1336r 2@2608B-phi  weight:1.971373e-03 w=1.971373e-03
RS_Split Cascade 2
Analyze counted 4 instrs in 4 blocks, through 3 blocks.
Compact region bundles, v=3 EB#3.
/* ... */
Split for $r18 in 1 bundles, intv 1.
splitAroundRegion with 2 globals.
%bb.0 isolated.
%bb.4 isolated.
%bb.6 [1296B;1360B), uses 1336r-1336r, reg-out 1, enter after invalid, defined in block after interference.
    selectIntv 1 -> 1
    useIntv [1336r;1360B): [1336r;1360B):1
%bb.11 [2608B;2832B), uses 2704r-2704r, reg-in 1, leave before invalid, killed in block before interference.
    selectIntv 1 -> 1
    useIntv [2608B;2704r): [1336r;1360B):1 [2608B;2704r):1
%bb.2 [880B;960B) intf invalid-invalid, live-through 0 -> 1, reload on exit.
    selectIntv 1 -> 1
    enterIntvAtEnd %bb.2, 944d: valno 0 [936r;960B):1 [1336r;1360B):1 [2608B;2704r):1
Direct complement def at 16e
Removing 0 back-copies.
  blit [16e,1032r:0): [16e;936r)=0(%94):0 [936r;960B)=1(%95):0 [960B;1032r)=0(%94):0
  blit [1336r,1360B:1): [1336r;1360B)=1(%95):1
  blit [2608B,2704r:2): [2608B;2704r)=1(%95):2
  rewr %bb.6	1336e:0	early-clobber %94:gpr8 = LDDRdPtrQ %stack.4, 0 :: (load (s8) from %stack.4)
  rewr %bb.0	16e:0	INLINEASM &"ldi ${0}, 123" [sideeffect] [mayload] [maystore] [alignstack] [attdialect], $0:[regdef-ec:GPR8], def early-clobber %94:gpr8, $1:[clobber], implicit-def dead early-clobber $sreg, $2:[reguse], $r1
  rewr %bb.11	2704B:1	STDPtrQRr %stack.1, 0, %95:gpr8 :: (store (s8) into %ir.1)
  rewr %bb.4	1032B:0	STDPtrQRr %stack.4, 0, %94:gpr8 :: (store (s8) into %stack.4)
  rewr %bb.2	936B:0	%95:gpr8 = COPY %94:gpr8
queuing new interval: %94 [16e,936r:0)[960B,1032r:0) 0@16e  weight:1.810036e-03
Enqueuing %94
queuing new interval: %95 [936r,960B:0)[1336r,1360B:1)[2608B,2704r:2) 0@936r 1@1336r 2@2608B-phi  weight:2.795203e-03
Enqueuing %95

In there, LLVM decides to materialize %95:gpr8 = COPY %94:gpr only inside %bb.2!

To make it less abstract, that's how the CFG looks like before this split:

click to see graph

... where bb0 contains the ldi instruction, bb4->bb6 contain a "spill" into a temporary vreg %72, and later the flow converges back at bb11, in both cases loading stuff from vreg %73.

After this selectOrSplit from above, LLVM generates an invalid CFG, where vreg %95 is left uninitialized in one of the code paths:

click to see graph

(almost as if %95 = COPY %94 was supposed to be put in both blocks or inside bb11 instead?)

It's even possible to force LLVM to tell yo something's wrong, but it takes two commands:

llc -march=avr -mcpu=atmega328p -O1 -o bug.mir bug.ll -stop-after=greedy
llc -march=avr -mcpu=atmega328p -O1 -o bug.asm bug.mir -start-after=greedy

*** Bad machine code: Virtual register defs don't dominate all uses. ***
- function:    main
- v. register: %95

(edit: -verify-machineinstrs does the job as well!)

@Patryk27
Copy link
Contributor Author

Patryk27 commented Mar 12, 2024

@aykevl, if you don't mind me pinging - you've used to do some llvm-avr stuff, maybe you've got an idea what can be happening in here?

It doesn't seem to be AVR-specific this time, but I'm having hard time trying to reproduce it on other architectures -- and regalloc is not the easiest code to follow 👀

@aykevl
Copy link
Contributor

aykevl commented Mar 13, 2024

I've done some LLVM stuff but I haven't touched the register allocator. Looks like you're already further into it now that I have ever been.
The main thing to look out for is that all constraints are put in the correct place for each instruction, and that all the hooks are correct. There are a number of backend hooks that modify instructions such as AVRInstrInfo::copyPhysReg that you could take a look at. I've seen bugs in them in the past.

@benshi001
Copy link
Member

benshi001 commented Mar 14, 2024

I have not looked deep into this issue, but there is another similar issue #55159. They both

  1. are triggered by inline assembly.
  2. get wrong in register allocation.

@benshi001 benshi001 self-assigned this Mar 14, 2024
@benshi001
Copy link
Member

I will have a look, maybe several weeks later when I have spare time.

@Patryk27
Copy link
Contributor Author

Patryk27 commented Mar 14, 2024

Thanks, @aykevl - your idea with looking into copyPhysReg() gave an insight!

Following loadRegFromStackSlot(), I've taken a look into the tablegen and found this constraint:

let Constraints = "@earlyclobber $reg" in def LDDRdPtrQ

It looks like removing that constraint (i.e. having just def LDDRdPtrQ) solves the issue and doesn't cause any test to fail.

I'm yet to understand whether removing that constraint actually makes sense, but it's a step forward 😄

@aykevl
Copy link
Contributor

aykevl commented Mar 14, 2024

It looks like removing that constraint (i.e. having just def LDDRdPtrQ) solves the issue and doesn't cause any test to fail.

My guess would be that it only hides the bug (not actually solves it), but it's useful to know of course.

@Patryk27
Copy link
Contributor Author

Patryk27 commented Mar 14, 2024

I think it might be related - it looks like that's the same reason we've got this 👀

// An identical pseudo instruction to LDDWRdPtrQ, expect restricted to the Y

Having that said, maybe the proper fix would be to create a separate opcode for the loadRegFromStackSlot() to use, not sure, just throwing some guesses.

benshi001 pushed a commit that referenced this issue Mar 15, 2024
LDDRdPtrQ was marked as `earlyclobber`, which doesn't play well with
GreedyRA (which can generate this instruction through `loadRegFromStackSlot()`).

This seems to be the same case as:

https://github.com/llvm/llvm-project/blob/a99b912c9b74f6ef91786b4dfbc25160c27d3b41/llvm/lib/Target/AVR/AVRInstrInfo.td#L1421

Closes #81911.
llvmbot pushed a commit to llvmbot/llvm-project that referenced this issue Mar 16, 2024
LDDRdPtrQ was marked as `earlyclobber`, which doesn't play well with
GreedyRA (which can generate this instruction through `loadRegFromStackSlot()`).

This seems to be the same case as:

https://github.com/llvm/llvm-project/blob/a99b912c9b74f6ef91786b4dfbc25160c27d3b41/llvm/lib/Target/AVR/AVRInstrInfo.td#L1421

Closes llvm#81911.

(cherry picked from commit 328cb9b)
@benshi001 benshi001 added this to the LLVM 18.X Release milestone Mar 18, 2024
@github-project-automation github-project-automation bot moved this to Needs Triage in LLVM Release Status Mar 18, 2024
@benshi001 benshi001 assigned Patryk27 and unassigned benshi001 Mar 18, 2024
@nikic nikic moved this from Needs Triage to Done in LLVM Release Status Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

4 participants