[RISCV][WIP] Let RA do the CSR saves. #90819

mgudim · 2024-05-02T05:08:50Z

We turn the problem of saving and restoring callee-saved registers efficiently into a register allocation problem. This has the advantage that the register allocator can essentialy do shrink-wrapping on per register basis. Currently, shrink-wrapping pass saves all CSR in the same place which may be suboptimal. Also, improvements to register allocation / coalescing will translate to improvements in shrink-wrapping.

In finalizeLowering() we copy all callee-saved registers from a physical register to a virtual one. In all return blocks we copy do the reverse.

llvmbot · 2024-05-02T05:09:27Z

@llvm/pr-subscribers-backend-risc-v

Author: Mikhail Gudim (mgudim)

Changes

We turn the problem of saving and restoring callee-saved registers efficiently into a register allocation problem. This has the advantage that the register allocator can essentialy do shrink-wrapping on per register basis. Currently, shrink-wrapping pass saves all CSR in the same place which may be suboptimal. Also, improvements to register allocation / coalescing will translate to improvements in shrink-wrapping.

In finalizeLowering() we copy all callee-saved registers from a physical register to a virtual one. In all return blocks we copy do the reverse.

Full diff: https://github.com/llvm/llvm-project/pull/90819.diff

6 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVFrameLowering.cpp (+57-6)
(modified) llvm/lib/Target/RISCV/RISCVFrameLowering.h (+1)
(modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+77)
(modified) llvm/lib/Target/RISCV/RISCVISelLowering.h (+2)
(modified) llvm/lib/Target/RISCV/RISCVSubtarget.cpp (+9)
(modified) llvm/lib/Target/RISCV/RISCVSubtarget.h (+2)

diff --git a/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp b/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
index cb41577c5d9435..b725bfb56389bc 100644
--- a/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
@@ -1026,12 +1026,51 @@ RISCVFrameLowering::getFrameIndexReference(const MachineFunction &MF, int FI,
   return Offset;
 }
 
-void RISCVFrameLowering::determineCalleeSaves(MachineFunction &MF,
-                                              BitVector &SavedRegs,
-                                              RegScavenger *RS) const {
-  TargetFrameLowering::determineCalleeSaves(MF, SavedRegs, RS);
-  // Unconditionally spill RA and FP only if the function uses a frame
-  // pointer.
+void RISCVFrameLowering::determineMustCalleeSaves(MachineFunction &MF,
+                                              BitVector &SavedRegs) const {
+  const TargetRegisterInfo &TRI = *MF.getSubtarget().getRegisterInfo();
+
+  // Resize before the early returns. Some backends expect that
+  // SavedRegs.size() == TRI.getNumRegs() after this call even if there are no
+  // saved registers.
+  SavedRegs.resize(TRI.getNumRegs());
+
+  // When interprocedural register allocation is enabled caller saved registers
+  // are preferred over callee saved registers.
+  if (MF.getTarget().Options.EnableIPRA &&
+      isSafeForNoCSROpt(MF.getFunction()) &&
+      isProfitableForNoCSROpt(MF.getFunction()))
+    return;
+
+  // Get the callee saved register list...
+  const MCPhysReg *CSRegs = MF.getRegInfo().getCalleeSavedRegs();
+
+  // Early exit if there are no callee saved registers.
+  if (!CSRegs || CSRegs[0] == 0)
+    return;
+
+  // In Naked functions we aren't going to save any registers.
+  if (MF.getFunction().hasFnAttribute(Attribute::Naked))
+    return;
+
+  // Noreturn+nounwind functions never restore CSR, so no saves are needed.
+  // Purely noreturn functions may still return through throws, so those must
+  // save CSR for caller exception handlers.
+  //
+  // If the function uses longjmp to break out of its current path of
+  // execution we do not need the CSR spills either: setjmp stores all CSRs
+  // it was called with into the jmp_buf, which longjmp then restores.
+  if (MF.getFunction().hasFnAttribute(Attribute::NoReturn) &&
+        MF.getFunction().hasFnAttribute(Attribute::NoUnwind) &&
+        !MF.getFunction().hasFnAttribute(Attribute::UWTable) &&
+        enableCalleeSaveSkip(MF))
+    return;
+
+  // Functions which call __builtin_unwind_init get all their registers saved.
+  if (MF.callsUnwindInit()) {
+    SavedRegs.set();
+    return;
+  }
   if (hasFP(MF)) {
     SavedRegs.set(RISCV::X1);
     SavedRegs.set(RISCV::X8);
@@ -1041,6 +1080,18 @@ void RISCVFrameLowering::determineCalleeSaves(MachineFunction &MF,
     SavedRegs.set(RISCVABI::getBPReg());
 }
 
+void RISCVFrameLowering::determineCalleeSaves(MachineFunction &MF,
+                                              BitVector &SavedRegs,
+                                              RegScavenger *RS) const {
+  const auto &ST = MF.getSubtarget<RISCVSubtarget>();
+  const Function &F = MF.getFunction();
+  determineMustCalleeSaves(MF, SavedRegs);
+  if (ST.doCSRSavesInRA() && F.doesNotThrow())
+    return;
+
+  TargetFrameLowering::determineCalleeSaves(MF, SavedRegs, RS);
+}
+
 std::pair<int64_t, Align>
 RISCVFrameLowering::assignRVVStackObjectOffsets(MachineFunction &MF) const {
   MachineFrameInfo &MFI = MF.getFrameInfo();
diff --git a/llvm/lib/Target/RISCV/RISCVFrameLowering.h b/llvm/lib/Target/RISCV/RISCVFrameLowering.h
index 28ab4aff3b9d51..d7b9df8bd68515 100644
--- a/llvm/lib/Target/RISCV/RISCVFrameLowering.h
+++ b/llvm/lib/Target/RISCV/RISCVFrameLowering.h
@@ -31,6 +31,7 @@ class RISCVFrameLowering : public TargetFrameLowering {
   StackOffset getFrameIndexReference(const MachineFunction &MF, int FI,
                                      Register &FrameReg) const override;
 
+  void determineMustCalleeSaves(MachineFunction &MF, BitVector &SavedRegs) const;
   void determineCalleeSaves(MachineFunction &MF, BitVector &SavedRegs,
                             RegScavenger *RS) const override;
 
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 19ef1f2f18ec1a..7978dac4aa7944 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -21314,6 +21314,83 @@ unsigned RISCVTargetLowering::getCustomCtpopCost(EVT VT,
   return isCtpopFast(VT) ? 0 : 1;
 }
 
+void RISCVTargetLowering::finalizeLowering(MachineFunction &MF) const {
+  const Function &F = MF.getFunction();
+  if (!Subtarget.doCSRSavesInRA() || !F.doesNotThrow()) { 
+    TargetLoweringBase::finalizeLowering(MF);
+    return;
+  }
+
+  MachineRegisterInfo &MRI = MF.getRegInfo();
+  const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo();
+  const RISCVRegisterInfo &TRI = *Subtarget.getRegisterInfo();
+  const RISCVFrameLowering &TFI = *Subtarget.getFrameLowering();
+
+  SmallVector<MachineBasicBlock *, 4> RestoreMBBs;
+  SmallVector<MachineBasicBlock *, 4> SaveMBBs;
+  SaveMBBs.push_back(&MF.front());
+  for (MachineBasicBlock &MBB : MF) {
+    if (MBB.isReturnBlock())
+      RestoreMBBs.push_back(&MBB);
+  }
+
+  BitVector MustCalleeSavedRegs;
+  TFI.determineMustCalleeSaves(MF, MustCalleeSavedRegs);
+  const MCPhysReg * CSRegs = MF.getRegInfo().getCalleeSavedRegs();
+  SmallVector<MCPhysReg, 4> EligibleRegs;
+  for (int i = 0; CSRegs[i]; ++i) {
+    if (!MustCalleeSavedRegs[i])
+      EligibleRegs.push_back(CSRegs[i]);
+  }
+
+  dbgs() << "EligibleRegs: " << EligibleRegs.size() << "\n";
+  SmallVector<Register, 4> VRegs;
+  for (MachineBasicBlock *SaveMBB : SaveMBBs) {
+    for (MCPhysReg Reg : EligibleRegs) {
+      SaveMBB->addLiveIn(Reg);
+      // TODO: should we use Maximal register class instead?
+      Register VReg = MRI.createVirtualRegister(TRI.getMinimalPhysRegClass(Reg));
+      VRegs.push_back(VReg);
+      BuildMI(
+        *SaveMBB,
+        SaveMBB->begin(),
+        SaveMBB->findDebugLoc(SaveMBB->begin()),
+        TII.get(TargetOpcode::COPY),
+        VReg
+      )
+      .addReg(Reg);
+    }
+  }
+
+  for (MachineBasicBlock *RestoreMBB : RestoreMBBs) {
+    MachineInstr &ReturnMI = RestoreMBB->back();
+    assert(ReturnMI.isReturn() && "Expected return instruction!");
+    auto VRegI = VRegs.begin();
+    for (MCPhysReg Reg : EligibleRegs) {
+      Register VReg = *VRegI;
+      BuildMI(
+        *RestoreMBB,
+        ReturnMI.getIterator(),
+        ReturnMI.getDebugLoc(),
+        TII.get(TargetOpcode::COPY),
+        Reg
+      )
+      .addReg(VReg);
+      ReturnMI.addOperand(
+        MF,
+        MachineOperand::CreateReg(
+          Reg,
+          /*isDef=*/false,
+          /*isImplicit=*/true
+        )
+      );
+      VRegI++;
+    }
+  }
+
+  TargetLoweringBase::finalizeLowering(MF);
+}
+
 bool RISCVTargetLowering::fallBackToDAGISel(const Instruction &Inst) const {
 
   // GISel support is in progress or complete for these opcodes.
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.h b/llvm/lib/Target/RISCV/RISCVISelLowering.h
index 78f99e70c083a7..ea1079af2ead05 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.h
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.h
@@ -853,6 +853,8 @@ class RISCVTargetLowering : public TargetLowering {
 
   bool fallBackToDAGISel(const Instruction &Inst) const override;
 
+  void finalizeLowering(MachineFunction &MF) const override;
+
   bool lowerInterleavedLoad(LoadInst *LI,
                             ArrayRef<ShuffleVectorInst *> Shuffles,
                             ArrayRef<unsigned> Indices,
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.cpp b/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
index d3236bb07d56d5..15476fc2d3c583 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
@@ -65,6 +65,11 @@ static cl::opt<unsigned> RISCVMinimumJumpTableEntries(
     "riscv-min-jump-table-entries", cl::Hidden,
     cl::desc("Set minimum number of entries to use a jump table on RISCV"));
 
+static cl::opt<bool> RISCVEnableSaveCSRByRA(
+    "riscv-enable-save-csr-in-ra",
+    cl::desc("Let register alloctor do csr saves/restores"),
+    cl::init(false), cl::Hidden);
+
 void RISCVSubtarget::anchor() {}
 
 RISCVSubtarget &
@@ -130,6 +135,10 @@ bool RISCVSubtarget::useConstantPoolForLargeInts() const {
   return !RISCVDisableUsingConstantPoolForLargeInts;
 }
 
+bool RISCVSubtarget::doCSRSavesInRA() const {
+  return RISCVEnableSaveCSRByRA;
+}
+
 unsigned RISCVSubtarget::getMaxBuildIntsCost() const {
   // Loading integer from constant pool needs two instructions (the reason why
   // the minimum cost is 2): an address calculation instruction and a load
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.h b/llvm/lib/Target/RISCV/RISCVSubtarget.h
index c880c9e921e0ea..f3d8a70c0df14e 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.h
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.h
@@ -270,6 +270,8 @@ class RISCVSubtarget : public RISCVGenSubtargetInfo {
 
   bool useConstantPoolForLargeInts() const;
 
+  bool doCSRSavesInRA() const;
+
   // Maximum cost used for building integers, integers will be put into constant
   // pool if exceeded.
   unsigned getMaxBuildIntsCost() const;

github-actions · 2024-05-02T05:11:46Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 5192cb772ad58af4b557539791ff8de60ab450a3 10e60c750ec233404e86941d35c56b4f3c346d29 --extensions h,cpp -- llvm/lib/Target/RISCV/RISCVCFIInserter.cpp llvm/include/llvm/CodeGen/ReachingDefAnalysis.h llvm/include/llvm/CodeGen/TargetFrameLowering.h llvm/include/llvm/CodeGen/TargetSubtargetInfo.h llvm/lib/CodeGen/MachineLICM.cpp llvm/lib/CodeGen/PrologEpilogInserter.cpp llvm/lib/CodeGen/ReachingDefAnalysis.cpp llvm/lib/CodeGen/RegAllocEvictionAdvisor.cpp llvm/lib/CodeGen/TargetSubtargetInfo.cpp llvm/lib/Target/RISCV/RISCV.h llvm/lib/Target/RISCV/RISCVFrameLowering.cpp llvm/lib/Target/RISCV/RISCVFrameLowering.h llvm/lib/Target/RISCV/RISCVISelLowering.cpp llvm/lib/Target/RISCV/RISCVISelLowering.h llvm/lib/Target/RISCV/RISCVInstrInfo.h llvm/lib/Target/RISCV/RISCVMachineFunctionInfo.cpp llvm/lib/Target/RISCV/RISCVMachineFunctionInfo.h llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp llvm/lib/Target/RISCV/RISCVSubtarget.cpp llvm/lib/Target/RISCV/RISCVSubtarget.h llvm/lib/Target/RISCV/RISCVTargetMachine.cpp

View the diff from clang-format here.

diff --git a/llvm/include/llvm/CodeGen/TargetFrameLowering.h b/llvm/include/llvm/CodeGen/TargetFrameLowering.h
index 3d0a5151cb..d63f3992cd 100644
--- a/llvm/include/llvm/CodeGen/TargetFrameLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetFrameLowering.h
@@ -34,7 +34,7 @@ namespace llvm {
     WasmLocal = 3,
     NoAlloc = 255
   };
-}
+  }
 
 /// Information about stack frame layout on the target.  It holds the direction
 /// of stack growth, the known stack alignment on entry to each function, and
diff --git a/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp b/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp
index 24017c50df..b1a4bb050b 100644
--- a/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp
+++ b/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp
@@ -199,7 +199,8 @@ void RISCVCFIInstrInserter::calculateCFAInfo(MachineFunction &MF) {
   updateSuccCFAInfo(MBBVector[MF.front().getNumber()]);
 
   LLVM_DEBUG(dbgs() << "Calculated CFI info for " << MF.getName() << "\n";
-             for (MachineBasicBlock &MBB : MF) {
+             for (MachineBasicBlock &MBB
+                  : MF) {
                dbgs() << "BasicBlock: " << MBB.getNumber() << " "
                       << MBB.getName() << "\n";
                dbgs() << "IncomingCSRLocations:\n";

mgudim · 2024-05-02T05:24:43Z

Currently the shrink wrapping saves / restores all registers at the same point. We noticed that this is not optimal in several places in SPEC. GCC can do this on per-register basis and there is an option to turn this off. By using this flag in gcc we determined that this can make up to 2% difference in dynamic instruction count on gcc benchmark and some smaller impacts in other benchmarks.

Instead of trying to improve shrink-wrapping in llvm, I decided to try this (crazy) idea since it was pretty easy to implement this prototype. I disabled this for functions which can unwind, because this is just an experiment to see if this is viable. Thus all c++ benchmarks that have exceptions were not affected. I got 1.7% improvement on gcc and 1.7% degradation on xz in dynamic instruciton count.

Do you think it's worth pursuing this idea further? To fix the degradations, we need to improve register allocator / coalescing which may have more impact than just improving shrink wrapping.

topperc · 2024-05-07T22:12:23Z

Will this still work with -msave-restore and cm.push/pop?

topperc · 2024-05-07T22:20:07Z

I fought some old post about improving shrink wrapping. https://lists.llvm.org/pipermail/llvm-dev/2017-August/116137.html not sure how much of it was implemented.

CC @francisvm

mshockwave · 2024-05-07T23:01:34Z

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

+      EligibleRegs.push_back(CSRegs[i]);
+  }
+
+  dbgs() << "EligibleRegs: " << EligibleRegs.size() << "\n";


please wrap this with LLVM_DEBUG.

mshockwave · 2024-05-07T23:06:14Z

Does this update any existing tests?

topperc · 2024-05-07T23:07:37Z

Does this update any existing tests?

It's gated by a command line flag that isn't enabled by default.

francisvm · 2024-05-08T00:00:31Z

I fought some old post about improving shrink wrapping. https://lists.llvm.org/pipermail/llvm-dev/2017-August/116137.html not sure how much of it was implemented.

CC @francisvm

Not much of it was implemented in the end. It turned out that for us, the compile-time impact, the unwinding impact and the various regressions were not worth the speedup in the few benchmarks. I wish I had the old data to justify this, but that was a while ago. The unwinder impact was more significant than I initially thought at that time, since Apple platforms rely heavily on compact unwinding, therefore missing some frames and falling back to eh/debug_frame CFI is a huge penalty, and in some memory/permission-constrained environments, very very hard, or impossible for those that only look at prologues/epilogues to quickly unwind (some crash reporters, for example). That made enabling it by default a hard task, and we never pursued an attribute/flag-based approach.

mgudim · 2024-05-09T14:34:46Z

Will this still work with -msave-restore and cm.push/pop?

I think so. All it does is it makes a callee-saved register a spill, so this should work. I'll run spec with these options to check. If we need to optimize for size, than we can just turn this off with a flag.

mgudim · 2024-05-09T14:38:37Z

At this point I just want to understand if I should continue with this approach or not. If "yes", I'll investigate the degradation on xz, figure out what goes wrong if function unwinds and add tests.

topperc · 2024-05-21T19:59:55Z

At this point I just want to understand if I should continue with this approach or not. If "yes", I'll investigate the degradation on xz, figure out what goes wrong if function unwinds and add tests.

I think you should continue.

I'm also curious how this interacts with -fno-omit-frame-pointer which makes the X8 register reserved, but its still in the callee save list.

wangpc-pp · 2024-05-22T05:33:59Z

I support to continue to investigate where we can reach via this approach, I think this can be a good workaroud to enhance "shrink wrapping". Thanks for your great idea and I will do some evaluations too.

mgudim · 2024-06-07T23:42:26Z

Hi all, just just wanted to give an update on this:

I looked at the xz degradation:

Half of it is due to the fact that since now we have some long live ranges, early-machinelicm doesn't hoist as well as before. This is easy to fix though. If the register pressure for some class exceeds the limit then there will be a spill. If there are live ranges that are live-through entire loop without any uses inside the loop, then these ranges will be spilled first. So basically, I fixed this by ignoring such live-through-no-uses-inside-loop live ranges in InitRegPressure.

The other problem is that RA seems to be doing something wrong and I got some extra spills. I hope that can be fixed too.

The biggest problem is that we need to emit CFI directives that tell runtime how to restore ALL callee-saved registers. With my approach I don't know what is the best way to do this. I don't know this CFI stuff very well, so I am hoping someone can give an advise / suggestion. I can only think of having a pass at the very end of the pipeline which tracks what happens to callee-saved registers and emits the directives accordingly. BTW, there is something which looks relevant in CFIFixup and CFIInstrInserter, but I haven't looked closely yet.

@topperc @preames @mshockwave @wangpc-pp @asb any suggestions how to do this?

wangpc-pp · 2024-06-17T07:14:30Z

How will GCC handle these CFIs?

qcolombet · 2024-06-20T10:24:57Z

Hi @mgudim

At this point I just want to understand if I should continue with this approach or not. If "yes", I'll investigate the degradation on xz, figure out what goes wrong if function unwinds and add tests.

The approach is not crazy at all!
In fact, I believe LLVM current approach is the crazy one :).
I believe you will have a bunch of regressions as it stands though. Up to you to pursue.
Since you're doing that only for 1 target and not all of them, you may have an easier time to track and fix the regression (as opposed to changing RA for all targets at once.)

FWIW, I believe the whole CSR distinction in LLVM is wrong and the proper way to handle them would be to issue parallel copies in the entry block and exit blocks right during ISel (at the very beginning).
In a sense what you are doing is reproducing exactly that, although you do not use parallel copies.

As you saw, this change in representation creates regression because the CSR distinction is baked left and right in the backend. So we can fix them as we find them, but there'll be a long tail.

One remark, since you're not using parallel copies, you're creating artificial interferences and that can have a cascading effect. (See below.) Even if you were to use parallel copies (in LLVM that would be bundles of copies) that wouldn't work out-of-the-box either since this constructs is not a first class citizen.

The TL;DR is you're testing the RA heuristic in ways that has not been exercised and some things will get better, some things will get worse.

Example of issue with CSR using sequential copies

v1 = COPY csr1
v2 = COPY csr2
v3 = COPY csr3

If for some reason csr3 gets used for something else, you are at the mercy of the RA ordering for the coalescing. I.e., worst case, you may end up with:

some_reg = COPY csr1
csr1 = COPY csr2
csr2 = COPY csr3

instead of

csr1 = COPY csr1 # no-op
csr2 = COPY csr2 # no-op
some_reg = COPY csr3

mgudim · 2024-06-25T20:53:51Z

@qcolombet

Hi Quentin,

Thank you for the comments. After I looked into some degradations, I see that you are exactly right!

proper way to handle them would be to issue parallel copies

Interesting, you mean we should put all the copies into one instruction bundle?

As you saw, this change in representation creates regression because the CSR distinction is baked left and right in the backend. So we can fix them as we find them, but there'll be a long tail.

In particular in the register allocator I see that CSRs are treated differently: since there is some cost to use CSR for the first time, there is some logic trying to minimize these costs. In this approach, I think this logic is working against us.

Here's another problem that I saw: when trying to allocate for the live range of x1 (the return-address register on RISCV), the allocator thinks its a good idea to evict previously allocated CSR's live range (even if it means breaking a hint). This eventually leads to many unnecessary spills. Looks like this is exactly the situation you described here:

Example of issue with CSR using sequential copies

v1 = COPY csr1
v2 = COPY csr2
v3 = COPY csr3
If for some reason csr3 gets used for something else, you are at the mercy of the RA ordering for the coalescing. I.e., worst case, you may end up with:

some_reg = COPY csr1
csr1 = COPY csr2
csr2 = COPY csr3
instead of

csr1 = COPY csr1 # no-op
csr2 = COPY csr2 # no-op
some_reg = COPY csr3

I have tried some hacks which basically disable eviction if it breaks hints and they make the problem go away, but I am not sure what the proper solution to this should be, maybe we can discuss it during your office hour.

qcolombet · 2024-06-26T08:11:59Z

@qcolombet

Hi Quentin,

Thank you for the comments. After I looked into some degradations, I see that you are exactly right!

proper way to handle them would be to issue parallel copies

Interesting, you mean we should put all the copies into one instruction bundle?

Yes, but that will not be enough because RA needs to be taught how to break the bundle if some of the copies need to be spilled.

As you saw, this change in representation creates regression because the CSR distinction is baked left and right in the backend. So we can fix them as we find them, but there'll be a long tail.

In particular in the register allocator I see that CSRs are treated differently: since there is some cost to use CSR for the first time, there is some logic trying to minimize these costs. In this approach, I think this logic is working against us.

Not surprised and that's what I was referring to when I was saying that CSRs are treated differently left and right.

Here's another problem that I saw: when trying to allocate for the live range of x1 (the return-address register on RISCV), the allocator thinks its a good idea to evict previously allocated CSR's live range (even if it means breaking a hint). This eventually leads to many unnecessary spills. Looks like this is exactly the situation you described here:

Example of issue with CSR using sequential copies

v1 = COPY csr1 v2 = COPY csr2 v3 = COPY csr3 If for some reason csr3 gets used for something else, you are at the mercy of the RA ordering for the coalescing. I.e., worst case, you may end up with:

some_reg = COPY csr1 csr1 = COPY csr2 csr2 = COPY csr3 instead of

csr1 = COPY csr1 # no-op csr2 = COPY csr2 # no-op some_reg = COPY csr3

I have tried some hacks which basically disable eviction if it breaks hints and they make the problem go away, but I am not sure what the proper solution to this should be, maybe we can discuss it during your office hour.

Sadly I don't think there's a good solution here. RA's decisions are currently too narrow-scoped to reason about this kind of things. The eviction chain heuristic is supposed to help with these cases, but this is a heuristic! (Plus it may be disabled by default, I don't remember.)

Long story short, if we want to go down that path, I think that a more serious overhaul of the RA infrastructure is needed.

michaelmaitland · 2024-10-30T18:48:30Z

I think something went wrong on the last push.

mgudim · 2024-10-30T19:29:26Z

I think something went wrong on the last push.

Yes, thanks! I am working on cleaning up the code and posting the branch so others can try it. Then I'll break it up into commits and try to merge the one-by-one

mgudim · 2024-11-06T17:55:05Z

I've updated the branch with the latest prototype. To try the branch, add the following flags to clang:
-mllvm -riscv-enable-save-csr-in-ra=true
mllvm -min-weight-ratio-needed-to-evict-hint=7.5 - this controls a new heuristic I added to register allocator. I've arrived at 7.5 by experiment, so other value could work better in your case.

Looking forward to your feedback. If there are no principal issues moving forward I can start upstreaming this work in small commits.

We turn the problem of saving and restoring callee-saved registers efficiently into a register allocation problem. This has the advantage that the register allocator can essentialy do shrink-wrapping on per register basis. Currently, shrink-wrapping pass saves all CSR in the same place which may be suboptimal. Also, improvements to register allocation / coalescing will translate to improvements in shrink-wrapping. In finalizeLowering() we copy all callee-saved registers from a physical register to a virtual one. In all return blocks we copy do the reverse.

michaelmaitland · 2024-11-06T19:44:58Z

What is the thought process of taking this approach over modifying the ShrinkWrap pass to communicate the PrologEpilogInserter where to put each callee saved register?

michaelmaitland

I like the idea of this patch. I left some comments as I went through it trying to understand what was going on.

michaelmaitland · 2024-11-06T21:00:27Z

llvm/include/llvm/CodeGen/ReachingDefAnalysis.h

@@ -138,6 +141,9 @@ class ReachingDefAnalysis : public MachineFunctionPass {
  DenseMap<MachineInstr *, int> InstIds;

  MBBReachingDefsInfo MBBReachingDefs;
+  using MBBFrameObjsReachingDefsInfo =
+      std::vector<std::vector<std::vector<int>>>;
+  MBBFrameObjsReachingDefsInfo MBBFrameObjsReachingDefs;


It would be nice to add a docstring that describes this object. My understanding is that this is representing a map of MBBNumber -> Frame Object Index -> reaching def instruction numbers, since it is not clear by the data type definition.

michaelmaitland · 2024-11-06T21:07:46Z

llvm/lib/CodeGen/ReachingDefAnalysis.cpp

 void ReachingDefAnalysis::enterBasicBlock(MachineBasicBlock *MBB) {
  unsigned MBBNumber = MBB->getNumber();
  assert(MBBNumber < MBBReachingDefs.numBlockIDs() &&
         "Unexpected basic block number.");
  MBBReachingDefs.startBasicBlock(MBBNumber, NumRegUnits);

+  MBBFrameObjsReachingDefs[MBBNumber].resize(NumStackObjects);
+  for (unsigned FOIdx = 0; FOIdx < NumStackObjects; ++FOIdx) {
+    MBBFrameObjsReachingDefs[MBBNumber][FOIdx].push_back(-1);


Avoid initializing with -1 for all elements here by returning LatestDef when MBBFrameObjsReachingDefs[MBBNumber][FOIdx].empty() below?

michaelmaitland · 2024-11-06T21:15:32Z

llvm/include/llvm/CodeGen/ReachingDefAnalysis.h

  LoopTraversal::TraversalOrder TraversedMBBOrder;
  unsigned NumRegUnits = 0;
+  unsigned NumStackObjects = 0;


I think NumStackObjects and ObjectIndexBegin are available in enterBasicBlock through MBB->getParent()->getFrameInfo().getNumObjects() and MBB->getParent()->getFrameInfo().getObjectIndexBegin().

We could probably avoid storing them as member variables in ReachingDefAnalysis to keep the object smaller.

michaelmaitland · 2024-11-06T21:25:25Z

llvm/lib/CodeGen/PrologEpilogInserter.cpp

@@ -262,6 +267,7 @@ bool PEI::runOnMachineFunction(MachineFunction &MF) {
  // called functions.  Because of this, calculateCalleeSavedRegisters()
  // must be called before this function in order to set the AdjustsStack
  // and MaxCallFrameSize variables.
+  RDA->reset();


Why is a call to reset() needed? Can it be dirty into this function or after the getAnalysis<ReachingDefAnalysis>() above? If it is needed, could we have done this instead in case the function had the Naked attribute:

if (!F.hasFnAttribute(Attribute::Naked)) { RDA->reset(); insertPrologEpilogCode(MF); }

michaelmaitland · 2024-11-06T21:28:24Z

llvm/lib/CodeGen/PrologEpilogInserter.cpp

@@ -1164,6 +1170,7 @@ void PEI::calculateFrameObjectOffsets(MachineFunction &MF) {
 void PEI::insertPrologEpilogCode(MachineFunction &MF) {
  const TargetFrameLowering &TFI = *MF.getSubtarget().getFrameLowering();

+  TFI.emitCFIsForCSRsHandledByRA(MF, RDA);


It might be nice to add a comment explaining that CSRs that are handled by RA still need CFI information and emitPrologue does not emit CFI info for those, since the CSRs may no longer be located in the prologue.

As a separate note, RISCVFrameLowering::emitPrologue has code to iterate over list of callee-saved registers and emit .cfi_offset directive. Can you help me understand how we know to skip over those instructions?

Only the registers added by RISCVFrameLowering::determineMustCalleeSaves will be handled the same way as before (with usual prologue / epilogue).

michaelmaitland · 2024-11-06T21:58:29Z

llvm/lib/CodeGen/PrologEpilogInserter.cpp

@@ -1164,6 +1170,7 @@ void PEI::calculateFrameObjectOffsets(MachineFunction &MF) {
 void PEI::insertPrologEpilogCode(MachineFunction &MF) {
  const TargetFrameLowering &TFI = *MF.getSubtarget().getFrameLowering();

+  TFI.emitCFIsForCSRsHandledByRA(MF, RDA);


Why is the CFIInserter pass needed if we are inserting CFI information here?

suppose we had the following code (before block layout is finalized):

ENTRY: save x1 .cfi_offset x1, ... ... beq ... %RET BB: ... RET: ld x1, ... .cfi_register x1, x1 ... ret

after the block layout the code may look like this (now %RET is layout successor of %ENTRY):

ENTRY: save x1 .cfi_offset x1, ... ... bne ... %BB RET: ld x1, ... .cfi_register x1, x1 BB: ... j %RET ... ret

The cfi directives apply to all the code below it in the layout, so the cfi's are wrong now. The CFIInserter compares the CFI info from the layout successor to what it has to be and corrects it if necessary.

I basically copied that pass from the existing one CodeGen/ CFIInstrInserter.cpp and modified it.

I basically copied that pass from the existing one CodeGen/CFIInstrInserter.cpp and modified it.

Is it possible to modify CodeGen/CFIInstrInserter.cpp instead of having a RISC-V specific copy? Maybe using target hooks where necessary.

Yes, it's possible. In some cases (notably when we have vector spills), we need to use cfi expression instead of simple .cfi_offset. I can add a small cfi-expression decoder to CFIInserter for that.

Here, I recorded cfi info into RISCVMachineFunctionInfo so it can be used by RISCVCFIInserter.

michaelmaitland · 2024-11-06T22:00:33Z

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

@@ -21874,6 +21874,108 @@ bool RISCVTargetLowering::fallBackToDAGISel(const Instruction &Inst) const {
  return false;
 }

+static MachineInstr *findInstrWhichNeedAllCSRs(MachineBasicBlock &MBB) {
+  // Some instructions may require (implicitly) all CSRs to be saved.


Is it possible that there are other functions in addition to these functions that need all CSRs to be saved? Is it possible that there can be multiple MI in MBB that need this kind of handling?

Doesn't callsUnwindInit need all CSR saved too?

Is it possible that there are other functions in addition to these functions that need all CSRs to be saved? Is it possible that there can be multiple MI in MBB that need this kind of handling?

It's possible, I only added the ones I know.

Do we need to be worried about not handling cases that we don't currently know about? Do we need to solve this problem more generally?

I don't know how we can solve this more generally. Maybe we can introduce a new attribute?

michaelmaitland · 2024-11-06T22:07:21Z

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

+      // instruction");
+      //++MII;
+      MachineBasicBlock::iterator MII = MBB.getFirstTerminator();
+      MachineInstr *NewRetMI = BuildMI(MBB, MII, CallToCxaThrow->getDebugLoc(),


The introduction of UnreachableRET is interesting to me. What is the purpose of it? Why is it relevant for this work? Is it useful in any other instances where we have a return after a call to a function marked noreturn?

This is a hack! I just wanted to trigger CSR saves. I hope there are better ways to do this.

michaelmaitland · 2024-11-06T22:08:27Z

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

+  SmallVector<MachineBasicBlock *, 4> SaveMBBs;
+  SaveMBBs.push_back(&MF.front());
+  for (MachineBasicBlock &MBB : MF) {
+    if (MBB.isReturnBlock())


If it's not a return block, do we need to call findInstrWhichNeedAllCSRs and replace the terminator?

We add the new fake terminator UnreachableRET

michaelmaitland · 2024-11-06T22:13:48Z

llvm/lib/Target/RISCV/RISCVFrameLowering.cpp

  if (hasFP(MF)) {
-    SavedRegs.set(RAReg);
-    SavedRegs.set(FPReg);
+    SavedRegs.set(RISCV::X1);


What was the reason for changing this? Isn't it the same?

I didn't mean to change this, it's a typo.

mgudim · 2024-11-07T09:53:31Z

@michaelmaitland thank you for the review! I wanted this branch to be just a proof of concept (I haven't even cleaned up the commented-out code, sorry!), so others can see the full context for the subsequent smaller PRs, I will address your comments there.

mgudim · 2024-11-07T09:53:37Z

About using the ReachingDefAnalysis: I wonder if it would be better to use RDF framework for this. This seems like it is capable to do the same thing, but if we make it work for RISCV, that could be some additional gains. Here I extended ReachingDefAnalysis to work with stack slots because it was easier (for me) to understand.

mgudim · 2024-11-07T09:54:32Z

What is the thought process of taking this approach over modifying the ShrinkWrap pass to communicate the PrologEpilogInserter where to put each callee saved register?

If you ignore the CFI stuff, this approach is much easier, simplifies things (now we don't special treatment of CSRs in RA) and allows for more optimizations than just shrink-wrapping by itself. Essentially, we're exposing the code to save / restore CSRs to all the optimizations after ISel.

qcolombet

I feel like there are at least three different changes in this PR and they would need to be broken down:

The LICM changes
The RDA changes
The CSR changes for RISC-V

I think the overall direction makes sense. I'm not super happy how the CFI are handled but I haven't given it much thoughts.

Disclaimer, I haven't looked at the RISC-V changes at all.

qcolombet · 2024-12-02T10:09:37Z

llvm/include/llvm/CodeGen/TargetFrameLowering.h

@@ -210,6 +211,11 @@ class TargetFrameLowering {
  /// for noreturn nounwind functions.
  virtual bool enableCalleeSaveSkip(const MachineFunction &MF) const;

+  virtual void emitCFIsForCSRsHandledByRA(MachineFunction &MF,


We need a comment here to tell backend authors what they are supposed to do with this hook and when they should consider implementing it.

Thanks @qcolombet I'll make sure I'll add a comment when I create a separate MR for it.

qcolombet · 2024-12-02T10:09:56Z

llvm/include/llvm/CodeGen/TargetSubtargetInfo.h

@@ -317,6 +317,8 @@ class TargetSubtargetInfo : public MCSubtargetInfo {
    return false;
  }

+  virtual bool doCSRSavesInRA() const;


Ditto missing comments

mgudim · 2024-12-02T12:16:39Z

I feel like there are at least three different changes in this PR and they would need to be broken down:

The LICM changes

The RDA changes

The CSR changes for RISC-V

I think the overall direction makes sense. I'm not super happy how the CFI are handled but I haven't given it much thoughts.

Disclaimer, I haven't looked at the RISC-V changes at all.

Here's MR for RDA:

#118097

michaelmaitland · 2025-01-14T02:46:26Z

Any chance we could rebase this patch on main? It would enable me to run some more experiments. Currently having some trouble when I tried to rebase it myself.

mgudim · 2025-01-15T10:18:06Z

Any chance we could rebase this patch on main? It would enable me to run some more experiments. Currently having some trouble when I tried to rebase it myself.

@michaelmaitland I'll do it soon, sorry for delay. I had to switch to something else for now.

mgudim · 2025-01-17T12:27:13Z

Any chance we could rebase this patch on main? It would enable me to run some more experiments. Currently having some trouble when I tried to rebase it myself.

@michaelmaitland I'll do it soon, sorry for delay. I had to switch to something else for now.

I rebased and pushed to a branch in my fork (https://github.com/mgudim/llvm-project/tree/rebased_save_csr_in_ra), so we can updo the rebase if needed.

mgudim · 2025-01-21T13:31:04Z

Any chance we could rebase this patch on main? It would enable me to run some more experiments. Currently having some trouble when I tried to rebase it myself.

@michaelmaitland I'll do it soon, sorry for delay. I had to switch to something else for now.

I rebased and pushed to a branch in my fork (https://github.com/mgudim/llvm-project/tree/rebased_save_csr_in_ra), so we can updo the rebase if needed.

In my original patch there was a hack related to emission of CFI's and the CFI inserter. I am trying to get rid of that we can get rid of the RISCV CFI inserter.

michaelmaitland · 2025-01-21T13:48:44Z

In my original patch there was a hack related to emission of CFI's and the CFI inserter. I am trying to get rid of that we can get rid of the RISCV CFI inserter.

I strongly support this direction. Copying the target agnostic CFI inserter as a target specific CFI inserter is likely not a viable direction. If we need to add some target specific logic to the CFI inserter, the we should use target hooks.

mgudim · 2025-01-25T07:19:03Z

In my original patch there was a hack related to emission of CFI's and the CFI inserter. I am trying to get rid of that we can get rid of the RISCV CFI inserter.

I strongly support this direction. Copying the target agnostic CFI inserter as a target specific CFI inserter is likely not a viable direction. If we need to add some target specific logic to the CFI inserter, the we should use target hooks.

I made the patch use the main CFIInserter and got rid of the RISCVCFIInserter completely. I also ran spec for scalar and everything passed. Running with vector now.

mgudim requested review from asb, preames, mshockwave, topperc, pcwang-thead and wangpc-pp May 2, 2024 05:08

llvmbot added the backend:RISC-V label May 2, 2024

mgudim force-pushed the save_csr_in_ra branch from d83e517 to de5e1cd Compare May 2, 2024 05:26

mshockwave reviewed May 7, 2024

View reviewed changes

wangpc-pp removed the request for review from pcwang-thead May 20, 2024 04:15

mgudim mentioned this pull request Jul 10, 2024

[RegAllocEvictAdvisor] Add minimum weight ratio heuristic. #98109

Open

mgudim force-pushed the save_csr_in_ra branch from de5e1cd to 64d855a Compare October 30, 2024 18:33

mgudim closed this Nov 1, 2024

mgudim deleted the save_csr_in_ra branch November 1, 2024 12:28

mgudim restored the save_csr_in_ra branch November 6, 2024 17:26

mgudim deleted the save_csr_in_ra branch November 6, 2024 17:26

mgudim restored the save_csr_in_ra branch November 6, 2024 17:32

mgudim reopened this Nov 6, 2024

mgudim force-pushed the save_csr_in_ra branch from b89105e to ae5d2b4 Compare November 6, 2024 17:58

mgudim force-pushed the save_csr_in_ra branch from ae5d2b4 to 10e60c7 Compare November 6, 2024 18:04

michaelmaitland self-requested a review November 6, 2024 19:45

michaelmaitland reviewed Nov 6, 2024

View reviewed changes

topperc requested a review from qcolombet November 20, 2024 17:16

mgudim mentioned this pull request Nov 29, 2024

[ReachingDefAnalysis] Extend the analysis to stack objects. #118097

Merged

qcolombet reviewed Dec 2, 2024

View reviewed changes

mgudim mentioned this pull request Dec 5, 2024

[MachineLICM] Improve register pressure estimation. #118615

Open

michaelmaitland mentioned this pull request Jan 6, 2025

[RISCV][AArch64] llvm has 9-12% performance degradation compared to gcc for spec2017/500.perlbench_r #121813

Open

mgudim mentioned this pull request Feb 21, 2025

[CFIInstrInserter] Tests for scenarios not currently handled by CFIInstrInserter. #128211

Closed

[RISCV][WIP] Let RA do the CSR saves. #90819

Are you sure you want to change the base?

[RISCV][WIP] Let RA do the CSR saves. #90819

Uh oh!

Conversation

mgudim commented May 2, 2024

Uh oh!

llvmbot commented May 2, 2024

Uh oh!

github-actions bot commented May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgudim commented May 2, 2024

Uh oh!

topperc commented May 7, 2024

Uh oh!

topperc commented May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mshockwave commented May 7, 2024

Uh oh!

topperc commented May 7, 2024

Uh oh!

francisvm commented May 8, 2024

Uh oh!

mgudim commented May 9, 2024

Uh oh!

mgudim commented May 9, 2024

Uh oh!

topperc commented May 21, 2024

Uh oh!

wangpc-pp commented May 22, 2024

Uh oh!

mgudim commented Jun 7, 2024

Uh oh!

wangpc-pp commented Jun 17, 2024

Uh oh!

qcolombet commented Jun 20, 2024

Uh oh!

mgudim commented Jun 25, 2024

Uh oh!

qcolombet commented Jun 26, 2024

Uh oh!

michaelmaitland commented Oct 30, 2024

Uh oh!

mgudim commented Oct 30, 2024

Uh oh!

mgudim commented Nov 6, 2024

Uh oh!

michaelmaitland commented Nov 6, 2024

Uh oh!

michaelmaitland left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgudim Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelmaitland Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 2, 2024 •

edited

Loading

topperc commented May 7, 2024 •

edited

Loading

mgudim Nov 7, 2024 •

edited

Loading

michaelmaitland Nov 7, 2024 •

edited

Loading

mgudim commented Nov 7, 2024 •

edited

Loading