Skip to content

[RISCV][WIP] Let RA do the CSR saves. #90819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

mgudim
Copy link
Contributor

@mgudim mgudim commented May 2, 2024

We turn the problem of saving and restoring callee-saved registers efficiently into a register allocation problem. This has the advantage that the register allocator can essentialy do shrink-wrapping on per register basis. Currently, shrink-wrapping pass saves all CSR in the same place which may be suboptimal. Also, improvements to register allocation / coalescing will translate to improvements in shrink-wrapping.

In finalizeLowering() we copy all callee-saved registers from a physical register to a virtual one. In all return blocks we copy do the reverse.

@llvmbot
Copy link
Member

llvmbot commented May 2, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Mikhail Gudim (mgudim)

Changes

We turn the problem of saving and restoring callee-saved registers efficiently into a register allocation problem. This has the advantage that the register allocator can essentialy do shrink-wrapping on per register basis. Currently, shrink-wrapping pass saves all CSR in the same place which may be suboptimal. Also, improvements to register allocation / coalescing will translate to improvements in shrink-wrapping.

In finalizeLowering() we copy all callee-saved registers from a physical register to a virtual one. In all return blocks we copy do the reverse.


Full diff: https://github.com/llvm/llvm-project/pull/90819.diff

6 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVFrameLowering.cpp (+57-6)
  • (modified) llvm/lib/Target/RISCV/RISCVFrameLowering.h (+1)
  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+77)
  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.h (+2)
  • (modified) llvm/lib/Target/RISCV/RISCVSubtarget.cpp (+9)
  • (modified) llvm/lib/Target/RISCV/RISCVSubtarget.h (+2)
diff --git a/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp b/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
index cb41577c5d9435..b725bfb56389bc 100644
--- a/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
@@ -1026,12 +1026,51 @@ RISCVFrameLowering::getFrameIndexReference(const MachineFunction &MF, int FI,
   return Offset;
 }
 
-void RISCVFrameLowering::determineCalleeSaves(MachineFunction &MF,
-                                              BitVector &SavedRegs,
-                                              RegScavenger *RS) const {
-  TargetFrameLowering::determineCalleeSaves(MF, SavedRegs, RS);
-  // Unconditionally spill RA and FP only if the function uses a frame
-  // pointer.
+void RISCVFrameLowering::determineMustCalleeSaves(MachineFunction &MF,
+                                              BitVector &SavedRegs) const {
+  const TargetRegisterInfo &TRI = *MF.getSubtarget().getRegisterInfo();
+
+  // Resize before the early returns. Some backends expect that
+  // SavedRegs.size() == TRI.getNumRegs() after this call even if there are no
+  // saved registers.
+  SavedRegs.resize(TRI.getNumRegs());
+
+  // When interprocedural register allocation is enabled caller saved registers
+  // are preferred over callee saved registers.
+  if (MF.getTarget().Options.EnableIPRA &&
+      isSafeForNoCSROpt(MF.getFunction()) &&
+      isProfitableForNoCSROpt(MF.getFunction()))
+    return;
+
+  // Get the callee saved register list...
+  const MCPhysReg *CSRegs = MF.getRegInfo().getCalleeSavedRegs();
+
+  // Early exit if there are no callee saved registers.
+  if (!CSRegs || CSRegs[0] == 0)
+    return;
+
+  // In Naked functions we aren't going to save any registers.
+  if (MF.getFunction().hasFnAttribute(Attribute::Naked))
+    return;
+
+  // Noreturn+nounwind functions never restore CSR, so no saves are needed.
+  // Purely noreturn functions may still return through throws, so those must
+  // save CSR for caller exception handlers.
+  //
+  // If the function uses longjmp to break out of its current path of
+  // execution we do not need the CSR spills either: setjmp stores all CSRs
+  // it was called with into the jmp_buf, which longjmp then restores.
+  if (MF.getFunction().hasFnAttribute(Attribute::NoReturn) &&
+        MF.getFunction().hasFnAttribute(Attribute::NoUnwind) &&
+        !MF.getFunction().hasFnAttribute(Attribute::UWTable) &&
+        enableCalleeSaveSkip(MF))
+    return;
+
+  // Functions which call __builtin_unwind_init get all their registers saved.
+  if (MF.callsUnwindInit()) {
+    SavedRegs.set();
+    return;
+  }
   if (hasFP(MF)) {
     SavedRegs.set(RISCV::X1);
     SavedRegs.set(RISCV::X8);
@@ -1041,6 +1080,18 @@ void RISCVFrameLowering::determineCalleeSaves(MachineFunction &MF,
     SavedRegs.set(RISCVABI::getBPReg());
 }
 
+void RISCVFrameLowering::determineCalleeSaves(MachineFunction &MF,
+                                              BitVector &SavedRegs,
+                                              RegScavenger *RS) const {
+  const auto &ST = MF.getSubtarget<RISCVSubtarget>();
+  const Function &F = MF.getFunction();
+  determineMustCalleeSaves(MF, SavedRegs);
+  if (ST.doCSRSavesInRA() && F.doesNotThrow())
+    return;
+
+  TargetFrameLowering::determineCalleeSaves(MF, SavedRegs, RS);
+}
+
 std::pair<int64_t, Align>
 RISCVFrameLowering::assignRVVStackObjectOffsets(MachineFunction &MF) const {
   MachineFrameInfo &MFI = MF.getFrameInfo();
diff --git a/llvm/lib/Target/RISCV/RISCVFrameLowering.h b/llvm/lib/Target/RISCV/RISCVFrameLowering.h
index 28ab4aff3b9d51..d7b9df8bd68515 100644
--- a/llvm/lib/Target/RISCV/RISCVFrameLowering.h
+++ b/llvm/lib/Target/RISCV/RISCVFrameLowering.h
@@ -31,6 +31,7 @@ class RISCVFrameLowering : public TargetFrameLowering {
   StackOffset getFrameIndexReference(const MachineFunction &MF, int FI,
                                      Register &FrameReg) const override;
 
+  void determineMustCalleeSaves(MachineFunction &MF, BitVector &SavedRegs) const;
   void determineCalleeSaves(MachineFunction &MF, BitVector &SavedRegs,
                             RegScavenger *RS) const override;
 
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 19ef1f2f18ec1a..7978dac4aa7944 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -21314,6 +21314,83 @@ unsigned RISCVTargetLowering::getCustomCtpopCost(EVT VT,
   return isCtpopFast(VT) ? 0 : 1;
 }
 
+void RISCVTargetLowering::finalizeLowering(MachineFunction &MF) const {
+  const Function &F = MF.getFunction();
+  if (!Subtarget.doCSRSavesInRA() || !F.doesNotThrow()) { 
+    TargetLoweringBase::finalizeLowering(MF);
+    return;
+  }
+
+  MachineRegisterInfo &MRI = MF.getRegInfo();
+  const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo();
+  const RISCVRegisterInfo &TRI = *Subtarget.getRegisterInfo();
+  const RISCVFrameLowering &TFI = *Subtarget.getFrameLowering();
+
+  SmallVector<MachineBasicBlock *, 4> RestoreMBBs;
+  SmallVector<MachineBasicBlock *, 4> SaveMBBs;
+  SaveMBBs.push_back(&MF.front());
+  for (MachineBasicBlock &MBB : MF) {
+    if (MBB.isReturnBlock())
+      RestoreMBBs.push_back(&MBB);
+  }
+
+  BitVector MustCalleeSavedRegs;
+  TFI.determineMustCalleeSaves(MF, MustCalleeSavedRegs);
+  const MCPhysReg * CSRegs = MF.getRegInfo().getCalleeSavedRegs();
+  SmallVector<MCPhysReg, 4> EligibleRegs;
+  for (int i = 0; CSRegs[i]; ++i) {
+    if (!MustCalleeSavedRegs[i])
+      EligibleRegs.push_back(CSRegs[i]);
+  }
+
+  dbgs() << "EligibleRegs: " << EligibleRegs.size() << "\n";
+  SmallVector<Register, 4> VRegs;
+  for (MachineBasicBlock *SaveMBB : SaveMBBs) {
+    for (MCPhysReg Reg : EligibleRegs) {
+      SaveMBB->addLiveIn(Reg);
+      // TODO: should we use Maximal register class instead?
+      Register VReg = MRI.createVirtualRegister(TRI.getMinimalPhysRegClass(Reg));
+      VRegs.push_back(VReg);
+      BuildMI(
+        *SaveMBB,
+        SaveMBB->begin(),
+        SaveMBB->findDebugLoc(SaveMBB->begin()),
+        TII.get(TargetOpcode::COPY),
+        VReg
+      )
+      .addReg(Reg);
+    }
+  }
+
+  for (MachineBasicBlock *RestoreMBB : RestoreMBBs) {
+    MachineInstr &ReturnMI = RestoreMBB->back();
+    assert(ReturnMI.isReturn() && "Expected return instruction!");
+    auto VRegI = VRegs.begin();
+    for (MCPhysReg Reg : EligibleRegs) {
+      Register VReg = *VRegI;
+      BuildMI(
+        *RestoreMBB,
+        ReturnMI.getIterator(),
+        ReturnMI.getDebugLoc(),
+        TII.get(TargetOpcode::COPY),
+        Reg
+      )
+      .addReg(VReg);
+      ReturnMI.addOperand(
+        MF,
+        MachineOperand::CreateReg(
+          Reg,
+          /*isDef=*/false,
+          /*isImplicit=*/true
+        )
+      );
+      VRegI++;
+    }
+  }
+
+  TargetLoweringBase::finalizeLowering(MF);
+}
+
 bool RISCVTargetLowering::fallBackToDAGISel(const Instruction &Inst) const {
 
   // GISel support is in progress or complete for these opcodes.
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.h b/llvm/lib/Target/RISCV/RISCVISelLowering.h
index 78f99e70c083a7..ea1079af2ead05 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.h
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.h
@@ -853,6 +853,8 @@ class RISCVTargetLowering : public TargetLowering {
 
   bool fallBackToDAGISel(const Instruction &Inst) const override;
 
+  void finalizeLowering(MachineFunction &MF) const override;
+
   bool lowerInterleavedLoad(LoadInst *LI,
                             ArrayRef<ShuffleVectorInst *> Shuffles,
                             ArrayRef<unsigned> Indices,
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.cpp b/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
index d3236bb07d56d5..15476fc2d3c583 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
@@ -65,6 +65,11 @@ static cl::opt<unsigned> RISCVMinimumJumpTableEntries(
     "riscv-min-jump-table-entries", cl::Hidden,
     cl::desc("Set minimum number of entries to use a jump table on RISCV"));
 
+static cl::opt<bool> RISCVEnableSaveCSRByRA(
+    "riscv-enable-save-csr-in-ra",
+    cl::desc("Let register alloctor do csr saves/restores"),
+    cl::init(false), cl::Hidden);
+
 void RISCVSubtarget::anchor() {}
 
 RISCVSubtarget &
@@ -130,6 +135,10 @@ bool RISCVSubtarget::useConstantPoolForLargeInts() const {
   return !RISCVDisableUsingConstantPoolForLargeInts;
 }
 
+bool RISCVSubtarget::doCSRSavesInRA() const {
+  return RISCVEnableSaveCSRByRA;
+}
+
 unsigned RISCVSubtarget::getMaxBuildIntsCost() const {
   // Loading integer from constant pool needs two instructions (the reason why
   // the minimum cost is 2): an address calculation instruction and a load
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.h b/llvm/lib/Target/RISCV/RISCVSubtarget.h
index c880c9e921e0ea..f3d8a70c0df14e 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.h
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.h
@@ -270,6 +270,8 @@ class RISCVSubtarget : public RISCVGenSubtargetInfo {
 
   bool useConstantPoolForLargeInts() const;
 
+  bool doCSRSavesInRA() const;
+
   // Maximum cost used for building integers, integers will be put into constant
   // pool if exceeded.
   unsigned getMaxBuildIntsCost() const;

Copy link

github-actions bot commented May 2, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 5192cb772ad58af4b557539791ff8de60ab450a3 10e60c750ec233404e86941d35c56b4f3c346d29 --extensions h,cpp -- llvm/lib/Target/RISCV/RISCVCFIInserter.cpp llvm/include/llvm/CodeGen/ReachingDefAnalysis.h llvm/include/llvm/CodeGen/TargetFrameLowering.h llvm/include/llvm/CodeGen/TargetSubtargetInfo.h llvm/lib/CodeGen/MachineLICM.cpp llvm/lib/CodeGen/PrologEpilogInserter.cpp llvm/lib/CodeGen/ReachingDefAnalysis.cpp llvm/lib/CodeGen/RegAllocEvictionAdvisor.cpp llvm/lib/CodeGen/TargetSubtargetInfo.cpp llvm/lib/Target/RISCV/RISCV.h llvm/lib/Target/RISCV/RISCVFrameLowering.cpp llvm/lib/Target/RISCV/RISCVFrameLowering.h llvm/lib/Target/RISCV/RISCVISelLowering.cpp llvm/lib/Target/RISCV/RISCVISelLowering.h llvm/lib/Target/RISCV/RISCVInstrInfo.h llvm/lib/Target/RISCV/RISCVMachineFunctionInfo.cpp llvm/lib/Target/RISCV/RISCVMachineFunctionInfo.h llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp llvm/lib/Target/RISCV/RISCVSubtarget.cpp llvm/lib/Target/RISCV/RISCVSubtarget.h llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
View the diff from clang-format here.
diff --git a/llvm/include/llvm/CodeGen/TargetFrameLowering.h b/llvm/include/llvm/CodeGen/TargetFrameLowering.h
index 3d0a5151cb..d63f3992cd 100644
--- a/llvm/include/llvm/CodeGen/TargetFrameLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetFrameLowering.h
@@ -34,7 +34,7 @@ namespace llvm {
     WasmLocal = 3,
     NoAlloc = 255
   };
-}
+  }
 
 /// Information about stack frame layout on the target.  It holds the direction
 /// of stack growth, the known stack alignment on entry to each function, and
diff --git a/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp b/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp
index 24017c50df..b1a4bb050b 100644
--- a/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp
+++ b/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp
@@ -199,7 +199,8 @@ void RISCVCFIInstrInserter::calculateCFAInfo(MachineFunction &MF) {
   updateSuccCFAInfo(MBBVector[MF.front().getNumber()]);
 
   LLVM_DEBUG(dbgs() << "Calculated CFI info for " << MF.getName() << "\n";
-             for (MachineBasicBlock &MBB : MF) {
+             for (MachineBasicBlock &MBB
+                  : MF) {
                dbgs() << "BasicBlock: " << MBB.getNumber() << " "
                       << MBB.getName() << "\n";
                dbgs() << "IncomingCSRLocations:\n";

@mgudim
Copy link
Contributor Author

mgudim commented May 2, 2024

Currently the shrink wrapping saves / restores all registers at the same point. We noticed that this is not optimal in several places in SPEC. GCC can do this on per-register basis and there is an option to turn this off. By using this flag in gcc we determined that this can make up to 2% difference in dynamic instruction count on gcc benchmark and some smaller impacts in other benchmarks.

Instead of trying to improve shrink-wrapping in llvm, I decided to try this (crazy) idea since it was pretty easy to implement this prototype. I disabled this for functions which can unwind, because this is just an experiment to see if this is viable. Thus all c++ benchmarks that have exceptions were not affected. I got 1.7% improvement on gcc and 1.7% degradation on xz in dynamic instruciton count.

Do you think it's worth pursuing this idea further? To fix the degradations, we need to improve register allocator / coalescing which may have more impact than just improving shrink wrapping.

@mgudim mgudim force-pushed the save_csr_in_ra branch from d83e517 to de5e1cd Compare May 2, 2024 05:26
@topperc
Copy link
Collaborator

topperc commented May 7, 2024

Will this still work with -msave-restore and cm.push/pop?

@topperc
Copy link
Collaborator

topperc commented May 7, 2024

I fought some old post about improving shrink wrapping. https://lists.llvm.org/pipermail/llvm-dev/2017-August/116137.html not sure how much of it was implemented.

CC @francisvm

EligibleRegs.push_back(CSRegs[i]);
}

dbgs() << "EligibleRegs: " << EligibleRegs.size() << "\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please wrap this with LLVM_DEBUG.

@mshockwave
Copy link
Member

Does this update any existing tests?

@topperc
Copy link
Collaborator

topperc commented May 7, 2024

Does this update any existing tests?

It's gated by a command line flag that isn't enabled by default.

@francisvm
Copy link
Collaborator

I fought some old post about improving shrink wrapping. https://lists.llvm.org/pipermail/llvm-dev/2017-August/116137.html not sure how much of it was implemented.

CC @francisvm

Not much of it was implemented in the end. It turned out that for us, the compile-time impact, the unwinding impact and the various regressions were not worth the speedup in the few benchmarks. I wish I had the old data to justify this, but that was a while ago. The unwinder impact was more significant than I initially thought at that time, since Apple platforms rely heavily on compact unwinding, therefore missing some frames and falling back to eh/debug_frame CFI is a huge penalty, and in some memory/permission-constrained environments, very very hard, or impossible for those that only look at prologues/epilogues to quickly unwind (some crash reporters, for example). That made enabling it by default a hard task, and we never pursued an attribute/flag-based approach.

@mgudim
Copy link
Contributor Author

mgudim commented May 9, 2024

Will this still work with -msave-restore and cm.push/pop?

I think so. All it does is it makes a callee-saved register a spill, so this should work. I'll run spec with these options to check. If we need to optimize for size, than we can just turn this off with a flag.

@mgudim
Copy link
Contributor Author

mgudim commented May 9, 2024

At this point I just want to understand if I should continue with this approach or not. If "yes", I'll investigate the degradation on xz, figure out what goes wrong if function unwinds and add tests.

@wangpc-pp wangpc-pp removed the request for review from pcwang-thead May 20, 2024 04:15
@topperc
Copy link
Collaborator

topperc commented May 21, 2024

At this point I just want to understand if I should continue with this approach or not. If "yes", I'll investigate the degradation on xz, figure out what goes wrong if function unwinds and add tests.

I think you should continue.

I'm also curious how this interacts with -fno-omit-frame-pointer which makes the X8 register reserved, but its still in the callee save list.

@wangpc-pp
Copy link
Contributor

I support to continue to investigate where we can reach via this approach, I think this can be a good workaroud to enhance "shrink wrapping". Thanks for your great idea and I will do some evaluations too.

@mgudim
Copy link
Contributor Author

mgudim commented Jun 7, 2024

Hi all, just just wanted to give an update on this:

I looked at the xz degradation:

Half of it is due to the fact that since now we have some long live ranges, early-machinelicm doesn't hoist as well as before. This is easy to fix though. If the register pressure for some class exceeds the limit then there will be a spill. If there are live ranges that are live-through entire loop without any uses inside the loop, then these ranges will be spilled first. So basically, I fixed this by ignoring such live-through-no-uses-inside-loop live ranges in InitRegPressure.

The other problem is that RA seems to be doing something wrong and I got some extra spills. I hope that can be fixed too.

The biggest problem is that we need to emit CFI directives that tell runtime how to restore ALL callee-saved registers. With my approach I don't know what is the best way to do this. I don't know this CFI stuff very well, so I am hoping someone can give an advise / suggestion. I can only think of having a pass at the very end of the pipeline which tracks what happens to callee-saved registers and emits the directives accordingly. BTW, there is something which looks relevant in CFIFixup and CFIInstrInserter, but I haven't looked closely yet.

@topperc @preames @mshockwave @wangpc-pp @asb any suggestions how to do this?

@wangpc-pp
Copy link
Contributor

How will GCC handle these CFIs?

@qcolombet
Copy link
Collaborator

Hi @mgudim

At this point I just want to understand if I should continue with this approach or not. If "yes", I'll investigate the degradation on xz, figure out what goes wrong if function unwinds and add tests.

The approach is not crazy at all!
In fact, I believe LLVM current approach is the crazy one :).
I believe you will have a bunch of regressions as it stands though. Up to you to pursue.
Since you're doing that only for 1 target and not all of them, you may have an easier time to track and fix the regression (as opposed to changing RA for all targets at once.)

FWIW, I believe the whole CSR distinction in LLVM is wrong and the proper way to handle them would be to issue parallel copies in the entry block and exit blocks right during ISel (at the very beginning).
In a sense what you are doing is reproducing exactly that, although you do not use parallel copies.

As you saw, this change in representation creates regression because the CSR distinction is baked left and right in the backend. So we can fix them as we find them, but there'll be a long tail.

One remark, since you're not using parallel copies, you're creating artificial interferences and that can have a cascading effect. (See below.) Even if you were to use parallel copies (in LLVM that would be bundles of copies) that wouldn't work out-of-the-box either since this constructs is not a first class citizen.

The TL;DR is you're testing the RA heuristic in ways that has not been exercised and some things will get better, some things will get worse.

Example of issue with CSR using sequential copies

v1 = COPY csr1
v2 = COPY csr2
v3 = COPY csr3

If for some reason csr3 gets used for something else, you are at the mercy of the RA ordering for the coalescing. I.e., worst case, you may end up with:

some_reg = COPY csr1
csr1 = COPY csr2
csr2 = COPY csr3

instead of

csr1 = COPY csr1 # no-op
csr2 = COPY csr2 # no-op
some_reg = COPY csr3

@mgudim
Copy link
Contributor Author

mgudim commented Jun 25, 2024

@qcolombet

Hi Quentin,

Thank you for the comments. After I looked into some degradations, I see that you are exactly right!

proper way to handle them would be to issue parallel copies

Interesting, you mean we should put all the copies into one instruction bundle?

As you saw, this change in representation creates regression because the CSR distinction is baked left and right in the backend. So we can fix them as we find them, but there'll be a long tail.

In particular in the register allocator I see that CSRs are treated differently: since there is some cost to use CSR for the first time, there is some logic trying to minimize these costs. In this approach, I think this logic is working against us.

Here's another problem that I saw: when trying to allocate for the live range of x1 (the return-address register on RISCV), the allocator thinks its a good idea to evict previously allocated CSR's live range (even if it means breaking a hint). This eventually leads to many unnecessary spills. Looks like this is exactly the situation you described here:

Example of issue with CSR using sequential copies

v1 = COPY csr1
v2 = COPY csr2
v3 = COPY csr3
If for some reason csr3 gets used for something else, you are at the mercy of the RA ordering for the coalescing. I.e., worst case, you may end up with:

some_reg = COPY csr1
csr1 = COPY csr2
csr2 = COPY csr3
instead of

csr1 = COPY csr1 # no-op
csr2 = COPY csr2 # no-op
some_reg = COPY csr3

I have tried some hacks which basically disable eviction if it breaks hints and they make the problem go away, but I am not sure what the proper solution to this should be, maybe we can discuss it during your office hour.

@qcolombet
Copy link
Collaborator

@qcolombet

Hi Quentin,

Thank you for the comments. After I looked into some degradations, I see that you are exactly right!

proper way to handle them would be to issue parallel copies

Interesting, you mean we should put all the copies into one instruction bundle?

Yes, but that will not be enough because RA needs to be taught how to break the bundle if some of the copies need to be spilled.

As you saw, this change in representation creates regression because the CSR distinction is baked left and right in the backend. So we can fix them as we find them, but there'll be a long tail.

In particular in the register allocator I see that CSRs are treated differently: since there is some cost to use CSR for the first time, there is some logic trying to minimize these costs. In this approach, I think this logic is working against us.

Not surprised and that's what I was referring to when I was saying that CSRs are treated differently left and right.

Here's another problem that I saw: when trying to allocate for the live range of x1 (the return-address register on RISCV), the allocator thinks its a good idea to evict previously allocated CSR's live range (even if it means breaking a hint). This eventually leads to many unnecessary spills. Looks like this is exactly the situation you described here:

Example of issue with CSR using sequential copies

v1 = COPY csr1 v2 = COPY csr2 v3 = COPY csr3 If for some reason csr3 gets used for something else, you are at the mercy of the RA ordering for the coalescing. I.e., worst case, you may end up with:

some_reg = COPY csr1 csr1 = COPY csr2 csr2 = COPY csr3 instead of

csr1 = COPY csr1 # no-op csr2 = COPY csr2 # no-op some_reg = COPY csr3

I have tried some hacks which basically disable eviction if it breaks hints and they make the problem go away, but I am not sure what the proper solution to this should be, maybe we can discuss it during your office hour.

Sadly I don't think there's a good solution here. RA's decisions are currently too narrow-scoped to reason about this kind of things. The eviction chain heuristic is supposed to help with these cases, but this is a heuristic! (Plus it may be disabled by default, I don't remember.)

Long story short, if we want to go down that path, I think that a more serious overhaul of the RA infrastructure is needed.

@michaelmaitland
Copy link
Contributor

I think something went wrong on the last push.

@mgudim
Copy link
Contributor Author

mgudim commented Oct 30, 2024

I think something went wrong on the last push.

Yes, thanks! I am working on cleaning up the code and posting the branch so others can try it. Then I'll break it up into commits and try to merge the one-by-one

@mgudim mgudim closed this Nov 1, 2024
@mgudim mgudim deleted the save_csr_in_ra branch November 1, 2024 12:28
@mgudim mgudim restored the save_csr_in_ra branch November 6, 2024 17:26
@mgudim mgudim deleted the save_csr_in_ra branch November 6, 2024 17:26
@mgudim mgudim restored the save_csr_in_ra branch November 6, 2024 17:32
@mgudim mgudim reopened this Nov 6, 2024
@mgudim
Copy link
Contributor Author

mgudim commented Nov 6, 2024

I've updated the branch with the latest prototype. To try the branch, add the following flags to clang:
-mllvm -riscv-enable-save-csr-in-ra=true
mllvm -min-weight-ratio-needed-to-evict-hint=7.5 - this controls a new heuristic I added to register allocator. I've arrived at 7.5 by experiment, so other value could work better in your case.

Looking forward to your feedback. If there are no principal issues moving forward I can start upstreaming this work in small commits.

We turn the problem of saving and restoring callee-saved registers efficiently into a register allocation problem. This has the advantage that the register allocator can essentialy do shrink-wrapping on per register basis. Currently, shrink-wrapping pass saves all CSR in the same place which may be suboptimal. Also, improvements to register allocation / coalescing will translate to improvements in shrink-wrapping.

In finalizeLowering() we copy all callee-saved registers from a physical register to a virtual one. In all return blocks we copy do the reverse.
@michaelmaitland
Copy link
Contributor

What is the thought process of taking this approach over modifying the ShrinkWrap pass to communicate the PrologEpilogInserter where to put each callee saved register?

@michaelmaitland michaelmaitland self-requested a review November 6, 2024 19:45
Copy link
Contributor

@michaelmaitland michaelmaitland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of this patch. I left some comments as I went through it trying to understand what was going on.

@@ -138,6 +141,9 @@ class ReachingDefAnalysis : public MachineFunctionPass {
DenseMap<MachineInstr *, int> InstIds;

MBBReachingDefsInfo MBBReachingDefs;
using MBBFrameObjsReachingDefsInfo =
std::vector<std::vector<std::vector<int>>>;
MBBFrameObjsReachingDefsInfo MBBFrameObjsReachingDefs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add a docstring that describes this object. My understanding is that this is representing a map of MBBNumber -> Frame Object Index -> reaching def instruction numbers, since it is not clear by the data type definition.

void ReachingDefAnalysis::enterBasicBlock(MachineBasicBlock *MBB) {
unsigned MBBNumber = MBB->getNumber();
assert(MBBNumber < MBBReachingDefs.numBlockIDs() &&
"Unexpected basic block number.");
MBBReachingDefs.startBasicBlock(MBBNumber, NumRegUnits);

MBBFrameObjsReachingDefs[MBBNumber].resize(NumStackObjects);
for (unsigned FOIdx = 0; FOIdx < NumStackObjects; ++FOIdx) {
MBBFrameObjsReachingDefs[MBBNumber][FOIdx].push_back(-1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid initializing with -1 for all elements here by returning LatestDef when MBBFrameObjsReachingDefs[MBBNumber][FOIdx].empty() below?

LoopTraversal::TraversalOrder TraversedMBBOrder;
unsigned NumRegUnits = 0;
unsigned NumStackObjects = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NumStackObjects and ObjectIndexBegin are available in enterBasicBlock through MBB->getParent()->getFrameInfo().getNumObjects() and MBB->getParent()->getFrameInfo().getObjectIndexBegin().

We could probably avoid storing them as member variables in ReachingDefAnalysis to keep the object smaller.

@@ -262,6 +267,7 @@ bool PEI::runOnMachineFunction(MachineFunction &MF) {
// called functions. Because of this, calculateCalleeSavedRegisters()
// must be called before this function in order to set the AdjustsStack
// and MaxCallFrameSize variables.
RDA->reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a call to reset() needed? Can it be dirty into this function or after the getAnalysis<ReachingDefAnalysis>() above? If it is needed, could we have done this instead in case the function had the Naked attribute:

  if (!F.hasFnAttribute(Attribute::Naked)) {
    RDA->reset();
    insertPrologEpilogCode(MF);
}

@@ -1164,6 +1170,7 @@ void PEI::calculateFrameObjectOffsets(MachineFunction &MF) {
void PEI::insertPrologEpilogCode(MachineFunction &MF) {
const TargetFrameLowering &TFI = *MF.getSubtarget().getFrameLowering();

TFI.emitCFIsForCSRsHandledByRA(MF, RDA);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to add a comment explaining that CSRs that are handled by RA still need CFI information and emitPrologue does not emit CFI info for those, since the CSRs may no longer be located in the prologue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a separate note, RISCVFrameLowering::emitPrologue has code to iterate over list of callee-saved registers and emit .cfi_offset directive. Can you help me understand how we know to skip over those instructions?

Copy link
Contributor Author

@mgudim mgudim Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the registers added by RISCVFrameLowering::determineMustCalleeSaves will be handled the same way as before (with usual prologue / epilogue).

@@ -1164,6 +1170,7 @@ void PEI::calculateFrameObjectOffsets(MachineFunction &MF) {
void PEI::insertPrologEpilogCode(MachineFunction &MF) {
const TargetFrameLowering &TFI = *MF.getSubtarget().getFrameLowering();

TFI.emitCFIsForCSRsHandledByRA(MF, RDA);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the CFIInserter pass needed if we are inserting CFI information here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suppose we had the following code (before block layout is finalized):

ENTRY:
save x1
.cfi_offset x1, ...
...
beq ... %RET

BB:
...

RET:
ld x1, ...
.cfi_register x1, x1
...
ret

after the block layout the code may look like this (now %RET is layout successor of %ENTRY):

ENTRY:
save x1
.cfi_offset x1, ...
...
bne ... %BB

RET:
ld x1, ...
.cfi_register x1, x1

BB:
...
j %RET

...
ret

The cfi directives apply to all the code below it in the layout, so the cfi's are wrong now. The CFIInserter compares the CFI info from the layout successor to what it has to be and corrects it if necessary.

I basically copied that pass from the existing one CodeGen/ CFIInstrInserter.cpp and modified it.

Copy link
Contributor

@michaelmaitland michaelmaitland Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically copied that pass from the existing one CodeGen/CFIInstrInserter.cpp and modified it.

Is it possible to modify CodeGen/CFIInstrInserter.cpp instead of having a RISC-V specific copy? Maybe using target hooks where necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's possible. In some cases (notably when we have vector spills), we need to use cfi expression instead of simple .cfi_offset. I can add a small cfi-expression decoder to CFIInserter for that.

Here, I recorded cfi info into RISCVMachineFunctionInfo so it can be used by RISCVCFIInserter.

@@ -21874,6 +21874,108 @@ bool RISCVTargetLowering::fallBackToDAGISel(const Instruction &Inst) const {
return false;
}

static MachineInstr *findInstrWhichNeedAllCSRs(MachineBasicBlock &MBB) {
// Some instructions may require (implicitly) all CSRs to be saved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that there are other functions in addition to these functions that need all CSRs to be saved? Is it possible that there can be multiple MI in MBB that need this kind of handling?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't callsUnwindInit need all CSR saved too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that there are other functions in addition to these functions that need all CSRs to be saved? Is it possible that there can be multiple MI in MBB that need this kind of handling?

It's possible, I only added the ones I know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be worried about not handling cases that we don't currently know about? Do we need to solve this problem more generally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how we can solve this more generally. Maybe we can introduce a new attribute?

// instruction");
//++MII;
MachineBasicBlock::iterator MII = MBB.getFirstTerminator();
MachineInstr *NewRetMI = BuildMI(MBB, MII, CallToCxaThrow->getDebugLoc(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The introduction of UnreachableRET is interesting to me. What is the purpose of it? Why is it relevant for this work? Is it useful in any other instances where we have a return after a call to a function marked noreturn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hack! I just wanted to trigger CSR saves. I hope there are better ways to do this.

SmallVector<MachineBasicBlock *, 4> SaveMBBs;
SaveMBBs.push_back(&MF.front());
for (MachineBasicBlock &MBB : MF) {
if (MBB.isReturnBlock())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not a return block, do we need to call findInstrWhichNeedAllCSRs and replace the terminator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add the new fake terminator UnreachableRET

if (hasFP(MF)) {
SavedRegs.set(RAReg);
SavedRegs.set(FPReg);
SavedRegs.set(RISCV::X1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason for changing this? Isn't it the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to change this, it's a typo.

@mgudim
Copy link
Contributor Author

mgudim commented Nov 7, 2024

@michaelmaitland thank you for the review! I wanted this branch to be just a proof of concept (I haven't even cleaned up the commented-out code, sorry!), so others can see the full context for the subsequent smaller PRs, I will address your comments there.

@mgudim
Copy link
Contributor Author

mgudim commented Nov 7, 2024

About using the ReachingDefAnalysis: I wonder if it would be better to use RDF framework for this. This seems like it is capable to do the same thing, but if we make it work for RISCV, that could be some additional gains. Here I extended ReachingDefAnalysis to work with stack slots because it was easier (for me) to understand.

@mgudim
Copy link
Contributor Author

mgudim commented Nov 7, 2024

What is the thought process of taking this approach over modifying the ShrinkWrap pass to communicate the PrologEpilogInserter where to put each callee saved register?

If you ignore the CFI stuff, this approach is much easier, simplifies things (now we don't special treatment of CSRs in RA) and allows for more optimizations than just shrink-wrapping by itself. Essentially, we're exposing the code to save / restore CSRs to all the optimizations after ISel.

Copy link
Collaborator

@qcolombet qcolombet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there are at least three different changes in this PR and they would need to be broken down:

  1. The LICM changes
  2. The RDA changes
  3. The CSR changes for RISC-V

I think the overall direction makes sense. I'm not super happy how the CFI are handled but I haven't given it much thoughts.

Disclaimer, I haven't looked at the RISC-V changes at all.

@@ -210,6 +211,11 @@ class TargetFrameLowering {
/// for noreturn nounwind functions.
virtual bool enableCalleeSaveSkip(const MachineFunction &MF) const;

virtual void emitCFIsForCSRsHandledByRA(MachineFunction &MF,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a comment here to tell backend authors what they are supposed to do with this hook and when they should consider implementing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @qcolombet I'll make sure I'll add a comment when I create a separate MR for it.

@@ -317,6 +317,8 @@ class TargetSubtargetInfo : public MCSubtargetInfo {
return false;
}

virtual bool doCSRSavesInRA() const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto missing comments

@mgudim
Copy link
Contributor Author

mgudim commented Dec 2, 2024

I feel like there are at least three different changes in this PR and they would need to be broken down:

  1. The LICM changes
  2. The RDA changes
  3. The CSR changes for RISC-V

I think the overall direction makes sense. I'm not super happy how the CFI are handled but I haven't given it much thoughts.

Disclaimer, I haven't looked at the RISC-V changes at all.

Here's MR for RDA:

#118097

@michaelmaitland
Copy link
Contributor

Any chance we could rebase this patch on main? It would enable me to run some more experiments. Currently having some trouble when I tried to rebase it myself.

@mgudim
Copy link
Contributor Author

mgudim commented Jan 15, 2025

Any chance we could rebase this patch on main? It would enable me to run some more experiments. Currently having some trouble when I tried to rebase it myself.

@michaelmaitland I'll do it soon, sorry for delay. I had to switch to something else for now.

@mgudim
Copy link
Contributor Author

mgudim commented Jan 17, 2025

Any chance we could rebase this patch on main? It would enable me to run some more experiments. Currently having some trouble when I tried to rebase it myself.

@michaelmaitland I'll do it soon, sorry for delay. I had to switch to something else for now.

I rebased and pushed to a branch in my fork (https://github.com/mgudim/llvm-project/tree/rebased_save_csr_in_ra), so we can updo the rebase if needed.

@mgudim
Copy link
Contributor Author

mgudim commented Jan 21, 2025

Any chance we could rebase this patch on main? It would enable me to run some more experiments. Currently having some trouble when I tried to rebase it myself.

@michaelmaitland I'll do it soon, sorry for delay. I had to switch to something else for now.

I rebased and pushed to a branch in my fork (https://github.com/mgudim/llvm-project/tree/rebased_save_csr_in_ra), so we can updo the rebase if needed.

In my original patch there was a hack related to emission of CFI's and the CFI inserter. I am trying to get rid of that we can get rid of the RISCV CFI inserter.

@michaelmaitland
Copy link
Contributor

In my original patch there was a hack related to emission of CFI's and the CFI inserter. I am trying to get rid of that we can get rid of the RISCV CFI inserter.

I strongly support this direction. Copying the target agnostic CFI inserter as a target specific CFI inserter is likely not a viable direction. If we need to add some target specific logic to the CFI inserter, the we should use target hooks.

@mgudim
Copy link
Contributor Author

mgudim commented Jan 25, 2025

In my original patch there was a hack related to emission of CFI's and the CFI inserter. I am trying to get rid of that we can get rid of the RISCV CFI inserter.

I strongly support this direction. Copying the target agnostic CFI inserter as a target specific CFI inserter is likely not a viable direction. If we need to add some target specific logic to the CFI inserter, the we should use target hooks.

I made the patch use the main CFIInserter and got rid of the RISCVCFIInserter completely. I also ran spec for scalar and everything passed. Running with vector now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants