[MachineSink] Extend loop sinking capability #117247

jrbyrnes · 2024-11-21T22:12:31Z

The current MIR cycle sinking capabilities are rather limited. It only support sinking copies into a single successor block while obeying limits.

This opt-in feature adds a more aggressive option, that is not limited to the above concerns. The feature will try to "sink" by duplicating any top-level preheader instruction (that we are sure is safe to sink) into any user block, then does some dead code cleanup. In particular, this is useful for high RP situations when loop bodies have control flow.

Change-Id: I62a6c6fc2c372523ce9ec98d084a434548609ead

llvmbot · 2024-11-21T22:13:10Z

@llvm/pr-subscribers-backend-systemz

@llvm/pr-subscribers-backend-amdgpu

Author: Jeffrey Byrnes (jrbyrnes)

Changes

The current MIR cycle sinking capabilities are rather limited. It only support sinking copies into a single successor block while obeying limits.

This opt-in feature adds a more aggressive option, that is not limited to the above concerns. The feature will try to "sink" by duplicating any top-level preheader instruction (that we are sure is safe to sink) into any user block, then does some dead code cleanup. In particular, this is useful for high RP situations when loop bodies have control flow.

Patch is 49.46 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/117247.diff

4 Files Affected:

(modified) llvm/lib/CodeGen/MachineSink.cpp (+184)
(added) llvm/test/CodeGen/AMDGPU/aggressive-loop-sink-nonstandard.ll (+20)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-ignorable-exec-use.mir (+360)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-lane-mask.mir (+139-69)

diff --git a/llvm/lib/CodeGen/MachineSink.cpp b/llvm/lib/CodeGen/MachineSink.cpp
index c470bd71dfb29f..d8dd6e8478686d 100644
--- a/llvm/lib/CodeGen/MachineSink.cpp
+++ b/llvm/lib/CodeGen/MachineSink.cpp
@@ -100,6 +100,12 @@ static cl::opt<bool>
                                 "register spills"),
                        cl::init(false), cl::Hidden);
 
+static cl::opt<bool> AggressivelySinkInstsIntoCycle(
+    "aggressively-sink-insts-to-avoid-spills",
+    cl::desc("Aggressively sink instructions into cycles to avoid "
+             "register spills"),
+    cl::init(false), cl::Hidden);
+
 static cl::opt<unsigned> SinkIntoCycleLimit(
     "machine-sink-cycle-limit",
     cl::desc(
@@ -256,6 +262,13 @@ class MachineSinking : public MachineFunctionPass {
                                SmallVectorImpl<MachineInstr *> &Candidates);
   bool SinkIntoCycle(MachineCycle *Cycle, MachineInstr &I);
 
+  bool isDead(const MachineInstr *MI) const;
+  bool AggressivelySinkIntoCycle(
+      MachineCycle *Cycle, MachineInstr &I,
+      DenseMap<MachineInstr *,
+               std::list<std::pair<MachineBasicBlock *, MachineInstr *>>>
+          SunkInstrs);
+
   bool isProfitableToSinkTo(Register Reg, MachineInstr &MI,
                             MachineBasicBlock *MBB,
                             MachineBasicBlock *SuccToSinkTo,
@@ -679,6 +692,10 @@ void MachineSinking::FindCycleSinkCandidates(
     SmallVectorImpl<MachineInstr *> &Candidates) {
   for (auto &MI : *BB) {
     LLVM_DEBUG(dbgs() << "CycleSink: Analysing candidate: " << MI);
+    if (MI.isDebugInstr()) {
+      LLVM_DEBUG(dbgs() << "CycleSink: Dont sink debug instructions\n");
+      continue;
+    }
     if (!TII->shouldSink(MI)) {
       LLVM_DEBUG(dbgs() << "CycleSink: Instruction not a candidate for this "
                            "target\n");
@@ -799,6 +816,30 @@ bool MachineSinking::runOnMachineFunction(MachineFunction &MF) {
     }
   }
 
+  if (AggressivelySinkInstsIntoCycle) {
+    SmallVector<MachineCycle *, 8> Cycles(CI->toplevel_cycles());
+    DenseMap<MachineInstr *,
+             std::list<std::pair<MachineBasicBlock *, MachineInstr *>>>
+        SunkInstrs;
+    for (auto *Cycle : Cycles) {
+      MachineBasicBlock *Preheader = Cycle->getCyclePreheader();
+      if (!Preheader) {
+        LLVM_DEBUG(dbgs() << "AggressiveCycleSink: Can't find preheader\n");
+        continue;
+      }
+      SmallVector<MachineInstr *, 8> Candidates;
+      FindCycleSinkCandidates(Cycle, Preheader, Candidates);
+
+      // Walk the candidates in reverse order so that we start with the use
+      // of a def-use chain, if there is any.
+      for (MachineInstr *I : llvm::reverse(Candidates)) {
+        AggressivelySinkIntoCycle(Cycle, *I, SunkInstrs);
+        EverMadeChange = true;
+        ++NumCycleSunk;
+      }
+    }
+  }
+
   HasStoreCache.clear();
   StoreInstrCache.clear();
 
@@ -1574,6 +1615,149 @@ bool MachineSinking::hasStoreBetween(MachineBasicBlock *From,
   return HasAliasedStore;
 }
 
+/// Copy paste from DeadMachineInstructionElimImpl
+
+bool MachineSinking::isDead(const MachineInstr *MI) const {
+  // Instructions without side-effects are dead iff they only define dead regs.
+  // This function is hot and this loop returns early in the common case,
+  // so only perform additional checks before this if absolutely necessary.
+  for (const MachineOperand &MO : MI->all_defs()) {
+    Register Reg = MO.getReg();
+    if (Reg.isPhysical()) {
+      return false;
+    } else {
+      if (MO.isDead()) {
+#ifndef NDEBUG
+        // Basic check on the register. All of them should be 'undef'.
+        for (auto &U : MRI->use_nodbg_operands(Reg))
+          assert(U.isUndef() && "'Undef' use on a 'dead' register is found!");
+#endif
+        continue;
+      }
+      for (const MachineInstr &Use : MRI->use_nodbg_instructions(Reg)) {
+        if (&Use != MI)
+          // This def has a non-debug use. Don't delete the instruction!
+          return false;
+      }
+    }
+  }
+
+  // Technically speaking inline asm without side effects and no defs can still
+  // be deleted. But there is so much bad inline asm code out there, we should
+  // let them be.
+  if (MI->isInlineAsm())
+    return false;
+
+  // FIXME: See issue #105950 for why LIFETIME markers are considered dead here.
+  if (MI->isLifetimeMarker())
+    return true;
+
+  // If there are no defs with uses, the instruction might be dead.
+  return MI->wouldBeTriviallyDead();
+}
+
+/// Aggressively sink instructions into cycles. This will aggressively try to
+/// sink all instructions in the top-most preheaders in an attempt to reduce RP.
+/// In particular, it will sink into multiple successor blocks without limits
+/// based on the amount of sinking, or the type of ops being sunk (so long as
+/// they are safe to sink).
+bool MachineSinking::AggressivelySinkIntoCycle(
+    MachineCycle *Cycle, MachineInstr &I,
+    DenseMap<MachineInstr *,
+             std::list<std::pair<MachineBasicBlock *, MachineInstr *>>>
+        SunkInstrs) {
+  LLVM_DEBUG(dbgs() << "AggressiveCycleSink: Finding sink block for: " << I);
+  MachineBasicBlock *Preheader = Cycle->getCyclePreheader();
+  assert(Preheader && "Cycle sink needs a preheader block");
+  SmallVector<std::pair<MachineOperand, MachineInstr *>> Uses;
+  // TODO: support instructions with multiple defs
+  if (I.getNumDefs() > 1)
+    return false;
+
+  MachineOperand DefMO = I.getOperand(0);
+  for (MachineInstr &MI : MRI->use_instructions(DefMO.getReg())) {
+    Uses.push_back({DefMO, &MI});
+  }
+
+  for (std::pair<MachineOperand, MachineInstr *> Entry : Uses) {
+    MachineInstr *MI = Entry.second;
+    LLVM_DEBUG(dbgs() << "AggressiveCycleSink:   Analysing use: " << MI);
+    if (MI->isPHI()) {
+      LLVM_DEBUG(
+          dbgs() << "AggressiveCycleSink:   Not attempting to sink for PHI.\n");
+      continue;
+    }
+    // We cannot sink before the prologue
+    if (TII->isBasicBlockPrologue(*MI) || MI->isPosition()) {
+      LLVM_DEBUG(dbgs() << "AggressiveCycleSink:   Use is BasicBlock prologue, "
+                           "can't sink.\n");
+      continue;
+    }
+    if (!Cycle->contains(MI->getParent())) {
+      LLVM_DEBUG(
+          dbgs() << "AggressiveCycleSink:   Use not in cycle, can't sink.\n");
+      continue;
+    }
+
+    MachineBasicBlock *SinkBlock = MI->getParent();
+    MachineInstr *NewMI = nullptr;
+
+    // Check for the case in which we have already sunk a copy of this
+    // instruction into the user block.
+    if (SunkInstrs.contains(&I)) {
+      auto SunkBlocks = SunkInstrs[&I];
+      auto Match = std::find_if(
+          SunkBlocks.begin(), SunkBlocks.end(),
+          [&SinkBlock](
+              std::pair<MachineBasicBlock *, MachineInstr *> SunkEntry) {
+            return SunkEntry.first == SinkBlock;
+          });
+      if (Match != SunkBlocks.end()) {
+        LLVM_DEBUG(dbgs() << "AggressiveCycleSink:   Already sunk to block: "
+                          << printMBBReference(*SinkBlock) << "\n");
+        NewMI = Match->second;
+      }
+    }
+
+    // Create a copy of the instruction in the use block.
+    if (!NewMI) {
+      LLVM_DEBUG(dbgs() << "AggressiveCycleSink: Sinking instruction to block: "
+                        << printMBBReference(*SinkBlock) << "\n");
+
+      NewMI = I.getMF()->CloneMachineInstr(&I);
+      if (DefMO.getReg().isVirtual()) {
+        const TargetRegisterClass *TRC = MRI->getRegClass(DefMO.getReg());
+        Register DestReg = MRI->createVirtualRegister(TRC);
+        NewMI->substituteRegister(DefMO.getReg(), DestReg, DefMO.getSubReg(),
+                                  *TRI);
+      }
+      SinkBlock->insert(SinkBlock->SkipPHIsAndLabels(SinkBlock->begin()),
+                        NewMI);
+      SunkInstrs[&I].push_back({SinkBlock, NewMI});
+    }
+
+    // Conservatively clear any kill flags on uses of sunk instruction
+    for (MachineOperand &MO : NewMI->operands()) {
+      if (MO.isReg() && MO.readsReg())
+        RegsToClearKillFlags.insert(MO.getReg());
+    }
+
+    // The instruction is moved from its basic block, so do not retain the
+    // debug information.
+    assert(!NewMI->isDebugInstr() && "Should not sink debug inst");
+    NewMI->setDebugLoc(DebugLoc());
+
+    // Replace the use with the newly created virtual register.
+    MachineOperand UseMO = Entry.first;
+    MI->substituteRegister(UseMO.getReg(), NewMI->getOperand(0).getReg(),
+                           UseMO.getSubReg(), *TRI);
+  }
+  // If we have replaced all uses, then delete the dead instruction
+  if (isDead(&I))
+    I.eraseFromParent();
+  return true;
+}
+
 /// Sink instructions into cycles if profitable. This especially tries to
 /// prevent register spills caused by register pressure if there is little to no
 /// overhead moving instructions into cycles.
diff --git a/llvm/test/CodeGen/AMDGPU/aggressive-loop-sink-nonstandard.ll b/llvm/test/CodeGen/AMDGPU/aggressive-loop-sink-nonstandard.ll
new file mode 100644
index 00000000000000..72b4495297a1c5
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/aggressive-loop-sink-nonstandard.ll
@@ -0,0 +1,20 @@
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 --aggressively-sink-insts-to-avoid-spills=1  < %s | FileCheck -check-prefix=SUNK %s
+
+; Check that various edge cases do not crash the compiler
+
+; Multiple uses of sunk valu, chain of sink candidates
+
+define half @global_agent_atomic_fmin_ret_f16__amdgpu_no_fine_grained_memory(ptr addrspace(1) %ptr, half %val) {
+; SUNK-LABEL: global_agent_atomic_fmin_ret_f16__amdgpu_no_fine_grained_memory:
+  %result = atomicrmw fmin ptr addrspace(1) %ptr, half %val syncscope("agent") seq_cst
+  ret half %result
+}
+
+; Sink candidates with multiple defs
+
+define void @memmove_p5_p5(ptr addrspace(5) align 1 %dst, ptr addrspace(5) align 1 readonly %src, i64 %sz) {
+; SUNK-LABEL: memmove_p5_p5:
+entry:
+  tail call void @llvm.memmove.p5.p5.i64(ptr addrspace(5) noundef nonnull align 1 %dst, ptr addrspace(5) noundef nonnull align 1 %src, i64 %sz, i1 false)
+  ret void
+}
diff --git a/llvm/test/CodeGen/AMDGPU/machine-sink-ignorable-exec-use.mir b/llvm/test/CodeGen/AMDGPU/machine-sink-ignorable-exec-use.mir
index efa21052e3ae2f..f93d8f3dde21b6 100644
--- a/llvm/test/CodeGen/AMDGPU/machine-sink-ignorable-exec-use.mir
+++ b/llvm/test/CodeGen/AMDGPU/machine-sink-ignorable-exec-use.mir
@@ -1,5 +1,7 @@
 # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
 # RUN: llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs -run-pass=machine-sink -o - %s | FileCheck -check-prefixes=GFX9 %s
+# RUN: llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs -run-pass=machine-sink --aggressively-sink-insts-to-avoid-spills=1 -o - %s | FileCheck -check-prefixes=GFX9-SUNK %s
+
 
 ---
 name:            test_sink_fmac_to_only_use
@@ -48,6 +50,47 @@ body:             |
   ; GFX9-NEXT: {{  $}}
   ; GFX9-NEXT: bb.3:
   ; GFX9-NEXT:   S_ENDPGM 0, implicit [[PHI]], implicit [[PHI1]]
+  ;
+  ; GFX9-SUNK-LABEL: name: test_sink_fmac_to_only_use
+  ; GFX9-SUNK: bb.0:
+  ; GFX9-SUNK-NEXT:   successors: %bb.2(0x40000000), %bb.1(0x40000000)
+  ; GFX9-SUNK-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2
+  ; GFX9-SUNK-NEXT: {{  $}}
+  ; GFX9-SUNK-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+  ; GFX9-SUNK-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+  ; GFX9-SUNK-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[V_MOV_B32_e32_1:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[S_MOV_B64_:%[0-9]+]]:sreg_64 = S_MOV_B64 0
+  ; GFX9-SUNK-NEXT:   [[S_MOV_B64_1:%[0-9]+]]:sreg_64 = S_MOV_B64 0
+  ; GFX9-SUNK-NEXT:   [[COPY2:%[0-9]+]]:vreg_64 = COPY [[S_MOV_B64_]]
+  ; GFX9-SUNK-NEXT:   [[COPY3:%[0-9]+]]:vreg_64 = COPY [[S_MOV_B64_1]]
+  ; GFX9-SUNK-NEXT:   [[GLOBAL_LOAD_DWORD:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD killed [[COPY2]], 0, 0, implicit $exec :: (load (s32), addrspace 1)
+  ; GFX9-SUNK-NEXT:   [[GLOBAL_LOAD_DWORD1:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD killed [[COPY3]], 0, 0, implicit $exec :: (load (s32), addrspace 1)
+  ; GFX9-SUNK-NEXT:   [[COPY4:%[0-9]+]]:vgpr_32(s32) = COPY $vgpr0
+  ; GFX9-SUNK-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 1
+  ; GFX9-SUNK-NEXT:   [[V_CMP_LT_I32_e64_:%[0-9]+]]:sreg_64 = V_CMP_LT_I32_e64 [[COPY4]](s32), [[S_MOV_B32_]], implicit $exec
+  ; GFX9-SUNK-NEXT:   [[SI_IF:%[0-9]+]]:sreg_64 = SI_IF [[V_CMP_LT_I32_e64_]], %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+  ; GFX9-SUNK-NEXT:   S_BRANCH %bb.1
+  ; GFX9-SUNK-NEXT: {{  $}}
+  ; GFX9-SUNK-NEXT: bb.1:
+  ; GFX9-SUNK-NEXT:   successors: %bb.2(0x80000000)
+  ; GFX9-SUNK-NEXT: {{  $}}
+  ; GFX9-SUNK-NEXT:   [[V_FMAC_F32_e64_:%[0-9]+]]:vgpr_32 = contract nofpexcept V_FMAC_F32_e64 0, [[GLOBAL_LOAD_DWORD]], 0, [[COPY]], 0, [[COPY1]], 0, 0, implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[V_FMAC_F32_e64_1:%[0-9]+]]:vgpr_32 = contract nofpexcept V_FMAC_F32_e64 0, [[V_FMAC_F32_e64_]], 0, [[COPY]], 0, [[COPY1]], 0, 0, implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[V_FMAC_F32_e64_2:%[0-9]+]]:vgpr_32 = contract nofpexcept V_FMAC_F32_e64 0, [[GLOBAL_LOAD_DWORD1]], 0, [[COPY1]], 0, [[COPY]], 0, 0, implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[V_FMAC_F32_e64_3:%[0-9]+]]:vgpr_32 = contract nofpexcept V_FMAC_F32_e64 0, [[V_FMAC_F32_e64_2]], 0, [[COPY1]], 0, [[COPY]], 0, 0, implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[V_ADD_F32_e32_:%[0-9]+]]:vgpr_32 = V_ADD_F32_e32 [[V_FMAC_F32_e64_]], [[V_FMAC_F32_e64_1]], implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[V_ADD_F32_e32_1:%[0-9]+]]:vgpr_32 = V_ADD_F32_e32 [[V_FMAC_F32_e64_2]], [[V_FMAC_F32_e64_3]], implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT: {{  $}}
+  ; GFX9-SUNK-NEXT: bb.2:
+  ; GFX9-SUNK-NEXT:   successors: %bb.3(0x80000000)
+  ; GFX9-SUNK-NEXT: {{  $}}
+  ; GFX9-SUNK-NEXT:   [[PHI:%[0-9]+]]:vgpr_32 = PHI [[V_MOV_B32_e32_]], %bb.0, [[V_ADD_F32_e32_]], %bb.1
+  ; GFX9-SUNK-NEXT:   [[PHI1:%[0-9]+]]:vgpr_32 = PHI [[V_MOV_B32_e32_1]], %bb.0, [[V_ADD_F32_e32_1]], %bb.1
+  ; GFX9-SUNK-NEXT:   SI_END_CF [[SI_IF]], implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+  ; GFX9-SUNK-NEXT: {{  $}}
+  ; GFX9-SUNK-NEXT: bb.3:
+  ; GFX9-SUNK-NEXT:   S_ENDPGM 0, implicit [[PHI]], implicit [[PHI1]]
   bb.0:
     liveins: $vgpr0, $vgpr1, $vgpr2
     %1:vgpr_32 = COPY $vgpr0
@@ -131,6 +174,48 @@ body:             |
   ; GFX9-NEXT: bb.3:
   ; GFX9-NEXT:   [[V_ADD_F32_e32_2:%[0-9]+]]:vgpr_32 = V_ADD_F32_e32 [[V_FMAC_F32_e64_3]], [[V_FMAC_F32_e64_1]], implicit $mode, implicit $exec
   ; GFX9-NEXT:   S_ENDPGM 0, implicit [[PHI]], implicit [[PHI1]]
+  ;
+  ; GFX9-SUNK-LABEL: name: test_no_sink_into_if_cond_multiple_uses
+  ; GFX9-SUNK: bb.0:
+  ; GFX9-SUNK-NEXT:   successors: %bb.2(0x40000000), %bb.1(0x40000000)
+  ; GFX9-SUNK-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2
+  ; GFX9-SUNK-NEXT: {{  $}}
+  ; GFX9-SUNK-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+  ; GFX9-SUNK-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+  ; GFX9-SUNK-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[V_MOV_B32_e32_1:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[S_MOV_B64_:%[0-9]+]]:sreg_64 = S_MOV_B64 0
+  ; GFX9-SUNK-NEXT:   [[S_MOV_B64_1:%[0-9]+]]:sreg_64 = S_MOV_B64 0
+  ; GFX9-SUNK-NEXT:   [[COPY2:%[0-9]+]]:vreg_64 = COPY [[S_MOV_B64_]]
+  ; GFX9-SUNK-NEXT:   [[COPY3:%[0-9]+]]:vreg_64 = COPY [[S_MOV_B64_1]]
+  ; GFX9-SUNK-NEXT:   [[GLOBAL_LOAD_DWORD:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD killed [[COPY2]], 0, 0, implicit $exec :: (load (s32), addrspace 1)
+  ; GFX9-SUNK-NEXT:   [[V_FMAC_F32_e64_:%[0-9]+]]:vgpr_32 = contract nofpexcept V_FMAC_F32_e64 0, [[GLOBAL_LOAD_DWORD]], 0, [[COPY]], 0, [[COPY1]], 0, 0, implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[V_FMAC_F32_e64_1:%[0-9]+]]:vgpr_32 = contract nofpexcept V_FMAC_F32_e64 0, [[V_FMAC_F32_e64_]], 0, [[COPY]], 0, [[COPY1]], 0, 0, implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[GLOBAL_LOAD_DWORD1:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD killed [[COPY3]], 0, 0, implicit $exec :: (load (s32), addrspace 1)
+  ; GFX9-SUNK-NEXT:   [[V_FMAC_F32_e64_2:%[0-9]+]]:vgpr_32 = contract nofpexcept V_FMAC_F32_e64 0, [[GLOBAL_LOAD_DWORD1]], 0, [[COPY1]], 0, [[COPY]], 0, 0, implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[V_FMAC_F32_e64_3:%[0-9]+]]:vgpr_32 = contract nofpexcept V_FMAC_F32_e64 0, [[V_FMAC_F32_e64_2]], 0, [[COPY1]], 0, [[COPY]], 0, 0, implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[COPY4:%[0-9]+]]:vgpr_32(s32) = COPY $vgpr0
+  ; GFX9-SUNK-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 1
+  ; GFX9-SUNK-NEXT:   [[V_CMP_LT_I32_e64_:%[0-9]+]]:sreg_64 = V_CMP_LT_I32_e64 [[COPY4]](s32), [[S_MOV_B32_]], implicit $exec
+  ; GFX9-SUNK-NEXT:   [[SI_IF:%[0-9]+]]:sreg_64 = SI_IF [[V_CMP_LT_I32_e64_]], %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+  ; GFX9-SUNK-NEXT:   S_BRANCH %bb.1
+  ; GFX9-SUNK-NEXT: {{  $}}
+  ; GFX9-SUNK-NEXT: bb.1:
+  ; GFX9-SUNK-NEXT:   successors: %bb.2(0x80000000)
+  ; GFX9-SUNK-NEXT: {{  $}}
+  ; GFX9-SUNK-NEXT:   [[V_ADD_F32_e32_:%[0-9]+]]:vgpr_32 = V_ADD_F32_e32 [[V_FMAC_F32_e64_]], [[V_FMAC_F32_e64_1]], implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[V_ADD_F32_e32_1:%[0-9]+]]:vgpr_32 = V_ADD_F32_e32 [[V_FMAC_F32_e64_2]], [[V_FMAC_F32_e64_3]], implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT: {{  $}}
+  ; GFX9-SUNK-NEXT: bb.2:
+  ; GFX9-SUNK-NEXT:   successors: %bb.3(0x80000000)
+  ; GFX9-SUNK-NEXT: {{  $}}
+  ; GFX9-SUNK-NEXT:   [[PHI:%[0-9]+]]:vgpr_32 = PHI [[V_MOV_B32_e32_]], %bb.0, [[V_ADD_F32_e32_]], %bb.1
+  ; GFX9-SUNK-NEXT:   [[PHI1:%[0-9]+]]:vgpr_32 = PHI [[V_MOV_B32_e32_1]], %bb.0, [[V_ADD_F32_e32_1]], %bb.1
+  ; GFX9-SUNK-NEXT:   SI_END_CF [[SI_IF]], implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+  ; GFX9-SUNK-NEXT: {{  $}}
+  ; GFX9-SUNK-NEXT: bb.3:
+  ; GFX9-SUNK-NEXT:   [[V_ADD_F32_e32_2:%[0-9]+]]:vgpr_32 = V_ADD_F32_e32 [[V_FMAC_F32_e64_3]], [[V_FMAC_F32_e64_1]], implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   S_ENDPGM 0, implicit [[PHI]], implicit [[PHI1]]
   bb.0:
     liveins: $vgpr0, $vgpr1, $vgpr2
     %1:vgpr_32 = COPY $vgpr0
@@ -215,6 +300,48 @@ body:             |
   ; GFX9-NEXT: {{  $}}
   ; GFX9-NEXT: bb.3:
   ; GFX9-NEXT:   S_ENDPGM 0, implicit [[PHI]], implicit [[PHI1]]
+  ;
+  ; GFX9-SUNK-LABEL: name: no_sink_fmac_not_constant_mode
+  ; GFX9-SUNK: bb.0:
+  ; GFX9-SUNK-NEXT:   successors: %bb.2(0x40000000), %bb.1(0x40000000)
+  ; GFX9-SUNK-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2
+  ; GFX9-SUNK-NEXT: {{  $}}
+  ; GFX9-SUNK-NEXT:   $mode = IMPLICIT_DEF
+  ; GFX9-SUNK-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+  ; GFX9-SUNK-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+  ; GFX9-SUNK-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[V_MOV_B32_e32_1:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[S_MOV_B64_:%[0-9]+]]:sreg_64 = S_MOV_B64 0
+  ; GFX9-SUNK-NEXT:   [[S_MOV_B64_1:%[0-9]+]]:sreg_64 = S_MOV_B64 0
+  ; GFX9-SUNK-NEXT:   [[COPY2:%[0-9]+]]:vreg_64 = COPY [[S_MOV_B64_]]
+  ; GFX9-SUNK-NEXT:   [[COPY3:%[0-9]+]]:vreg_64 = COPY [[S_MOV_B64_1]]
+  ; GFX9-SUNK-NEXT:   [[GLOBAL_LOAD_DWORD:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD killed [[COPY2]], 0, 0, implicit $exec :: (load (s32), addrspace 1)
+  ; GFX9-SUNK-NEXT:   [[V_FMAC_F32_e64_:%[0-9]+]]:vgpr_32 = contract nofpexcept V_FMAC_F32_e64 0, [[GLOBAL_LOAD_DWORD]], 0, [[COPY]], 0, [[COPY1]], 0, 0, implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[V_FMAC_F32_e64_1:%[0-9]+]]:vgpr_32 = contract nofpexcept V_FMAC_F32_e64 0, [[V_FMAC_F32_e64_]], 0, [[COPY]], 0, [[COPY1]], 0, 0, implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[GLOBAL_LOAD_DWORD1:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD killed [[COPY3]], 0, 0, implicit $exec :: (load (s32), addrspace 1)
+  ; GFX9-SUNK-NEXT:   [[V_FMAC_F32_e64_2:%[0-9]+]]:vgpr_32 = contract nofpexcept V_FMAC_F32_e64 0, [[GLOBAL_LOAD_DWORD1]], 0, [[COPY1]], 0, [[COPY]], 0, 0, implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[V_FMAC_F32_e64_3:%[0-9]+]]:vgpr_32 = contract nofpexcept V_FMAC_F32_e64 0, [[V_FMAC_F32_e64_2]], 0, [[COPY1]], 0, [[COPY]], 0, 0, implicit $mode, implicit $exec
+  ; GFX9-SUNK-NEXT:   [[COPY4:%[0-9]+]]:vgpr_32(s32) = COPY $vgpr0
+  ; GFX9-SUNK-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_...
[truncated]

llvm/lib/CodeGen/MachineSink.cpp

llvm/test/CodeGen/AMDGPU/aggressive-loop-sink-nonstandard.ll

Change-Id: I975fab6cf7dba21788fb5677a5484916ef29d959

Change-Id: I8f1138f9fc82251538f2c428f1e67fa2941266b5

jrbyrnes · 2024-12-02T19:32:09Z

Ping

llvm/lib/CodeGen/MachineSink.cpp

Change-Id: Iec36f11060ca1b46b6c33130d4ee02863360c671

Change-Id: I17405578571a711f53db71df0e9329600c01fceb

llvm/include/llvm/CodeGen/TargetInstrInfo.h

Change-Id: I4082bd57dd03236e4d578dac4804949544f4dcf2

jrbyrnes · 2024-12-06T02:42:47Z

Reworked the approach to introduce CycleSinkStage . This does sinking into loops in progressively aggressive stages, with execution of each stage being dependent upon whether or not we've met our RP goals. The first stage sinks copies and honors the sink cycle limit, the second sinks low latency instructions, and the third sinks all candidates.

Reorganized the tests to better test the new implementation.

llvm/lib/CodeGen/MachineSink.cpp

Change-Id: I3738cc0f14d7ab2db35109f3e02a2f7e4fa9f2e1

llvm/lib/CodeGen/MachineSink.cpp

Change-Id: I4d70eed99499df33f4bde04be05e88ea0c2de877

jrbyrnes · 2024-12-13T19:39:17Z

Any other concerns here?

jrbyrnes · 2024-12-21T00:25:43Z

ping

jrbyrnes · 2025-01-03T04:16:29Z

Change commit title to reflect that this isn't adding a new flag.

llvm/lib/CodeGen/MachineSink.cpp

jrbyrnes · 2025-01-06T19:47:39Z

Bring in #121818 -- this provides an interface for checking if an MI is dead.

As a result, I've deleted MachineSinking::isDead . The implementation concerns of isDead still exist in DeadMachineInstructionElim, but should be handeled separately from this.

jrbyrnes · 2025-01-10T15:46:36Z

Ping --

Sorry for short frequency ping, but would like to get this in.

llvm/lib/CodeGen/MachineSink.cpp

Change-Id: Ia3e109d47eaa23113e6777079be952fa29eb3e34

jrbyrnes · 2025-01-19T21:50:35Z

Replace #121818 with #123531

Change-Id: I90e7829a0afc535760c08417cc50db71f46d9910

llvm/lib/CodeGen/MachineSink.cpp

Use web editor

llvm-ci · 2025-01-24T01:41:19Z

LLVM Buildbot has detected a new failure on builder openmp-offload-libc-amdgpu-runtime running on omp-vega20-1 while building llvm at step 7 "Add check check-offload".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/73/builds/12318

Here is the relevant piece of the build log for the reference

Step 7 (Add check check-offload) failure: test (failure)
******************** TEST 'libomptarget :: amdgcn-amd-amdhsa :: offloading/pgo1.c' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 1
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp    -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fprofile-instr-generate      -Xclang "-fprofile-instrument=clang"
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fprofile-instr-generate -Xclang -fprofile-instrument=clang
# RUN: at line 3
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp 2>&1 | /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c      --check-prefix="CLANG-PGO"
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c --check-prefix=CLANG-PGO
# .---command stderr------------
# | /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c:32:20: error: CLANG-PGO-NEXT: expected string not found in input
# | // CLANG-PGO-NEXT: [ 0 11 20 ]
# |                    ^
# | <stdin>:3:28: note: scanning from here
# | ======== Counters =========
# |                            ^
# | <stdin>:4:1: note: possible intended match here
# | [ 0 14 20 ]
# | ^
# | 
# | Input file: <stdin>
# | Check file: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |            1: ======= GPU Profile ======= 
# |            2: Target: amdgcn-amd-amdhsa 
# |            3: ======== Counters ========= 
# | next:32'0                                X error: no match found
# |            4: [ 0 14 20 ] 
# | next:32'0     ~~~~~~~~~~~~
# | next:32'1     ?            possible intended match
# |            5: [ 10 ] 
# | next:32'0     ~~~~~~~
# |            6: [ 20 ] 
# | next:32'0     ~~~~~~~
# |            7: ========== Data =========== 
# | next:32'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# |            8: { 9515997471539760012 4749112401 0xffffffffffffffd8 0x0 0x0 0x0 3 [...] 0 } 
# | next:32'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# |            9: { 3666282617048535130 24 0xffffffffffffffb0 0x0 0x0 0x0 1 [...] 0 } 
# | next:32'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# |            .
# |            .
# |            .
...

preames · 2025-02-28T17:01:59Z

@jrbyrnes - Can I ask your plans here? I've seen cases where something like this would be useful for the RISCV backend. Are you going to move this towards a per backend enable? Aside from #127133, are you otherwise happy with the heuristics here?

jrbyrnes · 2025-02-28T17:28:22Z

Hey @preames --

At the moment, I don't really have plans to continue along this direction -- we needed some short-term way to cope with code that had huge live-in sets. The heuristics are okay -- I assume that the RP calculations are inaccurate, and there is no evaluation of latency of sunk code vs RP / spill reduction. But for our usescases, they are sufficient.

For our (AMDGPU) backend, the long term approach is to do rematerialization during scheduling. At that point, we have a better idea of the final state of RP, and more accurate info regarding latency. Using this approach we are also able to write target specific remat / sink strategies. Although I would expect the basic principles of said strategies to be universal, different targets tend to care about RP in different ways that are not always easy to express via a hook.

[MachineSink] Add option for aggressive loop sinking

b32aa25

Change-Id: I62a6c6fc2c372523ce9ec98d084a434548609ead

jrbyrnes requested review from arsenm and srpande November 21, 2024 22:12

llvmbot added the backend:AMDGPU label Nov 21, 2024

arsenm reviewed Nov 22, 2024

View reviewed changes

jrbyrnes added 2 commits November 22, 2024 13:52

Address review comments

0813b87

Change-Id: I975fab6cf7dba21788fb5677a5484916ef29d959

Fix SystemZ test

7e3caf5

Change-Id: I8f1138f9fc82251538f2c428f1e67fa2941266b5

llvmbot added the backend:SystemZ label Nov 23, 2024

srpande reviewed Dec 2, 2024

View reviewed changes

llvm/lib/CodeGen/MachineSink.cpp Outdated Show resolved Hide resolved

jrbyrnes added 2 commits December 3, 2024 14:16

Add low latency check

d09d4f1

Change-Id: Iec36f11060ca1b46b6c33130d4ee02863360c671

Extra semicolon

16b16e3

Change-Id: I17405578571a711f53db71df0e9329600c01fceb

arsenm reviewed Dec 4, 2024

View reviewed changes

llvm/include/llvm/CodeGen/TargetInstrInfo.h Outdated Show resolved Hide resolved

Use stage approach

5dddd84

Change-Id: I4082bd57dd03236e4d578dac4804949544f4dcf2

jrbyrnes changed the title ~~[MachineSink] Add option for aggressive loop sinking~~ [MachineSink] Add capability for aggressive loop sinking Dec 6, 2024

arsenm reviewed Dec 6, 2024

View reviewed changes

Review comments

4255e2a

Change-Id: I3738cc0f14d7ab2db35109f3e02a2f7e4fa9f2e1

arsenm reviewed Dec 8, 2024

View reviewed changes

llvm/lib/CodeGen/MachineSink.cpp Outdated Show resolved Hide resolved

llvm/lib/CodeGen/MachineSink.cpp Outdated Show resolved Hide resolved

llvm/lib/CodeGen/MachineSink.cpp Outdated Show resolved Hide resolved

Review comments

672150e

Change-Id: I4d70eed99499df33f4bde04be05e88ea0c2de877

jrbyrnes changed the title ~~[MachineSink] Add capability for aggressive loop sinking~~ [MachineSink] Extend loop sinking capability Jan 3, 2025

arsenm reviewed Jan 6, 2025

View reviewed changes

jrbyrnes mentioned this pull request Jan 6, 2025

[DeadMachineInstructionElim] NFC: Provide DeadMachineInstructionInfo #121818

Closed

arsenm reviewed Jan 16, 2025

View reviewed changes

llvm/lib/CodeGen/MachineSink.cpp Outdated Show resolved Hide resolved

Bring in and use MI.isDead()

4d5f49b

Change-Id: Ia3e109d47eaa23113e6777079be952fa29eb3e34

jrbyrnes force-pushed the MIRAggressiveSinkRebase0 branch from 8087333 to 4d5f49b Compare January 19, 2025 21:49

jrbyrnes mentioned this pull request Jan 19, 2025

[CodeGen] NFC: Move isDead to MachineInstr #123531

Merged

Early exit + review comments

3cb9ec9

Change-Id: I90e7829a0afc535760c08417cc50db71f46d9910

arsenm added the llvm:codegen label Jan 23, 2025

arsenm approved these changes Jan 23, 2025

View reviewed changes

Merge branch 'main' into MIRAggressiveSinkRebase0

f613463

jrbyrnes commented Jan 23, 2025

View reviewed changes

llvm/lib/CodeGen/MachineSink.cpp Outdated Show resolved Hide resolved

Update llvm/lib/CodeGen/MachineSink.cpp

9993ffc

Use web editor

jrbyrnes merged commit acb7859 into llvm:main Jan 24, 2025
8 checks passed

[MachineSink] Extend loop sinking capability #117247

[MachineSink] Extend loop sinking capability #117247

Uh oh!

Conversation

jrbyrnes commented Nov 21, 2024

Uh oh!

llvmbot commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbyrnes commented Dec 2, 2024

Uh oh!

Uh oh!

Uh oh!

jrbyrnes commented Dec 6, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbyrnes commented Dec 13, 2024

Uh oh!

jrbyrnes commented Dec 21, 2024

Uh oh!

jrbyrnes commented Jan 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbyrnes commented Jan 6, 2025

Uh oh!

jrbyrnes commented Jan 10, 2025

Uh oh!

Uh oh!

jrbyrnes commented Jan 19, 2025

Uh oh!

Uh oh!

Uh oh!

llvm-ci commented Jan 24, 2025

Uh oh!

preames commented Feb 28, 2025

Uh oh!

jrbyrnes commented Feb 28, 2025

Uh oh!

Uh oh!

llvmbot commented Nov 21, 2024 •

edited

Loading