RegAllocGreedy: Fix subrange based instruction split logic #120199

arsenm · 2024-12-17T08:39:34Z

Fix the logic for readsLaneSubset. Check at the correct point
for the use operands of the instruction, instead of the result.
Only consider the use register operands, and stop considering
whether the subranges are actually live at this point.

This avoids some unproductive splits. This also happens to avoid
a use after free due to a split of an unspillable register. That
issue still exists if the instruction does not reference the full
set of register lanes.

arsenm · 2024-12-17T08:39:58Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2024-12-17T09:18:02Z

@llvm/pr-subscribers-llvm-regalloc

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

Fix the logic for readsLaneSubset. Check at the correct point
for the use operands of the instruction, instead of the result.
Only consider the use register operands, and stop considering
whether the subranges are actually live at this point.

This avoids some unproductive splits. This also happens to avoid
a use after free due to a split of an unspillable register. That
issue still exists if the instruction does not reference the full
set of register lanes.

Patch is 110.59 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/120199.diff

9 Files Affected:

(modified) llvm/lib/CodeGen/RegAllocGreedy.cpp (+24-38)
(modified) llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll (+60-54)
(added) llvm/test/CodeGen/AMDGPU/inflated-reg-class-snippet-copy-inst-reads-lane-subset-use-after-free.mir (+58)
(added) llvm/test/CodeGen/AMDGPU/inflated-reg-class-snippet-copy-use-after-free.mir (+111)
(modified) llvm/test/CodeGen/AMDGPU/remat-smrd.mir (+5-7)
(modified) llvm/test/CodeGen/AMDGPU/splitkit-copy-live-lanes.mir (+296-286)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll (+3-14)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-load.ll (+14-27)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave.ll (+34-44)

diff --git a/llvm/lib/CodeGen/RegAllocGreedy.cpp b/llvm/lib/CodeGen/RegAllocGreedy.cpp
index 4fa2bc76b38b4b..040a02edba2b4f 100644
--- a/llvm/lib/CodeGen/RegAllocGreedy.cpp
+++ b/llvm/lib/CodeGen/RegAllocGreedy.cpp
@@ -1344,37 +1344,7 @@ static unsigned getNumAllocatableRegsForConstraints(
   return RCI.getNumAllocatableRegs(ConstrainedRC);
 }
 
-static LaneBitmask getInstReadLaneMask(const MachineRegisterInfo &MRI,
-                                       const TargetRegisterInfo &TRI,
-                                       const MachineInstr &FirstMI,
-                                       Register Reg) {
-  LaneBitmask Mask;
-  SmallVector<std::pair<MachineInstr *, unsigned>, 8> Ops;
-  (void)AnalyzeVirtRegInBundle(const_cast<MachineInstr &>(FirstMI), Reg, &Ops);
-
-  for (auto [MI, OpIdx] : Ops) {
-    const MachineOperand &MO = MI->getOperand(OpIdx);
-    assert(MO.isReg() && MO.getReg() == Reg);
-    unsigned SubReg = MO.getSubReg();
-    if (SubReg == 0 && MO.isUse()) {
-      if (MO.isUndef())
-        continue;
-      return MRI.getMaxLaneMaskForVReg(Reg);
-    }
-
-    LaneBitmask SubRegMask = TRI.getSubRegIndexLaneMask(SubReg);
-    if (MO.isDef()) {
-      if (!MO.isUndef())
-        Mask |= ~SubRegMask;
-    } else
-      Mask |= SubRegMask;
-  }
-
-  return Mask;
-}
-
-/// Return true if \p MI at \P Use reads a subset of the lanes live in \p
-/// VirtReg.
+/// Return true if \p MI at \P Use reads a subset of the lanes of \p VirtReg.
 static bool readsLaneSubset(const MachineRegisterInfo &MRI,
                             const MachineInstr *MI, const LiveInterval &VirtReg,
                             const TargetRegisterInfo *TRI, SlotIndex Use,
@@ -1387,18 +1357,34 @@ static bool readsLaneSubset(const MachineRegisterInfo &MRI,
       DestSrc->Destination->getSubReg() == DestSrc->Source->getSubReg())
     return false;
 
+  Register Reg = VirtReg.reg();
+
   // FIXME: We're only considering uses, but should be consider defs too?
-  LaneBitmask ReadMask = getInstReadLaneMask(MRI, *TRI, *MI, VirtReg.reg());
+  LaneBitmask UseMask;
+  SmallVector<std::pair<MachineInstr *, unsigned>, 8> Ops;
+  (void)AnalyzeVirtRegInBundle(const_cast<MachineInstr &>(*MI), Reg, &Ops);
 
-  LaneBitmask LiveAtMask;
-  for (const LiveInterval::SubRange &S : VirtReg.subranges()) {
-    if (S.liveAt(Use))
-      LiveAtMask |= S.LaneMask;
+  for (auto [MI, OpIdx] : Ops) {
+    const MachineOperand &MO = MI->getOperand(OpIdx);
+    assert(MO.isReg() && MO.getReg() == Reg);
+    unsigned SubReg = MO.getSubReg();
+    if (SubReg == 0 && MO.isUse()) {
+      if (MO.isUndef())
+        continue;
+      return false;
+    }
+
+    LaneBitmask SubRegMask = TRI->getSubRegIndexLaneMask(SubReg);
+    if (MO.isDef()) {
+      if (!MO.isUndef())
+        UseMask |= ~SubRegMask;
+    } else
+      UseMask |= SubRegMask;
   }
 
   // If the live lanes aren't different from the lanes used by the instruction,
   // this doesn't help.
-  return (ReadMask & ~(LiveAtMask & TRI->getCoveringLanes())).any();
+  return MRI.getMaxLaneMaskForVReg(VirtReg.reg()) != UseMask;
 }
 
 /// tryInstructionSplit - Split a live range around individual instructions.
@@ -1450,7 +1436,7 @@ unsigned RAGreedy::tryInstructionSplit(const LiveInterval &VirtReg,
                                                    TII, TRI, RegClassInfo)) ||
           // TODO: Handle split for subranges with subclass constraints?
           (!SplitSubClass && VirtReg.hasSubRanges() &&
-           !readsLaneSubset(*MRI, MI, VirtReg, TRI, Use, TII))) {
+           !readsLaneSubset(*MRI, MI, VirtReg, TRI, Use.getBaseIndex(), TII))) {
         LLVM_DEBUG(dbgs() << "    skip:\t" << Use << '\t' << *MI);
         continue;
       }
diff --git a/llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll b/llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll
index 7d07641f455e3f..fcdf870bd046ac 100644
--- a/llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll
+++ b/llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll
@@ -3181,7 +3181,7 @@ define amdgpu_gfx void @call_72xi32() #1 {
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX11-NEXT:    s_and_b32 s33, s33, 0xfffffe00
 ; GFX11-NEXT:    s_or_saveexec_b32 s0, -1
-; GFX11-NEXT:    scratch_store_b32 off, v60, s33 offset:1600 ; 4-byte Folded Spill
+; GFX11-NEXT:    scratch_store_b32 off, v63, s33 offset:1584 ; 4-byte Folded Spill
 ; GFX11-NEXT:    s_mov_b32 exec_lo, s0
 ; GFX11-NEXT:    s_mov_b32 s0, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v4, 0
@@ -3191,19 +3191,22 @@ define amdgpu_gfx void @call_72xi32() #1 {
 ; GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
 ; GFX11-NEXT:    v_dual_mov_b32 v2, s2 :: v_dual_mov_b32 v3, s3
 ; GFX11-NEXT:    s_addk_i32 s32, 0xa00
-; GFX11-NEXT:    s_clause 0xb
-; GFX11-NEXT:    scratch_store_b32 off, v40, s33 offset:44
-; GFX11-NEXT:    scratch_store_b32 off, v41, s33 offset:40
-; GFX11-NEXT:    scratch_store_b32 off, v42, s33 offset:36
-; GFX11-NEXT:    scratch_store_b32 off, v43, s33 offset:32
-; GFX11-NEXT:    scratch_store_b32 off, v44, s33 offset:28
-; GFX11-NEXT:    scratch_store_b32 off, v45, s33 offset:24
-; GFX11-NEXT:    scratch_store_b32 off, v46, s33 offset:20
-; GFX11-NEXT:    scratch_store_b32 off, v47, s33 offset:16
-; GFX11-NEXT:    scratch_store_b32 off, v56, s33 offset:12
-; GFX11-NEXT:    scratch_store_b32 off, v57, s33 offset:8
-; GFX11-NEXT:    scratch_store_b32 off, v58, s33 offset:4
-; GFX11-NEXT:    scratch_store_b32 off, v59, s33
+; GFX11-NEXT:    s_clause 0xe
+; GFX11-NEXT:    scratch_store_b32 off, v40, s33 offset:56
+; GFX11-NEXT:    scratch_store_b32 off, v41, s33 offset:52
+; GFX11-NEXT:    scratch_store_b32 off, v42, s33 offset:48
+; GFX11-NEXT:    scratch_store_b32 off, v43, s33 offset:44
+; GFX11-NEXT:    scratch_store_b32 off, v44, s33 offset:40
+; GFX11-NEXT:    scratch_store_b32 off, v45, s33 offset:36
+; GFX11-NEXT:    scratch_store_b32 off, v46, s33 offset:32
+; GFX11-NEXT:    scratch_store_b32 off, v47, s33 offset:28
+; GFX11-NEXT:    scratch_store_b32 off, v56, s33 offset:24
+; GFX11-NEXT:    scratch_store_b32 off, v57, s33 offset:20
+; GFX11-NEXT:    scratch_store_b32 off, v58, s33 offset:16
+; GFX11-NEXT:    scratch_store_b32 off, v59, s33 offset:12
+; GFX11-NEXT:    scratch_store_b32 off, v60, s33 offset:8
+; GFX11-NEXT:    scratch_store_b32 off, v61, s33 offset:4
+; GFX11-NEXT:    scratch_store_b32 off, v62, s33
 ; GFX11-NEXT:    s_add_i32 s0, s32, 0xa0
 ; GFX11-NEXT:    s_add_i32 s1, s32, 0x90
 ; GFX11-NEXT:    scratch_store_b128 off, v[0:3], s32
@@ -3224,7 +3227,7 @@ define amdgpu_gfx void @call_72xi32() #1 {
 ; GFX11-NEXT:    s_add_i32 s0, s32, 32
 ; GFX11-NEXT:    s_add_i32 s1, s32, 16
 ; GFX11-NEXT:    s_add_i32 s2, s33, 0x200
-; GFX11-NEXT:    v_writelane_b32 v60, s30, 0
+; GFX11-NEXT:    v_writelane_b32 v63, s30, 0
 ; GFX11-NEXT:    scratch_store_b128 off, v[0:3], s0
 ; GFX11-NEXT:    scratch_store_b128 off, v[0:3], s1
 ; GFX11-NEXT:    v_dual_mov_b32 v0, s2 :: v_dual_mov_b32 v3, 0
@@ -3245,7 +3248,7 @@ define amdgpu_gfx void @call_72xi32() #1 {
 ; GFX11-NEXT:    v_dual_mov_b32 v31, 0 :: v_dual_mov_b32 v30, 0
 ; GFX11-NEXT:    s_mov_b32 s1, return_72xi32@abs32@hi
 ; GFX11-NEXT:    s_mov_b32 s0, return_72xi32@abs32@lo
-; GFX11-NEXT:    v_writelane_b32 v60, s31, 1
+; GFX11-NEXT:    v_writelane_b32 v63, s31, 1
 ; GFX11-NEXT:    s_swappc_b64 s[30:31], s[0:1]
 ; GFX11-NEXT:    s_clause 0x1
 ; GFX11-NEXT:    scratch_load_b128 v[45:48], off, s33 offset:624
@@ -3267,7 +3270,8 @@ define amdgpu_gfx void @call_72xi32() #1 {
 ; GFX11-NEXT:    s_waitcnt vmcnt(2)
 ; GFX11-NEXT:    v_dual_mov_b32 v14, v1 :: v_dual_mov_b32 v1, v4
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-NEXT:    scratch_store_b128 off, v[16:19], s33 offset:1584 ; 16-byte Folded Spill
+; GFX11-NEXT:    v_dual_mov_b32 v62, v19 :: v_dual_mov_b32 v61, v18
+; GFX11-NEXT:    v_mov_b32_e32 v60, v17
 ; GFX11-NEXT:    s_clause 0x3
 ; GFX11-NEXT:    scratch_load_b128 v[16:19], off, s33 offset:528
 ; GFX11-NEXT:    scratch_load_b128 v[20:23], off, s33 offset:544
@@ -3285,17 +3289,18 @@ define amdgpu_gfx void @call_72xi32() #1 {
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-NEXT:    scratch_store_b128 off, v[28:31], s33 offset:1536 ; 16-byte Folded Spill
 ; GFX11-NEXT:    scratch_store_b128 off, v[32:35], s32
-; GFX11-NEXT:    v_dual_mov_b32 v31, v47 :: v_dual_mov_b32 v32, v36
+; GFX11-NEXT:    v_mov_b32_e32 v32, v36
 ; GFX11-NEXT:    v_dual_mov_b32 v33, v48 :: v_dual_mov_b32 v34, v49
+; GFX11-NEXT:    v_mov_b32_e32 v49, v52
 ; GFX11-NEXT:    v_dual_mov_b32 v35, v50 :: v_dual_mov_b32 v48, v51
-; GFX11-NEXT:    v_dual_mov_b32 v49, v52 :: v_dual_mov_b32 v50, v53
-; GFX11-NEXT:    v_dual_mov_b32 v51, v54 :: v_dual_mov_b32 v36, v55
-; GFX11-NEXT:    v_dual_mov_b32 v53, v41 :: v_dual_mov_b32 v52, v40
-; GFX11-NEXT:    v_dual_mov_b32 v54, v42 :: v_dual_mov_b32 v41, v56
-; GFX11-NEXT:    v_dual_mov_b32 v55, v43 :: v_dual_mov_b32 v40, v44
-; GFX11-NEXT:    v_dual_mov_b32 v42, v57 :: v_dual_mov_b32 v57, v12
+; GFX11-NEXT:    v_dual_mov_b32 v50, v53 :: v_dual_mov_b32 v51, v54
+; GFX11-NEXT:    v_mov_b32_e32 v36, v55
+; GFX11-NEXT:    v_dual_mov_b32 v52, v40 :: v_dual_mov_b32 v53, v41
+; GFX11-NEXT:    v_dual_mov_b32 v54, v42 :: v_dual_mov_b32 v55, v43
+; GFX11-NEXT:    v_mov_b32_e32 v40, v44
+; GFX11-NEXT:    v_dual_mov_b32 v41, v56 :: v_dual_mov_b32 v42, v57
 ; GFX11-NEXT:    v_dual_mov_b32 v43, v58 :: v_dual_mov_b32 v56, v59
-; GFX11-NEXT:    v_mov_b32_e32 v58, v13
+; GFX11-NEXT:    v_dual_mov_b32 v57, v12 :: v_dual_mov_b32 v58, v13
 ; GFX11-NEXT:    v_dual_mov_b32 v12, v15 :: v_dual_mov_b32 v13, v0
 ; GFX11-NEXT:    v_dual_mov_b32 v15, v2 :: v_dual_mov_b32 v0, v3
 ; GFX11-NEXT:    v_dual_mov_b32 v2, v5 :: v_dual_mov_b32 v3, v6
@@ -3310,57 +3315,58 @@ define amdgpu_gfx void @call_72xi32() #1 {
 ; GFX11-NEXT:    scratch_store_b128 off, v[0:3], s2
 ; GFX11-NEXT:    v_mov_b32_e32 v0, 24
 ; GFX11-NEXT:    s_add_i32 s2, s32, 0x70
-; GFX11-NEXT:    v_mov_b32_e32 v6, v17
+; GFX11-NEXT:    v_mov_b32_e32 v2, v60
 ; GFX11-NEXT:    scratch_store_b128 off, v[12:15], s2
-; GFX11-NEXT:    v_mov_b32_e32 v13, v24
+; GFX11-NEXT:    v_mov_b32_e32 v15, v26
 ; GFX11-NEXT:    s_add_i32 s2, s32, 0x6c
-; GFX11-NEXT:    v_mov_b32_e32 v7, v18
+; GFX11-NEXT:    v_dual_mov_b32 v4, v62 :: v_dual_mov_b32 v13, v24
 ; GFX11-NEXT:    scratch_store_b32 off, v0, s2
 ; GFX11-NEXT:    s_add_i32 s2, s32, 0x60
-; GFX11-NEXT:    v_dual_mov_b32 v8, v19 :: v_dual_mov_b32 v15, v26
+; GFX11-NEXT:    v_dual_mov_b32 v6, v17 :: v_dual_mov_b32 v31, v47
 ; GFX11-NEXT:    scratch_store_b96 off, v[56:58], s2
 ; GFX11-NEXT:    s_add_i32 s2, s32, 0x50
-; GFX11-NEXT:    v_dual_mov_b32 v12, v23 :: v_dual_mov_b32 v29, v45
+; GFX11-NEXT:    v_mov_b32_e32 v7, v18
 ; GFX11-NEXT:    scratch_store_b128 off, v[40:43], s2
 ; GFX11-NEXT:    s_add_i32 s2, s32, 64
-; GFX11-NEXT:    v_mov_b32_e32 v14, v25
+; GFX11-NEXT:    v_dual_mov_b32 v8, v19 :: v_dual_mov_b32 v29, v45
 ; GFX11-NEXT:    scratch_store_b128 off, v[52:55], s2
 ; GFX11-NEXT:    s_add_i32 s2, s32, 48
-; GFX11-NEXT:    v_mov_b32_e32 v16, v27
+; GFX11-NEXT:    v_mov_b32_e32 v12, v23
 ; GFX11-NEXT:    scratch_store_b128 off, v[36:39], s2
 ; GFX11-NEXT:    s_add_i32 s2, s32, 32
-; GFX11-NEXT:    v_mov_b32_e32 v30, v46
+; GFX11-NEXT:    v_mov_b32_e32 v14, v25
 ; GFX11-NEXT:    scratch_store_b128 off, v[48:51], s2
 ; GFX11-NEXT:    s_add_i32 s2, s32, 16
+; GFX11-NEXT:    v_mov_b32_e32 v16, v27
 ; GFX11-NEXT:    scratch_store_b128 off, v[32:35], s2
-; GFX11-NEXT:    scratch_load_b128 v[1:4], off, s33 offset:1584 ; 16-byte Folded Reload
-; GFX11-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-NEXT:    v_mov_b32_e32 v1, 42
 ; GFX11-NEXT:    s_clause 0x2
 ; GFX11-NEXT:    scratch_load_b128 v[17:20], off, s33 offset:1568
 ; GFX11-NEXT:    scratch_load_b128 v[21:24], off, s33 offset:1552
 ; GFX11-NEXT:    scratch_load_b128 v[25:28], off, s33 offset:1536
 ; GFX11-NEXT:    s_add_i32 s2, s33, 0x400
-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT:    v_mov_b32_e32 v0, s2
+; GFX11-NEXT:    v_dual_mov_b32 v3, v61 :: v_dual_mov_b32 v30, v46
+; GFX11-NEXT:    v_dual_mov_b32 v0, s2 :: v_dual_mov_b32 v1, 42
 ; GFX11-NEXT:    s_swappc_b64 s[30:31], s[0:1]
-; GFX11-NEXT:    s_clause 0xb
-; GFX11-NEXT:    scratch_load_b32 v59, off, s33
-; GFX11-NEXT:    scratch_load_b32 v58, off, s33 offset:4
-; GFX11-NEXT:    scratch_load_b32 v57, off, s33 offset:8
-; GFX11-NEXT:    scratch_load_b32 v56, off, s33 offset:12
-; GFX11-NEXT:    scratch_load_b32 v47, off, s33 offset:16
-; GFX11-NEXT:    scratch_load_b32 v46, off, s33 offset:20
-; GFX11-NEXT:    scratch_load_b32 v45, off, s33 offset:24
-; GFX11-NEXT:    scratch_load_b32 v44, off, s33 offset:28
-; GFX11-NEXT:    scratch_load_b32 v43, off, s33 offset:32
-; GFX11-NEXT:    scratch_load_b32 v42, off, s33 offset:36
-; GFX11-NEXT:    scratch_load_b32 v41, off, s33 offset:40
-; GFX11-NEXT:    scratch_load_b32 v40, off, s33 offset:44
-; GFX11-NEXT:    v_readlane_b32 s31, v60, 1
-; GFX11-NEXT:    v_readlane_b32 s30, v60, 0
+; GFX11-NEXT:    s_clause 0xe
+; GFX11-NEXT:    scratch_load_b32 v62, off, s33
+; GFX11-NEXT:    scratch_load_b32 v61, off, s33 offset:4
+; GFX11-NEXT:    scratch_load_b32 v60, off, s33 offset:8
+; GFX11-NEXT:    scratch_load_b32 v59, off, s33 offset:12
+; GFX11-NEXT:    scratch_load_b32 v58, off, s33 offset:16
+; GFX11-NEXT:    scratch_load_b32 v57, off, s33 offset:20
+; GFX11-NEXT:    scratch_load_b32 v56, off, s33 offset:24
+; GFX11-NEXT:    scratch_load_b32 v47, off, s33 offset:28
+; GFX11-NEXT:    scratch_load_b32 v46, off, s33 offset:32
+; GFX11-NEXT:    scratch_load_b32 v45, off, s33 offset:36
+; GFX11-NEXT:    scratch_load_b32 v44, off, s33 offset:40
+; GFX11-NEXT:    scratch_load_b32 v43, off, s33 offset:44
+; GFX11-NEXT:    scratch_load_b32 v42, off, s33 offset:48
+; GFX11-NEXT:    scratch_load_b32 v41, off, s33 offset:52
+; GFX11-NEXT:    scratch_load_b32 v40, off, s33 offset:56
+; GFX11-NEXT:    v_readlane_b32 s31, v63, 1
+; GFX11-NEXT:    v_readlane_b32 s30, v63, 0
 ; GFX11-NEXT:    s_or_saveexec_b32 s0, -1
-; GFX11-NEXT:    scratch_load_b32 v60, off, s33 offset:1600 ; 4-byte Folded Reload
+; GFX11-NEXT:    scratch_load_b32 v63, off, s33 offset:1584 ; 4-byte Folded Reload
 ; GFX11-NEXT:    s_mov_b32 exec_lo, s0
 ; GFX11-NEXT:    s_addk_i32 s32, 0xf600
 ; GFX11-NEXT:    s_mov_b32 s33, s34
diff --git a/llvm/test/CodeGen/AMDGPU/inflated-reg-class-snippet-copy-inst-reads-lane-subset-use-after-free.mir b/llvm/test/CodeGen/AMDGPU/inflated-reg-class-snippet-copy-inst-reads-lane-subset-use-after-free.mir
new file mode 100644
index 00000000000000..c2adafb5e742aa
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/inflated-reg-class-snippet-copy-inst-reads-lane-subset-use-after-free.mir
@@ -0,0 +1,58 @@
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -simplify-mir -start-before=greedy,2 -stress-regalloc=4 -stop-before=virtregrewriter,2 -filetype=null -verify-regalloc %s | FileCheck %s
+
+# This test is similar to
+# inflated-reg-class-snippet-copy-use-after-free.mir, except it is
+# still broken when the use instruction does not read the full set of
+# lanes
+
+--- |
+  define amdgpu_kernel void @inflated_reg_class_copy_use_after_free_lane_subset() {
+    ret void
+  }
+...
+---
+name:            inflated_reg_class_copy_use_after_free_lane_subset
+tracksRegLiveness: true
+machineFunctionInfo:
+  explicitKernArgSize: 8
+  maxKernArgAlign: 8
+  isEntryFunction: true
+  memoryBound:     true
+  waveLimiter:     true
+  scratchRSrcReg:  '$sgpr72_sgpr73_sgpr74_sgpr75'
+  stackPtrOffsetReg: '$sgpr32'
+  returnsVoid:     true
+  occupancy:       7
+  vgprForAGPRCopy: '$vgpr255'
+  sgprForEXECCopy: '$sgpr74_sgpr75'
+  longBranchReservedReg: ''
+body:             |
+  bb.0:
+    liveins: $vgpr0, $sgpr4_sgpr5
+
+    %0:vgpr_32 = IMPLICIT_DEF
+    renamable $sgpr0_sgpr1 = S_LOAD_DWORDX2_IMM killed undef renamable $sgpr4_sgpr5, 0, 0 :: (load (s64), addrspace 4)
+    S_NOP 0, implicit-def undef %1.sub12_sub13_sub14_sub15:vreg_512_align2
+    S_NOP 0, implicit-def %1.sub8_sub9_sub10_sub11:vreg_512_align2
+    S_NOP 0, implicit-def %1.sub4_sub5_sub6_sub7:vreg_512_align2
+    S_NOP 0, implicit-def %1.sub0_sub1_sub2_sub3:vreg_512_align2
+    S_NOP 0, implicit-def early-clobber %2:vreg_512_align2, implicit %1.sub0_sub1_sub2_sub3, implicit %1.sub4_sub5_sub6_sub7
+    %1.sub2:vreg_512_align2 = COPY %2.sub3
+    %1.sub3:vreg_512_align2 = COPY %2.sub2
+    %1.sub4:vreg_512_align2 = COPY %2.sub0
+    %1.sub5:vreg_512_align2 = V_MOV_B32_e32 0, implicit $exec
+    %1.sub6:vreg_512_align2 = V_MOV_B32_e32 0, implicit $exec
+    %1.sub7:vreg_512_align2 = V_MOV_B32_e32 0, implicit $exec
+    %1.sub8:vreg_512_align2 = V_MOV_B32_e32 0, implicit $exec
+    %1.sub9:vreg_512_align2 = V_MOV_B32_e32 0, implicit $exec
+    %1.sub10:vreg_512_align2 = V_MOV_B32_e32 0, implicit $exec
+    %1.sub11:vreg_512_align2 = V_MOV_B32_e32 0, implicit $exec
+    %1.sub12:vreg_512_align2 = V_MOV_B32_e32 0, implicit $exec
+    %1.sub13:vreg_512_align2 = V_MOV_B32_e32 0, implicit $exec
+    %1.sub14:vreg_512_align2 = V_MOV_B32_e32 0, implicit $exec
+    %1.sub15:vreg_512_align2 = V_MOV_B32_e32 0, implicit $exec
+    S_NOP 0, implicit-def %1:vreg_512_align2, implicit %1.sub0_sub1_sub2_sub3, implicit %1.sub4_sub5_sub6_sub7, implicit %1.sub8_sub9_sub10_sub11
+    GLOBAL_STORE_DWORDX4_SADDR undef %3:vgpr_32, %1.sub12_sub13_sub14_sub15, undef renamable $sgpr0_sgpr1, 96, 0, implicit $exec :: (store (s128), addrspace 1)
+    S_ENDPGM 0
+
+...
diff --git a/llvm/test/CodeGen/AMDGPU/inflated-reg-class-snippet-copy-use-after-free.mir b/llvm/test/CodeGen/AMDGPU/inflated-reg-class-snippet-copy-use-after-free.mir
new file mode 100644
index 00000000000000..5422d303958b51
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/inflated-reg-class-snippet-copy-use-after-free.mir
@@ -0,0 +1,111 @@
+# RUN: not llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -simplify-mir -start-before=greedy,2 -stress-regalloc=4 -stop-before=virtregrewriter,2 -o - -verify-regalloc %s 2> %t.err | FileCheck %s
+# RUN: FileCheck -check-prefix=ERR %s < %t.err
+
+# To allocate the vreg_512_align2, the allocation will attempt to
+# inflate the register class to av_512_align2. This will ultimately
+# not work, and the allocation will fail. Make sure we don't introduce
+# an unproductive live range split of the inflated virtual register
+# which will later hit a use after free.
+
+# ERR: error: <unknown>:0:0: ran out of registers during register allocation in function 'inflated_reg_class_copy_use_after_free'
+
+# CHECK: S_NOP 0, implicit-def %{{[0-9]+}}.sub0_sub1_sub2_sub3
+# CHECK-NEXT: SI_SPILL_AV512_SAVE %{{[0-9]+}}, %stack.0, $sgpr32, 0, implicit $exec :: (store (s512) into %stack.0, align 4, addrspace 5)
+# CHECK-NEXT: %{{[0-9]+}}:vreg_512_align2 = SI_SPILL_V512_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s512) from %stack.0, align 4, addrspace 5)
+# CHECK-NEXT: early-clobber %{{[0-9]+}}:vreg_512_align2 = V_MFMA_F32_16X16X1F32_vgprcd_e64 undef %3:vgpr_32, undef %3:vgpr_32, %{{[0-9]+}}, 0, 0, 0, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+# CHECK-NEXT: undef %{{[0-9]+}}.sub2_sub3:av_512_align2 = COPY %{{[0-9]+}}.sub2_sub3 {
+# CHECK-NEXT: internal %{{[0-9]+}}.sub0:av_512_align2 = COPY %{{[0-9]+}}.sub0
+# CHECK-NEXT: }
+# CHECK-NEXT: undef %{{[0-9]+}}.sub2_sub3:av_512_align2 = COPY %{{[0-9]+}}.sub2_sub3 {
+# CHECK-NEXT: internal %{{[0-9]+}}.sub0:av_512_align2 = COPY %{{[0-9]+}}.sub0
+# CHECK-NEXT: }
+# CHECK-NEXT: undef %{{[0-9]+}}.sub2_sub3:av_512_align2 = COPY %{{[0-9]+}}.sub2_sub3 {
+# CHECK-NEXT: internal %{{[0-9]+}}.sub0:av_512_align2 = COPY %{{[0-9]+}}.sub0
+# CHECK-NEXT: }
+# CHECK-NEXT: undef %27.sub2_sub3:av_512_align2 = COPY %{{[0-9]+}}.sub2_sub3 {
+# CHECK-NEXT: internal %27.sub0:av_512_align2 = COPY %{{[0-9]+}}.sub0
+# CHECK-NEXT: }
+# CHECK-NEXT: SI_SPILL_AV512_SAVE %27, %stack.1, $sgpr32, 0, implicit $exec :: (store (s512) into %stack.1, align 4, addrspace 5)
+# CHECK-NEXT: %16:vreg_512_align2 = SI_SPILL_V512_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s512) from %stack.0, align 4, addrspace 5)
+# CHECK-NEXT: undef %{{[0-9]+}}.sub0_sub1:av_512_align2 = COPY %16.sub0_sub1
+# CHECK-NEXT: %28:av_512_align2 = SI_SPILL_AV512_RESTORE %stack.1, $sgpr32, 0, implicit $exec :: (load (s512) from %stack.1, align 4, addrspace 5)
+# CHECK-NEXT: undef %{{[0-9]+}}.sub2_sub3:av_512_align2 = COPY %28.sub2_sub3 {
+# CHECK-NEXT: internal %{{[0-9]+}}.sub0:av_512_align2 = COPY %28.sub0
+# CHECK-NEXT: }
+# CHECK-NEXT: un...
[truncated]

qcolombet

The change doesn't look correct to me but I may be missing something

llvm/lib/CodeGen/RegAllocGreedy.cpp

qcolombet · 2024-12-17T10:26:00Z

llvm/lib/CodeGen/RegAllocGreedy.cpp

+    if (SubReg == 0 && MO.isUse()) {
+      if (MO.isUndef())
+        continue;
+      return false;


Here we read the full virtual register, shouldn't we return true?

No, this is checking for a strict subset

Ah that makes sense. Thanks for updating the comment of the function.

qcolombet · 2024-12-17T10:30:40Z

llvm/lib/CodeGen/RegAllocGreedy.cpp

  }

  // If the live lanes aren't different from the lanes used by the instruction,
  // this doesn't help.
-  return (ReadMask & ~(LiveAtMask & TRI->getCoveringLanes())).any();
+  return MRI.getMaxLaneMaskForVReg(VirtReg.reg()) != UseMask;


Instead of an exact match, shouldn't we check for an overlap?
The comment of this method says "reads a subset of the lanes". So any lane read should return true.

Read of the whole register should be rejected. This must be a strict subset

qcolombet · 2024-12-23T15:29:20Z

llvm/lib/CodeGen/RegAllocGreedy.cpp

+    if (SubReg == 0 && MO.isUse()) {
+      if (MO.isUndef())
+        continue;
+      return false;


Ah that makes sense. Thanks for updating the comment of the function.

Fix the logic for readsLaneSubset. Check at the correct point for the use operands of the instruction, instead of the result. Only consider the use register operands, and stop considering whether the subranges are actually live at this point. This avoids some unproductive splits. This also happens to avoid a use after free due to a split of an unspillable register. That issue still exists if the instruction does not reference the full set of register lanes.

arsenm force-pushed the users/arsenm/greedy-fix-subrange-instruction-split-logic branch from 9125238 to 204f93c Compare December 17, 2024 09:11

arsenm added the llvm:regalloc label Dec 17, 2024 — with Graphite App

arsenm requested review from jayfoad, kparzysz, MatzeB and qcolombet December 17, 2024 09:17

arsenm marked this pull request as ready for review December 17, 2024 09:17

llvmbot added the backend:AMDGPU label Dec 17, 2024

arsenm force-pushed the users/arsenm/greedy-fix-subrange-instruction-split-logic branch from 204f93c to 7885f55 Compare December 17, 2024 09:40

qcolombet reviewed Dec 17, 2024

View reviewed changes

arsenm force-pushed the users/arsenm/greedy-fix-subrange-instruction-split-logic branch 2 times, most recently from 05e8c2e to d376736 Compare December 20, 2024 08:45

arsenm changed the base branch from main to users/arsenm/greedy-only-unassign-after-last-chance-recoloring-if-assigned December 20, 2024 08:45

arsenm mentioned this pull request Dec 20, 2024

RegAllocGreedy: Fix use after free during last chance recoloring #120697

Merged

qcolombet approved these changes Dec 23, 2024

View reviewed changes

arsenm force-pushed the users/arsenm/greedy-only-unassign-after-last-chance-recoloring-if-assigned branch from 5398c21 to c65ef86 Compare January 3, 2025 05:27

Base automatically changed from users/arsenm/greedy-only-unassign-after-last-chance-recoloring-if-assigned to main January 6, 2025 16:12

arsenm force-pushed the users/arsenm/greedy-fix-subrange-instruction-split-logic branch from d376736 to 6cf5acc Compare January 6, 2025 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RegAllocGreedy: Fix subrange based instruction split logic #120199

RegAllocGreedy: Fix subrange based instruction split logic #120199

Uh oh!

arsenm commented Dec 17, 2024

Uh oh!

arsenm commented Dec 17, 2024 •

edited

Loading

Uh oh!

llvmbot commented Dec 17, 2024 •

edited

Loading

Uh oh!

qcolombet left a comment

Uh oh!

Uh oh!

qcolombet Dec 17, 2024

Uh oh!

arsenm Dec 17, 2024

Uh oh!

qcolombet Dec 23, 2024

Uh oh!

qcolombet Dec 17, 2024

Uh oh!

arsenm Dec 17, 2024

Uh oh!

qcolombet Dec 23, 2024

Uh oh!

Uh oh!

RegAllocGreedy: Fix subrange based instruction split logic #120199

Are you sure you want to change the base?

RegAllocGreedy: Fix subrange based instruction split logic #120199

Uh oh!

Conversation

arsenm commented Dec 17, 2024

Uh oh!

arsenm commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qcolombet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qcolombet Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

arsenm Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

qcolombet Dec 23, 2024

Choose a reason for hiding this comment

Uh oh!

qcolombet Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

arsenm Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

qcolombet Dec 23, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arsenm commented Dec 17, 2024 •

edited

Loading

llvmbot commented Dec 17, 2024 •

edited

Loading