[AMDGPU] Move WWM register pre-allocation to during regalloc #70618

perlfu · 2023-10-30T05:09:47Z

Move SIPreAllocateWWMRegs pass to just before VGPR allocation. This saves recomputation of the virtual matrix and live reg map, with the slight regression in O0 that live intervals and slot indexes must be computed.

llvmbot · 2023-10-30T05:11:01Z

@llvm/pr-subscribers-backend-amdgpu

Author: Carl Ritson (perlfu)

Changes

Move SIPreAllocateWWMRegs pass to just before VGPR allocation. This saves recomputation of the virtual matrix and live reg map, with the slight regression in O0 that live intervals and slot indexes must be computed.

Patch is 418.23 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/70618.diff

11 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2-2)
(modified) llvm/lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp (+1-3)
(modified) llvm/test/CodeGen/AMDGPU/bb-prolog-spill-during-regalloc.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+576-576)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+462-465)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+462-465)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+600-600)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+9-15)
(modified) llvm/test/CodeGen/AMDGPU/sgpr-regalloc-flags.ll (+9)
(modified) llvm/test/CodeGen/AMDGPU/wwm-reserved-spill.ll (+249-268)
(modified) llvm/test/CodeGen/AMDGPU/wwm-reserved.ll (+23-23)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index dc7321cd5de9fcd..a049bdf63bb5868 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -1276,7 +1276,6 @@ void GCNPassConfig::addFastRegAlloc() {
   insertPass(&PHIEliminationID, &SILowerControlFlowID);
 
   insertPass(&TwoAddressInstructionPassID, &SIWholeQuadModeID);
-  insertPass(&TwoAddressInstructionPassID, &SIPreAllocateWWMRegsID);
 
   TargetPassConfig::addFastRegAlloc();
 }
@@ -1285,7 +1284,6 @@ void GCNPassConfig::addOptimizedRegAlloc() {
   // Allow the scheduler to run before SIWholeQuadMode inserts exec manipulation
   // instructions that cause scheduling barriers.
   insertPass(&MachineSchedulerID, &SIWholeQuadModeID);
-  insertPass(&MachineSchedulerID, &SIPreAllocateWWMRegsID);
 
   if (OptExecMaskPreRA)
     insertPass(&MachineSchedulerID, &SIOptimizeExecMaskingPreRAID);
@@ -1372,6 +1370,7 @@ bool GCNPassConfig::addRegAssignAndRewriteFast() {
 
   // Equivalent of PEI for SGPRs.
   addPass(&SILowerSGPRSpillsID);
+  addPass(&SIPreAllocateWWMRegsID);
 
   addPass(createVGPRAllocPass(false));
 
@@ -1395,6 +1394,7 @@ bool GCNPassConfig::addRegAssignAndRewriteOptimized() {
 
   // Equivalent of PEI for SGPRs.
   addPass(&SILowerSGPRSpillsID);
+  addPass(&SIPreAllocateWWMRegsID);
 
   addPass(createVGPRAllocPass(true));
 
diff --git a/llvm/lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp b/llvm/lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
index c2ddfd7881ab760..fc35bec0edd3bdb 100644
--- a/llvm/lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
+++ b/llvm/lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
@@ -56,11 +56,9 @@ class SIPreAllocateWWMRegs : public MachineFunctionPass {
 
   void getAnalysisUsage(AnalysisUsage &AU) const override {
     AU.addRequired<LiveIntervals>();
-    AU.addPreserved<LiveIntervals>();
     AU.addRequired<VirtRegMap>();
     AU.addRequired<LiveRegMatrix>();
-    AU.addPreserved<SlotIndexes>();
-    AU.setPreservesCFG();
+    AU.setPreservesAll();
     MachineFunctionPass::getAnalysisUsage(AU);
   }
 
diff --git a/llvm/test/CodeGen/AMDGPU/bb-prolog-spill-during-regalloc.ll b/llvm/test/CodeGen/AMDGPU/bb-prolog-spill-during-regalloc.ll
index e22cb912552f970..28780fdf441381c 100644
--- a/llvm/test/CodeGen/AMDGPU/bb-prolog-spill-during-regalloc.ll
+++ b/llvm/test/CodeGen/AMDGPU/bb-prolog-spill-during-regalloc.ll
@@ -11,7 +11,7 @@ define i32 @prolog_spill(i32 %arg0, i32 %arg1, i32 %arg2) {
   ; REGALLOC-NEXT:   renamable $vgpr3 = IMPLICIT_DEF
   ; REGALLOC-NEXT:   SI_SPILL_V32_SAVE killed $vgpr2, %stack.5, $sgpr32, 0, implicit $exec :: (store (s32) into %stack.5, addrspace 5)
   ; REGALLOC-NEXT:   SI_SPILL_V32_SAVE killed $vgpr1, %stack.4, $sgpr32, 0, implicit $exec :: (store (s32) into %stack.4, addrspace 5)
-  ; REGALLOC-NEXT:   renamable $vgpr1 = COPY killed $vgpr0
+  ; REGALLOC-NEXT:   renamable $vgpr1 = COPY $vgpr0
   ; REGALLOC-NEXT:   $vgpr0 = SI_SPILL_WWM_V32_RESTORE %stack.2, $sgpr32, 0, implicit $exec :: (load (s32) from %stack.2, addrspace 5)
   ; REGALLOC-NEXT:   renamable $sgpr4 = S_MOV_B32 49
   ; REGALLOC-NEXT:   renamable $sgpr4_sgpr5 = V_CMP_GT_I32_e64 killed $vgpr1, killed $sgpr4, implicit $exec
diff --git a/llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll b/llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll
index 429bdd805ec5e11..7445dca09453eb2 100644
--- a/llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll
+++ b/llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll
@@ -725,36 +725,36 @@ define amdgpu_kernel void @global_atomic_fadd_uni_address_div_value_agent_scope_
 ; GFX9-DPP-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; GFX9-DPP-NEXT:    v_mbcnt_lo_u32_b32 v1, exec_lo, 0
 ; GFX9-DPP-NEXT:    v_mbcnt_hi_u32_b32 v1, exec_hi, v1
-; GFX9-DPP-NEXT:    v_mov_b32_e32 v3, v0
+; GFX9-DPP-NEXT:    v_mov_b32_e32 v40, v0
 ; GFX9-DPP-NEXT:    s_not_b64 exec, exec
-; GFX9-DPP-NEXT:    v_bfrev_b32_e32 v3, 1
+; GFX9-DPP-NEXT:    v_bfrev_b32_e32 v40, 1
 ; GFX9-DPP-NEXT:    s_not_b64 exec, exec
 ; GFX9-DPP-NEXT:    s_or_saveexec_b64 s[0:1], -1
-; GFX9-DPP-NEXT:    v_bfrev_b32_e32 v5, 1
-; GFX9-DPP-NEXT:    v_bfrev_b32_e32 v4, 1
+; GFX9-DPP-NEXT:    v_bfrev_b32_e32 v42, 1
+; GFX9-DPP-NEXT:    v_bfrev_b32_e32 v41, 1
 ; GFX9-DPP-NEXT:    s_nop 0
-; GFX9-DPP-NEXT:    v_mov_b32_dpp v5, v3 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX9-DPP-NEXT:    v_add_f32_e32 v3, v3, v5
-; GFX9-DPP-NEXT:    v_bfrev_b32_e32 v5, 1
+; GFX9-DPP-NEXT:    v_mov_b32_dpp v42, v40 row_shr:1 row_mask:0xf bank_mask:0xf
+; GFX9-DPP-NEXT:    v_add_f32_e32 v40, v40, v42
+; GFX9-DPP-NEXT:    v_bfrev_b32_e32 v42, 1
 ; GFX9-DPP-NEXT:    s_nop 1
-; GFX9-DPP-NEXT:    v_mov_b32_dpp v5, v3 row_shr:2 row_mask:0xf bank_mask:0xf
-; GFX9-DPP-NEXT:    v_add_f32_e32 v3, v3, v5
-; GFX9-DPP-NEXT:    v_bfrev_b32_e32 v5, 1
+; GFX9-DPP-NEXT:    v_mov_b32_dpp v42, v40 row_shr:2 row_mask:0xf bank_mask:0xf
+; GFX9-DPP-NEXT:    v_add_f32_e32 v40, v40, v42
+; GFX9-DPP-NEXT:    v_bfrev_b32_e32 v42, 1
 ; GFX9-DPP-NEXT:    s_nop 1
-; GFX9-DPP-NEXT:    v_mov_b32_dpp v5, v3 row_shr:4 row_mask:0xf bank_mask:0xf
-; GFX9-DPP-NEXT:    v_add_f32_e32 v3, v3, v5
-; GFX9-DPP-NEXT:    v_bfrev_b32_e32 v5, 1
+; GFX9-DPP-NEXT:    v_mov_b32_dpp v42, v40 row_shr:4 row_mask:0xf bank_mask:0xf
+; GFX9-DPP-NEXT:    v_add_f32_e32 v40, v40, v42
+; GFX9-DPP-NEXT:    v_bfrev_b32_e32 v42, 1
 ; GFX9-DPP-NEXT:    s_nop 1
-; GFX9-DPP-NEXT:    v_mov_b32_dpp v5, v3 row_shr:8 row_mask:0xf bank_mask:0xf
-; GFX9-DPP-NEXT:    v_add_f32_e32 v3, v3, v5
-; GFX9-DPP-NEXT:    v_bfrev_b32_e32 v5, 1
+; GFX9-DPP-NEXT:    v_mov_b32_dpp v42, v40 row_shr:8 row_mask:0xf bank_mask:0xf
+; GFX9-DPP-NEXT:    v_add_f32_e32 v40, v40, v42
+; GFX9-DPP-NEXT:    v_bfrev_b32_e32 v42, 1
 ; GFX9-DPP-NEXT:    s_nop 1
-; GFX9-DPP-NEXT:    v_mov_b32_dpp v5, v3 row_bcast:15 row_mask:0xa bank_mask:0xf
-; GFX9-DPP-NEXT:    v_add_f32_e32 v3, v3, v5
+; GFX9-DPP-NEXT:    v_mov_b32_dpp v42, v40 row_bcast:15 row_mask:0xa bank_mask:0xf
+; GFX9-DPP-NEXT:    v_add_f32_e32 v40, v40, v42
 ; GFX9-DPP-NEXT:    s_nop 1
-; GFX9-DPP-NEXT:    v_mov_b32_dpp v4, v3 row_bcast:31 row_mask:0xc bank_mask:0xf
-; GFX9-DPP-NEXT:    v_add_f32_e32 v3, v3, v4
-; GFX9-DPP-NEXT:    v_readlane_b32 s4, v3, 63
+; GFX9-DPP-NEXT:    v_mov_b32_dpp v41, v40 row_bcast:31 row_mask:0xc bank_mask:0xf
+; GFX9-DPP-NEXT:    v_add_f32_e32 v40, v40, v41
+; GFX9-DPP-NEXT:    v_readlane_b32 s4, v40, 63
 ; GFX9-DPP-NEXT:    s_mov_b64 exec, s[0:1]
 ; GFX9-DPP-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v1
 ; GFX9-DPP-NEXT:    s_and_saveexec_b64 s[0:1], vcc
@@ -809,50 +809,50 @@ define amdgpu_kernel void @global_atomic_fadd_uni_address_div_value_agent_scope_
 ; GFX1064-DPP-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX1064-DPP-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; GFX1064-DPP-NEXT:    s_or_saveexec_b64 s[0:1], -1
-; GFX1064-DPP-NEXT:    v_bfrev_b32_e32 v3, 1
+; GFX1064-DPP-NEXT:    v_bfrev_b32_e32 v40, 1
 ; GFX1064-DPP-NEXT:    s_mov_b64 exec, s[0:1]
-; GFX1064-DPP-NEXT:    v_mov_b32_e32 v4, v0
+; GFX1064-DPP-NEXT:    v_mov_b32_e32 v41, v0
 ; GFX1064-DPP-NEXT:    s_not_b64 exec, exec
-; GFX1064-DPP-NEXT:    v_bfrev_b32_e32 v4, 1
+; GFX1064-DPP-NEXT:    v_bfrev_b32_e32 v41, 1
 ; GFX1064-DPP-NEXT:    s_not_b64 exec, exec
 ; GFX1064-DPP-NEXT:    s_or_saveexec_b64 s[0:1], -1
-; GFX1064-DPP-NEXT:    v_mov_b32_dpp v3, v4 row_xmask:1 row_mask:0xf bank_mask:0xf
-; GFX1064-DPP-NEXT:    v_bfrev_b32_e32 v5, 1
-; GFX1064-DPP-NEXT:    v_add_f32_e32 v3, v4, v3
-; GFX1064-DPP-NEXT:    v_bfrev_b32_e32 v4, 1
-; GFX1064-DPP-NEXT:    v_mov_b32_dpp v5, v3 row_xmask:2 row_mask:0xf bank_mask:0xf
-; GFX1064-DPP-NEXT:    v_add_f32_e32 v3, v3, v5
-; GFX1064-DPP-NEXT:    v_bfrev_b32_e32 v5, 1
-; GFX1064-DPP-NEXT:    v_mov_b32_dpp v4, v3 row_xmask:4 row_mask:0xf bank_mask:0xf
-; GFX1064-DPP-NEXT:    v_add_f32_e32 v3, v3, v4
-; GFX1064-DPP-NEXT:    v_mov_b32_dpp v5, v3 row_xmask:8 row_mask:0xf bank_mask:0xf
-; GFX1064-DPP-NEXT:    v_add_f32_e32 v3, v3, v5
-; GFX1064-DPP-NEXT:    v_mov_b32_e32 v4, v3
-; GFX1064-DPP-NEXT:    v_permlanex16_b32 v4, v4, -1, -1
-; GFX1064-DPP-NEXT:    v_add_f32_e32 v3, v3, v4
-; GFX1064-DPP-NEXT:    v_readlane_b32 s2, v3, 0
-; GFX1064-DPP-NEXT:    v_readlane_b32 s3, v3, 32
+; GFX1064-DPP-NEXT:    v_mov_b32_dpp v40, v41 row_xmask:1 row_mask:0xf bank_mask:0xf
+; GFX1064-DPP-NEXT:    v_bfrev_b32_e32 v42, 1
+; GFX1064-DPP-NEXT:    v_add_f32_e32 v40, v41, v40
+; GFX1064-DPP-NEXT:    v_bfrev_b32_e32 v41, 1
+; GFX1064-DPP-NEXT:    v_mov_b32_dpp v42, v40 row_xmask:2 row_mask:0xf bank_mask:0xf
+; GFX1064-DPP-NEXT:    v_add_f32_e32 v40, v40, v42
+; GFX1064-DPP-NEXT:    v_bfrev_b32_e32 v42, 1
+; GFX1064-DPP-NEXT:    v_mov_b32_dpp v41, v40 row_xmask:4 row_mask:0xf bank_mask:0xf
+; GFX1064-DPP-NEXT:    v_add_f32_e32 v40, v40, v41
+; GFX1064-DPP-NEXT:    v_mov_b32_dpp v42, v40 row_xmask:8 row_mask:0xf bank_mask:0xf
+; GFX1064-DPP-NEXT:    v_add_f32_e32 v40, v40, v42
+; GFX1064-DPP-NEXT:    v_mov_b32_e32 v41, v40
+; GFX1064-DPP-NEXT:    v_permlanex16_b32 v41, v41, -1, -1
+; GFX1064-DPP-NEXT:    v_add_f32_e32 v40, v40, v41
+; GFX1064-DPP-NEXT:    v_readlane_b32 s2, v40, 0
+; GFX1064-DPP-NEXT:    v_readlane_b32 s3, v40, 32
 ; GFX1064-DPP-NEXT:    s_mov_b64 exec, s[0:1]
 ; GFX1064-DPP-NEXT:    v_mbcnt_lo_u32_b32 v0, exec_lo, 0
 ; GFX1064-DPP-NEXT:    s_or_saveexec_b64 s[0:1], -1
-; GFX1064-DPP-NEXT:    v_add_f32_e64 v3, s2, s3
+; GFX1064-DPP-NEXT:    v_add_f32_e64 v40, s2, s3
 ; GFX1064-DPP-NEXT:    s_mov_b64 exec, s[0:1]
 ; GFX1064-DPP-NEXT:    v_mbcnt_hi_u32_b32 v0, exec_hi, v0
-; GFX1064-DPP-NEXT:    v_mov_b32_e32 v2, v3
+; GFX1064-DPP-NEXT:    v_mov_b32_e32 v2, v40
 ; GFX1064-DPP-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v0
 ; GFX1064-DPP-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX1064-DPP-NEXT:    s_cbranch_execz .LBB1_3
 ; GFX1064-DPP-NEXT:  ; %bb.1:
 ; GFX1064-DPP-NEXT:    s_load_dwordx2 s[0:1], s[34:35], 0x24
-; GFX1064-DPP-NEXT:    v_mov_b32_e32 v6, 0
+; GFX1064-DPP-NEXT:    v_mov_b32_e32 v3, 0
 ; GFX1064-DPP-NEXT:    s_mov_b64 s[2:3], 0
 ; GFX1064-DPP-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064-DPP-NEXT:    global_load_dword v1, v6, s[0:1]
+; GFX1064-DPP-NEXT:    global_load_dword v1, v3, s[0:1]
 ; GFX1064-DPP-NEXT:  .LBB1_2: ; %atomicrmw.start
 ; GFX1064-DPP-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX1064-DPP-NEXT:    s_waitcnt vmcnt(0)
 ; GFX1064-DPP-NEXT:    v_add_f32_e32 v0, v1, v2
-; GFX1064-DPP-NEXT:    global_atomic_cmpswap v0, v6, v[0:1], s[0:1] glc
+; GFX1064-DPP-NEXT:    global_atomic_cmpswap v0, v3, v[0:1], s[0:1] glc
 ; GFX1064-DPP-NEXT:    s_waitcnt vmcnt(0)
 ; GFX1064-DPP-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v1
 ; GFX1064-DPP-NEXT:    v_mov_b32_e32 v1, v0
@@ -892,44 +892,44 @@ define amdgpu_kernel void @global_atomic_fadd_uni_address_div_value_agent_scope_
 ; GFX1032-DPP-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX1032-DPP-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; GFX1032-DPP-NEXT:    s_or_saveexec_b32 s0, -1
-; GFX1032-DPP-NEXT:    v_bfrev_b32_e32 v3, 1
+; GFX1032-DPP-NEXT:    v_bfrev_b32_e32 v40, 1
 ; GFX1032-DPP-NEXT:    s_mov_b32 exec_lo, s0
-; GFX1032-DPP-NEXT:    v_mov_b32_e32 v4, v0
+; GFX1032-DPP-NEXT:    v_mov_b32_e32 v41, v0
 ; GFX1032-DPP-NEXT:    s_not_b32 exec_lo, exec_lo
-; GFX1032-DPP-NEXT:    v_bfrev_b32_e32 v4, 1
+; GFX1032-DPP-NEXT:    v_bfrev_b32_e32 v41, 1
 ; GFX1032-DPP-NEXT:    s_not_b32 exec_lo, exec_lo
 ; GFX1032-DPP-NEXT:    s_or_saveexec_b32 s0, -1
-; GFX1032-DPP-NEXT:    v_mov_b32_dpp v3, v4 row_xmask:1 row_mask:0xf bank_mask:0xf
-; GFX1032-DPP-NEXT:    v_bfrev_b32_e32 v5, 1
-; GFX1032-DPP-NEXT:    v_add_f32_e32 v3, v4, v3
-; GFX1032-DPP-NEXT:    v_bfrev_b32_e32 v4, 1
-; GFX1032-DPP-NEXT:    v_mov_b32_dpp v5, v3 row_xmask:2 row_mask:0xf bank_mask:0xf
-; GFX1032-DPP-NEXT:    v_add_f32_e32 v3, v3, v5
-; GFX1032-DPP-NEXT:    v_bfrev_b32_e32 v5, 1
-; GFX1032-DPP-NEXT:    v_mov_b32_dpp v4, v3 row_xmask:4 row_mask:0xf bank_mask:0xf
-; GFX1032-DPP-NEXT:    v_add_f32_e32 v3, v3, v4
-; GFX1032-DPP-NEXT:    v_mov_b32_dpp v5, v3 row_xmask:8 row_mask:0xf bank_mask:0xf
-; GFX1032-DPP-NEXT:    v_add_f32_e32 v3, v3, v5
-; GFX1032-DPP-NEXT:    v_mov_b32_e32 v4, v3
-; GFX1032-DPP-NEXT:    v_permlanex16_b32 v4, v4, -1, -1
-; GFX1032-DPP-NEXT:    v_add_f32_e32 v3, v3, v4
+; GFX1032-DPP-NEXT:    v_mov_b32_dpp v40, v41 row_xmask:1 row_mask:0xf bank_mask:0xf
+; GFX1032-DPP-NEXT:    v_bfrev_b32_e32 v42, 1
+; GFX1032-DPP-NEXT:    v_add_f32_e32 v40, v41, v40
+; GFX1032-DPP-NEXT:    v_bfrev_b32_e32 v41, 1
+; GFX1032-DPP-NEXT:    v_mov_b32_dpp v42, v40 row_xmask:2 row_mask:0xf bank_mask:0xf
+; GFX1032-DPP-NEXT:    v_add_f32_e32 v40, v40, v42
+; GFX1032-DPP-NEXT:    v_bfrev_b32_e32 v42, 1
+; GFX1032-DPP-NEXT:    v_mov_b32_dpp v41, v40 row_xmask:4 row_mask:0xf bank_mask:0xf
+; GFX1032-DPP-NEXT:    v_add_f32_e32 v40, v40, v41
+; GFX1032-DPP-NEXT:    v_mov_b32_dpp v42, v40 row_xmask:8 row_mask:0xf bank_mask:0xf
+; GFX1032-DPP-NEXT:    v_add_f32_e32 v40, v40, v42
+; GFX1032-DPP-NEXT:    v_mov_b32_e32 v41, v40
+; GFX1032-DPP-NEXT:    v_permlanex16_b32 v41, v41, -1, -1
+; GFX1032-DPP-NEXT:    v_add_f32_e32 v40, v40, v41
 ; GFX1032-DPP-NEXT:    s_mov_b32 exec_lo, s0
 ; GFX1032-DPP-NEXT:    v_mbcnt_lo_u32_b32 v0, exec_lo, 0
-; GFX1032-DPP-NEXT:    v_mov_b32_e32 v2, v3
+; GFX1032-DPP-NEXT:    v_mov_b32_e32 v2, v40
 ; GFX1032-DPP-NEXT:    s_mov_b32 s2, 0
 ; GFX1032-DPP-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v0
 ; GFX1032-DPP-NEXT:    s_and_saveexec_b32 s0, vcc_lo
 ; GFX1032-DPP-NEXT:    s_cbranch_execz .LBB1_3
 ; GFX1032-DPP-NEXT:  ; %bb.1:
 ; GFX1032-DPP-NEXT:    s_load_dwordx2 s[0:1], s[34:35], 0x24
-; GFX1032-DPP-NEXT:    v_mov_b32_e32 v6, 0
+; GFX1032-DPP-NEXT:    v_mov_b32_e32 v3, 0
 ; GFX1032-DPP-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032-DPP-NEXT:    global_load_dword v1, v6, s[0:1]
+; GFX1032-DPP-NEXT:    global_load_dword v1, v3, s[0:1]
 ; GFX1032-DPP-NEXT:  .LBB1_2: ; %atomicrmw.start
 ; GFX1032-DPP-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX1032-DPP-NEXT:    s_waitcnt vmcnt(0)
 ; GFX1032-DPP-NEXT:    v_add_f32_e32 v0, v1, v2
-; GFX1032-DPP-NEXT:    global_atomic_cmpswap v0, v6, v[0:1], s[0:1] glc
+; GFX1032-DPP-NEXT:    global_atomic_cmpswap v0, v3, v[0:1], s[0:1] glc
 ; GFX1032-DPP-NEXT:    s_waitcnt vmcnt(0)
 ; GFX1032-DPP-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v0, v1
 ; GFX1032-DPP-NEXT:    v_mov_b32_e32 v1, v0
@@ -959,53 +959,53 @@ define amdgpu_kernel void @global_atomic_fadd_uni_address_div_value_agent_scope_
 ; GFX1164-DPP-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX1164-DPP-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; GFX1164-DPP-NEXT:    s_or_saveexec_b64 s[0:1], -1
-; GFX1164-DPP-NEXT:    v_bfrev_b32_e32 v1, 1
+; GFX1164-DPP-NEXT:    v_bfrev_b32_e32 v40, 1
 ; GFX1164-DPP-NEXT:    s_mov_b64 exec, s[0:1]
-; GFX1164-DPP-NEXT:    v_mov_b32_e32 v2, v0
+; GFX1164-DPP-NEXT:    v_mov_b32_e32 v41, v0
 ; GFX1164-DPP-NEXT:    s_not_b64 exec, exec
-; GFX1164-DPP-NEXT:    v_bfrev_b32_e32 v2, 1
+; GFX1164-DPP-NEXT:    v_bfrev_b32_e32 v41, 1
 ; GFX1164-DPP-NEXT:    s_not_b64 exec, exec
 ; GFX1164-DPP-NEXT:    s_or_saveexec_b64 s[0:1], -1
 ; GFX1164-DPP-NEXT:    s_waitcnt_depctr 0xfff
-; GFX1164-DPP-NEXT:    v_mov_b32_dpp v1, v2 row_xmask:1 row_mask:0xf bank_mask:0xf
-; GFX1164-DPP-NEXT:    v_bfrev_b32_e32 v3, 1
+; GFX1164-DPP-NEXT:    v_mov_b32_dpp v40, v41 row_xmask:1 row_mask:0xf bank_mask:0xf
+; GFX1164-DPP-NEXT:    v_bfrev_b32_e32 v42, 1
 ; GFX1164-DPP-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX1164-DPP-NEXT:    v_add_f32_e32 v1, v2, v1
-; GFX1164-DPP-NEXT:    v_bfrev_b32_e32 v2, 1
-; GFX1164-DPP-NEXT:    v_mov_b32_dpp v3, v1 row_xmask:2 row_mask:0xf bank_mask:0xf
+; GFX1164-DPP-NEXT:    v_add_f32_e32 v40, v41, v40
+; GFX1164-DPP-NEXT:    v_bfrev_b32_e32 v41, 1
+; GFX1164-DPP-NEXT:    v_mov_b32_dpp v42, v40 row_xmask:2 row_mask:0xf bank_mask:0xf
 ; GFX1164-DPP-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX1164-DPP-NEXT:    v_add_f32_e32 v1, v1, v3
-; GFX1164-DPP-NEXT:    v_bfrev_b32_e32 v3, 1
-; GFX1164-DPP-NEXT:    v_mov_b32_dpp v2, v1 row_xmask:4 row_mask:0xf bank_mask:0xf
+; GFX1164-DPP-NEXT:    v_add_f32_e32 v40, v40, v42
+; GFX1164-DPP-NEXT:    v_bfrev_b32_e32 v42, 1
+; GFX1164-DPP-NEXT:    v_mov_b32_dpp v41, v40 row_xmask:4 row_mask:0xf bank_mask:0xf
 ; GFX1164-DPP-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164-DPP-NEXT:    v_add_f32_e32 v1, v1, v2
-; GFX1164-DPP-NEXT:    v_mov_b32_dpp v3, v1 row_xmask:8 row_mask:0xf bank_mask:0xf
+; GFX1164-DPP-NEXT:    v_add_f32_e32 v40, v40, v41
+; GFX1164-DPP-NEXT:    v_mov_b32_dpp v42, v40 row_xmask:8 row_mask:0xf bank_mask:0xf
 ; GFX1164-DPP-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164-DPP-NEXT:    v_add_f32_e32 v1, v1, v3
-; GFX1164-DPP-NEXT:    v_mov_b32_e32 v2, v1
+; GFX1164-DPP-NEXT:    v_add_f32_e32 v40, v40, v42
+; GFX1164-DPP-NEXT:    v_mov_b32_e32 v41, v40
 ; GFX1164-DPP-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164-DPP-NEXT:    v_permlanex16_b32 v2, v2, -1, -1
-; GFX1164-DPP-NEXT:    v_add_f32_e32 v1, v1, v2
+; GFX1164-DPP-NEXT:    v_permlanex16_b32 v41, v41, -1, -1
+; GFX1164-DPP-NEXT:    v_add_f32_e32 v40, v40, v41
 ; GFX1164-DPP-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164-DPP-NEXT:    v_permlane64_b32 v2, v1
+; GFX1164-DPP-NEXT:    v_permlane64_b32 v41, v40
 ; GFX1164-DPP-NEXT:    s_mov_b64 exec, s[0:1]
 ; GFX1164-DPP-NEXT:    v_mbcnt_lo_u32_b32 v0, exec_lo, 0
 ; GFX1164-DPP-NEXT:    s_or_saveexec_b64 s[0:1], -1
 ; GFX1164-DPP-NEXT:    s_delay_alu instid0(VALU_DEP_2)
-; GFX1164-DPP-NEXT:    v_add_f32_e32 v1, v1, v2
+; GFX1164-DPP-NEXT:    v_add_f32_e32 v40, v40, v41
 ; GFX1164-DPP-NEXT:    s_mov_b64 exec, s[0:1]
 ; GFX1164-DPP-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instid1(SALU_CYCLE_1)
-; GFX1164-DPP-NEXT:    v_mbcnt_hi_u32_b32 v4, exec_hi, v0
+; GFX1164-DPP-NEXT:    v_mbcnt_hi_u32_b32 v1, exec_hi, v0
 ; GFX1164-DPP-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX1164-DPP-NEXT:    v_mov_b32_e32 v0, v1
+; GFX1164-DPP-NEXT:    v_mov_b32_e32 v0, v40
 ; GFX1164-DPP-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164-DPP-NEXT:    v_cmpx_eq_u32_e32 0, v4
+; GFX1164-DPP-NEXT:    v_cmpx_eq_u32_e32 0, v1
 ; GFX1164-DPP-NEXT:    s_cbranch_execz .LBB1_2
 ; GFX1164-DPP-NEXT:  ; %bb.1:
 ; GFX1164-DPP-NEXT:    s_load_b64 s[0:1], s[34:35], 0x24
-; GFX1164-DPP-NEXT:    v_mov_b32_e32 v4, 0
+; GFX1164-DPP-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX1164-DPP-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164-DPP-NEXT:    global_atomic_add_f32 v4, v0, s[0:1]
+; GFX1164-DPP-NEXT:    global_atomic_add_f32 v1, v0, s[0:1]
 ; GFX1164-DPP-NEXT:  .LBB1_2:
 ; GFX1164-DPP-NEXT:    s_nop 0
 ; GFX1164-DPP-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
@@ -1031,45 +1031,45 @@ define amdgpu_kernel void @global_atomic_fadd_uni_address_div_value_agent_scope_
 ; GFX1132-DPP-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX1132-DPP-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; GFX1132-DPP-NEXT:    s_or_saveexec_b32 s0, -1
-; GFX1132-DPP-NEXT:    v_bfrev_b32_e32 v1, 1
+; GFX1132-DPP-NEXT:    v_bfrev_b32_e32 v40, 1
 ; GFX1132-DPP-NEXT:    s_mov_b32 exec_lo, s0
-; GFX1132-DPP-NEXT:    v_mov_b32_e32 v2, v0
+; GFX1132-DPP-NEXT:    v_mov_b32_e32 v41, v0
 ; GFX1132-DPP-NEXT:    s_not_b32 exec_lo, exec_lo
-; GFX1132-DPP-NEXT:    v_bfrev_b32_e32 v2, 1
+; GFX1132-DPP-NEXT:    v_bfrev_b32_e32 v41, 1
 ; GFX1132-DPP-NEXT:    s_not_b32 exec_lo, exec_lo
 ; GFX1132-DPP-NEXT:    s_or_saveexec_b32 s0, -1
 ; GFX1132-DPP-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX1132-DPP-NEXT:    v_mov_b32_dpp v1, v2 row_xmask:1 row_mask:0xf bank_mask:0xf
-; GFX1132-DPP-NEXT:    v_bfrev_b32_e32 v3, 1
-; GFX1132-DPP-NEXT:    v_add_f32_e32 v1, v2, v1
-; GFX1132-DPP-NEXT:    v_bfrev_b32_e32 v2, 1
+; GFX1132-DPP-NEXT:    v_mov_b32_dpp v40, v41 row_xmask:1 row_mask:0xf bank_mask:0xf
+; GFX1132-DPP-NEXT:    v_bfrev_b32_e32 v42, 1
+; GFX1132-DPP-NEXT:    v_add_f32_e32 v40, v41, v40
+; GFX1132-DPP-NEXT:    v_bfrev_b32_e32 v41, 1
 ; GFX1132-DPP-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132-DPP-NEXT:    v_mov_b32_dpp v3, v1 row_xmask:2 row_mask:0xf bank_mask:0xf
-; GFX1132-DPP-NEXT:    v_add_f32_e32 v1, v1, v3
-; GFX1132-DPP-NEXT:    v_bfrev_b32_e32 v3, 1
+; GFX1132-DPP-NEXT:    v_mov_b32_dpp v42, v40 row_xmask:2 row_mask:0xf bank_mask:0xf
+; GFX1132-DPP-...
[truncated]

llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll

arsenm

Really the allocator should be handling this, so as we move towards that point it makes sense to move where this runs

Move SIPreAllocateWWMRegs pass to just before VGPR allocation. This saves recomputation of the virtual matrix and live reg map, with the slight regression in O0 that live intervals and slot indexes must be computed.

We are not concerned about clobbers from calls as these will be spilt correctly -- and these were never considered before.

perlfu · 2023-11-02T09:18:48Z

I realized that we need to explicitly ignore SI_SPILL_S32_TO_VGPR instructions which occur in STRICT_WWM sections otherwise they will unintentionally be assigned WWM registers.
This brings the total test diff on this down massive to primarily wwm-reserved-spill.ll, which shows a change in use of registers because SGPR spill now get first access to VGPRs before WWM pre-allocation runs.

This should now be good to go.

llvm/lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp

perlfu · 2023-11-03T07:58:35Z

If there is no further feedback I will merge this on November 6th.

jayfoad · 2023-11-03T12:35:52Z

Change looks good overall but I don't quite understand this:

I realized that we need to explicitly ignore SI_SPILL_S32_TO_VGPR instructions which occur in STRICT_WWM sections otherwise they will unintentionally be assigned WWM registers. This brings the total test diff on this down massive to primarily wwm-reserved-spill.ll, which shows a change in use of registers because SGPR spill now get first access to VGPRs before WWM pre-allocation runs.

Do SI_SPILL_S32_TO_VGPR instructions refer to physical VGPRs? If so, why do you need to explicitly ignore them in this pass?

If SI_SPILL_S32_TO_VGPR instructions refer to virtual VGPRs then I don't understand how they "get first access to [physical?] VGPRs".

perlfu · 2023-11-06T06:05:12Z

Change looks good overall but I don't quite understand this:

I realized that we need to explicitly ignore SI_SPILL_S32_TO_VGPR instructions which occur in STRICT_WWM sections otherwise they will unintentionally be assigned WWM registers. This brings the total test diff on this down massive to primarily wwm-reserved-spill.ll, which shows a change in use of registers because SGPR spill now get first access to VGPRs before WWM pre-allocation runs.

Do SI_SPILL_S32_TO_VGPR instructions refer to physical VGPRs? If so, why do you need to explicitly ignore them in this pass?

If SI_SPILL_S32_TO_VGPR instructions refer to virtual VGPRs then I don't understand how they "get first access to [physical?] VGPRs".

SI_SPILL_S32_TO_VGPR can take a physical or virtual VGPR.
The decision to use a physical register is made during SGPR spill lowering based on whether the function is an entry function or not.
In the entry function, physical VGPRs will be assigned for VGPR spill registers.

So a high level we currently have two independent code paths which can allocate physical registers for WWM use.
Unifying them is beyond the scope of this change, but is a near term objective which could be built on this patch.

cdevadas · 2023-11-06T06:53:48Z

The decision to use a physical register is made during SGPR spill lowering based on whether the function is an entry function or not.

One correction. I always wanted the SGPR spill lowering to use virtual VGPR here so that RA can allocate them efficiently. Currently, we use physical registers for CSR SGPR spills inserted at the prolog/epilog and virtual for the rest of the spills (the one introduced during SGPR-regalloc). This is to ensure that the CSR spills get VGPRs that won't be reused by RA for any other liveranges in the function. This is a fixup to ensure the CSR spills get static CFI entries. We earlier encountered an issue in the rocgdb during frame unwinding that the CFI entries for the CSR spills are broken.

Skipping AMDGPU::SI_SPILL_S32_TO_VGPR & AMDGPU::SI_RESTORE_S32_FROM_VGPR here is the right thing. Because they are meant to be allocated during RA.
BTW, why SI_RESTORE_S32_FROM_VGPR isn't included here?

perlfu · 2023-11-06T07:13:10Z

Skipping AMDGPU::SI_SPILL_S32_TO_VGPR & AMDGPU::SI_RESTORE_S32_FROM_VGPR here is the right thing. Because they are meant to be allocated during RA. BTW, why SI_RESTORE_S32_FROM_VGPR isn't included here?

SI_RESTORE_S32_FROM_VGPR is not included here because it does not define any VGPRs, so WWM pre-allocate will ignore it as-is.

cdevadas · 2023-11-06T07:44:39Z

Skipping AMDGPU::SI_SPILL_S32_TO_VGPR & AMDGPU::SI_RESTORE_S32_FROM_VGPR here is the right thing. Because they are meant to be allocated during RA. BTW, why SI_RESTORE_S32_FROM_VGPR isn't included here?

SI_RESTORE_S32_FROM_VGPR is not included here because it does not define any VGPRs, so WWM pre-allocate will ignore it as-is.

Ok.

perlfu requested review from jayfoad, arsenm, rampitec and cdevadas October 30, 2023 05:09

llvmbot added the backend:AMDGPU label Oct 30, 2023

perlfu mentioned this pull request Oct 30, 2023

[AMDGPU] Add option to pre-allocate SGPR spill VGPRs #70626

Merged

cdevadas reviewed Oct 30, 2023

View reviewed changes

llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll Outdated Show resolved Hide resolved

arsenm approved these changes Oct 30, 2023

View reviewed changes

perlfu added 3 commits November 2, 2023 18:11

[AMDGPU] Move WWM register pre-allocation to during regalloc

04174a2

Move SIPreAllocateWWMRegs pass to just before VGPR allocation. This saves recomputation of the virtual matrix and live reg map, with the slight regression in O0 that live intervals and slot indexes must be computed.

Skip UsedPhysRegMask test when calling isPhysRegUsed.

24a245e

We are not concerned about clobbers from calls as these will be spilt correctly -- and these were never considered before.

Ignore AMDGPU::SI_SPILL_S32_TO_VGPR in STRICT_WWM regions.

8f454eb

perlfu force-pushed the move-wwm-reservation branch from c116901 to 8f454eb Compare November 2, 2023 09:13

arsenm reviewed Nov 2, 2023

View reviewed changes

llvm/lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp Outdated Show resolved Hide resolved

Add comment for SkipRegMaskTest.

834ae79

perlfu merged commit af6ff98 into llvm:main Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Move WWM register pre-allocation to during regalloc #70618

[AMDGPU] Move WWM register pre-allocation to during regalloc #70618

Uh oh!

perlfu commented Oct 30, 2023

Uh oh!

llvmbot commented Oct 30, 2023

Uh oh!

Uh oh!

arsenm left a comment

Uh oh!

perlfu commented Nov 2, 2023

Uh oh!

Uh oh!

perlfu commented Nov 3, 2023

Uh oh!

jayfoad commented Nov 3, 2023

Uh oh!

perlfu commented Nov 6, 2023

Uh oh!

cdevadas commented Nov 6, 2023 •

edited

Loading

Uh oh!

perlfu commented Nov 6, 2023

Uh oh!

cdevadas commented Nov 6, 2023

Uh oh!

Uh oh!

[AMDGPU] Move WWM register pre-allocation to during regalloc #70618

[AMDGPU] Move WWM register pre-allocation to during regalloc #70618

Uh oh!

Conversation

perlfu commented Oct 30, 2023

Uh oh!

llvmbot commented Oct 30, 2023

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

perlfu commented Nov 2, 2023

Uh oh!

Uh oh!

perlfu commented Nov 3, 2023

Uh oh!

jayfoad commented Nov 3, 2023

Uh oh!

perlfu commented Nov 6, 2023

Uh oh!

cdevadas commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

perlfu commented Nov 6, 2023

Uh oh!

cdevadas commented Nov 6, 2023

Uh oh!

Uh oh!

cdevadas commented Nov 6, 2023 •

edited

Loading