Skip to content

[AMDGPU] Disable VALU sinking and hoisting with WWM #123124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rampitec
Copy link
Collaborator

Machine LICM can hoist a VALU instruction from a WWM region.
In this case WQM pass will have to create yet another WWM region
around the hoisted instruction, which is not desired.

Unfortunatelly we cannot tell if an instruction is in the WWM
region, so this patch disables hoisting if WWM is used in the
function.

This works around the bug SWDEV-502411.

Copy link
Collaborator Author

rampitec commented Jan 15, 2025

@rampitec rampitec requested review from jayfoad and perlfu January 15, 2025 21:53
@rampitec rampitec marked this pull request as ready for review January 15, 2025 22:37
@llvmbot
Copy link
Member

llvmbot commented Jan 15, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Stanislav Mekhanoshin (rampitec)

Changes

Machine LICM can hoist a VALU instruction from a WWM region.
In this case WQM pass will have to create yet another WWM region
around the hoisted instruction, which is not desired.

Unfortunatelly we cannot tell if an instruction is in the WWM
region, so this patch disables hoisting if WWM is used in the
function.

This works around the bug SWDEV-502411.


Patch is 1.34 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123124.diff

14 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp (+3)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5-1)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+5-1)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+700-680)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+3067-2967)
  • (modified) llvm/test/CodeGen/AMDGPU/cse-convergent.ll (+11-9)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+1138-1077)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+792-745)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+792-745)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+1179-1110)
  • (modified) llvm/test/CodeGen/AMDGPU/set-inactive-wwm-overwrite.ll (+11-10)
  • (modified) llvm/test/CodeGen/AMDGPU/should-not-hoist-set-inactive.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/wwm-reserved-spill.ll (+12-10)
  • (modified) llvm/test/CodeGen/AMDGPU/wwm-reserved.ll (+20-20)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index 27e9018d68a03e..041cf40a7588df 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -2772,6 +2772,9 @@ void AMDGPUDAGToDAGISel::SelectINTRINSIC_WO_CHAIN(SDNode *N) {
   case Intrinsic::amdgcn_wwm:
   case Intrinsic::amdgcn_strict_wwm:
     Opcode = AMDGPU::STRICT_WWM;
+    CurDAG->getMachineFunction()
+        .getInfo<SIMachineFunctionInfo>()
+        ->setInitWholeWave();
     break;
   case Intrinsic::amdgcn_strict_wqm:
     Opcode = AMDGPU::STRICT_WQM;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 926c1e4b23b4a1..96d0c321704f1e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1055,8 +1055,12 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC(MachineInstr &I) const {
   case Intrinsic::amdgcn_softwqm:
     return constrainCopyLikeIntrin(I, AMDGPU::SOFT_WQM);
   case Intrinsic::amdgcn_strict_wwm:
-  case Intrinsic::amdgcn_wwm:
+  case Intrinsic::amdgcn_wwm: {
+    MachineFunction *MF = I.getParent()->getParent();
+    SIMachineFunctionInfo *MFInfo = MF->getInfo<SIMachineFunctionInfo>();
+    MFInfo->setInitWholeWave();
     return constrainCopyLikeIntrin(I, AMDGPU::STRICT_WWM);
+  }
   case Intrinsic::amdgcn_strict_wqm:
     return constrainCopyLikeIntrin(I, AMDGPU::STRICT_WQM);
   case Intrinsic::amdgcn_writelane:
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 8fc32d9e60bf20..498080caf69624 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -184,7 +184,11 @@ static bool resultDependsOnExec(const MachineInstr &MI) {
 bool SIInstrInfo::isIgnorableUse(const MachineOperand &MO) const {
   // Any implicit use of exec by VALU is not a real register read.
   return MO.getReg() == AMDGPU::EXEC && MO.isImplicit() &&
-         isVALU(*MO.getParent()) && !resultDependsOnExec(*MO.getParent());
+         isVALU(*MO.getParent()) && !resultDependsOnExec(*MO.getParent()) &&
+         !MO.getParent()
+              ->getMF()
+              ->getInfo<SIMachineFunctionInfo>()
+              ->hasInitWholeWave();
 }
 
 bool SIInstrInfo::isSafeToSink(MachineInstr &MI,
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
index 9577230c6c52e2..b5a8d72af5c3f1 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
@@ -2825,44 +2825,44 @@ define amdgpu_kernel void @add_i64_varying(ptr addrspace(1) %out, ptr addrspace(
 ; GFX8_DPP-NEXT:    v_mbcnt_hi_u32_b32 v6, exec_hi, v6
 ; GFX8_DPP-NEXT:    s_or_saveexec_b64 s[4:5], -1
 ; GFX8_DPP-NEXT:    v_cndmask_b32_e64 v3, 0, v0, s[4:5]
-; GFX8_DPP-NEXT:    v_mov_b32_e32 v5, 0
+; GFX8_DPP-NEXT:    v_mov_b32_e32 v5, v1
 ; GFX8_DPP-NEXT:    v_cndmask_b32_e64 v2, 0, 0, s[4:5]
-; GFX8_DPP-NEXT:    v_mov_b32_e32 v4, 0
+; GFX8_DPP-NEXT:    v_mov_b32_e32 v4, v1
 ; GFX8_DPP-NEXT:    v_mov_b32_dpp v5, v3 row_shr:1 row_mask:0xf bank_mask:0xf
 ; GFX8_DPP-NEXT:    v_add_u32_e32 v3, vcc, v3, v5
 ; GFX8_DPP-NEXT:    v_mov_b32_dpp v4, v2 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX8_DPP-NEXT:    v_mov_b32_e32 v5, 0
+; GFX8_DPP-NEXT:    v_mov_b32_e32 v5, v1
 ; GFX8_DPP-NEXT:    v_addc_u32_e32 v2, vcc, v2, v4, vcc
-; GFX8_DPP-NEXT:    v_mov_b32_e32 v4, 0
+; GFX8_DPP-NEXT:    v_mov_b32_e32 v4, v1
 ; GFX8_DPP-NEXT:    v_mov_b32_dpp v5, v3 row_shr:2 row_mask:0xf bank_mask:0xf
 ; GFX8_DPP-NEXT:    v_add_u32_e32 v3, vcc, v3, v5
 ; GFX8_DPP-NEXT:    v_mov_b32_dpp v4, v2 row_shr:2 row_mask:0xf bank_mask:0xf
-; GFX8_DPP-NEXT:    v_mov_b32_e32 v5, 0
+; GFX8_DPP-NEXT:    v_mov_b32_e32 v5, v1
 ; GFX8_DPP-NEXT:    v_addc_u32_e32 v2, vcc, v2, v4, vcc
-; GFX8_DPP-NEXT:    v_mov_b32_e32 v4, 0
+; GFX8_DPP-NEXT:    v_mov_b32_e32 v4, v1
 ; GFX8_DPP-NEXT:    v_mov_b32_dpp v5, v3 row_shr:4 row_mask:0xf bank_mask:0xf
 ; GFX8_DPP-NEXT:    v_add_u32_e32 v3, vcc, v3, v5
 ; GFX8_DPP-NEXT:    v_mov_b32_dpp v4, v2 row_shr:4 row_mask:0xf bank_mask:0xf
-; GFX8_DPP-NEXT:    v_mov_b32_e32 v5, 0
+; GFX8_DPP-NEXT:    v_mov_b32_e32 v5, v1
 ; GFX8_DPP-NEXT:    v_addc_u32_e32 v2, vcc, v2, v4, vcc
-; GFX8_DPP-NEXT:    v_mov_b32_e32 v4, 0
+; GFX8_DPP-NEXT:    v_mov_b32_e32 v4, v1
 ; GFX8_DPP-NEXT:    v_mov_b32_dpp v5, v3 row_shr:8 row_mask:0xf bank_mask:0xf
 ; GFX8_DPP-NEXT:    v_add_u32_e32 v3, vcc, v3, v5
 ; GFX8_DPP-NEXT:    v_mov_b32_dpp v4, v2 row_shr:8 row_mask:0xf bank_mask:0xf
-; GFX8_DPP-NEXT:    v_mov_b32_e32 v5, 0
+; GFX8_DPP-NEXT:    v_mov_b32_e32 v5, v1
 ; GFX8_DPP-NEXT:    v_addc_u32_e32 v2, vcc, v2, v4, vcc
-; GFX8_DPP-NEXT:    v_mov_b32_e32 v4, 0
+; GFX8_DPP-NEXT:    v_mov_b32_e32 v4, v1
 ; GFX8_DPP-NEXT:    v_mov_b32_dpp v5, v3 row_bcast:15 row_mask:0xa bank_mask:0xf
 ; GFX8_DPP-NEXT:    v_add_u32_e32 v3, vcc, v3, v5
 ; GFX8_DPP-NEXT:    v_mov_b32_dpp v4, v2 row_bcast:15 row_mask:0xa bank_mask:0xf
-; GFX8_DPP-NEXT:    v_mov_b32_e32 v5, 0
+; GFX8_DPP-NEXT:    v_mov_b32_e32 v5, v1
 ; GFX8_DPP-NEXT:    v_addc_u32_e32 v2, vcc, v2, v4, vcc
-; GFX8_DPP-NEXT:    v_mov_b32_e32 v4, 0
+; GFX8_DPP-NEXT:    v_mov_b32_e32 v4, v1
 ; GFX8_DPP-NEXT:    v_mov_b32_dpp v5, v3 row_bcast:31 row_mask:0xc bank_mask:0xf
 ; GFX8_DPP-NEXT:    v_add_u32_e32 v3, vcc, v3, v5
 ; GFX8_DPP-NEXT:    v_mov_b32_dpp v4, v2 row_bcast:31 row_mask:0xc bank_mask:0xf
 ; GFX8_DPP-NEXT:    v_addc_u32_e32 v4, vcc, v2, v4, vcc
-; GFX8_DPP-NEXT:    v_mov_b32_e32 v2, 0
+; GFX8_DPP-NEXT:    v_mov_b32_e32 v2, v1
 ; GFX8_DPP-NEXT:    v_readlane_b32 s7, v4, 63
 ; GFX8_DPP-NEXT:    v_readlane_b32 s6, v3, 63
 ; GFX8_DPP-NEXT:    v_mov_b32_dpp v2, v4 wave_shr:1 row_mask:0xf bank_mask:0xf
@@ -2908,44 +2908,44 @@ define amdgpu_kernel void @add_i64_varying(ptr addrspace(1) %out, ptr addrspace(
 ; GFX9_DPP-NEXT:    v_mbcnt_hi_u32_b32 v6, exec_hi, v6
 ; GFX9_DPP-NEXT:    s_or_saveexec_b64 s[4:5], -1
 ; GFX9_DPP-NEXT:    v_cndmask_b32_e64 v3, 0, v0, s[4:5]
-; GFX9_DPP-NEXT:    v_mov_b32_e32 v5, 0
+; GFX9_DPP-NEXT:    v_mov_b32_e32 v5, v1
 ; GFX9_DPP-NEXT:    v_cndmask_b32_e64 v2, 0, 0, s[4:5]
-; GFX9_DPP-NEXT:    v_mov_b32_e32 v4, 0
+; GFX9_DPP-NEXT:    v_mov_b32_e32 v4, v1
 ; GFX9_DPP-NEXT:    v_mov_b32_dpp v5, v3 row_shr:1 row_mask:0xf bank_mask:0xf
 ; GFX9_DPP-NEXT:    v_add_co_u32_e32 v3, vcc, v3, v5
 ; GFX9_DPP-NEXT:    v_mov_b32_dpp v4, v2 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX9_DPP-NEXT:    v_mov_b32_e32 v5, 0
+; GFX9_DPP-NEXT:    v_mov_b32_e32 v5, v1
 ; GFX9_DPP-NEXT:    v_addc_co_u32_e32 v2, vcc, v2, v4, vcc
-; GFX9_DPP-NEXT:    v_mov_b32_e32 v4, 0
+; GFX9_DPP-NEXT:    v_mov_b32_e32 v4, v1
 ; GFX9_DPP-NEXT:    v_mov_b32_dpp v5, v3 row_shr:2 row_mask:0xf bank_mask:0xf
 ; GFX9_DPP-NEXT:    v_add_co_u32_e32 v3, vcc, v3, v5
 ; GFX9_DPP-NEXT:    v_mov_b32_dpp v4, v2 row_shr:2 row_mask:0xf bank_mask:0xf
-; GFX9_DPP-NEXT:    v_mov_b32_e32 v5, 0
+; GFX9_DPP-NEXT:    v_mov_b32_e32 v5, v1
 ; GFX9_DPP-NEXT:    v_addc_co_u32_e32 v2, vcc, v2, v4, vcc
-; GFX9_DPP-NEXT:    v_mov_b32_e32 v4, 0
+; GFX9_DPP-NEXT:    v_mov_b32_e32 v4, v1
 ; GFX9_DPP-NEXT:    v_mov_b32_dpp v5, v3 row_shr:4 row_mask:0xf bank_mask:0xf
 ; GFX9_DPP-NEXT:    v_add_co_u32_e32 v3, vcc, v3, v5
 ; GFX9_DPP-NEXT:    v_mov_b32_dpp v4, v2 row_shr:4 row_mask:0xf bank_mask:0xf
-; GFX9_DPP-NEXT:    v_mov_b32_e32 v5, 0
+; GFX9_DPP-NEXT:    v_mov_b32_e32 v5, v1
 ; GFX9_DPP-NEXT:    v_addc_co_u32_e32 v2, vcc, v2, v4, vcc
-; GFX9_DPP-NEXT:    v_mov_b32_e32 v4, 0
+; GFX9_DPP-NEXT:    v_mov_b32_e32 v4, v1
 ; GFX9_DPP-NEXT:    v_mov_b32_dpp v5, v3 row_shr:8 row_mask:0xf bank_mask:0xf
 ; GFX9_DPP-NEXT:    v_add_co_u32_e32 v3, vcc, v3, v5
 ; GFX9_DPP-NEXT:    v_mov_b32_dpp v4, v2 row_shr:8 row_mask:0xf bank_mask:0xf
-; GFX9_DPP-NEXT:    v_mov_b32_e32 v5, 0
+; GFX9_DPP-NEXT:    v_mov_b32_e32 v5, v1
 ; GFX9_DPP-NEXT:    v_addc_co_u32_e32 v2, vcc, v2, v4, vcc
-; GFX9_DPP-NEXT:    v_mov_b32_e32 v4, 0
+; GFX9_DPP-NEXT:    v_mov_b32_e32 v4, v1
 ; GFX9_DPP-NEXT:    v_mov_b32_dpp v5, v3 row_bcast:15 row_mask:0xa bank_mask:0xf
 ; GFX9_DPP-NEXT:    v_add_co_u32_e32 v3, vcc, v3, v5
 ; GFX9_DPP-NEXT:    v_mov_b32_dpp v4, v2 row_bcast:15 row_mask:0xa bank_mask:0xf
-; GFX9_DPP-NEXT:    v_mov_b32_e32 v5, 0
+; GFX9_DPP-NEXT:    v_mov_b32_e32 v5, v1
 ; GFX9_DPP-NEXT:    v_addc_co_u32_e32 v2, vcc, v2, v4, vcc
-; GFX9_DPP-NEXT:    v_mov_b32_e32 v4, 0
+; GFX9_DPP-NEXT:    v_mov_b32_e32 v4, v1
 ; GFX9_DPP-NEXT:    v_mov_b32_dpp v5, v3 row_bcast:31 row_mask:0xc bank_mask:0xf
 ; GFX9_DPP-NEXT:    v_add_co_u32_e32 v3, vcc, v3, v5
 ; GFX9_DPP-NEXT:    v_mov_b32_dpp v4, v2 row_bcast:31 row_mask:0xc bank_mask:0xf
 ; GFX9_DPP-NEXT:    v_addc_co_u32_e32 v4, vcc, v2, v4, vcc
-; GFX9_DPP-NEXT:    v_mov_b32_e32 v2, 0
+; GFX9_DPP-NEXT:    v_mov_b32_e32 v2, v1
 ; GFX9_DPP-NEXT:    v_readlane_b32 s7, v4, 63
 ; GFX9_DPP-NEXT:    v_readlane_b32 s6, v3, 63
 ; GFX9_DPP-NEXT:    v_mov_b32_dpp v2, v4 wave_shr:1 row_mask:0xf bank_mask:0xf
@@ -2984,76 +2984,76 @@ define amdgpu_kernel void @add_i64_varying(ptr addrspace(1) %out, ptr addrspace(
 ; GFX1064_DPP-LABEL: add_i64_varying:
 ; GFX1064_DPP:       ; %bb.0: ; %entry
 ; GFX1064_DPP-NEXT:    s_or_saveexec_b64 s[0:1], -1
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1064_DPP-NEXT:    v_cndmask_b32_e64 v2, 0, v0, s[0:1]
-; GFX1064_DPP-NEXT:    v_cndmask_b32_e64 v3, 0, 0, s[0:1]
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v4, 0
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v6, 0
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v5, 0
-; GFX1064_DPP-NEXT:    v_mov_b32_dpp v1, v2 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v7, 0
-; GFX1064_DPP-NEXT:    v_mov_b32_dpp v4, v3 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT:    v_add_co_u32 v1, vcc, v2, v1
-; GFX1064_DPP-NEXT:    v_add_co_ci_u32_e32 v2, vcc, v3, v4, vcc
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v4, 0
-; GFX1064_DPP-NEXT:    v_mov_b32_dpp v6, v1 row_shr:2 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v3, 0
-; GFX1064_DPP-NEXT:    v_mov_b32_dpp v5, v2 row_shr:2 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT:    v_add_co_u32 v1, vcc, v1, v6
-; GFX1064_DPP-NEXT:    v_add_co_ci_u32_e32 v2, vcc, v2, v5, vcc
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v6, 0
-; GFX1064_DPP-NEXT:    v_mov_b32_dpp v4, v1 row_shr:4 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v5, 0
-; GFX1064_DPP-NEXT:    v_mov_b32_dpp v3, v2 row_shr:4 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT:    v_add_co_u32 v1, vcc, v1, v4
-; GFX1064_DPP-NEXT:    v_add_co_ci_u32_e32 v2, vcc, v2, v3, vcc
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v3, 0
-; GFX1064_DPP-NEXT:    v_mov_b32_dpp v6, v1 row_shr:8 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT:    v_mov_b32_dpp v5, v2 row_shr:8 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT:    v_add_co_u32 v1, vcc, v1, v6
-; GFX1064_DPP-NEXT:    v_add_co_ci_u32_e32 v2, vcc, v2, v5, vcc
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v5, 0
-; GFX1064_DPP-NEXT:    v_permlanex16_b32 v4, v1, -1, -1
-; GFX1064_DPP-NEXT:    v_permlanex16_b32 v6, v2, -1, -1
-; GFX1064_DPP-NEXT:    v_mov_b32_dpp v3, v4 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf
-; GFX1064_DPP-NEXT:    v_mov_b32_dpp v5, v6 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf
-; GFX1064_DPP-NEXT:    v_add_co_u32 v1, vcc, v1, v3
-; GFX1064_DPP-NEXT:    v_add_co_ci_u32_e32 v2, vcc, v2, v5, vcc
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v3, 0
-; GFX1064_DPP-NEXT:    v_readlane_b32 s2, v1, 31
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v5, 0
-; GFX1064_DPP-NEXT:    v_readlane_b32 s3, v2, 31
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v4, s2
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v6, s3
-; GFX1064_DPP-NEXT:    v_mov_b32_dpp v3, v4 quad_perm:[0,1,2,3] row_mask:0xc bank_mask:0xf
-; GFX1064_DPP-NEXT:    v_mov_b32_dpp v5, v6 quad_perm:[0,1,2,3] row_mask:0xc bank_mask:0xf
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v6, 0
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v2, 0
+; GFX1064_DPP-NEXT:    v_cndmask_b32_e64 v1, 0, v0, s[0:1]
+; GFX1064_DPP-NEXT:    v_cndmask_b32_e64 v4, 0, 0, s[0:1]
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v3, v2
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v5, v2
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v7, v2
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v6, v2
+; GFX1064_DPP-NEXT:    v_mov_b32_dpp v3, v1 row_shr:1 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT:    v_mov_b32_dpp v5, v4 row_shr:1 row_mask:0xf bank_mask:0xf
 ; GFX1064_DPP-NEXT:    v_add_co_u32 v1, vcc, v1, v3
-; GFX1064_DPP-NEXT:    v_add_co_ci_u32_e32 v2, vcc, v2, v5, vcc
+; GFX1064_DPP-NEXT:    v_add_co_ci_u32_e32 v3, vcc, v4, v5, vcc
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v5, v2
+; GFX1064_DPP-NEXT:    v_mov_b32_dpp v7, v1 row_shr:2 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v4, v2
+; GFX1064_DPP-NEXT:    v_mov_b32_dpp v6, v3 row_shr:2 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT:    v_add_co_u32 v1, vcc, v1, v7
+; GFX1064_DPP-NEXT:    v_add_co_ci_u32_e32 v3, vcc, v3, v6, vcc
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v7, v2
+; GFX1064_DPP-NEXT:    v_mov_b32_dpp v5, v1 row_shr:4 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v6, v2
+; GFX1064_DPP-NEXT:    v_mov_b32_dpp v4, v3 row_shr:4 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT:    v_add_co_u32 v1, vcc, v1, v5
+; GFX1064_DPP-NEXT:    v_add_co_ci_u32_e32 v3, vcc, v3, v4, vcc
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v4, v2
+; GFX1064_DPP-NEXT:    v_mov_b32_dpp v7, v1 row_shr:8 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT:    v_mov_b32_dpp v6, v3 row_shr:8 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT:    v_add_co_u32 v1, vcc, v1, v7
+; GFX1064_DPP-NEXT:    v_add_co_ci_u32_e32 v3, vcc, v3, v6, vcc
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v6, v2
+; GFX1064_DPP-NEXT:    v_permlanex16_b32 v5, v1, -1, -1
+; GFX1064_DPP-NEXT:    v_permlanex16_b32 v7, v3, -1, -1
+; GFX1064_DPP-NEXT:    v_mov_b32_dpp v4, v5 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v5, v2
+; GFX1064_DPP-NEXT:    v_mov_b32_dpp v6, v7 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf
+; GFX1064_DPP-NEXT:    v_add_co_u32 v4, vcc, v1, v4
+; GFX1064_DPP-NEXT:    v_add_co_ci_u32_e32 v3, vcc, v3, v6, vcc
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v6, v2
+; GFX1064_DPP-NEXT:    v_readlane_b32 s2, v4, 31
+; GFX1064_DPP-NEXT:    v_readlane_b32 s3, v3, 31
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v1, s2
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v7, s3
+; GFX1064_DPP-NEXT:    v_mov_b32_dpp v5, v1 quad_perm:[0,1,2,3] row_mask:0xc bank_mask:0xf
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v1, v2
+; GFX1064_DPP-NEXT:    v_mov_b32_dpp v6, v7 quad_perm:[0,1,2,3] row_mask:0xc bank_mask:0xf
+; GFX1064_DPP-NEXT:    v_add_co_u32 v4, vcc, v4, v5
+; GFX1064_DPP-NEXT:    v_add_co_ci_u32_e32 v3, vcc, v3, v6, vcc
 ; GFX1064_DPP-NEXT:    s_mov_b64 exec, s[0:1]
 ; GFX1064_DPP-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX1064_DPP-NEXT:    v_mbcnt_lo_u32_b32 v0, exec_lo, 0
 ; GFX1064_DPP-NEXT:    s_or_saveexec_b64 s[4:5], -1
-; GFX1064_DPP-NEXT:    v_mov_b32_dpp v6, v1 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT:    v_mov_b32_dpp v7, v2 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT:    v_readlane_b32 s6, v2, 15
-; GFX1064_DPP-NEXT:    v_readlane_b32 s7, v1, 15
-; GFX1064_DPP-NEXT:    v_readlane_b32 s8, v2, 31
-; GFX1064_DPP-NEXT:    v_readlane_b32 s9, v1, 31
-; GFX1064_DPP-NEXT:    v_readlane_b32 s10, v1, 47
-; GFX1064_DPP-NEXT:    v_writelane_b32 v7, s6, 16
-; GFX1064_DPP-NEXT:    v_writelane_b32 v6, s7, 16
-; GFX1064_DPP-NEXT:    v_readlane_b32 s6, v1, 63
-; GFX1064_DPP-NEXT:    v_readlane_b32 s11, v2, 47
-; GFX1064_DPP-NEXT:    v_readlane_b32 s7, v2, 63
-; GFX1064_DPP-NEXT:    v_writelane_b32 v7, s8, 32
-; GFX1064_DPP-NEXT:    v_writelane_b32 v6, s9, 32
+; GFX1064_DPP-NEXT:    v_mov_b32_dpp v1, v4 row_shr:1 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT:    v_mov_b32_dpp v2, v3 row_shr:1 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT:    v_readlane_b32 s6, v3, 15
+; GFX1064_DPP-NEXT:    v_readlane_b32 s7, v4, 15
+; GFX1064_DPP-NEXT:    v_readlane_b32 s8, v3, 31
+; GFX1064_DPP-NEXT:    v_readlane_b32 s9, v4, 31
+; GFX1064_DPP-NEXT:    v_readlane_b32 s10, v4, 47
+; GFX1064_DPP-NEXT:    v_writelane_b32 v2, s6, 16
+; GFX1064_DPP-NEXT:    v_writelane_b32 v1, s7, 16
+; GFX1064_DPP-NEXT:    v_readlane_b32 s6, v4, 63
+; GFX1064_DPP-NEXT:    v_readlane_b32 s11, v3, 47
+; GFX1064_DPP-NEXT:    v_readlane_b32 s7, v3, 63
+; GFX1064_DPP-NEXT:    v_writelane_b32 v2, s8, 32
+; GFX1064_DPP-NEXT:    v_writelane_b32 v1, s9, 32
 ; GFX1064_DPP-NEXT:    s_mov_b64 exec, s[4:5]
 ; GFX1064_DPP-NEXT:    v_mbcnt_hi_u32_b32 v0, exec_hi, v0
 ; GFX1064_DPP-NEXT:    s_or_saveexec_b64 s[8:9], -1
 ; GFX1064_DPP-NEXT:    s_mov_b64 s[4:5], s[6:7]
-; GFX1064_DPP-NEXT:    v_writelane_b32 v7, s11, 48
-; GFX1064_DPP-NEXT:    v_writelane_b32 v6, s10, 48
+; GFX1064_DPP-NEXT:    v_writelane_b32 v2, s11, 48
+; GFX1064_DPP-NEXT:    v_writelane_b32 v1, s10, 48
 ; GFX1064_DPP-NEXT:    s_mov_b64 exec, s[8:9]
 ; GFX1064_DPP-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v0
 ; GFX1064_DPP-NEXT:    s_mov_b32 s6, -1
@@ -3076,8 +3076,8 @@ define amdgpu_kernel void @add_i64_varying(ptr addrspace(1) %out, ptr addrspace(
 ; GFX1064_DPP-NEXT:    s_or_b64 exec, exec, s[8:9]
 ; GFX1064_DPP-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX1064_DPP-NEXT:    v_readfirstlane_b32 s2, v8
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v10, v6
-; GFX1064_DPP-NEXT:    v_mov_b32_e32 v11, v7
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v10, v1
+; GFX1064_DPP-NEXT:    v_mov_b32_e32 v11, v2
 ; GFX1064_DPP-NEXT:    v_readfirstlane_b32 s3, v9
 ; GFX1064_DPP-NEXT:    v_add_co_u32 v8, vcc, s2, v10
 ; GFX1064_DPP-NEXT:    s_mov_b32 s2, s6
@@ -3089,70 +3089,70 @@ define amdgpu_kernel void @add_i64_varying(ptr addrspace(1) %out, ptr addrspace(
 ; GFX1032_DPP-LABEL: add_i64_varying:
 ; GFX1032_DPP:       ; %bb.0: ; %entry
 ; GFX1032_DPP-NEXT:    s_or_saveexec_b32 s0, -1
-; GFX1032_DPP-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1032_DPP-NEXT:    v_cndmask_b32_e64 v2, 0, v0, s0
-; GFX1032_DPP-NEXT:    v_cndmask_b32_e64 v3, 0, 0, s0
-; GFX1032_DPP-NEXT:    v_mov_b32_e32 v4, 0
-; GFX1032_DPP-NEXT:    v_mov_b32_e32 v6, 0
-; GFX1032_DPP-NEXT:    v_mov_b32_e32 v5, 0
-; GFX1032_DPP-NEXT:    v_mov_b32_dpp v1, v2 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT:    v_mov_b32_e32 v8, 0
-; GFX1032_DPP-NEXT:    v_mov_b32_dpp v4, v3 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT:    v_mov_b32_e32 v7, 0
-; GFX1032_DPP-NEXT:    v_add_co_u32 v1, vcc_lo, v2, v1
-; GFX1032_DPP-NEXT:    v_add_co_ci_u32_e32 v2, vcc_lo, v3, v4, vcc_lo
-; GFX1032_DPP-NEXT:    v_mov_b32_e32 v4, 0
-; GFX1032_DPP-NEXT:    v_mov_b32_dpp v6, v1 row_shr:2 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT:    v_mov_b32_e32 v3, 0
-; GFX1032_DPP-NEXT:    v_mov_b32_dpp v5, v2 row_shr:2 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT:    v_add_co_u32 v1, vcc_lo, v1, v6
-; GFX1032_DPP-NEXT:    v_add_co_ci_u32_e32 v2, vcc_lo, v2, v5, vcc_lo
-; GFX1032_DPP-NEXT:    v_mov_b32_e32 v6, 0
-; GFX1032_DPP-NEXT:    v_mov_b32_dpp v4, v1 row_shr:4 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT:    v_mov_b32_e32 v5, 0
-; GFX1032_DPP-NEXT:    v_mov_b32_dpp v3, v2 row_shr:4 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT:    v_add_co_u32 v1, vcc_lo, v1, v4
-; GFX1032_DPP-NEXT:    v_add_co_ci_u32_e32 v2, vcc_lo, v2, v3, vcc_lo
-; GFX1032_DPP-NEXT:    v_mov_b32_e32 v3, 0
-; GFX1032_DPP-NEXT:    v_mov_b32_dpp v6, v1 row_shr:8 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT:    v_mov_b32_dpp v5, v2 row_shr:8 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT:    v_add_co_u32 v1, vcc_lo, v1, v6
-; GFX1032_DPP-NEXT:    v_add_co_ci_u32_e32 v2, vcc_lo, v2, v5, vcc_lo
-; GFX1032_DPP-NEXT:    v_mov_b32_e32 v5, 0
-; GFX1032_DPP-NEXT:    v_permlanex16_b32 v4, v1, -1, -1
-; GFX1032_DPP-NEXT:    v_permlanex16_b32 v6, v2, -1, -1
-; GFX1032_DPP-NEXT:    v_mov_b32_dpp v3, v4 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf
-; GFX1032_DPP-NEXT:    v_mov_b32_dpp v5, v6 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf
+; GFX1032_DPP-NEXT:    v_mov_b32_e32 v2, 0
+; GFX1032_DPP-NEXT:    v_cndmask_b32_e64 v1, 0, v0, s0
+; GFX1032_DPP-NEXT:    v_cndmask_b32_e64 v4, 0, 0, s0
+; GFX1032_DPP-NEXT:    v_mov_b32_e32 v3, v2
+; GFX1032_DPP-NEXT:    v_mov_b32_e32 v5, v2
+; GFX1032_DPP-NEXT:    v_mov_b32_e32 v7, v2
+; GFX1032_DPP-NEXT:    v_mov_b32_e32 v6, v2
+; GFX1032_DPP-NEXT: ...
[truncated]

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing new test?

@rampitec
Copy link
Collaborator Author

Missing new test?

Yes. I have spent today 3 hours trying to reproduce the original problem in a reasonably small and clear testcase and failed so far. I can create a small mir testcase, not clear as well due to the pass pipeline differences between llc and llpc, plus testcase showing the property is added to the MFI, but I really want to hear from gfx if that is a welcomed approach at all. The gfx here is essential because the problem really happens when atomic optimizer is run. It is disabled for compute but enabled for llpc. I am not aware of any other scenario for compute when a strict.wwm call will happen, and will happen late enough for an IR LICM to miss it.

@rampitec rampitec force-pushed the users/rampitec/01-15-_amdgpu_disable_valu_sinking_and_hoisting_with_wwm branch from 5b9cca3 to f7558c1 Compare January 16, 2025 20:01
@rampitec rampitec changed the base branch from main to users/rampitec/01-16-_amdgpu_add_test_for_valu_hoisiting_from_wwm_region._nfc January 16, 2025 20:01
@rampitec
Copy link
Collaborator Author

Missing new test?

Tests added.

@perlfu
Copy link
Contributor

perlfu commented Jan 17, 2025

I guess my concern is performance regressions if any use of WWM (e.g. atomic optimizer) essentially turns off Machine LICM.

@rampitec
Copy link
Collaborator Author

I guess my concern is performance regressions if any use of WWM (e.g. atomic optimizer) essentially turns off Machine LICM.

I agree. But when moving the code llvm thinks it is something cheap, and its is not, which is also a performance problem. Things would be much easier if we could tell an instruction belongs to a WWM region.

@rampitec rampitec force-pushed the users/rampitec/01-15-_amdgpu_disable_valu_sinking_and_hoisting_with_wwm branch from f7558c1 to 22d28d8 Compare January 17, 2025 09:29
@rampitec rampitec requested a review from a team as a code owner January 17, 2025 09:29
@rampitec rampitec changed the base branch from users/rampitec/01-16-_amdgpu_add_test_for_valu_hoisiting_from_wwm_region._nfc to main January 17, 2025 09:29
@rampitec rampitec force-pushed the users/rampitec/01-15-_amdgpu_disable_valu_sinking_and_hoisting_with_wwm branch from 22d28d8 to 9b437e6 Compare January 17, 2025 09:37
@rampitec rampitec changed the base branch from main to users/rampitec/01-16-_amdgpu_add_test_for_valu_hoisiting_from_wwm_region._nfc January 17, 2025 09:37
@@ -2773,6 +2773,9 @@ void AMDGPUDAGToDAGISel::SelectINTRINSIC_WO_CHAIN(SDNode *N) {
case Intrinsic::amdgcn_wwm:
case Intrinsic::amdgcn_strict_wwm:
Opcode = AMDGPU::STRICT_WWM;
CurDAG->getMachineFunction()
.getInfo<SIMachineFunctionInfo>()
->setInitWholeWave();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not recommend using HasInitWholeWave for this. That has a very narrow meaning referring only to the use of the llvm.amdgcn.init.whole.wave intrinsic, not WWM in general.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I can create a separate property HasWWM, but I really want to hear if we even want to go that way.

Base automatically changed from users/rampitec/01-16-_amdgpu_add_test_for_valu_hoisiting_from_wwm_region._nfc to main January 17, 2025 18:06
Machine LICM can hoist a VALU instruction from a WWM region.
In this case WQM pass will have to create yet another WWM region
around the hoisted instruction, which is not desired.

Unfortunatelly we cannot tell if an instruction is in the WWM
region, so this patch disables hoisting if WWM is used in the
function.

This works around the bug SWDEV-502411.
@rampitec rampitec force-pushed the users/rampitec/01-15-_amdgpu_disable_valu_sinking_and_hoisting_with_wwm branch from 9b437e6 to 14727cf Compare January 17, 2025 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants