-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AMDGPU] Disable VALU sinking and hoisting with WWM #123124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[AMDGPU] Disable VALU sinking and hoisting with WWM #123124
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
@llvm/pr-subscribers-backend-amdgpu Author: Stanislav Mekhanoshin (rampitec) ChangesMachine LICM can hoist a VALU instruction from a WWM region. Unfortunatelly we cannot tell if an instruction is in the WWM This works around the bug SWDEV-502411. Patch is 1.34 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123124.diff 14 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index 27e9018d68a03e..041cf40a7588df 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -2772,6 +2772,9 @@ void AMDGPUDAGToDAGISel::SelectINTRINSIC_WO_CHAIN(SDNode *N) {
case Intrinsic::amdgcn_wwm:
case Intrinsic::amdgcn_strict_wwm:
Opcode = AMDGPU::STRICT_WWM;
+ CurDAG->getMachineFunction()
+ .getInfo<SIMachineFunctionInfo>()
+ ->setInitWholeWave();
break;
case Intrinsic::amdgcn_strict_wqm:
Opcode = AMDGPU::STRICT_WQM;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 926c1e4b23b4a1..96d0c321704f1e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1055,8 +1055,12 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC(MachineInstr &I) const {
case Intrinsic::amdgcn_softwqm:
return constrainCopyLikeIntrin(I, AMDGPU::SOFT_WQM);
case Intrinsic::amdgcn_strict_wwm:
- case Intrinsic::amdgcn_wwm:
+ case Intrinsic::amdgcn_wwm: {
+ MachineFunction *MF = I.getParent()->getParent();
+ SIMachineFunctionInfo *MFInfo = MF->getInfo<SIMachineFunctionInfo>();
+ MFInfo->setInitWholeWave();
return constrainCopyLikeIntrin(I, AMDGPU::STRICT_WWM);
+ }
case Intrinsic::amdgcn_strict_wqm:
return constrainCopyLikeIntrin(I, AMDGPU::STRICT_WQM);
case Intrinsic::amdgcn_writelane:
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 8fc32d9e60bf20..498080caf69624 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -184,7 +184,11 @@ static bool resultDependsOnExec(const MachineInstr &MI) {
bool SIInstrInfo::isIgnorableUse(const MachineOperand &MO) const {
// Any implicit use of exec by VALU is not a real register read.
return MO.getReg() == AMDGPU::EXEC && MO.isImplicit() &&
- isVALU(*MO.getParent()) && !resultDependsOnExec(*MO.getParent());
+ isVALU(*MO.getParent()) && !resultDependsOnExec(*MO.getParent()) &&
+ !MO.getParent()
+ ->getMF()
+ ->getInfo<SIMachineFunctionInfo>()
+ ->hasInitWholeWave();
}
bool SIInstrInfo::isSafeToSink(MachineInstr &MI,
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
index 9577230c6c52e2..b5a8d72af5c3f1 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
@@ -2825,44 +2825,44 @@ define amdgpu_kernel void @add_i64_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX8_DPP-NEXT: v_mbcnt_hi_u32_b32 v6, exec_hi, v6
; GFX8_DPP-NEXT: s_or_saveexec_b64 s[4:5], -1
; GFX8_DPP-NEXT: v_cndmask_b32_e64 v3, 0, v0, s[4:5]
-; GFX8_DPP-NEXT: v_mov_b32_e32 v5, 0
+; GFX8_DPP-NEXT: v_mov_b32_e32 v5, v1
; GFX8_DPP-NEXT: v_cndmask_b32_e64 v2, 0, 0, s[4:5]
-; GFX8_DPP-NEXT: v_mov_b32_e32 v4, 0
+; GFX8_DPP-NEXT: v_mov_b32_e32 v4, v1
; GFX8_DPP-NEXT: v_mov_b32_dpp v5, v3 row_shr:1 row_mask:0xf bank_mask:0xf
; GFX8_DPP-NEXT: v_add_u32_e32 v3, vcc, v3, v5
; GFX8_DPP-NEXT: v_mov_b32_dpp v4, v2 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX8_DPP-NEXT: v_mov_b32_e32 v5, 0
+; GFX8_DPP-NEXT: v_mov_b32_e32 v5, v1
; GFX8_DPP-NEXT: v_addc_u32_e32 v2, vcc, v2, v4, vcc
-; GFX8_DPP-NEXT: v_mov_b32_e32 v4, 0
+; GFX8_DPP-NEXT: v_mov_b32_e32 v4, v1
; GFX8_DPP-NEXT: v_mov_b32_dpp v5, v3 row_shr:2 row_mask:0xf bank_mask:0xf
; GFX8_DPP-NEXT: v_add_u32_e32 v3, vcc, v3, v5
; GFX8_DPP-NEXT: v_mov_b32_dpp v4, v2 row_shr:2 row_mask:0xf bank_mask:0xf
-; GFX8_DPP-NEXT: v_mov_b32_e32 v5, 0
+; GFX8_DPP-NEXT: v_mov_b32_e32 v5, v1
; GFX8_DPP-NEXT: v_addc_u32_e32 v2, vcc, v2, v4, vcc
-; GFX8_DPP-NEXT: v_mov_b32_e32 v4, 0
+; GFX8_DPP-NEXT: v_mov_b32_e32 v4, v1
; GFX8_DPP-NEXT: v_mov_b32_dpp v5, v3 row_shr:4 row_mask:0xf bank_mask:0xf
; GFX8_DPP-NEXT: v_add_u32_e32 v3, vcc, v3, v5
; GFX8_DPP-NEXT: v_mov_b32_dpp v4, v2 row_shr:4 row_mask:0xf bank_mask:0xf
-; GFX8_DPP-NEXT: v_mov_b32_e32 v5, 0
+; GFX8_DPP-NEXT: v_mov_b32_e32 v5, v1
; GFX8_DPP-NEXT: v_addc_u32_e32 v2, vcc, v2, v4, vcc
-; GFX8_DPP-NEXT: v_mov_b32_e32 v4, 0
+; GFX8_DPP-NEXT: v_mov_b32_e32 v4, v1
; GFX8_DPP-NEXT: v_mov_b32_dpp v5, v3 row_shr:8 row_mask:0xf bank_mask:0xf
; GFX8_DPP-NEXT: v_add_u32_e32 v3, vcc, v3, v5
; GFX8_DPP-NEXT: v_mov_b32_dpp v4, v2 row_shr:8 row_mask:0xf bank_mask:0xf
-; GFX8_DPP-NEXT: v_mov_b32_e32 v5, 0
+; GFX8_DPP-NEXT: v_mov_b32_e32 v5, v1
; GFX8_DPP-NEXT: v_addc_u32_e32 v2, vcc, v2, v4, vcc
-; GFX8_DPP-NEXT: v_mov_b32_e32 v4, 0
+; GFX8_DPP-NEXT: v_mov_b32_e32 v4, v1
; GFX8_DPP-NEXT: v_mov_b32_dpp v5, v3 row_bcast:15 row_mask:0xa bank_mask:0xf
; GFX8_DPP-NEXT: v_add_u32_e32 v3, vcc, v3, v5
; GFX8_DPP-NEXT: v_mov_b32_dpp v4, v2 row_bcast:15 row_mask:0xa bank_mask:0xf
-; GFX8_DPP-NEXT: v_mov_b32_e32 v5, 0
+; GFX8_DPP-NEXT: v_mov_b32_e32 v5, v1
; GFX8_DPP-NEXT: v_addc_u32_e32 v2, vcc, v2, v4, vcc
-; GFX8_DPP-NEXT: v_mov_b32_e32 v4, 0
+; GFX8_DPP-NEXT: v_mov_b32_e32 v4, v1
; GFX8_DPP-NEXT: v_mov_b32_dpp v5, v3 row_bcast:31 row_mask:0xc bank_mask:0xf
; GFX8_DPP-NEXT: v_add_u32_e32 v3, vcc, v3, v5
; GFX8_DPP-NEXT: v_mov_b32_dpp v4, v2 row_bcast:31 row_mask:0xc bank_mask:0xf
; GFX8_DPP-NEXT: v_addc_u32_e32 v4, vcc, v2, v4, vcc
-; GFX8_DPP-NEXT: v_mov_b32_e32 v2, 0
+; GFX8_DPP-NEXT: v_mov_b32_e32 v2, v1
; GFX8_DPP-NEXT: v_readlane_b32 s7, v4, 63
; GFX8_DPP-NEXT: v_readlane_b32 s6, v3, 63
; GFX8_DPP-NEXT: v_mov_b32_dpp v2, v4 wave_shr:1 row_mask:0xf bank_mask:0xf
@@ -2908,44 +2908,44 @@ define amdgpu_kernel void @add_i64_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX9_DPP-NEXT: v_mbcnt_hi_u32_b32 v6, exec_hi, v6
; GFX9_DPP-NEXT: s_or_saveexec_b64 s[4:5], -1
; GFX9_DPP-NEXT: v_cndmask_b32_e64 v3, 0, v0, s[4:5]
-; GFX9_DPP-NEXT: v_mov_b32_e32 v5, 0
+; GFX9_DPP-NEXT: v_mov_b32_e32 v5, v1
; GFX9_DPP-NEXT: v_cndmask_b32_e64 v2, 0, 0, s[4:5]
-; GFX9_DPP-NEXT: v_mov_b32_e32 v4, 0
+; GFX9_DPP-NEXT: v_mov_b32_e32 v4, v1
; GFX9_DPP-NEXT: v_mov_b32_dpp v5, v3 row_shr:1 row_mask:0xf bank_mask:0xf
; GFX9_DPP-NEXT: v_add_co_u32_e32 v3, vcc, v3, v5
; GFX9_DPP-NEXT: v_mov_b32_dpp v4, v2 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX9_DPP-NEXT: v_mov_b32_e32 v5, 0
+; GFX9_DPP-NEXT: v_mov_b32_e32 v5, v1
; GFX9_DPP-NEXT: v_addc_co_u32_e32 v2, vcc, v2, v4, vcc
-; GFX9_DPP-NEXT: v_mov_b32_e32 v4, 0
+; GFX9_DPP-NEXT: v_mov_b32_e32 v4, v1
; GFX9_DPP-NEXT: v_mov_b32_dpp v5, v3 row_shr:2 row_mask:0xf bank_mask:0xf
; GFX9_DPP-NEXT: v_add_co_u32_e32 v3, vcc, v3, v5
; GFX9_DPP-NEXT: v_mov_b32_dpp v4, v2 row_shr:2 row_mask:0xf bank_mask:0xf
-; GFX9_DPP-NEXT: v_mov_b32_e32 v5, 0
+; GFX9_DPP-NEXT: v_mov_b32_e32 v5, v1
; GFX9_DPP-NEXT: v_addc_co_u32_e32 v2, vcc, v2, v4, vcc
-; GFX9_DPP-NEXT: v_mov_b32_e32 v4, 0
+; GFX9_DPP-NEXT: v_mov_b32_e32 v4, v1
; GFX9_DPP-NEXT: v_mov_b32_dpp v5, v3 row_shr:4 row_mask:0xf bank_mask:0xf
; GFX9_DPP-NEXT: v_add_co_u32_e32 v3, vcc, v3, v5
; GFX9_DPP-NEXT: v_mov_b32_dpp v4, v2 row_shr:4 row_mask:0xf bank_mask:0xf
-; GFX9_DPP-NEXT: v_mov_b32_e32 v5, 0
+; GFX9_DPP-NEXT: v_mov_b32_e32 v5, v1
; GFX9_DPP-NEXT: v_addc_co_u32_e32 v2, vcc, v2, v4, vcc
-; GFX9_DPP-NEXT: v_mov_b32_e32 v4, 0
+; GFX9_DPP-NEXT: v_mov_b32_e32 v4, v1
; GFX9_DPP-NEXT: v_mov_b32_dpp v5, v3 row_shr:8 row_mask:0xf bank_mask:0xf
; GFX9_DPP-NEXT: v_add_co_u32_e32 v3, vcc, v3, v5
; GFX9_DPP-NEXT: v_mov_b32_dpp v4, v2 row_shr:8 row_mask:0xf bank_mask:0xf
-; GFX9_DPP-NEXT: v_mov_b32_e32 v5, 0
+; GFX9_DPP-NEXT: v_mov_b32_e32 v5, v1
; GFX9_DPP-NEXT: v_addc_co_u32_e32 v2, vcc, v2, v4, vcc
-; GFX9_DPP-NEXT: v_mov_b32_e32 v4, 0
+; GFX9_DPP-NEXT: v_mov_b32_e32 v4, v1
; GFX9_DPP-NEXT: v_mov_b32_dpp v5, v3 row_bcast:15 row_mask:0xa bank_mask:0xf
; GFX9_DPP-NEXT: v_add_co_u32_e32 v3, vcc, v3, v5
; GFX9_DPP-NEXT: v_mov_b32_dpp v4, v2 row_bcast:15 row_mask:0xa bank_mask:0xf
-; GFX9_DPP-NEXT: v_mov_b32_e32 v5, 0
+; GFX9_DPP-NEXT: v_mov_b32_e32 v5, v1
; GFX9_DPP-NEXT: v_addc_co_u32_e32 v2, vcc, v2, v4, vcc
-; GFX9_DPP-NEXT: v_mov_b32_e32 v4, 0
+; GFX9_DPP-NEXT: v_mov_b32_e32 v4, v1
; GFX9_DPP-NEXT: v_mov_b32_dpp v5, v3 row_bcast:31 row_mask:0xc bank_mask:0xf
; GFX9_DPP-NEXT: v_add_co_u32_e32 v3, vcc, v3, v5
; GFX9_DPP-NEXT: v_mov_b32_dpp v4, v2 row_bcast:31 row_mask:0xc bank_mask:0xf
; GFX9_DPP-NEXT: v_addc_co_u32_e32 v4, vcc, v2, v4, vcc
-; GFX9_DPP-NEXT: v_mov_b32_e32 v2, 0
+; GFX9_DPP-NEXT: v_mov_b32_e32 v2, v1
; GFX9_DPP-NEXT: v_readlane_b32 s7, v4, 63
; GFX9_DPP-NEXT: v_readlane_b32 s6, v3, 63
; GFX9_DPP-NEXT: v_mov_b32_dpp v2, v4 wave_shr:1 row_mask:0xf bank_mask:0xf
@@ -2984,76 +2984,76 @@ define amdgpu_kernel void @add_i64_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1064_DPP-LABEL: add_i64_varying:
; GFX1064_DPP: ; %bb.0: ; %entry
; GFX1064_DPP-NEXT: s_or_saveexec_b64 s[0:1], -1
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v1, 0
-; GFX1064_DPP-NEXT: v_cndmask_b32_e64 v2, 0, v0, s[0:1]
-; GFX1064_DPP-NEXT: v_cndmask_b32_e64 v3, 0, 0, s[0:1]
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v4, 0
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v6, 0
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v5, 0
-; GFX1064_DPP-NEXT: v_mov_b32_dpp v1, v2 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v7, 0
-; GFX1064_DPP-NEXT: v_mov_b32_dpp v4, v3 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT: v_add_co_u32 v1, vcc, v2, v1
-; GFX1064_DPP-NEXT: v_add_co_ci_u32_e32 v2, vcc, v3, v4, vcc
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v4, 0
-; GFX1064_DPP-NEXT: v_mov_b32_dpp v6, v1 row_shr:2 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v3, 0
-; GFX1064_DPP-NEXT: v_mov_b32_dpp v5, v2 row_shr:2 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT: v_add_co_u32 v1, vcc, v1, v6
-; GFX1064_DPP-NEXT: v_add_co_ci_u32_e32 v2, vcc, v2, v5, vcc
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v6, 0
-; GFX1064_DPP-NEXT: v_mov_b32_dpp v4, v1 row_shr:4 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v5, 0
-; GFX1064_DPP-NEXT: v_mov_b32_dpp v3, v2 row_shr:4 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT: v_add_co_u32 v1, vcc, v1, v4
-; GFX1064_DPP-NEXT: v_add_co_ci_u32_e32 v2, vcc, v2, v3, vcc
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v3, 0
-; GFX1064_DPP-NEXT: v_mov_b32_dpp v6, v1 row_shr:8 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT: v_mov_b32_dpp v5, v2 row_shr:8 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT: v_add_co_u32 v1, vcc, v1, v6
-; GFX1064_DPP-NEXT: v_add_co_ci_u32_e32 v2, vcc, v2, v5, vcc
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v5, 0
-; GFX1064_DPP-NEXT: v_permlanex16_b32 v4, v1, -1, -1
-; GFX1064_DPP-NEXT: v_permlanex16_b32 v6, v2, -1, -1
-; GFX1064_DPP-NEXT: v_mov_b32_dpp v3, v4 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf
-; GFX1064_DPP-NEXT: v_mov_b32_dpp v5, v6 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf
-; GFX1064_DPP-NEXT: v_add_co_u32 v1, vcc, v1, v3
-; GFX1064_DPP-NEXT: v_add_co_ci_u32_e32 v2, vcc, v2, v5, vcc
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v3, 0
-; GFX1064_DPP-NEXT: v_readlane_b32 s2, v1, 31
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v5, 0
-; GFX1064_DPP-NEXT: v_readlane_b32 s3, v2, 31
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v4, s2
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v6, s3
-; GFX1064_DPP-NEXT: v_mov_b32_dpp v3, v4 quad_perm:[0,1,2,3] row_mask:0xc bank_mask:0xf
-; GFX1064_DPP-NEXT: v_mov_b32_dpp v5, v6 quad_perm:[0,1,2,3] row_mask:0xc bank_mask:0xf
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v6, 0
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v2, 0
+; GFX1064_DPP-NEXT: v_cndmask_b32_e64 v1, 0, v0, s[0:1]
+; GFX1064_DPP-NEXT: v_cndmask_b32_e64 v4, 0, 0, s[0:1]
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v3, v2
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v5, v2
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v7, v2
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v6, v2
+; GFX1064_DPP-NEXT: v_mov_b32_dpp v3, v1 row_shr:1 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT: v_mov_b32_dpp v5, v4 row_shr:1 row_mask:0xf bank_mask:0xf
; GFX1064_DPP-NEXT: v_add_co_u32 v1, vcc, v1, v3
-; GFX1064_DPP-NEXT: v_add_co_ci_u32_e32 v2, vcc, v2, v5, vcc
+; GFX1064_DPP-NEXT: v_add_co_ci_u32_e32 v3, vcc, v4, v5, vcc
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v5, v2
+; GFX1064_DPP-NEXT: v_mov_b32_dpp v7, v1 row_shr:2 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v4, v2
+; GFX1064_DPP-NEXT: v_mov_b32_dpp v6, v3 row_shr:2 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT: v_add_co_u32 v1, vcc, v1, v7
+; GFX1064_DPP-NEXT: v_add_co_ci_u32_e32 v3, vcc, v3, v6, vcc
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v7, v2
+; GFX1064_DPP-NEXT: v_mov_b32_dpp v5, v1 row_shr:4 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v6, v2
+; GFX1064_DPP-NEXT: v_mov_b32_dpp v4, v3 row_shr:4 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT: v_add_co_u32 v1, vcc, v1, v5
+; GFX1064_DPP-NEXT: v_add_co_ci_u32_e32 v3, vcc, v3, v4, vcc
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v4, v2
+; GFX1064_DPP-NEXT: v_mov_b32_dpp v7, v1 row_shr:8 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT: v_mov_b32_dpp v6, v3 row_shr:8 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT: v_add_co_u32 v1, vcc, v1, v7
+; GFX1064_DPP-NEXT: v_add_co_ci_u32_e32 v3, vcc, v3, v6, vcc
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v6, v2
+; GFX1064_DPP-NEXT: v_permlanex16_b32 v5, v1, -1, -1
+; GFX1064_DPP-NEXT: v_permlanex16_b32 v7, v3, -1, -1
+; GFX1064_DPP-NEXT: v_mov_b32_dpp v4, v5 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v5, v2
+; GFX1064_DPP-NEXT: v_mov_b32_dpp v6, v7 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf
+; GFX1064_DPP-NEXT: v_add_co_u32 v4, vcc, v1, v4
+; GFX1064_DPP-NEXT: v_add_co_ci_u32_e32 v3, vcc, v3, v6, vcc
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v6, v2
+; GFX1064_DPP-NEXT: v_readlane_b32 s2, v4, 31
+; GFX1064_DPP-NEXT: v_readlane_b32 s3, v3, 31
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v1, s2
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v7, s3
+; GFX1064_DPP-NEXT: v_mov_b32_dpp v5, v1 quad_perm:[0,1,2,3] row_mask:0xc bank_mask:0xf
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v1, v2
+; GFX1064_DPP-NEXT: v_mov_b32_dpp v6, v7 quad_perm:[0,1,2,3] row_mask:0xc bank_mask:0xf
+; GFX1064_DPP-NEXT: v_add_co_u32 v4, vcc, v4, v5
+; GFX1064_DPP-NEXT: v_add_co_ci_u32_e32 v3, vcc, v3, v6, vcc
; GFX1064_DPP-NEXT: s_mov_b64 exec, s[0:1]
; GFX1064_DPP-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24
; GFX1064_DPP-NEXT: v_mbcnt_lo_u32_b32 v0, exec_lo, 0
; GFX1064_DPP-NEXT: s_or_saveexec_b64 s[4:5], -1
-; GFX1064_DPP-NEXT: v_mov_b32_dpp v6, v1 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT: v_mov_b32_dpp v7, v2 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX1064_DPP-NEXT: v_readlane_b32 s6, v2, 15
-; GFX1064_DPP-NEXT: v_readlane_b32 s7, v1, 15
-; GFX1064_DPP-NEXT: v_readlane_b32 s8, v2, 31
-; GFX1064_DPP-NEXT: v_readlane_b32 s9, v1, 31
-; GFX1064_DPP-NEXT: v_readlane_b32 s10, v1, 47
-; GFX1064_DPP-NEXT: v_writelane_b32 v7, s6, 16
-; GFX1064_DPP-NEXT: v_writelane_b32 v6, s7, 16
-; GFX1064_DPP-NEXT: v_readlane_b32 s6, v1, 63
-; GFX1064_DPP-NEXT: v_readlane_b32 s11, v2, 47
-; GFX1064_DPP-NEXT: v_readlane_b32 s7, v2, 63
-; GFX1064_DPP-NEXT: v_writelane_b32 v7, s8, 32
-; GFX1064_DPP-NEXT: v_writelane_b32 v6, s9, 32
+; GFX1064_DPP-NEXT: v_mov_b32_dpp v1, v4 row_shr:1 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT: v_mov_b32_dpp v2, v3 row_shr:1 row_mask:0xf bank_mask:0xf
+; GFX1064_DPP-NEXT: v_readlane_b32 s6, v3, 15
+; GFX1064_DPP-NEXT: v_readlane_b32 s7, v4, 15
+; GFX1064_DPP-NEXT: v_readlane_b32 s8, v3, 31
+; GFX1064_DPP-NEXT: v_readlane_b32 s9, v4, 31
+; GFX1064_DPP-NEXT: v_readlane_b32 s10, v4, 47
+; GFX1064_DPP-NEXT: v_writelane_b32 v2, s6, 16
+; GFX1064_DPP-NEXT: v_writelane_b32 v1, s7, 16
+; GFX1064_DPP-NEXT: v_readlane_b32 s6, v4, 63
+; GFX1064_DPP-NEXT: v_readlane_b32 s11, v3, 47
+; GFX1064_DPP-NEXT: v_readlane_b32 s7, v3, 63
+; GFX1064_DPP-NEXT: v_writelane_b32 v2, s8, 32
+; GFX1064_DPP-NEXT: v_writelane_b32 v1, s9, 32
; GFX1064_DPP-NEXT: s_mov_b64 exec, s[4:5]
; GFX1064_DPP-NEXT: v_mbcnt_hi_u32_b32 v0, exec_hi, v0
; GFX1064_DPP-NEXT: s_or_saveexec_b64 s[8:9], -1
; GFX1064_DPP-NEXT: s_mov_b64 s[4:5], s[6:7]
-; GFX1064_DPP-NEXT: v_writelane_b32 v7, s11, 48
-; GFX1064_DPP-NEXT: v_writelane_b32 v6, s10, 48
+; GFX1064_DPP-NEXT: v_writelane_b32 v2, s11, 48
+; GFX1064_DPP-NEXT: v_writelane_b32 v1, s10, 48
; GFX1064_DPP-NEXT: s_mov_b64 exec, s[8:9]
; GFX1064_DPP-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0
; GFX1064_DPP-NEXT: s_mov_b32 s6, -1
@@ -3076,8 +3076,8 @@ define amdgpu_kernel void @add_i64_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1064_DPP-NEXT: s_or_b64 exec, exec, s[8:9]
; GFX1064_DPP-NEXT: s_waitcnt lgkmcnt(0)
; GFX1064_DPP-NEXT: v_readfirstlane_b32 s2, v8
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v10, v6
-; GFX1064_DPP-NEXT: v_mov_b32_e32 v11, v7
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v10, v1
+; GFX1064_DPP-NEXT: v_mov_b32_e32 v11, v2
; GFX1064_DPP-NEXT: v_readfirstlane_b32 s3, v9
; GFX1064_DPP-NEXT: v_add_co_u32 v8, vcc, s2, v10
; GFX1064_DPP-NEXT: s_mov_b32 s2, s6
@@ -3089,70 +3089,70 @@ define amdgpu_kernel void @add_i64_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1032_DPP-LABEL: add_i64_varying:
; GFX1032_DPP: ; %bb.0: ; %entry
; GFX1032_DPP-NEXT: s_or_saveexec_b32 s0, -1
-; GFX1032_DPP-NEXT: v_mov_b32_e32 v1, 0
-; GFX1032_DPP-NEXT: v_cndmask_b32_e64 v2, 0, v0, s0
-; GFX1032_DPP-NEXT: v_cndmask_b32_e64 v3, 0, 0, s0
-; GFX1032_DPP-NEXT: v_mov_b32_e32 v4, 0
-; GFX1032_DPP-NEXT: v_mov_b32_e32 v6, 0
-; GFX1032_DPP-NEXT: v_mov_b32_e32 v5, 0
-; GFX1032_DPP-NEXT: v_mov_b32_dpp v1, v2 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT: v_mov_b32_e32 v8, 0
-; GFX1032_DPP-NEXT: v_mov_b32_dpp v4, v3 row_shr:1 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT: v_mov_b32_e32 v7, 0
-; GFX1032_DPP-NEXT: v_add_co_u32 v1, vcc_lo, v2, v1
-; GFX1032_DPP-NEXT: v_add_co_ci_u32_e32 v2, vcc_lo, v3, v4, vcc_lo
-; GFX1032_DPP-NEXT: v_mov_b32_e32 v4, 0
-; GFX1032_DPP-NEXT: v_mov_b32_dpp v6, v1 row_shr:2 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT: v_mov_b32_e32 v3, 0
-; GFX1032_DPP-NEXT: v_mov_b32_dpp v5, v2 row_shr:2 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT: v_add_co_u32 v1, vcc_lo, v1, v6
-; GFX1032_DPP-NEXT: v_add_co_ci_u32_e32 v2, vcc_lo, v2, v5, vcc_lo
-; GFX1032_DPP-NEXT: v_mov_b32_e32 v6, 0
-; GFX1032_DPP-NEXT: v_mov_b32_dpp v4, v1 row_shr:4 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT: v_mov_b32_e32 v5, 0
-; GFX1032_DPP-NEXT: v_mov_b32_dpp v3, v2 row_shr:4 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT: v_add_co_u32 v1, vcc_lo, v1, v4
-; GFX1032_DPP-NEXT: v_add_co_ci_u32_e32 v2, vcc_lo, v2, v3, vcc_lo
-; GFX1032_DPP-NEXT: v_mov_b32_e32 v3, 0
-; GFX1032_DPP-NEXT: v_mov_b32_dpp v6, v1 row_shr:8 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT: v_mov_b32_dpp v5, v2 row_shr:8 row_mask:0xf bank_mask:0xf
-; GFX1032_DPP-NEXT: v_add_co_u32 v1, vcc_lo, v1, v6
-; GFX1032_DPP-NEXT: v_add_co_ci_u32_e32 v2, vcc_lo, v2, v5, vcc_lo
-; GFX1032_DPP-NEXT: v_mov_b32_e32 v5, 0
-; GFX1032_DPP-NEXT: v_permlanex16_b32 v4, v1, -1, -1
-; GFX1032_DPP-NEXT: v_permlanex16_b32 v6, v2, -1, -1
-; GFX1032_DPP-NEXT: v_mov_b32_dpp v3, v4 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf
-; GFX1032_DPP-NEXT: v_mov_b32_dpp v5, v6 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf
+; GFX1032_DPP-NEXT: v_mov_b32_e32 v2, 0
+; GFX1032_DPP-NEXT: v_cndmask_b32_e64 v1, 0, v0, s0
+; GFX1032_DPP-NEXT: v_cndmask_b32_e64 v4, 0, 0, s0
+; GFX1032_DPP-NEXT: v_mov_b32_e32 v3, v2
+; GFX1032_DPP-NEXT: v_mov_b32_e32 v5, v2
+; GFX1032_DPP-NEXT: v_mov_b32_e32 v7, v2
+; GFX1032_DPP-NEXT: v_mov_b32_e32 v6, v2
+; GFX1032_DPP-NEXT: ...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing new test?
Yes. I have spent today 3 hours trying to reproduce the original problem in a reasonably small and clear testcase and failed so far. I can create a small mir testcase, not clear as well due to the pass pipeline differences between llc and llpc, plus testcase showing the property is added to the MFI, but I really want to hear from gfx if that is a welcomed approach at all. The gfx here is essential because the problem really happens when atomic optimizer is run. It is disabled for compute but enabled for llpc. I am not aware of any other scenario for compute when a strict.wwm call will happen, and will happen late enough for an IR LICM to miss it. |
5b9cca3
to
f7558c1
Compare
Tests added. |
I guess my concern is performance regressions if any use of WWM (e.g. atomic optimizer) essentially turns off Machine LICM. |
I agree. But when moving the code llvm thinks it is something cheap, and its is not, which is also a performance problem. Things would be much easier if we could tell an instruction belongs to a WWM region. |
f7558c1
to
22d28d8
Compare
22d28d8
to
9b437e6
Compare
@@ -2773,6 +2773,9 @@ void AMDGPUDAGToDAGISel::SelectINTRINSIC_WO_CHAIN(SDNode *N) { | |||
case Intrinsic::amdgcn_wwm: | |||
case Intrinsic::amdgcn_strict_wwm: | |||
Opcode = AMDGPU::STRICT_WWM; | |||
CurDAG->getMachineFunction() | |||
.getInfo<SIMachineFunctionInfo>() | |||
->setInitWholeWave(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not recommend using HasInitWholeWave for this. That has a very narrow meaning referring only to the use of the llvm.amdgcn.init.whole.wave intrinsic, not WWM in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I can create a separate property HasWWM, but I really want to hear if we even want to go that way.
Machine LICM can hoist a VALU instruction from a WWM region. In this case WQM pass will have to create yet another WWM region around the hoisted instruction, which is not desired. Unfortunatelly we cannot tell if an instruction is in the WWM region, so this patch disables hoisting if WWM is used in the function. This works around the bug SWDEV-502411.
9b437e6
to
14727cf
Compare
Machine LICM can hoist a VALU instruction from a WWM region.
In this case WQM pass will have to create yet another WWM region
around the hoisted instruction, which is not desired.
Unfortunatelly we cannot tell if an instruction is in the WWM
region, so this patch disables hoisting if WWM is used in the
function.
This works around the bug SWDEV-502411.