-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AMDGPU][SDAG] DAGCombine PTRADD -> disjoint OR #146075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: users/ritter-x2a/06-26-_sdag_amdgpu_allow_opting_in_to_oob-generating_ptradd_transforms
Are you sure you want to change the base?
Conversation
If we can't fold a PTRADD's offset into its users, lowering them to disjoint ORs is preferable: Often, a 32-bit OR instruction suffices where we'd otherwise use a pair of 32-bit additions with carry. This needs to be a DAGCombine (and not a selection rule) because its main purpose is to enable subsequent DAGCombines for bitwise operations. We don't want to just turn PTRADDs into disjoint ORs whenever that's sound because this transform loses the information that the operation implements pointer arithmetic, which we will soon need to fold offsets into FLAT instructions. Currently, disjoint ORs can still be used for offset folding, so that part of the logic can't be tested. The PR contains a hacky workaround for a situation where an AssertAlign operand of a PTRADD is not DAGCombined before the PTRADD, causing the PTRADD to be turned into a disjoint OR although reassociating it with the operand of the AssertAlign would be better. This wouldn't be a problem if the DAGCombiner ensured that a node is only processed after all its operands have been processed. For SWDEV-516125.
@llvm/pr-subscribers-llvm-selectiondag @llvm/pr-subscribers-backend-amdgpu Author: Fabian Ritter (ritter-x2a) ChangesIf we can't fold a PTRADD's offset into its users, lowering them to This needs to be a DAGCombine (and not a selection rule) because its The PR contains a hacky workaround for a situation where an AssertAlign For SWDEV-516125. Full diff: https://github.com/llvm/llvm-project/pull/146075.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 822bab88c8a09..71230078edc69 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -15136,6 +15136,41 @@ SDValue SITargetLowering::performPtrAddCombine(SDNode *N,
return Folded;
}
+ // Transform (ptradd a, b) -> (or disjoint a, b) if it is equivalent and if
+ // that transformation can't block an offset folding at any use of the ptradd.
+ // This should be done late, after legalization, so that it doesn't block
+ // other ptradd combines that could enable more offset folding.
+ bool HasIntermediateAssertAlign =
+ N0->getOpcode() == ISD::AssertAlign && N0->getOperand(0)->isAnyAdd();
+ // This is a hack to work around an ordering problem for DAGs like this:
+ // (ptradd (AssertAlign (ptradd p, c1), k), c2)
+ // If the outer ptradd is handled first by the DAGCombiner, it can be
+ // transformed into a disjoint or. Then, when the generic AssertAlign combine
+ // pushes the AssertAlign through the inner ptradd, it's too late for the
+ // ptradd reassociation to trigger.
+ if (!DCI.isBeforeLegalizeOps() && !HasIntermediateAssertAlign &&
+ DAG.haveNoCommonBitsSet(N0, N1)) {
+ bool TransformCanBreakAddrMode = any_of(N->users(), [&](SDNode *User) {
+ if (auto *LoadStore = dyn_cast<MemSDNode>(User);
+ LoadStore && LoadStore->getBasePtr().getNode() == N) {
+ unsigned AS = LoadStore->getAddressSpace();
+ // Currently, we only really need ptradds to fold offsets into flat
+ // memory instructions.
+ if (AS != AMDGPUAS::FLAT_ADDRESS)
+ return false;
+ TargetLoweringBase::AddrMode AM;
+ AM.HasBaseReg = true;
+ EVT VT = LoadStore->getMemoryVT();
+ Type *AccessTy = VT.getTypeForEVT(*DAG.getContext());
+ return isLegalAddressingMode(DAG.getDataLayout(), AM, AccessTy, AS);
+ }
+ return false;
+ });
+
+ if (!TransformCanBreakAddrMode)
+ return DAG.getNode(ISD::OR, DL, VT, N0, N1, SDNodeFlags::Disjoint);
+ }
+
if (N1.getOpcode() != ISD::ADD || !N1.hasOneUse())
return SDValue();
diff --git a/llvm/test/CodeGen/AMDGPU/ptradd-sdag-optimizations.ll b/llvm/test/CodeGen/AMDGPU/ptradd-sdag-optimizations.ll
index 893deb35fe822..64e041103a563 100644
--- a/llvm/test/CodeGen/AMDGPU/ptradd-sdag-optimizations.ll
+++ b/llvm/test/CodeGen/AMDGPU/ptradd-sdag-optimizations.ll
@@ -100,7 +100,7 @@ define void @baseptr_null(i64 %offset, i8 %v) {
; Taken from implicit-kernarg-backend-usage.ll, tests the PTRADD handling in the
; assertalign DAG combine.
-define amdgpu_kernel void @llvm_amdgcn_queue_ptr(ptr addrspace(1) %ptr) #0 {
+define amdgpu_kernel void @llvm_amdgcn_queue_ptr(ptr addrspace(1) %ptr) {
; GFX942-LABEL: llvm_amdgcn_queue_ptr:
; GFX942: ; %bb.0:
; GFX942-NEXT: v_mov_b32_e32 v2, 0
@@ -416,6 +416,60 @@ entry:
ret void
}
+; Check that ptradds can be lowered to disjoint ORs.
+define ptr @gep_disjoint_or(ptr %base) {
+; GFX942-LABEL: gep_disjoint_or:
+; GFX942: ; %bb.0:
+; GFX942-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX942-NEXT: v_and_or_b32 v0, v0, -16, 4
+; GFX942-NEXT: s_setpc_b64 s[30:31]
+ %p = call ptr @llvm.ptrmask(ptr %base, i64 s0xf0)
+ %gep = getelementptr nuw inbounds i8, ptr %p, i64 4
+ ret ptr %gep
+}
+
+; Check that AssertAlign nodes between ptradd nodes don't block offset folding,
+; taken from preload-implicit-kernargs.ll
+define amdgpu_kernel void @random_incorrect_offset(ptr addrspace(1) inreg %out) #0 {
+; GFX942_PTRADD-LABEL: random_incorrect_offset:
+; GFX942_PTRADD: ; %bb.1:
+; GFX942_PTRADD-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x0
+; GFX942_PTRADD-NEXT: s_waitcnt lgkmcnt(0)
+; GFX942_PTRADD-NEXT: s_branch .LBB21_0
+; GFX942_PTRADD-NEXT: .p2align 8
+; GFX942_PTRADD-NEXT: ; %bb.2:
+; GFX942_PTRADD-NEXT: .LBB21_0:
+; GFX942_PTRADD-NEXT: s_load_dword s0, s[0:1], 0xa
+; GFX942_PTRADD-NEXT: v_mov_b32_e32 v0, 0
+; GFX942_PTRADD-NEXT: s_waitcnt lgkmcnt(0)
+; GFX942_PTRADD-NEXT: v_mov_b32_e32 v1, s0
+; GFX942_PTRADD-NEXT: global_store_dword v0, v1, s[2:3]
+; GFX942_PTRADD-NEXT: s_endpgm
+;
+; GFX942_LEGACY-LABEL: random_incorrect_offset:
+; GFX942_LEGACY: ; %bb.1:
+; GFX942_LEGACY-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x0
+; GFX942_LEGACY-NEXT: s_waitcnt lgkmcnt(0)
+; GFX942_LEGACY-NEXT: s_branch .LBB21_0
+; GFX942_LEGACY-NEXT: .p2align 8
+; GFX942_LEGACY-NEXT: ; %bb.2:
+; GFX942_LEGACY-NEXT: .LBB21_0:
+; GFX942_LEGACY-NEXT: s_mov_b32 s4, 8
+; GFX942_LEGACY-NEXT: s_load_dword s0, s[0:1], s4 offset:0x2
+; GFX942_LEGACY-NEXT: v_mov_b32_e32 v0, 0
+; GFX942_LEGACY-NEXT: s_waitcnt lgkmcnt(0)
+; GFX942_LEGACY-NEXT: v_mov_b32_e32 v1, s0
+; GFX942_LEGACY-NEXT: global_store_dword v0, v1, s[2:3]
+; GFX942_LEGACY-NEXT: s_endpgm
+ %imp_arg_ptr = call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
+ %gep = getelementptr i8, ptr addrspace(4) %imp_arg_ptr, i32 2
+ %load = load i32, ptr addrspace(4) %gep
+ store i32 %load, ptr addrspace(1) %out
+ ret void
+}
+
declare void @llvm.memcpy.p0.p4.i64(ptr noalias nocapture writeonly, ptr addrspace(4) noalias nocapture readonly, i64, i1 immarg)
!0 = !{}
+
+attributes #0 = { "amdgpu-agpr-alloc"="0" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
|
declare void @llvm.memcpy.p0.p4.i64(ptr noalias nocapture writeonly, ptr addrspace(4) noalias nocapture readonly, i64, i1 immarg) | ||
|
||
!0 = !{} | ||
|
||
attributes #0 = { "amdgpu-agpr-alloc"="0" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these attributes should be relevant
bool TransformCanBreakAddrMode = any_of(N->users(), [&](SDNode *User) { | ||
if (auto *LoadStore = dyn_cast<MemSDNode>(User); | ||
LoadStore && LoadStore->getBasePtr().getNode() == N) { | ||
unsigned AS = LoadStore->getAddressSpace(); | ||
// Currently, we only really need ptradds to fold offsets into flat | ||
// memory instructions. | ||
if (AS != AMDGPUAS::FLAT_ADDRESS) | ||
return false; | ||
TargetLoweringBase::AddrMode AM; | ||
AM.HasBaseReg = true; | ||
EVT VT = LoadStore->getMemoryVT(); | ||
Type *AccessTy = VT.getTypeForEVT(*DAG.getContext()); | ||
return isLegalAddressingMode(DAG.getDataLayout(), AM, AccessTy, AS); | ||
} | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need all of this target specific logic, and can just deal with this in the generic combine.
Also can we just drop the AssertAlign in the case where the source is an add of a constant offset? Or push it through an add + constant?
If we can't fold a PTRADD's offset into its users, lowering them to
disjoint ORs is preferable: Often, a 32-bit OR instruction suffices
where we'd otherwise use a pair of 32-bit additions with carry.
This needs to be a DAGCombine (and not a selection rule) because its
main purpose is to enable subsequent DAGCombines for bitwise operations.
We don't want to just turn PTRADDs into disjoint ORs whenever that's
sound because this transform loses the information that the operation
implements pointer arithmetic, which we will soon need to fold offsets
into FLAT instructions. Currently, disjoint ORs can still be used for
offset folding, so that part of the logic can't be tested.
The PR contains a hacky workaround for a situation where an AssertAlign
operand of a PTRADD is not DAGCombined before the PTRADD, causing the
PTRADD to be turned into a disjoint OR although reassociating it with
the operand of the AssertAlign would be better. This wouldn't be a
problem if the DAGCombiner ensured that a node is only processed after
all its operands have been processed.
For SWDEV-516125.