-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AMDGPU] Use alias info to relax waitcounts for LDS DMA #74537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
LDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a psuedo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it. This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis. Fixes: SWDEV-433427
@llvm/pr-subscribers-backend-amdgpu Author: Stanislav Mekhanoshin (rampitec) ChangesLDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a psuedo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it. This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis. Fixes: SWDEV-433427 Patch is 20.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/74537.diff 6 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index a7f4d63229b7e..2e079404b087f 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -1128,11 +1128,10 @@ bool SITargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
MachineMemOperand::MOStore |
MachineMemOperand::MODereferenceable;
- // XXX - Should this be volatile without known ordering?
- Info.flags |= MachineMemOperand::MOVolatile;
-
switch (IntrID) {
default:
+ // XXX - Should this be volatile without known ordering?
+ Info.flags |= MachineMemOperand::MOVolatile;
break;
case Intrinsic::amdgcn_raw_buffer_load_lds:
case Intrinsic::amdgcn_raw_ptr_buffer_load_lds:
@@ -1140,6 +1139,7 @@ bool SITargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
case Intrinsic::amdgcn_struct_ptr_buffer_load_lds: {
unsigned Width = cast<ConstantInt>(CI.getArgOperand(2))->getZExtValue();
Info.memVT = EVT::getIntegerVT(CI.getContext(), Width * 8);
+ Info.ptrVal = CI.getArgOperand(1);
return true;
}
}
@@ -1268,8 +1268,8 @@ bool SITargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
Info.opc = ISD::INTRINSIC_VOID;
unsigned Width = cast<ConstantInt>(CI.getArgOperand(2))->getZExtValue();
Info.memVT = EVT::getIntegerVT(CI.getContext(), Width * 8);
- Info.flags |= MachineMemOperand::MOLoad | MachineMemOperand::MOStore |
- MachineMemOperand::MOVolatile;
+ Info.ptrVal = CI.getArgOperand(1);
+ Info.flags |= MachineMemOperand::MOLoad | MachineMemOperand::MOStore;
return true;
}
case Intrinsic::amdgcn_ds_bvh_stack_rtn: {
@@ -9084,7 +9084,9 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
MachinePointerInfo LoadPtrI = LoadMMO->getPointerInfo();
MachinePointerInfo StorePtrI = LoadPtrI;
- StorePtrI.V = nullptr;
+ LoadPtrI.V = UndefValue::get(
+ PointerType::get(*DAG.getContext(), AMDGPUAS::GLOBAL_ADDRESS));
+ LoadPtrI.AddrSpace = AMDGPUAS::GLOBAL_ADDRESS;
StorePtrI.AddrSpace = AMDGPUAS::LOCAL_ADDRESS;
auto F = LoadMMO->getFlags() &
@@ -9162,6 +9164,8 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
MachinePointerInfo LoadPtrI = LoadMMO->getPointerInfo();
LoadPtrI.Offset = Op->getConstantOperandVal(5);
MachinePointerInfo StorePtrI = LoadPtrI;
+ LoadPtrI.V = UndefValue::get(
+ PointerType::get(*DAG.getContext(), AMDGPUAS::GLOBAL_ADDRESS));
LoadPtrI.AddrSpace = AMDGPUAS::GLOBAL_ADDRESS;
StorePtrI.AddrSpace = AMDGPUAS::LOCAL_ADDRESS;
auto F = LoadMMO->getFlags() &
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index ede4841b8a5fd..50ad22130e939 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -31,6 +31,7 @@
#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/PostOrderIterator.h"
#include "llvm/ADT/Sequence.h"
+#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/CodeGen/MachineLoopInfo.h"
#include "llvm/CodeGen/MachinePostDominators.h"
#include "llvm/InitializePasses.h"
@@ -121,8 +122,13 @@ enum RegisterMapping {
SQ_MAX_PGM_VGPRS = 512, // Maximum programmable VGPRs across all targets.
AGPR_OFFSET = 256, // Maximum programmable ArchVGPRs across all targets.
SQ_MAX_PGM_SGPRS = 256, // Maximum programmable SGPRs across all targets.
- NUM_EXTRA_VGPRS = 1, // A reserved slot for DS.
- EXTRA_VGPR_LDS = 0, // An artificial register to track LDS writes.
+ NUM_EXTRA_VGPRS = 9, // Reserved slots for DS.
+ // Artificial register slots to track LDS writes into specific LDS locations
+ // if a location is known. When slots are exhausted or location is
+ // unknown use the first slot. The first slot is also always updated in
+ // addition to known location's slot to properly generate waits if dependent
+ // instruction's location is unknown.
+ EXTRA_VGPR_LDS = 0,
NUM_ALL_VGPRS = SQ_MAX_PGM_VGPRS + NUM_EXTRA_VGPRS, // Where SGPR starts.
};
@@ -292,6 +298,10 @@ class WaitcntBrackets {
VgprVmemTypes[GprNo] = 0;
}
+ const SmallVectorImpl<const MachineInstr *>& getLDSDMAStores() const {
+ return LDSDMAStores;
+ }
+
void print(raw_ostream &);
void dump() { print(dbgs()); }
@@ -354,6 +364,9 @@ class WaitcntBrackets {
// Bitmask of the VmemTypes of VMEM instructions that might have a pending
// write to each vgpr.
unsigned char VgprVmemTypes[NUM_ALL_VGPRS] = {0};
+ // Store representative LDS DMA operations. The only useful info here is
+ // alias info. One store is kept per unique AAInfo.
+ SmallVector<const MachineInstr *, NUM_EXTRA_VGPRS - 1> LDSDMAStores;
};
class SIInsertWaitcnts : public MachineFunctionPass {
@@ -369,6 +382,7 @@ class SIInsertWaitcnts : public MachineFunctionPass {
DenseMap<MachineBasicBlock *, bool> PreheadersToFlush;
MachineLoopInfo *MLI;
MachinePostDominatorTree *PDT;
+ AliasAnalysis *AA = nullptr;
struct BlockInfo {
std::unique_ptr<WaitcntBrackets> Incoming;
@@ -411,6 +425,8 @@ class SIInsertWaitcnts : public MachineFunctionPass {
AU.setPreservesCFG();
AU.addRequired<MachineLoopInfo>();
AU.addRequired<MachinePostDominatorTree>();
+ AU.addRequired<AAResultsWrapperPass>();
+ AU.addPreserved<AAResultsWrapperPass>();
MachineFunctionPass::getAnalysisUsage(AU);
}
@@ -452,7 +468,7 @@ class SIInsertWaitcnts : public MachineFunctionPass {
// FLAT instruction.
WaitEventType getVmemWaitEventType(const MachineInstr &Inst) const {
assert(SIInstrInfo::isVMEM(Inst) || SIInstrInfo::isFLAT(Inst));
- if (!ST->hasVscnt())
+ if (!ST->hasVscnt() || SIInstrInfo::isLDSDMA(Inst))
return VMEM_ACCESS;
if (Inst.mayStore() && !SIInstrInfo::isAtomicRet(Inst)) {
// FLAT and SCRATCH instructions may access scratch. Other VMEM
@@ -547,8 +563,7 @@ void WaitcntBrackets::setExpScore(const MachineInstr *MI,
// MUBUF and FLAT LDS DMA operations need a wait on vmcnt before LDS written
// can be accessed. A load from LDS to VMEM does not need a wait.
static bool mayWriteLDSThroughDMA(const MachineInstr &MI) {
- return SIInstrInfo::isVALU(MI) &&
- (SIInstrInfo::isMUBUF(MI) || SIInstrInfo::isFLAT(MI)) &&
+ return SIInstrInfo::isLDSDMA(MI) &&
MI.getOpcode() != AMDGPU::BUFFER_STORE_LDS_DWORD;
}
@@ -704,7 +719,36 @@ void WaitcntBrackets::updateByEvent(const SIInstrInfo *TII,
}
}
if (Inst.mayStore() && (TII->isDS(Inst) || mayWriteLDSThroughDMA(Inst))) {
- setRegScore(SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS, T, CurrScore);
+ unsigned Slot = 0;
+ for (const auto *MemOp : Inst.memoperands()) {
+ if (MemOp->isStore() &&
+ MemOp->getAddrSpace() == AMDGPUAS::LOCAL_ADDRESS) {
+ // Comparing just AA info does not guarantee memoperands are equal
+ // in general, but this is so for LDS DMA on practice.
+ auto AAI = MemOp->getAAInfo();
+ if (!AAI)
+ break;
+ auto I = llvm::find_if(LDSDMAStores, [&AAI](const MachineInstr *I) {
+ for (const auto *MemOp : I->memoperands()) {
+ if (MemOp->isStore())
+ return AAI == MemOp->getAAInfo();
+ }
+ return false;
+ });
+ if (I != LDSDMAStores.end()) {
+ Slot = I - LDSDMAStores.begin() + 1;
+ break;
+ }
+ if (LDSDMAStores.size() == NUM_EXTRA_VGPRS - 1)
+ break;
+ LDSDMAStores.push_back(&Inst);
+ Slot = LDSDMAStores.size();
+ break;
+ }
+ }
+ setRegScore(SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS + Slot, T, CurrScore);
+ if (Slot)
+ setRegScore(SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS, T, CurrScore);
}
}
}
@@ -1180,9 +1224,21 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
// No need to wait before load from VMEM to LDS.
if (mayWriteLDSThroughDMA(MI))
continue;
- unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
+
// VM_CNT is only relevant to vgpr or LDS.
- ScoreBrackets.determineWait(VM_CNT, RegNo, Wait);
+ unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
+ bool FoundAliasingStore = false;
+ if (Ptr && Memop->getAAInfo()) {
+ const auto &LDSDMAStores = ScoreBrackets.getLDSDMAStores();
+ for (unsigned I = 0, E = LDSDMAStores.size(); I != E; ++I) {
+ if (MI.mayAlias(AA, *LDSDMAStores[I], true)) {
+ FoundAliasingStore = true;
+ ScoreBrackets.determineWait(VM_CNT, RegNo + I + 1, Wait);
+ }
+ }
+ }
+ if (!FoundAliasingStore)
+ ScoreBrackets.determineWait(VM_CNT, RegNo, Wait);
if (Memop->isStore()) {
ScoreBrackets.determineWait(EXP_CNT, RegNo, Wait);
}
@@ -1818,6 +1874,7 @@ bool SIInsertWaitcnts::runOnMachineFunction(MachineFunction &MF) {
const SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
MLI = &getAnalysis<MachineLoopInfo>();
PDT = &getAnalysis<MachinePostDominatorTree>();
+ AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
ForceEmitZeroWaitcnts = ForceEmitZeroFlag;
for (auto T : inst_counter_types())
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index b5b456d691254..17befd6c0e346 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -3654,8 +3654,8 @@ bool SIInstrInfo::areMemAccessesTriviallyDisjoint(const MachineInstr &MIa,
// underlying address space, even if it was lowered to a different one,
// e.g. private accesses lowered to use MUBUF instructions on a scratch
// buffer.
- if (isDS(MIa)) {
- if (isDS(MIb))
+ if (isDS(MIa) || isLDSDMA(MIa)) {
+ if (isDS(MIb) || isLDSDMA(MIb))
return checkInstOffsetsDoNotOverlap(MIa, MIb);
return !isFLAT(MIb) || isSegmentSpecificFLAT(MIb);
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index e388b5550cb10..456d8d835986f 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -547,6 +547,14 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
return get(Opcode).TSFlags & SIInstrFlags::DS;
}
+ static bool isLDSDMA(const MachineInstr &MI) {
+ return isVALU(MI) && (isMUBUF(MI) || isFLAT(MI));
+ }
+
+ bool isLDSDMA(uint16_t Opcode) {
+ return isVALU(Opcode) && (isMUBUF(Opcode) || isFLAT(Opcode));
+ }
+
static bool isGWS(const MachineInstr &MI) {
return MI.getDesc().TSFlags & SIInstrFlags::GWS;
}
diff --git a/llvm/lib/Target/AMDGPU/lds-dma-waits.ll b/llvm/lib/Target/AMDGPU/lds-dma-waits.ll
new file mode 100644
index 0000000000000..31155290216d0
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/lds-dma-waits.ll
@@ -0,0 +1,154 @@
+; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck %s --check-prefixeses=GCN,GFX9
+; RUN: llc -march=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s | FileCheck %s --check-prefixeses=GCN,GFX10
+
+@lds.0 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.1 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.2 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.3 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.4 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.5 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.6 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.7 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.8 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.9 = internal addrspace(3) global [64 x float] poison, align 16
+
+declare void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) nocapture, i32 %size, i32 %voffset, i32 %soffset, i32 %offset, i32 %aux)
+declare void @llvm.amdgcn.global.load.lds(ptr addrspace(1) nocapture %gptr, ptr addrspace(3) nocapture %lptr, i32 %size, i32 %offset, i32 %aux)
+
+; GCN-LABEL: {{^}}buffer_load_lds_dword_2_arrays:
+; GCN-COUNT-4: buffer_load_dword
+; GCN: s_waitcnt vmcnt(2)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(0)
+; GCN: ds_read_b32
+define amdgpu_kernel void @buffer_load_lds_dword_2_arrays(<4 x i32> %rsrc, i32 %i1, i32 %i2, ptr addrspace(1) %out) {
+main_body:
+ call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0, i32 0, i32 0)
+ call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 4, i32 0, i32 0, i32 0)
+ call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.1, i32 4, i32 8, i32 0, i32 0, i32 0)
+ call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.1, i32 4, i32 12, i32 0, i32 0, i32 0)
+ %gep.0 = getelementptr float, ptr addrspace(3) @lds.0, i32 %i1
+ %gep.1 = getelementptr float, ptr addrspace(3) @lds.1, i32 %i2
+ %val.0 = load float, ptr addrspace(3) %gep.0, align 4
+ call void @llvm.amdgcn.wave.barrier()
+ %val.1 = load float, ptr addrspace(3) %gep.1, align 4
+ %tmp.0 = insertelement <2 x float> undef, float %val.0, i32 0
+ %res = insertelement <2 x float> %tmp.0, float %val.1, i32 1
+ store <2 x float> %res, ptr addrspace(1) %out
+ ret void
+}
+
+; On gfx9 if there is a pending FLAT operation, and this is a VMem or LGKM
+; waitcnt and the target can report early completion, then we need to force a waitcnt 0.
+
+; GCN-LABEL: {{^}}global_load_lds_dword_2_arrays:
+; GCN-COUNT-4: global_load_dword
+; GFX9: s_waitcnt vmcnt(0)
+; GFX9-COUNT-2: ds_read_b32
+; GFX10: s_waitcnt vmcnt(2)
+; GFX10: ds_read_b32
+; GFX10: s_waitcnt vmcnt(0)
+; GFX10: ds_read_b32
+define amdgpu_kernel void @global_load_lds_dword_2_arrays(ptr addrspace(1) nocapture %gptr, i32 %i1, i32 %i2, ptr addrspace(1) %out) {
+main_body:
+ call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0)
+ call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.0, i32 4, i32 4, i32 0)
+ call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.1, i32 4, i32 8, i32 0)
+ call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.1, i32 4, i32 12, i32 0)
+ %gep.0 = getelementptr float, ptr addrspace(3) @lds.0, i32 %i1
+ %gep.1 = getelementptr float, ptr addrspace(3) @lds.1, i32 %i2
+ %val.0 = load float, ptr addrspace(3) %gep.0, align 4
+ call void @llvm.amdgcn.wave.barrier()
+ %val.1 = load float, ptr addrspace(3) %gep.1, align 4
+ %tmp.0 = insertelement <2 x float> undef, float %val.0, i32 0
+ %res = insertelement <2 x float> %tmp.0, float %val.1, i32 1
+ store <2 x float> %res, ptr addrspace(1) %out
+ ret void
+}
+
+; There are 8 pseudo registers defined to track LDS DMA dependencies.
+; When exhausted we default to vmcnt(0).
+
+; GCN-LABEL: {{^}}buffer_load_lds_dword_10_arrays:
+; GCN-COUNT-10: buffer_load_dword
+; GCN: s_waitcnt vmcnt(8)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(7)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(6)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(5)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(4)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(3)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(2)
+; GCN-NOT: s_waitcnt vmcnt
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(0)
+; GCN: ds_read_b32
+define amdgpu_kernel void @buffer_load_lds_dword_10_arrays(<4 x i32> %rsrc, i32 %i1, i32 %i2, i32 %i3, i32 %i4, i32 %i5, i32 %i6, i32 %i7, i32 %i8, i32 %i9, ptr addrspace(1) %out) {
+main_body:
+ call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0, i32 0, i32 0)
+ call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.1, i32 4, i32 0, i32 0, i32 0, i32 0)
+ call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.2, i32 4, i32 0, i32 0, i32 0, i32 0)
+ call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.3, i32 4, i32 0, i32 0, i32 0, i32 0)
+ call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.4, i32 4, i32 0, i32 0, i32 0, i32 0)
+ call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.5, i32 4, i32 0, i32 0, i32 0, i32 0)
+ call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.6, i32 4, i32 0, i32 0, i32 0, i32 0)
+ call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.7, i32 4, i32 0, i32 0, i32 0, i32 0)
+ call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.8, i32 4, i32 0, i32 0, i32 0, i32 0)
+ call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.9, i32 4, i32 0, i32 0, i32 0, i32 0)
+ %gep.0 = getelementptr float, ptr addrspace(3) @lds.0, i32 %i1
+ %gep.1 = getelementptr float, ptr addrspace(3) @lds.1, i32 %i2
+ %gep.2 = getelementptr float, ptr addrspace(3) @lds.2, i32 %i2
+ %gep.3 = getelementptr float, ptr addrspace(3) @lds.3, i32 %i2
+ %gep.4 = getelementptr float, ptr addrspace(3) @lds.4, i32 %i2
+ %gep.5 = getelementptr float, ptr addrspace(3) @lds.5, i32 %i2
+ %gep.6 = getelementptr float, ptr addrspace(3) @lds.6, i32 %i2
+ %gep.7 = getelementptr float, ptr addrspace(3) @lds.7, i32 %i2
+ %gep.8 = getelementptr float, ptr addrspace(3) @lds.8, i32 %i2
+ %gep.9 = getelementptr float, ptr addrspace(3) @lds.9, i32 %i2
+ %val.0 = load float, ptr addrspace(3) %gep.0, align 4
+ call void @llvm.amdgcn.wave.barrier()
+ %val.1 = load float, ptr addrspace(3) %gep.1, align 4
+ call void @llvm.amdgcn.wave.barrier()
+ %val.2 = load float, ptr addrspace(3) %gep.2, align 4
+ call void @llvm.amdgcn.wave.barrier()
+ %val.3 = load float, ptr addrspace(3) %gep.3, align 4
+ call void @llvm.amdgcn.wave.barrier()
+ %val.4 = load float, ptr addrspace(3) %gep.4, align 4
+ call void @llvm.amdgcn.wave.barrier()
+ %val.5 = load float, ptr addrspace(3) %gep.5, align 4
+ call void @llvm.amdgcn.wave.barrier()
+ %val.6 = load float, ptr addrspace(3) %gep.6, align 4
+ call void @llvm.amdgcn.wave.barrier()
+ %val.7 = load float, ptr addrspace(3) %gep.7, align 4
+ call void @llvm.amdgcn.wave.barrier()
+ %val.8 = load float, ptr addrspace(3) %gep.8, align 4
+ call void @llvm.amdgcn.wave.barrier()
+ %val.9 = load float, ptr addrspace(3) %gep.9, align 4
+ %out.gep.1 = getelementptr float, ptr addrspace(1) %out, i32 1
+ %out.gep.2 = getelementptr float, ptr addrspace(1) %out, i32 2
+ %out.gep.3 = getelementptr float, ptr addrspace(1) %out, i32 3
+ %out.gep.4 = getelementptr float, ptr addrspace(1) %out, i32 4
+ %out.gep.5 = getelementptr float, ptr addrspace(1) %out, i32 5
+ %out.gep.6 = getelementptr float, ptr addrspace(1) %out, i32 6
+ %out.gep.7 = getelementptr float, ptr addrspace(1) %out, i32 7
+ %out.gep.8 = getelementptr float, ptr addrspace(1) %out, i32 8
+ %out.gep.9 = getelementptr float, ptr addrspace(1) %out, i32 9
+ store float %val.0, ptr addrspace(1) %out
+ store float %val.1, ptr addrspace(1) %out.gep.1
+ store float %val.2, ptr addrspace(1) %out.gep.2
+ store float %val.3, ptr addrspace(1) %out.gep.3
+ store float %val.4, ptr addrspace(1) %out.gep.4
+ store float %val.5, ptr addrspace(1) %out.gep.5
+ store float %val.6, ptr addrspace(1) %out.gep.6
+ store float %val.7, ptr addrspace(1) %out.gep.7
+ store float %val.8, ptr addrspace(1) %out.gep.8
+ store float %val.9, ptr addrspace(1) %out.gep.9
+ ret void
+}
+
+declare void @llvm.amdgcn.wave.barrier()
diff --git a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
index 7abb789019f1f..9787b8b6c6fbe 100644
--- a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
+++ b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
@@ -130,6 +130,8 @@
; GCN-O0-NEXT: MachineDominator Tree Construction
; GCN-O0-NEXT: Machine Natural Loop Construction
; GCN-O0-NEXT: MachinePostDominator Tree Construction
+; GCN-O0-NEXT: Basic Alias Analysis (stateless AA impl)
+; GCN-O0-NEXT: Function Alias Analysis Results
; GCN-O0-NEXT: SI insert wait instructions
; GCN-O0-NEXT: Insert...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
return false; | ||
}); | ||
if (I != LDSDMAStores.end()) { | ||
Slot = I - LDSDMAStores.begin() + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this iterator logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just need an index of the instruction in the list. One is added because the first pseudo register is reserved for general case when we cannot find a memory location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have simplified it, just replaced with a regular loop.
Changed getVmemWaitEventType() to use mayWriteLDSThroughDMA instead of isLDSDMA as this is more sound and added a comment.
break; | ||
for (unsigned I = 0, E = LDSDMAStores.size(); I != E && !Slot; ++I) { | ||
for (const auto *MemOp : LDSDMAStores[I]->memoperands()) { | ||
if (MemOp->isStore() && AAI == MemOp->getAAInfo()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the equality check here; you perform an actual mayAlias query later? What is the point of this filter? Don't you need to consider any possibly aliasing write as an event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alias check is done later at the line 1229, and this is also a place where any aliasing access gets actual waits, read or write. Anyway, that part did not change here. This search is just to find if we already have a slot allocated for this memory location. In reality we do not need instruction, we just need LDS memory location, but since mayAlias is an interface which needs a MachineInstr, I am searching and keeping instructions in the list. All we really need from this instruction is AA tags.
if (!MemOp->isStore() || | ||
MemOp->getAddrSpace() != AMDGPUAS::LOCAL_ADDRESS) | ||
continue; | ||
// Comparing just AA info does not guarantee memoperands are equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want exact equality of location, you would use the Value/PseudoSourceValue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MI::mayAlias is extremely conservative and does not work with PsedoSourceValue, it just ignores AA tags in this case. The Value is also unreliable because it is really a GEP, always a different GEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of Value, this is how a real program's MIR look like. These are all loads and stores of the same arrays. You may see that Value there is not helpful, only alias scope is (look at the second memory operand, LDS store):
BUFFER_LOAD_DWORD_LDS_OFFEN %51:vgpr_32, %48:sgpr_128, %52:sreg_32, 0, 0, 0, implicit $exec, implicit $m0 :: (dereferenceable load (s32) from `ptr addrspace(1) poison`, align 1, !alias.scope !7, !noalias !10, addrspace 1), (dereferenceable store (s32) into %ir.10, align 1, !alias.scope !7, !noalias !10, addrspace 3)
BUFFER_LOAD_DWORD_LDS_OFFEN killed %54:vgpr_32, %48:sgpr_128, %52:sreg_32, 0, 0, 0, implicit $exec, implicit $m0 :: (dereferenceable load (s32) from `ptr addrspace(1) poison`, align 1, !alias.scope !7, !noalias !10, addrspace 1), (dereferenceable store (s32) into %ir.12, align 1, !alias.scope !7, !noalias !10, addrspace 3)
BUFFER_LOAD_DWORD_LDS_OFFEN killed %56:vgpr_32, %48:sgpr_128, %52:sreg_32, 0, 0, 0, implicit $exec, implicit $m0 :: (dereferenceable load (s32) from `ptr addrspace(1) poison`, align 1, !alias.scope !7, !noalias !10, addrspace 1), (dereferenceable store (s32) into %ir.14, align 1, !alias.scope !7, !noalias !10, addrspace 3)
BUFFER_LOAD_DWORD_LDS_OFFEN killed %58:vgpr_32, %48:sgpr_128, %52:sreg_32, 0, 0, 0, implicit $exec, implicit $m0 :: (dereferenceable load (s32) from `ptr addrspace(1) poison`, align 1, !alias.scope !7, !noalias !10, addrspace 1), (dereferenceable store (s32) into %ir.16, align 1, !alias.scope !7, !noalias !10, addrspace 3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PseudoSourceValue::mayAlias is supposed to report aliasing to possible IR values. It looks like it's layered weirdly, and expects you to go through MachineInstr::mayAlias. MachineInstr::mayAlias ought to be using the AA tags, it shouldn't be a fundamental limitation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like it does use it if you pass UseTBAA=true. Not sure why this would be a parameter in the first place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PseudoSourceValue::mayAlias is supposed to report aliasing to possible IR values. It looks like it's layered weirdly, and expects you to go through MachineInstr::mayAlias. MachineInstr::mayAlias ought to be using the AA tags, it shouldn't be a fundamental limitation
This is all PSV::mayAlias() does:
bool PseudoSourceValue::mayAlias(const MachineFrameInfo *) const {
return !(isGOT() || isConstantPool() || isJumpTable());
}
No very useful. Then even to get to the AA tags check MI:mayAlias() shall go through all IR values' checks first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like it does use it if you pass UseTBAA=true. Not sure why this would be a parameter in the first place
I am passing it, but to get to that check it shall first go through all Value and offset checks. Using AA is the last thing it does: https://llvm.org/doxygen/MachineInstr_8cpp_source.html#l01285
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values don't need to be identical, that's the point of the AA query. BasicAA will parse through the offsets
I also think that values don't need to be identical. But that is what MI:mayAlias() does before it checks AA: https://llvm.org/doxygen/MachineInstr_8cpp_source.html#l01285
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there's no PseudoSourceValue in this example, it should be a straightforward Value-to-Value comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, there is no PSV. I have mentioned PSV because you have earlier suggested to use it. For the real IR value: it is not helpful to compare it. The IR value is a GEP, and this GEP is always different. I.e. these values never compare equal. The rest of the IR is already gone and unavailable for the analysis. Even if it would be available this GEP will address kernel module LDS variable, a single huge LDS array, and will be useless again. In this case it will tell you any LDS operation aliases any other. Now during the module LDS lowering I am creating alias scope info specifically to disambiguate aliasing after the pass has squashed all LDS variables.
Ping |
To make it easier I am splitting the patch. I have pre-comitted the test, and there is a part which fixes lack of wait on GFX10 : #75245 |
Another part is improving memoperand info: #75247. This is NFCI just by itself. |
Yet another part to fix disjoint memory checks with LDS DMA: #75249 |
All split off parts were merged and this patch is merged with main. Only waitcount insertion pass changes remained here. |
How does this work in a case like this?
i.e.
|
It does not know the pointer, so it uses default slot 0 and waits till 0. I have to tell anyone interested here: before I even wrote this code it didn't know of the dependency and did not wait for anything at all. Everyone was happy. |
Test case:
Generates:
The
I am still happy, because buffer/flat/global-load-to-lds was removed in GFX11. |
This is still correct, pointer argument cannot alias module global. A pointer argument to a kernel is an LDS external requested by the host side, and host cannot see module LDS. |
I.e. that is really the point of the patch: if we are able to definitively identify an LDS object targeted by both load and store we only wait on that store or stores. And the only way to definitively identify the object at this stage is via alias.scope info which we are generating ourselves during module LDS lowering. |
I have added a check for the presence of alias scope info just in case we get a rogue AA. The testcase with a pointer argument still produces correct code with vmcnt(1). |
Actually since I am only using alias scope I can avoid all alias analysis altogether and only compare alias scope. This does not need an analysis pass, calls to mayAlias, and in general simpler code. You can see an alternative PR if you like it more: #75974 |
Ping |
1 similar comment
Ping |
if (!MemOp->isStore() || | ||
MemOp->getAddrSpace() != AMDGPUAS::LOCAL_ADDRESS) | ||
continue; | ||
// Comparing just AA info does not guarantee memoperands are equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there's no PseudoSourceValue in this example, it should be a straightforward Value-to-Value comparison
; GCN-O0-NEXT: Basic Alias Analysis (stateless AA impl) | ||
; GCN-O0-NEXT: Function Alias Analysis Results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you skip querying for AA at -O0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I just skip getAnalysis call it does not help since analysis is requested in the getAnalysisUsage. If I do not request it it is not available at any optlevel. This is the benefit of the alternative #75974, it does not request the full analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check the opt level in getAnalysisUsage too, from the TargetMachine
ScoreBrackets.determineWait(VM_CNT, RegNo, Wait); | ||
unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS; | ||
bool FoundAliasingStore = false; | ||
if (Ptr && Memop->getAAInfo() && Memop->getAAInfo().Scope) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand the usage of scope; scope isn't special, isn't common and I do at all like specially treating it. I think you should just let the AA query figure out what to do with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reserved just 8 pseudo registers to track it. I do not want to fill it with unrelated stuff. I know that the only way AA will be able to handle this very specific situation is if there is scope info, otherwise there is no reason to waste a slot and compile time. If I do not enter this 'if' the pass will just do conservatively correct thing and wait for this memory regardless of aliasing or lack of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added more comments to explain this. The place which fills the LDS DMA slot bails if there is no scope info not to waste limited tracking slots. In that case a generic first slot is still used for such operation (it is always used, regardless if we can or cannot be more specific about the underlying object). Here AA will be unable to disambiguate aliasing if there is no scope info, so this condition is simply a shortcut to avoid an expensive loop and AA query. I can remove this part of the condition here and nothing will change except it will work slower. Note that not entering this 'if' statement will always produce a conservatively correct wait using first generic tracking slot, which always gets a score regardless of our ability to track a specific object. The condition is around the relaxation code to avoid a generic and conservative 'wait for everything' part below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, but can still fix the -O0 thing
; GCN-O0-NEXT: Basic Alias Analysis (stateless AA impl) | ||
; GCN-O0-NEXT: Function Alias Analysis Results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check the opt level in getAnalysisUsage too, from the TargetMachine
But where do I get TM in the getAnalysisUsage? |
Found addUsedIfAvailable() which does the trick. |
MF.getTarget() (or maybe a pass parameter is necessary?) |
There is no MF there of course. |
LDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a pseudo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it. This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis. Fixes: SWDEV-433427
LDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a pseudo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it. This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis. Fixes: SWDEV-433427 Change-Id: I21e5931c3db0676a8778489b07f490976c6ef45e
LDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a pseudo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it.
This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis.
Fixes: SWDEV-433427