[AMDGPU] Use alias info to relax waitcounts for LDS DMA #74537

rampitec · 2023-12-05T23:23:53Z

LDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a pseudo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it.

This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis.

Fixes: SWDEV-433427

LDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a psuedo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it. This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis. Fixes: SWDEV-433427

llvmbot · 2023-12-05T23:24:21Z

@llvm/pr-subscribers-backend-amdgpu

Author: Stanislav Mekhanoshin (rampitec)

Changes

LDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a psuedo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it.

This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis.

Fixes: SWDEV-433427

Patch is 20.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/74537.diff

6 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+10-6)
(modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+65-8)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+2-2)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+8)
(added) llvm/lib/Target/AMDGPU/lds-dma-waits.ll (+154)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+2)

diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index a7f4d63229b7e..2e079404b087f 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -1128,11 +1128,10 @@ bool SITargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
                     MachineMemOperand::MOStore |
                     MachineMemOperand::MODereferenceable;
 
-      // XXX - Should this be volatile without known ordering?
-      Info.flags |= MachineMemOperand::MOVolatile;
-
       switch (IntrID) {
       default:
+        // XXX - Should this be volatile without known ordering?
+        Info.flags |= MachineMemOperand::MOVolatile;
         break;
       case Intrinsic::amdgcn_raw_buffer_load_lds:
       case Intrinsic::amdgcn_raw_ptr_buffer_load_lds:
@@ -1140,6 +1139,7 @@ bool SITargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
       case Intrinsic::amdgcn_struct_ptr_buffer_load_lds: {
         unsigned Width = cast<ConstantInt>(CI.getArgOperand(2))->getZExtValue();
         Info.memVT = EVT::getIntegerVT(CI.getContext(), Width * 8);
+        Info.ptrVal = CI.getArgOperand(1);
         return true;
       }
       }
@@ -1268,8 +1268,8 @@ bool SITargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
     Info.opc = ISD::INTRINSIC_VOID;
     unsigned Width = cast<ConstantInt>(CI.getArgOperand(2))->getZExtValue();
     Info.memVT = EVT::getIntegerVT(CI.getContext(), Width * 8);
-    Info.flags |= MachineMemOperand::MOLoad | MachineMemOperand::MOStore |
-                  MachineMemOperand::MOVolatile;
+    Info.ptrVal = CI.getArgOperand(1);
+    Info.flags |= MachineMemOperand::MOLoad | MachineMemOperand::MOStore;
     return true;
   }
   case Intrinsic::amdgcn_ds_bvh_stack_rtn: {
@@ -9084,7 +9084,9 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
     MachinePointerInfo LoadPtrI = LoadMMO->getPointerInfo();
 
     MachinePointerInfo StorePtrI = LoadPtrI;
-    StorePtrI.V = nullptr;
+    LoadPtrI.V = UndefValue::get(
+        PointerType::get(*DAG.getContext(), AMDGPUAS::GLOBAL_ADDRESS));
+    LoadPtrI.AddrSpace = AMDGPUAS::GLOBAL_ADDRESS;
     StorePtrI.AddrSpace = AMDGPUAS::LOCAL_ADDRESS;
 
     auto F = LoadMMO->getFlags() &
@@ -9162,6 +9164,8 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
     MachinePointerInfo LoadPtrI = LoadMMO->getPointerInfo();
     LoadPtrI.Offset = Op->getConstantOperandVal(5);
     MachinePointerInfo StorePtrI = LoadPtrI;
+    LoadPtrI.V = UndefValue::get(
+        PointerType::get(*DAG.getContext(), AMDGPUAS::GLOBAL_ADDRESS));
     LoadPtrI.AddrSpace = AMDGPUAS::GLOBAL_ADDRESS;
     StorePtrI.AddrSpace = AMDGPUAS::LOCAL_ADDRESS;
     auto F = LoadMMO->getFlags() &
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index ede4841b8a5fd..50ad22130e939 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -31,6 +31,7 @@
 #include "llvm/ADT/MapVector.h"
 #include "llvm/ADT/PostOrderIterator.h"
 #include "llvm/ADT/Sequence.h"
+#include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/CodeGen/MachineLoopInfo.h"
 #include "llvm/CodeGen/MachinePostDominators.h"
 #include "llvm/InitializePasses.h"
@@ -121,8 +122,13 @@ enum RegisterMapping {
   SQ_MAX_PGM_VGPRS = 512, // Maximum programmable VGPRs across all targets.
   AGPR_OFFSET = 256,      // Maximum programmable ArchVGPRs across all targets.
   SQ_MAX_PGM_SGPRS = 256, // Maximum programmable SGPRs across all targets.
-  NUM_EXTRA_VGPRS = 1,    // A reserved slot for DS.
-  EXTRA_VGPR_LDS = 0,     // An artificial register to track LDS writes.
+  NUM_EXTRA_VGPRS = 9,    // Reserved slots for DS.
+  // Artificial register slots to track LDS writes into specific LDS locations
+  // if a location is known. When slots are exhausted or location is
+  // unknown use the first slot. The first slot is also always updated in
+  // addition to known location's slot to properly generate waits if dependent
+  // instruction's location is unknown.
+  EXTRA_VGPR_LDS = 0,
   NUM_ALL_VGPRS = SQ_MAX_PGM_VGPRS + NUM_EXTRA_VGPRS, // Where SGPR starts.
 };
 
@@ -292,6 +298,10 @@ class WaitcntBrackets {
     VgprVmemTypes[GprNo] = 0;
   }
 
+  const SmallVectorImpl<const MachineInstr *>& getLDSDMAStores() const {
+    return LDSDMAStores;
+  }
+
   void print(raw_ostream &);
   void dump() { print(dbgs()); }
 
@@ -354,6 +364,9 @@ class WaitcntBrackets {
   // Bitmask of the VmemTypes of VMEM instructions that might have a pending
   // write to each vgpr.
   unsigned char VgprVmemTypes[NUM_ALL_VGPRS] = {0};
+  // Store representative LDS DMA operations. The only useful info here is
+  // alias info. One store is kept per unique AAInfo.
+  SmallVector<const MachineInstr *, NUM_EXTRA_VGPRS - 1> LDSDMAStores;
 };
 
 class SIInsertWaitcnts : public MachineFunctionPass {
@@ -369,6 +382,7 @@ class SIInsertWaitcnts : public MachineFunctionPass {
   DenseMap<MachineBasicBlock *, bool> PreheadersToFlush;
   MachineLoopInfo *MLI;
   MachinePostDominatorTree *PDT;
+  AliasAnalysis *AA = nullptr;
 
   struct BlockInfo {
     std::unique_ptr<WaitcntBrackets> Incoming;
@@ -411,6 +425,8 @@ class SIInsertWaitcnts : public MachineFunctionPass {
     AU.setPreservesCFG();
     AU.addRequired<MachineLoopInfo>();
     AU.addRequired<MachinePostDominatorTree>();
+    AU.addRequired<AAResultsWrapperPass>();
+    AU.addPreserved<AAResultsWrapperPass>();
     MachineFunctionPass::getAnalysisUsage(AU);
   }
 
@@ -452,7 +468,7 @@ class SIInsertWaitcnts : public MachineFunctionPass {
   // FLAT instruction.
   WaitEventType getVmemWaitEventType(const MachineInstr &Inst) const {
     assert(SIInstrInfo::isVMEM(Inst) || SIInstrInfo::isFLAT(Inst));
-    if (!ST->hasVscnt())
+    if (!ST->hasVscnt() || SIInstrInfo::isLDSDMA(Inst))
       return VMEM_ACCESS;
     if (Inst.mayStore() && !SIInstrInfo::isAtomicRet(Inst)) {
       // FLAT and SCRATCH instructions may access scratch. Other VMEM
@@ -547,8 +563,7 @@ void WaitcntBrackets::setExpScore(const MachineInstr *MI,
 // MUBUF and FLAT LDS DMA operations need a wait on vmcnt before LDS written
 // can be accessed. A load from LDS to VMEM does not need a wait.
 static bool mayWriteLDSThroughDMA(const MachineInstr &MI) {
-  return SIInstrInfo::isVALU(MI) &&
-         (SIInstrInfo::isMUBUF(MI) || SIInstrInfo::isFLAT(MI)) &&
+  return SIInstrInfo::isLDSDMA(MI) &&
          MI.getOpcode() != AMDGPU::BUFFER_STORE_LDS_DWORD;
 }
 
@@ -704,7 +719,36 @@ void WaitcntBrackets::updateByEvent(const SIInstrInfo *TII,
       }
     }
     if (Inst.mayStore() && (TII->isDS(Inst) || mayWriteLDSThroughDMA(Inst))) {
-      setRegScore(SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS, T, CurrScore);
+      unsigned Slot = 0;
+      for (const auto *MemOp : Inst.memoperands()) {
+        if (MemOp->isStore() &&
+            MemOp->getAddrSpace() == AMDGPUAS::LOCAL_ADDRESS) {
+          // Comparing just AA info does not guarantee memoperands are equal
+          // in general, but this is so for LDS DMA on practice.
+          auto AAI = MemOp->getAAInfo();
+          if (!AAI)
+            break;
+          auto I = llvm::find_if(LDSDMAStores, [&AAI](const MachineInstr *I) {
+            for (const auto *MemOp : I->memoperands()) {
+              if (MemOp->isStore())
+                return AAI == MemOp->getAAInfo();
+            }
+            return false;
+          });
+          if (I != LDSDMAStores.end()) {
+            Slot = I - LDSDMAStores.begin() + 1;
+            break;
+          }
+          if (LDSDMAStores.size() == NUM_EXTRA_VGPRS - 1)
+            break;
+          LDSDMAStores.push_back(&Inst);
+          Slot = LDSDMAStores.size();
+          break;
+        }
+      }
+      setRegScore(SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS + Slot, T, CurrScore);
+      if (Slot)
+        setRegScore(SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS, T, CurrScore);
     }
   }
 }
@@ -1180,9 +1224,21 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
         // No need to wait before load from VMEM to LDS.
         if (mayWriteLDSThroughDMA(MI))
           continue;
-        unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
+
         // VM_CNT is only relevant to vgpr or LDS.
-        ScoreBrackets.determineWait(VM_CNT, RegNo, Wait);
+        unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
+        bool FoundAliasingStore = false;
+        if (Ptr && Memop->getAAInfo()) {
+          const auto &LDSDMAStores = ScoreBrackets.getLDSDMAStores();
+          for (unsigned I = 0, E = LDSDMAStores.size(); I != E; ++I) {
+            if (MI.mayAlias(AA, *LDSDMAStores[I], true)) {
+              FoundAliasingStore = true;
+              ScoreBrackets.determineWait(VM_CNT, RegNo + I + 1, Wait);
+            }
+          }
+        }
+        if (!FoundAliasingStore)
+          ScoreBrackets.determineWait(VM_CNT, RegNo, Wait);
         if (Memop->isStore()) {
           ScoreBrackets.determineWait(EXP_CNT, RegNo, Wait);
         }
@@ -1818,6 +1874,7 @@ bool SIInsertWaitcnts::runOnMachineFunction(MachineFunction &MF) {
   const SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
   MLI = &getAnalysis<MachineLoopInfo>();
   PDT = &getAnalysis<MachinePostDominatorTree>();
+  AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
 
   ForceEmitZeroWaitcnts = ForceEmitZeroFlag;
   for (auto T : inst_counter_types())
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index b5b456d691254..17befd6c0e346 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -3654,8 +3654,8 @@ bool SIInstrInfo::areMemAccessesTriviallyDisjoint(const MachineInstr &MIa,
   // underlying address space, even if it was lowered to a different one,
   // e.g. private accesses lowered to use MUBUF instructions on a scratch
   // buffer.
-  if (isDS(MIa)) {
-    if (isDS(MIb))
+  if (isDS(MIa) || isLDSDMA(MIa)) {
+    if (isDS(MIb) || isLDSDMA(MIb))
       return checkInstOffsetsDoNotOverlap(MIa, MIb);
 
     return !isFLAT(MIb) || isSegmentSpecificFLAT(MIb);
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index e388b5550cb10..456d8d835986f 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -547,6 +547,14 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
     return get(Opcode).TSFlags & SIInstrFlags::DS;
   }
 
+  static bool isLDSDMA(const MachineInstr &MI) {
+    return isVALU(MI) && (isMUBUF(MI) || isFLAT(MI));
+  }
+
+  bool isLDSDMA(uint16_t Opcode) {
+    return isVALU(Opcode) && (isMUBUF(Opcode) || isFLAT(Opcode));
+  }
+
   static bool isGWS(const MachineInstr &MI) {
     return MI.getDesc().TSFlags & SIInstrFlags::GWS;
   }
diff --git a/llvm/lib/Target/AMDGPU/lds-dma-waits.ll b/llvm/lib/Target/AMDGPU/lds-dma-waits.ll
new file mode 100644
index 0000000000000..31155290216d0
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/lds-dma-waits.ll
@@ -0,0 +1,154 @@
+; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck %s --check-prefixeses=GCN,GFX9
+; RUN: llc -march=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s | FileCheck %s --check-prefixeses=GCN,GFX10
+
+@lds.0 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.1 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.2 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.3 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.4 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.5 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.6 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.7 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.8 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.9 = internal addrspace(3) global [64 x float] poison, align 16
+
+declare void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) nocapture, i32 %size, i32 %voffset, i32 %soffset, i32 %offset, i32 %aux)
+declare void @llvm.amdgcn.global.load.lds(ptr addrspace(1) nocapture %gptr, ptr addrspace(3) nocapture %lptr, i32 %size, i32 %offset, i32 %aux)
+
+; GCN-LABEL: {{^}}buffer_load_lds_dword_2_arrays:
+; GCN-COUNT-4: buffer_load_dword
+; GCN: s_waitcnt vmcnt(2)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(0)
+; GCN: ds_read_b32
+define amdgpu_kernel void @buffer_load_lds_dword_2_arrays(<4 x i32> %rsrc, i32 %i1, i32 %i2, ptr addrspace(1) %out) {
+main_body:
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 4, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.1, i32 4, i32 8, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.1, i32 4, i32 12, i32 0, i32 0, i32 0)
+  %gep.0 = getelementptr float, ptr addrspace(3) @lds.0, i32 %i1
+  %gep.1 = getelementptr float, ptr addrspace(3) @lds.1, i32 %i2
+  %val.0 = load float, ptr addrspace(3) %gep.0, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.1 = load float, ptr addrspace(3) %gep.1, align 4
+  %tmp.0 = insertelement <2 x float> undef, float %val.0, i32 0
+  %res = insertelement <2 x float> %tmp.0, float %val.1, i32 1
+  store <2 x float> %res, ptr addrspace(1) %out
+  ret void
+}
+
+; On gfx9 if there is a pending FLAT operation, and this is a VMem or LGKM
+; waitcnt and the target can report early completion, then we need to force a waitcnt 0.
+
+; GCN-LABEL: {{^}}global_load_lds_dword_2_arrays:
+; GCN-COUNT-4: global_load_dword
+; GFX9: s_waitcnt vmcnt(0)
+; GFX9-COUNT-2: ds_read_b32
+; GFX10: s_waitcnt vmcnt(2)
+; GFX10: ds_read_b32
+; GFX10: s_waitcnt vmcnt(0)
+; GFX10: ds_read_b32
+define amdgpu_kernel void @global_load_lds_dword_2_arrays(ptr addrspace(1) nocapture %gptr, i32 %i1, i32 %i2, ptr addrspace(1) %out) {
+main_body:
+  call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0)
+  call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.0, i32 4, i32 4, i32 0)
+  call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.1, i32 4, i32 8, i32 0)
+  call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.1, i32 4, i32 12, i32 0)
+  %gep.0 = getelementptr float, ptr addrspace(3) @lds.0, i32 %i1
+  %gep.1 = getelementptr float, ptr addrspace(3) @lds.1, i32 %i2
+  %val.0 = load float, ptr addrspace(3) %gep.0, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.1 = load float, ptr addrspace(3) %gep.1, align 4
+  %tmp.0 = insertelement <2 x float> undef, float %val.0, i32 0
+  %res = insertelement <2 x float> %tmp.0, float %val.1, i32 1
+  store <2 x float> %res, ptr addrspace(1) %out
+  ret void
+}
+
+; There are 8 pseudo registers defined to track LDS DMA dependencies.
+; When exhausted we default to vmcnt(0).
+
+; GCN-LABEL: {{^}}buffer_load_lds_dword_10_arrays:
+; GCN-COUNT-10: buffer_load_dword
+; GCN: s_waitcnt vmcnt(8)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(7)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(6)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(5)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(4)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(3)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(2)
+; GCN-NOT: s_waitcnt vmcnt
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(0)
+; GCN: ds_read_b32
+define amdgpu_kernel void @buffer_load_lds_dword_10_arrays(<4 x i32> %rsrc, i32 %i1, i32 %i2, i32 %i3, i32 %i4, i32 %i5, i32 %i6, i32 %i7, i32 %i8, i32 %i9, ptr addrspace(1) %out) {
+main_body:
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.1, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.2, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.3, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.4, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.5, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.6, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.7, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.8, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.9, i32 4, i32 0, i32 0, i32 0, i32 0)
+  %gep.0 = getelementptr float, ptr addrspace(3) @lds.0, i32 %i1
+  %gep.1 = getelementptr float, ptr addrspace(3) @lds.1, i32 %i2
+  %gep.2 = getelementptr float, ptr addrspace(3) @lds.2, i32 %i2
+  %gep.3 = getelementptr float, ptr addrspace(3) @lds.3, i32 %i2
+  %gep.4 = getelementptr float, ptr addrspace(3) @lds.4, i32 %i2
+  %gep.5 = getelementptr float, ptr addrspace(3) @lds.5, i32 %i2
+  %gep.6 = getelementptr float, ptr addrspace(3) @lds.6, i32 %i2
+  %gep.7 = getelementptr float, ptr addrspace(3) @lds.7, i32 %i2
+  %gep.8 = getelementptr float, ptr addrspace(3) @lds.8, i32 %i2
+  %gep.9 = getelementptr float, ptr addrspace(3) @lds.9, i32 %i2
+  %val.0 = load float, ptr addrspace(3) %gep.0, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.1 = load float, ptr addrspace(3) %gep.1, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.2 = load float, ptr addrspace(3) %gep.2, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.3 = load float, ptr addrspace(3) %gep.3, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.4 = load float, ptr addrspace(3) %gep.4, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.5 = load float, ptr addrspace(3) %gep.5, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.6 = load float, ptr addrspace(3) %gep.6, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.7 = load float, ptr addrspace(3) %gep.7, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.8 = load float, ptr addrspace(3) %gep.8, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.9 = load float, ptr addrspace(3) %gep.9, align 4
+  %out.gep.1 = getelementptr float, ptr addrspace(1) %out, i32 1
+  %out.gep.2 = getelementptr float, ptr addrspace(1) %out, i32 2
+  %out.gep.3 = getelementptr float, ptr addrspace(1) %out, i32 3
+  %out.gep.4 = getelementptr float, ptr addrspace(1) %out, i32 4
+  %out.gep.5 = getelementptr float, ptr addrspace(1) %out, i32 5
+  %out.gep.6 = getelementptr float, ptr addrspace(1) %out, i32 6
+  %out.gep.7 = getelementptr float, ptr addrspace(1) %out, i32 7
+  %out.gep.8 = getelementptr float, ptr addrspace(1) %out, i32 8
+  %out.gep.9 = getelementptr float, ptr addrspace(1) %out, i32 9
+  store float %val.0, ptr addrspace(1) %out
+  store float %val.1, ptr addrspace(1) %out.gep.1
+  store float %val.2, ptr addrspace(1) %out.gep.2
+  store float %val.3, ptr addrspace(1) %out.gep.3
+  store float %val.4, ptr addrspace(1) %out.gep.4
+  store float %val.5, ptr addrspace(1) %out.gep.5
+  store float %val.6, ptr addrspace(1) %out.gep.6
+  store float %val.7, ptr addrspace(1) %out.gep.7
+  store float %val.8, ptr addrspace(1) %out.gep.8
+  store float %val.9, ptr addrspace(1) %out.gep.9
+  ret void
+}
+
+declare void @llvm.amdgcn.wave.barrier()
diff --git a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
index 7abb789019f1f..9787b8b6c6fbe 100644
--- a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
+++ b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
@@ -130,6 +130,8 @@
 ; GCN-O0-NEXT:        MachineDominator Tree Construction
 ; GCN-O0-NEXT:        Machine Natural Loop Construction
 ; GCN-O0-NEXT:        MachinePostDominator Tree Construction
+; GCN-O0-NEXT:        Basic Alias Analysis (stateless AA impl)
+; GCN-O0-NEXT:        Function Alias Analysis Results
 ; GCN-O0-NEXT:        SI insert wait instructions
 ; GCN-O0-NEXT:        Insert...
[truncated]

github-actions · 2023-12-05T23:26:26Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

arsenm · 2023-12-06T08:34:46Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+            return false;
+          });
+          if (I != LDSDMAStores.end()) {
+            Slot = I - LDSDMAStores.begin() + 1;


I don't understand this iterator logic

I just need an index of the instruction in the list. One is added because the first pseudo register is reserved for general case when we cannot find a memory location.

I have simplified it, just replaced with a regular loop.

llvm/lib/Target/AMDGPU/lds-dma-waits.ll

Changed getVmemWaitEventType() to use mayWriteLDSThroughDMA instead of isLDSDMA as this is more sound and added a comment.

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

arsenm · 2023-12-07T10:55:53Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+          break;
+        for (unsigned I = 0, E = LDSDMAStores.size(); I != E && !Slot; ++I) {
+          for (const auto *MemOp : LDSDMAStores[I]->memoperands()) {
+            if (MemOp->isStore() && AAI == MemOp->getAAInfo()) {


I'm not sure I understand the equality check here; you perform an actual mayAlias query later? What is the point of this filter? Don't you need to consider any possibly aliasing write as an event?

Alias check is done later at the line 1229, and this is also a place where any aliasing access gets actual waits, read or write. Anyway, that part did not change here. This search is just to find if we already have a slot allocated for this memory location. In reality we do not need instruction, we just need LDS memory location, but since mayAlias is an interface which needs a MachineInstr, I am searching and keeping instructions in the list. All we really need from this instruction is AA tags.

arsenm · 2023-12-07T11:03:51Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+        if (!MemOp->isStore() ||
+            MemOp->getAddrSpace() != AMDGPUAS::LOCAL_ADDRESS)
+          continue;
+        // Comparing just AA info does not guarantee memoperands are equal


If you want exact equality of location, you would use the Value/PseudoSourceValue?

MI::mayAlias is extremely conservative and does not work with PsedoSourceValue, it just ignores AA tags in this case. The Value is also unreliable because it is really a GEP, always a different GEP.

Speaking of Value, this is how a real program's MIR look like. These are all loads and stores of the same arrays. You may see that Value there is not helpful, only alias scope is (look at the second memory operand, LDS store):

BUFFER_LOAD_DWORD_LDS_OFFEN %51:vgpr_32, %48:sgpr_128, %52:sreg_32, 0, 0, 0, implicit $exec, implicit $m0 :: (dereferenceable load (s32) from `ptr addrspace(1) poison`, align 1, !alias.scope !7, !noalias !10, addrspace 1), (dereferenceable store (s32) into %ir.10, align 1, !alias.scope !7, !noalias !10, addrspace 3) BUFFER_LOAD_DWORD_LDS_OFFEN killed %54:vgpr_32, %48:sgpr_128, %52:sreg_32, 0, 0, 0, implicit $exec, implicit $m0 :: (dereferenceable load (s32) from `ptr addrspace(1) poison`, align 1, !alias.scope !7, !noalias !10, addrspace 1), (dereferenceable store (s32) into %ir.12, align 1, !alias.scope !7, !noalias !10, addrspace 3) BUFFER_LOAD_DWORD_LDS_OFFEN killed %56:vgpr_32, %48:sgpr_128, %52:sreg_32, 0, 0, 0, implicit $exec, implicit $m0 :: (dereferenceable load (s32) from `ptr addrspace(1) poison`, align 1, !alias.scope !7, !noalias !10, addrspace 1), (dereferenceable store (s32) into %ir.14, align 1, !alias.scope !7, !noalias !10, addrspace 3) BUFFER_LOAD_DWORD_LDS_OFFEN killed %58:vgpr_32, %48:sgpr_128, %52:sreg_32, 0, 0, 0, implicit $exec, implicit $m0 :: (dereferenceable load (s32) from `ptr addrspace(1) poison`, align 1, !alias.scope !7, !noalias !10, addrspace 1), (dereferenceable store (s32) into %ir.16, align 1, !alias.scope !7, !noalias !10, addrspace 3)

PseudoSourceValue::mayAlias is supposed to report aliasing to possible IR values. It looks like it's layered weirdly, and expects you to go through MachineInstr::mayAlias. MachineInstr::mayAlias ought to be using the AA tags, it shouldn't be a fundamental limitation

It looks to me like it does use it if you pass UseTBAA=true. Not sure why this would be a parameter in the first place

PseudoSourceValue::mayAlias is supposed to report aliasing to possible IR values. It looks like it's layered weirdly, and expects you to go through MachineInstr::mayAlias. MachineInstr::mayAlias ought to be using the AA tags, it shouldn't be a fundamental limitation

This is all PSV::mayAlias() does:

bool PseudoSourceValue::mayAlias(const MachineFrameInfo *) const { return !(isGOT() || isConstantPool() || isJumpTable()); }

No very useful. Then even to get to the AA tags check MI:mayAlias() shall go through all IR values' checks first.

It looks to me like it does use it if you pass UseTBAA=true. Not sure why this would be a parameter in the first place

I am passing it, but to get to that check it shall first go through all Value and offset checks. Using AA is the last thing it does: https://llvm.org/doxygen/MachineInstr_8cpp_source.html#l01285

The values don't need to be identical, that's the point of the AA query. BasicAA will parse through the offsets

I also think that values don't need to be identical. But that is what MI:mayAlias() does before it checks AA: https://llvm.org/doxygen/MachineInstr_8cpp_source.html#l01285

But there's no PseudoSourceValue in this example, it should be a straightforward Value-to-Value comparison

Right, there is no PSV. I have mentioned PSV because you have earlier suggested to use it. For the real IR value: it is not helpful to compare it. The IR value is a GEP, and this GEP is always different. I.e. these values never compare equal. The rest of the IR is already gone and unavailable for the analysis. Even if it would be available this GEP will address kernel module LDS variable, a single huge LDS array, and will be useless again. In this case it will tell you any LDS operation aliases any other. Now during the module LDS lowering I am creating alias scope info specifically to disambiguate aliasing after the pass has squashed all LDS variables.

llvm/lib/Target/AMDGPU/lds-dma-waits.ll

rampitec · 2023-12-12T10:08:20Z

Ping

rampitec · 2023-12-12T21:26:19Z

To make it easier I am splitting the patch. I have pre-comitted the test, and there is a part which fixes lack of wait on GFX10 : #75245

rampitec · 2023-12-12T21:42:24Z

Another part is improving memoperand info: #75247. This is NFCI just by itself.

rampitec · 2023-12-12T21:55:22Z

Yet another part to fix disjoint memory checks with LDS DMA: #75249

rampitec · 2023-12-18T19:24:09Z

All split off parts were merged and this patch is merged with main. Only waitcount insertion pass changes remained here.

jayfoad · 2023-12-19T09:36:45Z

How does this work in a case like this?

call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.3, i32 4, i32 0, i32 0, i32 0, i32 0)
call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) %ptr, i32 4, i32 0, i32 0, i32 0, i32 0)
%val.3 = load float, ptr addrspace(3) @lds.3, align 4

i.e.

store to known lds address @lds.3 (this will use slot 0 and another slot e.g. slot 3?)
store to unknown lds address (this will use slot 0?)
load from known lds address @lds.3 (this will use slot 3?)

rampitec · 2023-12-19T10:00:00Z

How does this work in a case like this?

call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.3, i32 4, i32 0, i32 0, i32 0, i32 0)
call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) %ptr, i32 4, i32 0, i32 0, i32 0, i32 0)
%val.3 = load float, ptr addrspace(3) @lds.3, align 4

i.e.

* store to known lds address `@lds.3` (this will use slot 0 and another slot e.g. slot 3?)

* store to unknown lds address (this will use slot 0?)

* load from known lds address `@lds.3` (this will use slot 3?)

It does not know the pointer, so it uses default slot 0 and waits till 0. I have to tell anyone interested here: before I even wrote this code it didn't know of the dependency and did not wait for anything at all. Everyone was happy.

jayfoad · 2023-12-19T10:29:03Z

How does this work in a case like this?

call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.3, i32 4, i32 0, i32 0, i32 0, i32 0)
call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) %ptr, i32 4, i32 0, i32 0, i32 0, i32 0)
%val.3 = load float, ptr addrspace(3) @lds.3, align 4

i.e.

* store to known lds address `@lds.3` (this will use slot 0 and another slot e.g. slot 3?)

* store to unknown lds address (this will use slot 0?)

* load from known lds address `@lds.3` (this will use slot 3?)

It does not know the pointer, so it uses default slot 0 and waits till 0.

Test case:

@lds.0 = internal addrspace(3) global [64 x float] poison, align 16
@lds.1 = internal addrspace(3) global [64 x float] poison, align 16

declare void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) nocapture, i32 %size, i32 %voffset, i32 %soffset, i32 %offset, i32 %aux)

define amdgpu_kernel void @f(<4 x i32> %rsrc, i32 %i1, i32 %i2, ptr addrspace(1) %out, ptr addrspace(3) %ptr) {
main_body:
  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0, i32 0, i32 0)
  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) %ptr, i32 4, i32 0, i32 0, i32 0, i32 0)
  %gep.0 = getelementptr float, ptr addrspace(3) @lds.0, i32 %i1
  %gep.1 = getelementptr float, ptr addrspace(3) @lds.1, i32 %i2
  %val.0 = load volatile float, ptr addrspace(3) %gep.0, align 4
  %val.1 = load volatile float, ptr addrspace(3) %gep.1, align 4
  %out.gep.1 = getelementptr float, ptr addrspace(1) %out, i32 1
  store float %val.0, ptr addrspace(1) %out
  store float %val.1, ptr addrspace(1) %out.gep.1
  ret void
}

Generates:

	s_load_dwordx8 s[4:11], s[0:1], 0x24
	s_load_dword s2, s[0:1], 0x44
	s_mov_b32 m0, 0
	v_mov_b32_e32 v2, 0
	s_waitcnt lgkmcnt(0)
	buffer_load_dword off, s[4:7], 0 lds
	s_mov_b32 m0, s2
	s_lshl_b32 s0, s8, 2
	buffer_load_dword off, s[4:7], 0 lds
	s_lshl_b32 s1, s9, 2
	v_mov_b32_e32 v0, s0
	v_mov_b32_e32 v1, s1
	s_waitcnt vmcnt(1)
	ds_read_b32 v0, v0
	s_waitcnt vmcnt(0)
	ds_read_b32 v1, v1 offset:256
	s_waitcnt lgkmcnt(0)
	global_store_dwordx2 v2, v[0:1], s[10:11]
	s_endpgm

The s_waitcnt vmcnt(1) seems incorrect, because the second buffer-load-to-lds might clobber @lds.0.

I have to tell anyone interested here: before I even wrote this code it didn't know of the dependency and did not wait for anything at all. Everyone was happy.

I am still happy, because buffer/flat/global-load-to-lds was removed in GFX11.

rampitec · 2023-12-19T18:51:18Z

Test case:

@lds.0 = internal addrspace(3) global [64 x float] poison, align 16
@lds.1 = internal addrspace(3) global [64 x float] poison, align 16

declare void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) nocapture, i32 %size, i32 %voffset, i32 %soffset, i32 %offset, i32 %aux)

define amdgpu_kernel void @f(<4 x i32> %rsrc, i32 %i1, i32 %i2, ptr addrspace(1) %out, ptr addrspace(3) %ptr) {
main_body:
  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0, i32 0, i32 0)
  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) %ptr, i32 4, i32 0, i32 0, i32 0, i32 0)
  %gep.0 = getelementptr float, ptr addrspace(3) @lds.0, i32 %i1
  %gep.1 = getelementptr float, ptr addrspace(3) @lds.1, i32 %i2
  %val.0 = load volatile float, ptr addrspace(3) %gep.0, align 4
  %val.1 = load volatile float, ptr addrspace(3) %gep.1, align 4
  %out.gep.1 = getelementptr float, ptr addrspace(1) %out, i32 1
  store float %val.0, ptr addrspace(1) %out
  store float %val.1, ptr addrspace(1) %out.gep.1
  ret void
}

Generates:

	s_load_dwordx8 s[4:11], s[0:1], 0x24
	s_load_dword s2, s[0:1], 0x44
	s_mov_b32 m0, 0
	v_mov_b32_e32 v2, 0
	s_waitcnt lgkmcnt(0)
	buffer_load_dword off, s[4:7], 0 lds
	s_mov_b32 m0, s2
	s_lshl_b32 s0, s8, 2
	buffer_load_dword off, s[4:7], 0 lds
	s_lshl_b32 s1, s9, 2
	v_mov_b32_e32 v0, s0
	v_mov_b32_e32 v1, s1
	s_waitcnt vmcnt(1)
	ds_read_b32 v0, v0
	s_waitcnt vmcnt(0)
	ds_read_b32 v1, v1 offset:256
	s_waitcnt lgkmcnt(0)
	global_store_dwordx2 v2, v[0:1], s[10:11]
	s_endpgm

The s_waitcnt vmcnt(1) seems incorrect, because the second buffer-load-to-lds might clobber @lds.0.

This is still correct, pointer argument cannot alias module global. A pointer argument to a kernel is an LDS external requested by the host side, and host cannot see module LDS.

rampitec · 2023-12-19T18:58:59Z

This is still correct, pointer argument cannot alias module global. A pointer argument to a kernel is an LDS external requested by the host side, and host cannot see module LDS.

I.e. that is really the point of the patch: if we are able to definitively identify an LDS object targeted by both load and store we only wait on that store or stores. And the only way to definitively identify the object at this stage is via alias.scope info which we are generating ourselves during module LDS lowering.

rampitec · 2023-12-19T20:04:04Z

This is still correct, pointer argument cannot alias module global. A pointer argument to a kernel is an LDS external requested by the host side, and host cannot see module LDS.

I.e. that is really the point of the patch: if we are able to definitively identify an LDS object targeted by both load and store we only wait on that store or stores. And the only way to definitively identify the object at this stage is via alias.scope info which we are generating ourselves during module LDS lowering.

I have added a check for the presence of alias scope info just in case we get a rogue AA. The testcase with a pointer argument still produces correct code with vmcnt(1).

rampitec · 2023-12-19T22:27:36Z

Actually since I am only using alias scope I can avoid all alias analysis altogether and only compare alias scope. This does not need an analysis pass, calls to mayAlias, and in general simpler code. You can see an alternative PR if you like it more: #75974

rampitec · 2024-01-02T18:53:35Z

Ping

rampitec · 2024-01-11T20:50:55Z

Ping

arsenm · 2024-01-12T08:26:21Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+        if (!MemOp->isStore() ||
+            MemOp->getAddrSpace() != AMDGPUAS::LOCAL_ADDRESS)
+          continue;
+        // Comparing just AA info does not guarantee memoperands are equal


But there's no PseudoSourceValue in this example, it should be a straightforward Value-to-Value comparison

arsenm · 2024-01-12T08:29:59Z

llvm/test/CodeGen/AMDGPU/llc-pipeline.ll

+; GCN-O0-NEXT:        Basic Alias Analysis (stateless AA impl)
+; GCN-O0-NEXT:        Function Alias Analysis Results


Can you skip querying for AA at -O0?

If I just skip getAnalysis call it does not help since analysis is requested in the getAnalysisUsage. If I do not request it it is not available at any optlevel. This is the benefit of the alternative #75974, it does not request the full analysis.

You can check the opt level in getAnalysisUsage too, from the TargetMachine

arsenm · 2024-01-12T08:31:05Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

-        ScoreBrackets.determineWait(VM_CNT, RegNo, Wait);
+        unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
+        bool FoundAliasingStore = false;
+        if (Ptr && Memop->getAAInfo() && Memop->getAAInfo().Scope) {


I still don't understand the usage of scope; scope isn't special, isn't common and I do at all like specially treating it. I think you should just let the AA query figure out what to do with it

I have reserved just 8 pseudo registers to track it. I do not want to fill it with unrelated stuff. I know that the only way AA will be able to handle this very specific situation is if there is scope info, otherwise there is no reason to waste a slot and compile time. If I do not enter this 'if' the pass will just do conservatively correct thing and wait for this memory regardless of aliasing or lack of it.

I have added more comments to explain this. The place which fills the LDS DMA slot bails if there is no scope info not to waste limited tracking slots. In that case a generic first slot is still used for such operation (it is always used, regardless if we can or cannot be more specific about the underlying object). Here AA will be unable to disambiguate aliasing if there is no scope info, so this condition is simply a shortcut to avoid an expensive loop and AA query. I can remove this part of the condition here and nothing will change except it will work slower. Note that not entering this 'if' statement will always produce a conservatively correct wait using first generic tracking slot, which always gets a score regardless of our ability to track a specific object. The condition is around the relaxation code to avoid a generic and conservative 'wait for everything' part below.

arsenm

lgtm, but can still fix the -O0 thing

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

arsenm · 2024-01-17T05:42:53Z

llvm/test/CodeGen/AMDGPU/llc-pipeline.ll

+; GCN-O0-NEXT:        Basic Alias Analysis (stateless AA impl)
+; GCN-O0-NEXT:        Function Alias Analysis Results


You can check the opt level in getAnalysisUsage too, from the TargetMachine

rampitec · 2024-01-17T17:41:58Z

lgtm, but can still fix the -O0 thing

But where do I get TM in the getAnalysisUsage?

rampitec · 2024-01-17T18:23:17Z

lgtm, but can still fix the -O0 thing

But where do I get TM in the getAnalysisUsage?

Found addUsedIfAvailable() which does the trick.

arsenm · 2024-01-18T06:19:27Z

lgtm, but can still fix the -O0 thing

But where do I get TM in the getAnalysisUsage?

MF.getTarget() (or maybe a pass parameter is necessary?)

rampitec · 2024-01-18T07:43:23Z

lgtm, but can still fix the -O0 thing

But where do I get TM in the getAnalysisUsage?

MF.getTarget() (or maybe a pass parameter is necessary?)

There is no MF there of course.

LDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a pseudo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it. This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis. Fixes: SWDEV-433427

LDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a pseudo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it. This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis. Fixes: SWDEV-433427 Change-Id: I21e5931c3db0676a8778489b07f490976c6ef45e

rampitec requested review from arsenm and searlmc1 December 5, 2023 23:23

llvmbot added the backend:AMDGPU label Dec 5, 2023

clang-format

bab128b

arsenm reviewed Dec 6, 2023

View reviewed changes

rampitec added 3 commits December 6, 2023 10:18

Addressed review comments

cb78351

Moved mayWriteLDSThroughDMA into SIInstrInfo

21c55d2

Changed getVmemWaitEventType() to use mayWriteLDSThroughDMA instead of isLDSDMA as this is more sound and added a comment.

Replaced iterator logic with a simple loop

0587c28

arsenm reviewed Dec 7, 2023

View reviewed changes

Fixed comment.

ac7a070

searlmc1 reviewed Dec 10, 2023

View reviewed changes

llvm/lib/Target/AMDGPU/lds-dma-waits.ll Outdated Show resolved Hide resolved

rampitec added 3 commits December 11, 2023 08:14

Fix test placement and prefixes

7bab56e

Merge branch 'main' into lds-dma-wait

d655f12

Merge branch 'main' into lds-dma-wait

0d72766

Merge branch 'main' into lds-dma-wait

1f8f853

rampitec added 4 commits December 13, 2023 11:00

Merge branch 'main' into lds-dma-wait

e41607a

Bail early in areMemAccessesTriviallyDisjoint

0f78060

Remove old code

b86d65b

Merge branch 'main' into lds-dma-wait

3f8e9b1

Add check for presence of the alias scope info

82ec408

Merge branch 'main' into lds-dma-wait

92e7977

arsenm reviewed Jan 12, 2024

View reviewed changes

rampitec added 2 commits January 12, 2024 11:44

Merge branch 'main' into lds-dma-wait

a5d95fd

Added more comments.

4f71d9b

arsenm approved these changes Jan 17, 2024

View reviewed changes

rampitec added 2 commits January 17, 2024 09:33

Merge branch 'main' into lds-dma-wait

a7abeb8

Fixed typo in the comment

0115e30

Do not run AA at -O0

5988047

arsenm approved these changes Jan 18, 2024

View reviewed changes

rampitec merged commit 021def6 into llvm:main Jan 18, 2024

rampitec deleted the lds-dma-wait branch January 18, 2024 07:44

		; GCN-O0-NEXT: Basic Alias Analysis (stateless AA impl)
		; GCN-O0-NEXT: Function Alias Analysis Results

[AMDGPU] Use alias info to relax waitcounts for LDS DMA #74537

[AMDGPU] Use alias info to relax waitcounts for LDS DMA #74537

Conversation

rampitec commented Dec 5, 2023 • edited by arsenm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Dec 5, 2023

Uh oh!

github-actions bot commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rampitec commented Dec 12, 2023

Uh oh!

rampitec commented Dec 12, 2023

Uh oh!

rampitec commented Dec 12, 2023

Uh oh!

rampitec commented Dec 12, 2023

Uh oh!

rampitec commented Dec 18, 2023

Uh oh!

jayfoad commented Dec 19, 2023

Uh oh!

rampitec commented Dec 19, 2023

Uh oh!

jayfoad commented Dec 19, 2023

Uh oh!

rampitec commented Dec 19, 2023

Uh oh!

rampitec commented Dec 19, 2023

Uh oh!

rampitec commented Dec 19, 2023

Uh oh!

rampitec commented Dec 19, 2023

Uh oh!

rampitec commented Jan 2, 2024

Uh oh!

rampitec commented Jan 11, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

rampitec commented Dec 5, 2023 •

edited by arsenm

Loading

github-actions bot commented Dec 5, 2023 •

edited

Loading

arsenm commented Jan 18, 2024 •

edited

Loading