Skip to content

[AMDGPU] Use alias info to relax waitcounts for LDS DMA #74537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jan 18, 2024

Conversation

rampitec
Copy link
Collaborator

@rampitec rampitec commented Dec 5, 2023

LDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a pseudo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it.

This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis.

Fixes: SWDEV-433427

LDA DMA loads increase VMCNT and a load from the LDS stored must
wait on this counter to only read memory after it is written.
Wait count insertion pass does not track memory dependencies, it
tracks register dependencies. To model the LDS dependency a
psuedo register is used in the scoreboard, acting like if LDS DMA
writes it and LDS load reads it.

This patch adds 8 more pseudo registers to use for independent LDS
locations if we can prove they are disjoint using alias analysis.

Fixes: SWDEV-433427
@llvmbot
Copy link
Member

llvmbot commented Dec 5, 2023

@llvm/pr-subscribers-backend-amdgpu

Author: Stanislav Mekhanoshin (rampitec)

Changes

LDA DMA loads increase VMCNT and a load from the LDS stored must wait on this counter to only read memory after it is written. Wait count insertion pass does not track memory dependencies, it tracks register dependencies. To model the LDS dependency a psuedo register is used in the scoreboard, acting like if LDS DMA writes it and LDS load reads it.

This patch adds 8 more pseudo registers to use for independent LDS locations if we can prove they are disjoint using alias analysis.

Fixes: SWDEV-433427


Patch is 20.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/74537.diff

6 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+10-6)
  • (modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+65-8)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+2-2)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+8)
  • (added) llvm/lib/Target/AMDGPU/lds-dma-waits.ll (+154)
  • (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+2)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index a7f4d63229b7e..2e079404b087f 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -1128,11 +1128,10 @@ bool SITargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
                     MachineMemOperand::MOStore |
                     MachineMemOperand::MODereferenceable;
 
-      // XXX - Should this be volatile without known ordering?
-      Info.flags |= MachineMemOperand::MOVolatile;
-
       switch (IntrID) {
       default:
+        // XXX - Should this be volatile without known ordering?
+        Info.flags |= MachineMemOperand::MOVolatile;
         break;
       case Intrinsic::amdgcn_raw_buffer_load_lds:
       case Intrinsic::amdgcn_raw_ptr_buffer_load_lds:
@@ -1140,6 +1139,7 @@ bool SITargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
       case Intrinsic::amdgcn_struct_ptr_buffer_load_lds: {
         unsigned Width = cast<ConstantInt>(CI.getArgOperand(2))->getZExtValue();
         Info.memVT = EVT::getIntegerVT(CI.getContext(), Width * 8);
+        Info.ptrVal = CI.getArgOperand(1);
         return true;
       }
       }
@@ -1268,8 +1268,8 @@ bool SITargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
     Info.opc = ISD::INTRINSIC_VOID;
     unsigned Width = cast<ConstantInt>(CI.getArgOperand(2))->getZExtValue();
     Info.memVT = EVT::getIntegerVT(CI.getContext(), Width * 8);
-    Info.flags |= MachineMemOperand::MOLoad | MachineMemOperand::MOStore |
-                  MachineMemOperand::MOVolatile;
+    Info.ptrVal = CI.getArgOperand(1);
+    Info.flags |= MachineMemOperand::MOLoad | MachineMemOperand::MOStore;
     return true;
   }
   case Intrinsic::amdgcn_ds_bvh_stack_rtn: {
@@ -9084,7 +9084,9 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
     MachinePointerInfo LoadPtrI = LoadMMO->getPointerInfo();
 
     MachinePointerInfo StorePtrI = LoadPtrI;
-    StorePtrI.V = nullptr;
+    LoadPtrI.V = UndefValue::get(
+        PointerType::get(*DAG.getContext(), AMDGPUAS::GLOBAL_ADDRESS));
+    LoadPtrI.AddrSpace = AMDGPUAS::GLOBAL_ADDRESS;
     StorePtrI.AddrSpace = AMDGPUAS::LOCAL_ADDRESS;
 
     auto F = LoadMMO->getFlags() &
@@ -9162,6 +9164,8 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
     MachinePointerInfo LoadPtrI = LoadMMO->getPointerInfo();
     LoadPtrI.Offset = Op->getConstantOperandVal(5);
     MachinePointerInfo StorePtrI = LoadPtrI;
+    LoadPtrI.V = UndefValue::get(
+        PointerType::get(*DAG.getContext(), AMDGPUAS::GLOBAL_ADDRESS));
     LoadPtrI.AddrSpace = AMDGPUAS::GLOBAL_ADDRESS;
     StorePtrI.AddrSpace = AMDGPUAS::LOCAL_ADDRESS;
     auto F = LoadMMO->getFlags() &
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index ede4841b8a5fd..50ad22130e939 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -31,6 +31,7 @@
 #include "llvm/ADT/MapVector.h"
 #include "llvm/ADT/PostOrderIterator.h"
 #include "llvm/ADT/Sequence.h"
+#include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/CodeGen/MachineLoopInfo.h"
 #include "llvm/CodeGen/MachinePostDominators.h"
 #include "llvm/InitializePasses.h"
@@ -121,8 +122,13 @@ enum RegisterMapping {
   SQ_MAX_PGM_VGPRS = 512, // Maximum programmable VGPRs across all targets.
   AGPR_OFFSET = 256,      // Maximum programmable ArchVGPRs across all targets.
   SQ_MAX_PGM_SGPRS = 256, // Maximum programmable SGPRs across all targets.
-  NUM_EXTRA_VGPRS = 1,    // A reserved slot for DS.
-  EXTRA_VGPR_LDS = 0,     // An artificial register to track LDS writes.
+  NUM_EXTRA_VGPRS = 9,    // Reserved slots for DS.
+  // Artificial register slots to track LDS writes into specific LDS locations
+  // if a location is known. When slots are exhausted or location is
+  // unknown use the first slot. The first slot is also always updated in
+  // addition to known location's slot to properly generate waits if dependent
+  // instruction's location is unknown.
+  EXTRA_VGPR_LDS = 0,
   NUM_ALL_VGPRS = SQ_MAX_PGM_VGPRS + NUM_EXTRA_VGPRS, // Where SGPR starts.
 };
 
@@ -292,6 +298,10 @@ class WaitcntBrackets {
     VgprVmemTypes[GprNo] = 0;
   }
 
+  const SmallVectorImpl<const MachineInstr *>& getLDSDMAStores() const {
+    return LDSDMAStores;
+  }
+
   void print(raw_ostream &);
   void dump() { print(dbgs()); }
 
@@ -354,6 +364,9 @@ class WaitcntBrackets {
   // Bitmask of the VmemTypes of VMEM instructions that might have a pending
   // write to each vgpr.
   unsigned char VgprVmemTypes[NUM_ALL_VGPRS] = {0};
+  // Store representative LDS DMA operations. The only useful info here is
+  // alias info. One store is kept per unique AAInfo.
+  SmallVector<const MachineInstr *, NUM_EXTRA_VGPRS - 1> LDSDMAStores;
 };
 
 class SIInsertWaitcnts : public MachineFunctionPass {
@@ -369,6 +382,7 @@ class SIInsertWaitcnts : public MachineFunctionPass {
   DenseMap<MachineBasicBlock *, bool> PreheadersToFlush;
   MachineLoopInfo *MLI;
   MachinePostDominatorTree *PDT;
+  AliasAnalysis *AA = nullptr;
 
   struct BlockInfo {
     std::unique_ptr<WaitcntBrackets> Incoming;
@@ -411,6 +425,8 @@ class SIInsertWaitcnts : public MachineFunctionPass {
     AU.setPreservesCFG();
     AU.addRequired<MachineLoopInfo>();
     AU.addRequired<MachinePostDominatorTree>();
+    AU.addRequired<AAResultsWrapperPass>();
+    AU.addPreserved<AAResultsWrapperPass>();
     MachineFunctionPass::getAnalysisUsage(AU);
   }
 
@@ -452,7 +468,7 @@ class SIInsertWaitcnts : public MachineFunctionPass {
   // FLAT instruction.
   WaitEventType getVmemWaitEventType(const MachineInstr &Inst) const {
     assert(SIInstrInfo::isVMEM(Inst) || SIInstrInfo::isFLAT(Inst));
-    if (!ST->hasVscnt())
+    if (!ST->hasVscnt() || SIInstrInfo::isLDSDMA(Inst))
       return VMEM_ACCESS;
     if (Inst.mayStore() && !SIInstrInfo::isAtomicRet(Inst)) {
       // FLAT and SCRATCH instructions may access scratch. Other VMEM
@@ -547,8 +563,7 @@ void WaitcntBrackets::setExpScore(const MachineInstr *MI,
 // MUBUF and FLAT LDS DMA operations need a wait on vmcnt before LDS written
 // can be accessed. A load from LDS to VMEM does not need a wait.
 static bool mayWriteLDSThroughDMA(const MachineInstr &MI) {
-  return SIInstrInfo::isVALU(MI) &&
-         (SIInstrInfo::isMUBUF(MI) || SIInstrInfo::isFLAT(MI)) &&
+  return SIInstrInfo::isLDSDMA(MI) &&
          MI.getOpcode() != AMDGPU::BUFFER_STORE_LDS_DWORD;
 }
 
@@ -704,7 +719,36 @@ void WaitcntBrackets::updateByEvent(const SIInstrInfo *TII,
       }
     }
     if (Inst.mayStore() && (TII->isDS(Inst) || mayWriteLDSThroughDMA(Inst))) {
-      setRegScore(SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS, T, CurrScore);
+      unsigned Slot = 0;
+      for (const auto *MemOp : Inst.memoperands()) {
+        if (MemOp->isStore() &&
+            MemOp->getAddrSpace() == AMDGPUAS::LOCAL_ADDRESS) {
+          // Comparing just AA info does not guarantee memoperands are equal
+          // in general, but this is so for LDS DMA on practice.
+          auto AAI = MemOp->getAAInfo();
+          if (!AAI)
+            break;
+          auto I = llvm::find_if(LDSDMAStores, [&AAI](const MachineInstr *I) {
+            for (const auto *MemOp : I->memoperands()) {
+              if (MemOp->isStore())
+                return AAI == MemOp->getAAInfo();
+            }
+            return false;
+          });
+          if (I != LDSDMAStores.end()) {
+            Slot = I - LDSDMAStores.begin() + 1;
+            break;
+          }
+          if (LDSDMAStores.size() == NUM_EXTRA_VGPRS - 1)
+            break;
+          LDSDMAStores.push_back(&Inst);
+          Slot = LDSDMAStores.size();
+          break;
+        }
+      }
+      setRegScore(SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS + Slot, T, CurrScore);
+      if (Slot)
+        setRegScore(SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS, T, CurrScore);
     }
   }
 }
@@ -1180,9 +1224,21 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
         // No need to wait before load from VMEM to LDS.
         if (mayWriteLDSThroughDMA(MI))
           continue;
-        unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
+
         // VM_CNT is only relevant to vgpr or LDS.
-        ScoreBrackets.determineWait(VM_CNT, RegNo, Wait);
+        unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
+        bool FoundAliasingStore = false;
+        if (Ptr && Memop->getAAInfo()) {
+          const auto &LDSDMAStores = ScoreBrackets.getLDSDMAStores();
+          for (unsigned I = 0, E = LDSDMAStores.size(); I != E; ++I) {
+            if (MI.mayAlias(AA, *LDSDMAStores[I], true)) {
+              FoundAliasingStore = true;
+              ScoreBrackets.determineWait(VM_CNT, RegNo + I + 1, Wait);
+            }
+          }
+        }
+        if (!FoundAliasingStore)
+          ScoreBrackets.determineWait(VM_CNT, RegNo, Wait);
         if (Memop->isStore()) {
           ScoreBrackets.determineWait(EXP_CNT, RegNo, Wait);
         }
@@ -1818,6 +1874,7 @@ bool SIInsertWaitcnts::runOnMachineFunction(MachineFunction &MF) {
   const SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
   MLI = &getAnalysis<MachineLoopInfo>();
   PDT = &getAnalysis<MachinePostDominatorTree>();
+  AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
 
   ForceEmitZeroWaitcnts = ForceEmitZeroFlag;
   for (auto T : inst_counter_types())
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index b5b456d691254..17befd6c0e346 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -3654,8 +3654,8 @@ bool SIInstrInfo::areMemAccessesTriviallyDisjoint(const MachineInstr &MIa,
   // underlying address space, even if it was lowered to a different one,
   // e.g. private accesses lowered to use MUBUF instructions on a scratch
   // buffer.
-  if (isDS(MIa)) {
-    if (isDS(MIb))
+  if (isDS(MIa) || isLDSDMA(MIa)) {
+    if (isDS(MIb) || isLDSDMA(MIb))
       return checkInstOffsetsDoNotOverlap(MIa, MIb);
 
     return !isFLAT(MIb) || isSegmentSpecificFLAT(MIb);
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index e388b5550cb10..456d8d835986f 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -547,6 +547,14 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
     return get(Opcode).TSFlags & SIInstrFlags::DS;
   }
 
+  static bool isLDSDMA(const MachineInstr &MI) {
+    return isVALU(MI) && (isMUBUF(MI) || isFLAT(MI));
+  }
+
+  bool isLDSDMA(uint16_t Opcode) {
+    return isVALU(Opcode) && (isMUBUF(Opcode) || isFLAT(Opcode));
+  }
+
   static bool isGWS(const MachineInstr &MI) {
     return MI.getDesc().TSFlags & SIInstrFlags::GWS;
   }
diff --git a/llvm/lib/Target/AMDGPU/lds-dma-waits.ll b/llvm/lib/Target/AMDGPU/lds-dma-waits.ll
new file mode 100644
index 0000000000000..31155290216d0
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/lds-dma-waits.ll
@@ -0,0 +1,154 @@
+; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck %s --check-prefixeses=GCN,GFX9
+; RUN: llc -march=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s | FileCheck %s --check-prefixeses=GCN,GFX10
+
+@lds.0 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.1 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.2 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.3 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.4 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.5 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.6 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.7 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.8 = internal addrspace(3) global [64 x float] poison, align 16
+@lds.9 = internal addrspace(3) global [64 x float] poison, align 16
+
+declare void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) nocapture, i32 %size, i32 %voffset, i32 %soffset, i32 %offset, i32 %aux)
+declare void @llvm.amdgcn.global.load.lds(ptr addrspace(1) nocapture %gptr, ptr addrspace(3) nocapture %lptr, i32 %size, i32 %offset, i32 %aux)
+
+; GCN-LABEL: {{^}}buffer_load_lds_dword_2_arrays:
+; GCN-COUNT-4: buffer_load_dword
+; GCN: s_waitcnt vmcnt(2)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(0)
+; GCN: ds_read_b32
+define amdgpu_kernel void @buffer_load_lds_dword_2_arrays(<4 x i32> %rsrc, i32 %i1, i32 %i2, ptr addrspace(1) %out) {
+main_body:
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 4, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.1, i32 4, i32 8, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.1, i32 4, i32 12, i32 0, i32 0, i32 0)
+  %gep.0 = getelementptr float, ptr addrspace(3) @lds.0, i32 %i1
+  %gep.1 = getelementptr float, ptr addrspace(3) @lds.1, i32 %i2
+  %val.0 = load float, ptr addrspace(3) %gep.0, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.1 = load float, ptr addrspace(3) %gep.1, align 4
+  %tmp.0 = insertelement <2 x float> undef, float %val.0, i32 0
+  %res = insertelement <2 x float> %tmp.0, float %val.1, i32 1
+  store <2 x float> %res, ptr addrspace(1) %out
+  ret void
+}
+
+; On gfx9 if there is a pending FLAT operation, and this is a VMem or LGKM
+; waitcnt and the target can report early completion, then we need to force a waitcnt 0.
+
+; GCN-LABEL: {{^}}global_load_lds_dword_2_arrays:
+; GCN-COUNT-4: global_load_dword
+; GFX9: s_waitcnt vmcnt(0)
+; GFX9-COUNT-2: ds_read_b32
+; GFX10: s_waitcnt vmcnt(2)
+; GFX10: ds_read_b32
+; GFX10: s_waitcnt vmcnt(0)
+; GFX10: ds_read_b32
+define amdgpu_kernel void @global_load_lds_dword_2_arrays(ptr addrspace(1) nocapture %gptr, i32 %i1, i32 %i2, ptr addrspace(1) %out) {
+main_body:
+  call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0)
+  call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.0, i32 4, i32 4, i32 0)
+  call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.1, i32 4, i32 8, i32 0)
+  call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.1, i32 4, i32 12, i32 0)
+  %gep.0 = getelementptr float, ptr addrspace(3) @lds.0, i32 %i1
+  %gep.1 = getelementptr float, ptr addrspace(3) @lds.1, i32 %i2
+  %val.0 = load float, ptr addrspace(3) %gep.0, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.1 = load float, ptr addrspace(3) %gep.1, align 4
+  %tmp.0 = insertelement <2 x float> undef, float %val.0, i32 0
+  %res = insertelement <2 x float> %tmp.0, float %val.1, i32 1
+  store <2 x float> %res, ptr addrspace(1) %out
+  ret void
+}
+
+; There are 8 pseudo registers defined to track LDS DMA dependencies.
+; When exhausted we default to vmcnt(0).
+
+; GCN-LABEL: {{^}}buffer_load_lds_dword_10_arrays:
+; GCN-COUNT-10: buffer_load_dword
+; GCN: s_waitcnt vmcnt(8)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(7)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(6)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(5)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(4)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(3)
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(2)
+; GCN-NOT: s_waitcnt vmcnt
+; GCN: ds_read_b32
+; GCN: s_waitcnt vmcnt(0)
+; GCN: ds_read_b32
+define amdgpu_kernel void @buffer_load_lds_dword_10_arrays(<4 x i32> %rsrc, i32 %i1, i32 %i2, i32 %i3, i32 %i4, i32 %i5, i32 %i6, i32 %i7, i32 %i8, i32 %i9, ptr addrspace(1) %out) {
+main_body:
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.1, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.2, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.3, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.4, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.5, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.6, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.7, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.8, i32 4, i32 0, i32 0, i32 0, i32 0)
+  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.9, i32 4, i32 0, i32 0, i32 0, i32 0)
+  %gep.0 = getelementptr float, ptr addrspace(3) @lds.0, i32 %i1
+  %gep.1 = getelementptr float, ptr addrspace(3) @lds.1, i32 %i2
+  %gep.2 = getelementptr float, ptr addrspace(3) @lds.2, i32 %i2
+  %gep.3 = getelementptr float, ptr addrspace(3) @lds.3, i32 %i2
+  %gep.4 = getelementptr float, ptr addrspace(3) @lds.4, i32 %i2
+  %gep.5 = getelementptr float, ptr addrspace(3) @lds.5, i32 %i2
+  %gep.6 = getelementptr float, ptr addrspace(3) @lds.6, i32 %i2
+  %gep.7 = getelementptr float, ptr addrspace(3) @lds.7, i32 %i2
+  %gep.8 = getelementptr float, ptr addrspace(3) @lds.8, i32 %i2
+  %gep.9 = getelementptr float, ptr addrspace(3) @lds.9, i32 %i2
+  %val.0 = load float, ptr addrspace(3) %gep.0, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.1 = load float, ptr addrspace(3) %gep.1, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.2 = load float, ptr addrspace(3) %gep.2, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.3 = load float, ptr addrspace(3) %gep.3, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.4 = load float, ptr addrspace(3) %gep.4, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.5 = load float, ptr addrspace(3) %gep.5, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.6 = load float, ptr addrspace(3) %gep.6, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.7 = load float, ptr addrspace(3) %gep.7, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.8 = load float, ptr addrspace(3) %gep.8, align 4
+  call void @llvm.amdgcn.wave.barrier()
+  %val.9 = load float, ptr addrspace(3) %gep.9, align 4
+  %out.gep.1 = getelementptr float, ptr addrspace(1) %out, i32 1
+  %out.gep.2 = getelementptr float, ptr addrspace(1) %out, i32 2
+  %out.gep.3 = getelementptr float, ptr addrspace(1) %out, i32 3
+  %out.gep.4 = getelementptr float, ptr addrspace(1) %out, i32 4
+  %out.gep.5 = getelementptr float, ptr addrspace(1) %out, i32 5
+  %out.gep.6 = getelementptr float, ptr addrspace(1) %out, i32 6
+  %out.gep.7 = getelementptr float, ptr addrspace(1) %out, i32 7
+  %out.gep.8 = getelementptr float, ptr addrspace(1) %out, i32 8
+  %out.gep.9 = getelementptr float, ptr addrspace(1) %out, i32 9
+  store float %val.0, ptr addrspace(1) %out
+  store float %val.1, ptr addrspace(1) %out.gep.1
+  store float %val.2, ptr addrspace(1) %out.gep.2
+  store float %val.3, ptr addrspace(1) %out.gep.3
+  store float %val.4, ptr addrspace(1) %out.gep.4
+  store float %val.5, ptr addrspace(1) %out.gep.5
+  store float %val.6, ptr addrspace(1) %out.gep.6
+  store float %val.7, ptr addrspace(1) %out.gep.7
+  store float %val.8, ptr addrspace(1) %out.gep.8
+  store float %val.9, ptr addrspace(1) %out.gep.9
+  ret void
+}
+
+declare void @llvm.amdgcn.wave.barrier()
diff --git a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
index 7abb789019f1f..9787b8b6c6fbe 100644
--- a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
+++ b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
@@ -130,6 +130,8 @@
 ; GCN-O0-NEXT:        MachineDominator Tree Construction
 ; GCN-O0-NEXT:        Machine Natural Loop Construction
 ; GCN-O0-NEXT:        MachinePostDominator Tree Construction
+; GCN-O0-NEXT:        Basic Alias Analysis (stateless AA impl)
+; GCN-O0-NEXT:        Function Alias Analysis Results
 ; GCN-O0-NEXT:        SI insert wait instructions
 ; GCN-O0-NEXT:        Insert...
[truncated]

Copy link

github-actions bot commented Dec 5, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

return false;
});
if (I != LDSDMAStores.end()) {
Slot = I - LDSDMAStores.begin() + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this iterator logic

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just need an index of the instruction in the list. One is added because the first pseudo register is reserved for general case when we cannot find a memory location.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simplified it, just replaced with a regular loop.

Changed getVmemWaitEventType() to use mayWriteLDSThroughDMA instead
of isLDSDMA as this is more sound and added a comment.
break;
for (unsigned I = 0, E = LDSDMAStores.size(); I != E && !Slot; ++I) {
for (const auto *MemOp : LDSDMAStores[I]->memoperands()) {
if (MemOp->isStore() && AAI == MemOp->getAAInfo()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the equality check here; you perform an actual mayAlias query later? What is the point of this filter? Don't you need to consider any possibly aliasing write as an event?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alias check is done later at the line 1229, and this is also a place where any aliasing access gets actual waits, read or write. Anyway, that part did not change here. This search is just to find if we already have a slot allocated for this memory location. In reality we do not need instruction, we just need LDS memory location, but since mayAlias is an interface which needs a MachineInstr, I am searching and keeping instructions in the list. All we really need from this instruction is AA tags.

if (!MemOp->isStore() ||
MemOp->getAddrSpace() != AMDGPUAS::LOCAL_ADDRESS)
continue;
// Comparing just AA info does not guarantee memoperands are equal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want exact equality of location, you would use the Value/PseudoSourceValue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MI::mayAlias is extremely conservative and does not work with PsedoSourceValue, it just ignores AA tags in this case. The Value is also unreliable because it is really a GEP, always a different GEP.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of Value, this is how a real program's MIR look like. These are all loads and stores of the same arrays. You may see that Value there is not helpful, only alias scope is (look at the second memory operand, LDS store):

  BUFFER_LOAD_DWORD_LDS_OFFEN %51:vgpr_32, %48:sgpr_128, %52:sreg_32, 0, 0, 0, implicit $exec, implicit $m0 :: (dereferenceable load (s32) from `ptr addrspace(1) poison`, align 1, !alias.scope !7, !noalias !10, addrspace 1), (dereferenceable store (s32) into %ir.10, align 1, !alias.scope !7, !noalias !10, addrspace 3)
  BUFFER_LOAD_DWORD_LDS_OFFEN killed %54:vgpr_32, %48:sgpr_128, %52:sreg_32, 0, 0, 0, implicit $exec, implicit $m0 :: (dereferenceable load (s32) from `ptr addrspace(1) poison`, align 1, !alias.scope !7, !noalias !10, addrspace 1), (dereferenceable store (s32) into %ir.12, align 1, !alias.scope !7, !noalias !10, addrspace 3)
  BUFFER_LOAD_DWORD_LDS_OFFEN killed %56:vgpr_32, %48:sgpr_128, %52:sreg_32, 0, 0, 0, implicit $exec, implicit $m0 :: (dereferenceable load (s32) from `ptr addrspace(1) poison`, align 1, !alias.scope !7, !noalias !10, addrspace 1), (dereferenceable store (s32) into %ir.14, align 1, !alias.scope !7, !noalias !10, addrspace 3)
  BUFFER_LOAD_DWORD_LDS_OFFEN killed %58:vgpr_32, %48:sgpr_128, %52:sreg_32, 0, 0, 0, implicit $exec, implicit $m0 :: (dereferenceable load (s32) from `ptr addrspace(1) poison`, align 1, !alias.scope !7, !noalias !10, addrspace 1), (dereferenceable store (s32) into %ir.16, align 1, !alias.scope !7, !noalias !10, addrspace 3)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PseudoSourceValue::mayAlias is supposed to report aliasing to possible IR values. It looks like it's layered weirdly, and expects you to go through MachineInstr::mayAlias. MachineInstr::mayAlias ought to be using the AA tags, it shouldn't be a fundamental limitation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like it does use it if you pass UseTBAA=true. Not sure why this would be a parameter in the first place

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PseudoSourceValue::mayAlias is supposed to report aliasing to possible IR values. It looks like it's layered weirdly, and expects you to go through MachineInstr::mayAlias. MachineInstr::mayAlias ought to be using the AA tags, it shouldn't be a fundamental limitation

This is all PSV::mayAlias() does:

bool PseudoSourceValue::mayAlias(const MachineFrameInfo *) const {
  return !(isGOT() || isConstantPool() || isJumpTable());
}

No very useful. Then even to get to the AA tags check MI:mayAlias() shall go through all IR values' checks first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like it does use it if you pass UseTBAA=true. Not sure why this would be a parameter in the first place

I am passing it, but to get to that check it shall first go through all Value and offset checks. Using AA is the last thing it does: https://llvm.org/doxygen/MachineInstr_8cpp_source.html#l01285

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values don't need to be identical, that's the point of the AA query. BasicAA will parse through the offsets

I also think that values don't need to be identical. But that is what MI:mayAlias() does before it checks AA: https://llvm.org/doxygen/MachineInstr_8cpp_source.html#l01285

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there's no PseudoSourceValue in this example, it should be a straightforward Value-to-Value comparison

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, there is no PSV. I have mentioned PSV because you have earlier suggested to use it. For the real IR value: it is not helpful to compare it. The IR value is a GEP, and this GEP is always different. I.e. these values never compare equal. The rest of the IR is already gone and unavailable for the analysis. Even if it would be available this GEP will address kernel module LDS variable, a single huge LDS array, and will be useless again. In this case it will tell you any LDS operation aliases any other. Now during the module LDS lowering I am creating alias scope info specifically to disambiguate aliasing after the pass has squashed all LDS variables.

@rampitec
Copy link
Collaborator Author

Ping

@rampitec
Copy link
Collaborator Author

To make it easier I am splitting the patch. I have pre-comitted the test, and there is a part which fixes lack of wait on GFX10 : #75245

@rampitec
Copy link
Collaborator Author

Another part is improving memoperand info: #75247. This is NFCI just by itself.

@rampitec
Copy link
Collaborator Author

Yet another part to fix disjoint memory checks with LDS DMA: #75249

@rampitec
Copy link
Collaborator Author

All split off parts were merged and this patch is merged with main. Only waitcount insertion pass changes remained here.

@jayfoad
Copy link
Contributor

jayfoad commented Dec 19, 2023

How does this work in a case like this?

call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.3, i32 4, i32 0, i32 0, i32 0, i32 0)
call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) %ptr, i32 4, i32 0, i32 0, i32 0, i32 0)
%val.3 = load float, ptr addrspace(3) @lds.3, align 4

i.e.

  • store to known lds address @lds.3 (this will use slot 0 and another slot e.g. slot 3?)
  • store to unknown lds address (this will use slot 0?)
  • load from known lds address @lds.3 (this will use slot 3?)

@rampitec
Copy link
Collaborator Author

How does this work in a case like this?

call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.3, i32 4, i32 0, i32 0, i32 0, i32 0)
call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) %ptr, i32 4, i32 0, i32 0, i32 0, i32 0)
%val.3 = load float, ptr addrspace(3) @lds.3, align 4

i.e.

* store to known lds address `@lds.3` (this will use slot 0 and another slot e.g. slot 3?)

* store to unknown lds address (this will use slot 0?)

* load from known lds address `@lds.3` (this will use slot 3?)

It does not know the pointer, so it uses default slot 0 and waits till 0. I have to tell anyone interested here: before I even wrote this code it didn't know of the dependency and did not wait for anything at all. Everyone was happy.

@jayfoad
Copy link
Contributor

jayfoad commented Dec 19, 2023

How does this work in a case like this?

call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.3, i32 4, i32 0, i32 0, i32 0, i32 0)
call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) %ptr, i32 4, i32 0, i32 0, i32 0, i32 0)
%val.3 = load float, ptr addrspace(3) @lds.3, align 4

i.e.

* store to known lds address `@lds.3` (this will use slot 0 and another slot e.g. slot 3?)

* store to unknown lds address (this will use slot 0?)

* load from known lds address `@lds.3` (this will use slot 3?)

It does not know the pointer, so it uses default slot 0 and waits till 0.

Test case:

@lds.0 = internal addrspace(3) global [64 x float] poison, align 16
@lds.1 = internal addrspace(3) global [64 x float] poison, align 16

declare void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) nocapture, i32 %size, i32 %voffset, i32 %soffset, i32 %offset, i32 %aux)

define amdgpu_kernel void @f(<4 x i32> %rsrc, i32 %i1, i32 %i2, ptr addrspace(1) %out, ptr addrspace(3) %ptr) {
main_body:
  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0, i32 0, i32 0)
  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) %ptr, i32 4, i32 0, i32 0, i32 0, i32 0)
  %gep.0 = getelementptr float, ptr addrspace(3) @lds.0, i32 %i1
  %gep.1 = getelementptr float, ptr addrspace(3) @lds.1, i32 %i2
  %val.0 = load volatile float, ptr addrspace(3) %gep.0, align 4
  %val.1 = load volatile float, ptr addrspace(3) %gep.1, align 4
  %out.gep.1 = getelementptr float, ptr addrspace(1) %out, i32 1
  store float %val.0, ptr addrspace(1) %out
  store float %val.1, ptr addrspace(1) %out.gep.1
  ret void
}

Generates:

	s_load_dwordx8 s[4:11], s[0:1], 0x24
	s_load_dword s2, s[0:1], 0x44
	s_mov_b32 m0, 0
	v_mov_b32_e32 v2, 0
	s_waitcnt lgkmcnt(0)
	buffer_load_dword off, s[4:7], 0 lds
	s_mov_b32 m0, s2
	s_lshl_b32 s0, s8, 2
	buffer_load_dword off, s[4:7], 0 lds
	s_lshl_b32 s1, s9, 2
	v_mov_b32_e32 v0, s0
	v_mov_b32_e32 v1, s1
	s_waitcnt vmcnt(1)
	ds_read_b32 v0, v0
	s_waitcnt vmcnt(0)
	ds_read_b32 v1, v1 offset:256
	s_waitcnt lgkmcnt(0)
	global_store_dwordx2 v2, v[0:1], s[10:11]
	s_endpgm

The s_waitcnt vmcnt(1) seems incorrect, because the second buffer-load-to-lds might clobber @lds.0.

I have to tell anyone interested here: before I even wrote this code it didn't know of the dependency and did not wait for anything at all. Everyone was happy.

I am still happy, because buffer/flat/global-load-to-lds was removed in GFX11.

@rampitec
Copy link
Collaborator Author

Test case:

@lds.0 = internal addrspace(3) global [64 x float] poison, align 16
@lds.1 = internal addrspace(3) global [64 x float] poison, align 16

declare void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) nocapture, i32 %size, i32 %voffset, i32 %soffset, i32 %offset, i32 %aux)

define amdgpu_kernel void @f(<4 x i32> %rsrc, i32 %i1, i32 %i2, ptr addrspace(1) %out, ptr addrspace(3) %ptr) {
main_body:
  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0, i32 0, i32 0)
  call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) %ptr, i32 4, i32 0, i32 0, i32 0, i32 0)
  %gep.0 = getelementptr float, ptr addrspace(3) @lds.0, i32 %i1
  %gep.1 = getelementptr float, ptr addrspace(3) @lds.1, i32 %i2
  %val.0 = load volatile float, ptr addrspace(3) %gep.0, align 4
  %val.1 = load volatile float, ptr addrspace(3) %gep.1, align 4
  %out.gep.1 = getelementptr float, ptr addrspace(1) %out, i32 1
  store float %val.0, ptr addrspace(1) %out
  store float %val.1, ptr addrspace(1) %out.gep.1
  ret void
}

Generates:

	s_load_dwordx8 s[4:11], s[0:1], 0x24
	s_load_dword s2, s[0:1], 0x44
	s_mov_b32 m0, 0
	v_mov_b32_e32 v2, 0
	s_waitcnt lgkmcnt(0)
	buffer_load_dword off, s[4:7], 0 lds
	s_mov_b32 m0, s2
	s_lshl_b32 s0, s8, 2
	buffer_load_dword off, s[4:7], 0 lds
	s_lshl_b32 s1, s9, 2
	v_mov_b32_e32 v0, s0
	v_mov_b32_e32 v1, s1
	s_waitcnt vmcnt(1)
	ds_read_b32 v0, v0
	s_waitcnt vmcnt(0)
	ds_read_b32 v1, v1 offset:256
	s_waitcnt lgkmcnt(0)
	global_store_dwordx2 v2, v[0:1], s[10:11]
	s_endpgm

The s_waitcnt vmcnt(1) seems incorrect, because the second buffer-load-to-lds might clobber @lds.0.

This is still correct, pointer argument cannot alias module global. A pointer argument to a kernel is an LDS external requested by the host side, and host cannot see module LDS.

@rampitec
Copy link
Collaborator Author

This is still correct, pointer argument cannot alias module global. A pointer argument to a kernel is an LDS external requested by the host side, and host cannot see module LDS.

I.e. that is really the point of the patch: if we are able to definitively identify an LDS object targeted by both load and store we only wait on that store or stores. And the only way to definitively identify the object at this stage is via alias.scope info which we are generating ourselves during module LDS lowering.

@rampitec
Copy link
Collaborator Author

This is still correct, pointer argument cannot alias module global. A pointer argument to a kernel is an LDS external requested by the host side, and host cannot see module LDS.

I.e. that is really the point of the patch: if we are able to definitively identify an LDS object targeted by both load and store we only wait on that store or stores. And the only way to definitively identify the object at this stage is via alias.scope info which we are generating ourselves during module LDS lowering.

I have added a check for the presence of alias scope info just in case we get a rogue AA. The testcase with a pointer argument still produces correct code with vmcnt(1).

@rampitec
Copy link
Collaborator Author

Actually since I am only using alias scope I can avoid all alias analysis altogether and only compare alias scope. This does not need an analysis pass, calls to mayAlias, and in general simpler code. You can see an alternative PR if you like it more: #75974

@rampitec
Copy link
Collaborator Author

rampitec commented Jan 2, 2024

Ping

1 similar comment
@rampitec
Copy link
Collaborator Author

Ping

if (!MemOp->isStore() ||
MemOp->getAddrSpace() != AMDGPUAS::LOCAL_ADDRESS)
continue;
// Comparing just AA info does not guarantee memoperands are equal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there's no PseudoSourceValue in this example, it should be a straightforward Value-to-Value comparison

Comment on lines 133 to 134
; GCN-O0-NEXT: Basic Alias Analysis (stateless AA impl)
; GCN-O0-NEXT: Function Alias Analysis Results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you skip querying for AA at -O0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I just skip getAnalysis call it does not help since analysis is requested in the getAnalysisUsage. If I do not request it it is not available at any optlevel. This is the benefit of the alternative #75974, it does not request the full analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check the opt level in getAnalysisUsage too, from the TargetMachine

ScoreBrackets.determineWait(VM_CNT, RegNo, Wait);
unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
bool FoundAliasingStore = false;
if (Ptr && Memop->getAAInfo() && Memop->getAAInfo().Scope) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand the usage of scope; scope isn't special, isn't common and I do at all like specially treating it. I think you should just let the AA query figure out what to do with it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reserved just 8 pseudo registers to track it. I do not want to fill it with unrelated stuff. I know that the only way AA will be able to handle this very specific situation is if there is scope info, otherwise there is no reason to waste a slot and compile time. If I do not enter this 'if' the pass will just do conservatively correct thing and wait for this memory regardless of aliasing or lack of it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added more comments to explain this. The place which fills the LDS DMA slot bails if there is no scope info not to waste limited tracking slots. In that case a generic first slot is still used for such operation (it is always used, regardless if we can or cannot be more specific about the underlying object). Here AA will be unable to disambiguate aliasing if there is no scope info, so this condition is simply a shortcut to avoid an expensive loop and AA query. I can remove this part of the condition here and nothing will change except it will work slower. Note that not entering this 'if' statement will always produce a conservatively correct wait using first generic tracking slot, which always gets a score regardless of our ability to track a specific object. The condition is around the relaxation code to avoid a generic and conservative 'wait for everything' part below.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but can still fix the -O0 thing

Comment on lines 133 to 134
; GCN-O0-NEXT: Basic Alias Analysis (stateless AA impl)
; GCN-O0-NEXT: Function Alias Analysis Results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check the opt level in getAnalysisUsage too, from the TargetMachine

@rampitec
Copy link
Collaborator Author

lgtm, but can still fix the -O0 thing

But where do I get TM in the getAnalysisUsage?

@rampitec
Copy link
Collaborator Author

lgtm, but can still fix the -O0 thing

But where do I get TM in the getAnalysisUsage?

Found addUsedIfAvailable() which does the trick.

@arsenm
Copy link
Contributor

arsenm commented Jan 18, 2024

lgtm, but can still fix the -O0 thing

But where do I get TM in the getAnalysisUsage?

MF.getTarget() (or maybe a pass parameter is necessary?)

@rampitec
Copy link
Collaborator Author

lgtm, but can still fix the -O0 thing

But where do I get TM in the getAnalysisUsage?

MF.getTarget() (or maybe a pass parameter is necessary?)

There is no MF there of course.

@rampitec rampitec merged commit 021def6 into llvm:main Jan 18, 2024
@rampitec rampitec deleted the lds-dma-wait branch January 18, 2024 07:44
ampandey-1995 pushed a commit to ampandey-1995/llvm-project that referenced this pull request Jan 19, 2024
LDA DMA loads increase VMCNT and a load from the LDS stored must wait on
this counter to only read memory after it is written. Wait count
insertion pass does not track memory dependencies, it tracks register
dependencies. To model the LDS dependency a pseudo register is used in
the scoreboard, acting like if LDS DMA writes it and LDS load reads it.

This patch adds 8 more pseudo registers to use for independent LDS
locations if we can prove they are disjoint using alias analysis.

Fixes: SWDEV-433427
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Feb 6, 2024
LDA DMA loads increase VMCNT and a load from the LDS stored must wait on
this counter to only read memory after it is written. Wait count
insertion pass does not track memory dependencies, it tracks register
dependencies. To model the LDS dependency a pseudo register is used in
the scoreboard, acting like if LDS DMA writes it and LDS load reads it.

This patch adds 8 more pseudo registers to use for independent LDS
locations if we can prove they are disjoint using alias analysis.

Fixes: SWDEV-433427
Change-Id: I21e5931c3db0676a8778489b07f490976c6ef45e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants