[AMDGPU] Mitigate GFX12 VALU read SGPR hazard #100067

perlfu · 2024-07-23T06:41:06Z

Any SGPR read by a VALU can potentially obscure SALU writes to the same register.
Insert s_wait_alu instructions to mitigate the hazard on affected paths.

Compute a global cache of SGPRs with any VALU reads and use this to avoid inserting mitigation for SGPRs never accessed by VALUs.

To avoid excessive search when compile time is priority implement secondary mode where all SALU writes are mitigated.

Co-authored-by: Shilei Tian [email protected]

llvmbot · 2024-07-23T06:41:38Z

@llvm/pr-subscribers-backend-amdgpu

Author: Carl Ritson (perlfu)

Changes

Any SGPR read by a VALU can potentially obscure SALU writes to the same register.
Insert s_wait_alu instructions to mitigate the hazard on affected paths.

Compute a global cache of SGPRs with any VALU reads and use this to avoid inserting mitigation for SGPRs never accessed by VALUs.

To avoid excessive search when compile time is priority implement secondary mode where all SALU writes are mitigated.

Co-authored-by: Shilei Tian <[email protected]>

Patch is 1.67 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/100067.diff

92 Files Affected:

(modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp (+300-20)
(modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h (+4)
(modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll (+36-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll (+36-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll (+8-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.rsq.clamp.ll (+12-8)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+51-17)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+84-12)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+43-14)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+43-14)
(modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+6-2)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+89-30)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+119-43)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+119-43)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (+99-33)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll (+144-48)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll (+144-48)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll (+174-58)
(modified) llvm/test/CodeGen/AMDGPU/flat-scratch.ll (+12-6)
(modified) llvm/test/CodeGen/AMDGPU/fmaximum.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/fmaximum3.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/fminimum.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/fminimum3.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/fp-atomics-gfx940.ll (+7-3)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (+92-30)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll (+138-46)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll (+138-46)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll (+174-58)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64.ll (+24-24)
(added) llvm/test/CodeGen/AMDGPU/hazard-recognizer-src-shared-base.ll (+23)
(modified) llvm/test/CodeGen/AMDGPU/indirect-call-known-callees.ll (+10-5)
(modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+35-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.atomic.cond.sub.ll (+6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane.ll (+24-20)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.atomic.fadd.v2bf16.ll (+7-3)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.wait.ll (+14-5)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd.v2bf16.ll (+14-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd_nortn.ll (+14-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd_rtn.ll (+14-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmax.f32.ll (+12-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmin.f32.ll (+12-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.id.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll (+3-1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll (+2)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll (+3-1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll (+2)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i1.ll (+38-8)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i16.ll (+34-3)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i32.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i8.ll (+94-14)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+81-32)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+60-20)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+60-20)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+90-30)
(modified) llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/lower-work-group-id-intrinsics-hsa.ll (+4-1)
(modified) llvm/test/CodeGen/AMDGPU/lower-work-group-id-intrinsics-pal.ll (+4-1)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll (+156)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-lastuse.ll (+8)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-nontemporal.ll (+26)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-singlethread.ll (+156)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll (+156)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll (+26)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-wavefront.ll (+154)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll (+148)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll (+150)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-lastuse.ll (+8)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-nontemporal.ll (+18)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-singlethread.ll (+152)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll (+142)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll (+20)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-wavefront.ll (+152)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll (+152)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-invalid-syncscope.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll (+120)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-nontemporal.ll (+16)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-singlethread.ll (+120)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll (+120)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll (+14)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-wavefront.ll (+120)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll (+120)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-lastuse.ll (+6)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-nontemporal.ll (+18)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-volatile.ll (+16)
(modified) llvm/test/CodeGen/AMDGPU/offset-split-flat.ll (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/offset-split-global.ll (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/pseudo-scalar-transcendental.ll (+26-11)
(modified) llvm/test/CodeGen/AMDGPU/s-getpc-b64-remat.ll (+6-2)
(modified) llvm/test/CodeGen/AMDGPU/valu-mask-write-hazard.mir (+56-112)
(added) llvm/test/CodeGen/AMDGPU/valu-read-sgpr-hazard.mir (+863)
(modified) llvm/test/CodeGen/AMDGPU/vcmpx-permlane-hazard.mir (+4-1)

diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index a402fc6d7e611..45c2624e43d4c 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -14,6 +14,8 @@
 #include "GCNSubtarget.h"
 #include "MCTargetDesc/AMDGPUMCTargetDesc.h"
 #include "SIMachineFunctionInfo.h"
+#include "llvm/ADT/PostOrderIterator.h"
+#include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/ScheduleDAG.h"
 #include "llvm/TargetParser/TargetParser.h"
@@ -43,6 +45,10 @@ static cl::opt<unsigned, false, MFMAPaddingRatioParser>
                      cl::desc("Fill a percentage of the latency between "
                               "neighboring MFMA with s_nops."));
 
+static cl::opt<unsigned> MaxExhaustiveHazardSearch(
+    "amdgpu-max-exhaustive-hazard-search", cl::init(128), cl::Hidden,
+    cl::desc("Maximum function size for exhausive hazard search"));
+
 //===----------------------------------------------------------------------===//
 // Hazard Recognizer Implementation
 //===----------------------------------------------------------------------===//
@@ -50,15 +56,11 @@ static cl::opt<unsigned, false, MFMAPaddingRatioParser>
 static bool shouldRunLdsBranchVmemWARHazardFixup(const MachineFunction &MF,
                                                  const GCNSubtarget &ST);
 
-GCNHazardRecognizer::GCNHazardRecognizer(const MachineFunction &MF) :
-  IsHazardRecognizerMode(false),
-  CurrCycleInstr(nullptr),
-  MF(MF),
-  ST(MF.getSubtarget<GCNSubtarget>()),
-  TII(*ST.getInstrInfo()),
-  TRI(TII.getRegisterInfo()),
-  ClauseUses(TRI.getNumRegUnits()),
-  ClauseDefs(TRI.getNumRegUnits()) {
+GCNHazardRecognizer::GCNHazardRecognizer(const MachineFunction &MF)
+    : IsHazardRecognizerMode(false), CurrCycleInstr(nullptr), MF(MF),
+      ST(MF.getSubtarget<GCNSubtarget>()), TII(*ST.getInstrInfo()),
+      TRI(TII.getRegisterInfo()), UseVALUReadHazardExhaustiveSearch(false),
+      ClauseUses(TRI.getNumRegUnits()), ClauseDefs(TRI.getNumRegUnits()) {
   MaxLookAhead = MF.getRegInfo().isPhysRegUsed(AMDGPU::AGPR0) ? 19 : 5;
   TSchedModel.init(&ST);
   RunLdsBranchVmemWARHazardFixup = shouldRunLdsBranchVmemWARHazardFixup(MF, ST);
@@ -1104,6 +1106,7 @@ void GCNHazardRecognizer::fixHazards(MachineInstr *MI) {
   fixWMMAHazards(MI);
   fixShift64HighRegBug(MI);
   fixVALUMaskWriteHazard(MI);
+  fixVALUReadSGPRHazard(MI);
 }
 
 bool GCNHazardRecognizer::fixVcmpxPermlaneHazards(MachineInstr *MI) {
@@ -2759,6 +2762,36 @@ bool GCNHazardRecognizer::ShouldPreferAnother(SUnit *SU) {
   return false;
 }
 
+// Adjust global offsets for instructions bundled with S_GETPC_B64 after
+// insertion of a new instruction.
+static void updateGetPCBundle(MachineInstr *NewMI) {
+  if (!NewMI->isBundled())
+    return;
+
+  // Find start of bundle.
+  auto I = NewMI->getIterator();
+  while (I->isBundledWithPred())
+    I--;
+  if (I->isBundle())
+    I++;
+
+  // Bail if this is not an S_GETPC bundle.
+  if (I->getOpcode() != AMDGPU::S_GETPC_B64)
+    return;
+
+  // Update offsets of any references in the bundle.
+  const unsigned NewBytes = NewMI->getDesc().getSize();
+  auto NextMI = std::next(NewMI->getIterator());
+  auto End = NewMI->getParent()->end();
+  while (NextMI != End && NextMI->isBundledWithPred()) {
+    for (auto &Operand : NextMI->operands()) {
+      if (Operand.isGlobal())
+        Operand.setOffset(Operand.getOffset() + NewBytes);
+    }
+    NextMI++;
+  }
+}
+
 bool GCNHazardRecognizer::fixVALUMaskWriteHazard(MachineInstr *MI) {
   if (!ST.hasVALUMaskWriteHazard())
     return false;
@@ -2876,22 +2909,269 @@ bool GCNHazardRecognizer::fixVALUMaskWriteHazard(MachineInstr *MI) {
   auto NextMI = std::next(MI->getIterator());
 
   // Add s_waitcnt_depctr sa_sdst(0) after SALU write.
-  BuildMI(*MI->getParent(), NextMI, MI->getDebugLoc(),
-          TII.get(AMDGPU::S_WAITCNT_DEPCTR))
-      .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
+  auto NewMI = BuildMI(*MI->getParent(), NextMI, MI->getDebugLoc(),
+                       TII.get(AMDGPU::S_WAITCNT_DEPCTR))
+                   .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
 
   // SALU write may be s_getpc in a bundle.
-  if (MI->getOpcode() == AMDGPU::S_GETPC_B64) {
-    // Update offsets of any references in the bundle.
-    while (NextMI != MI->getParent()->end() &&
-           NextMI->isBundledWithPred()) {
-      for (auto &Operand : NextMI->operands()) {
-        if (Operand.isGlobal())
-          Operand.setOffset(Operand.getOffset() + 4);
+  updateGetPCBundle(NewMI);
+
+  return true;
+}
+
+static unsigned baseSGPRNumber(Register Reg, const SIRegisterInfo &TRI) {
+  unsigned RegN = TRI.getEncodingValue(Reg);
+  assert(RegN <= 127);
+  return (RegN >> 1) & 0x3f;
+}
+
+// For VALUReadSGPRHazard: pre-compute a bit vector of all SGPRs used by VALUs.
+void GCNHazardRecognizer::computeVALUHazardSGPRs(MachineFunction *MMF) {
+  assert(MMF == &MF);
+
+  // Assume non-empty vector means it has already been computed.
+  if (!VALUReadHazardSGPRs.empty())
+    return;
+
+  auto CallingConv = MF.getFunction().getCallingConv();
+  bool IsCallFree =
+      AMDGPU::isEntryFunctionCC(CallingConv) && !MF.getFrameInfo().hasCalls();
+
+  // Exhaustive search is only viable in non-caller/callee functions where
+  // VALUs will be exposed to the hazard recognizer.
+  UseVALUReadHazardExhaustiveSearch =
+      IsCallFree && MF.getTarget().getOptLevel() > CodeGenOptLevel::None &&
+      MF.getInstructionCount() <= MaxExhaustiveHazardSearch;
+
+  // Consider all SGPRs hazards if the shader uses function calls or is callee.
+  bool UseVALUUseCache =
+      IsCallFree && MF.getTarget().getOptLevel() > CodeGenOptLevel::None;
+  VALUReadHazardSGPRs.resize(64, !UseVALUUseCache);
+  if (!UseVALUUseCache)
+    return;
+
+  // Perform a post ordered reverse scan to find VALUs which read an SGPR
+  // before a SALU write to the same SGPR.  This provides a reduction in
+  // hazard insertion when all VALU access to an SGPR occurs after its last
+  // SALU write, when compared to a linear scan.
+  const unsigned SGPR_NULL = TRI.getEncodingValue(AMDGPU::SGPR_NULL_gfx11plus);
+  const MachineRegisterInfo &MRI = MF.getRegInfo();
+  BitVector SALUWriteSGPRs(64), ReadSGPRs(64);
+  MachineCycleInfo CI;
+  CI.compute(*MMF);
+
+  for (auto *MBB : post_order(&MF)) {
+    bool InCycle = CI.getCycle(MBB) != nullptr;
+    for (auto &MI : reverse(MBB->instrs())) {
+      bool IsVALU = SIInstrInfo::isVALU(MI);
+      bool IsSALU = SIInstrInfo::isSALU(MI);
+      if (!(IsVALU || IsSALU))
+        continue;
+
+      for (const MachineOperand &Op : MI.operands()) {
+        if (!Op.isReg())
+          continue;
+        Register Reg = Op.getReg();
+        // Only consider implicit operands of VCC.
+        if (Op.isImplicit() && !(Reg == AMDGPU::VCC_LO ||
+                                 Reg == AMDGPU::VCC_HI || Reg == AMDGPU::VCC))
+          continue;
+        if (!TRI.isSGPRReg(MRI, Reg))
+          continue;
+        if (TRI.getEncodingValue(Reg) >= SGPR_NULL)
+          continue;
+        unsigned RegN = baseSGPRNumber(Reg, TRI);
+        if (IsVALU && Op.isUse()) {
+          // Note: any access within a cycle must be considered a hazard.
+          if (InCycle || (ReadSGPRs[RegN] && SALUWriteSGPRs[RegN]))
+            VALUReadHazardSGPRs.set(RegN);
+          ReadSGPRs.set(RegN);
+        } else if (IsSALU) {
+          if (Op.isDef())
+            SALUWriteSGPRs.set(RegN);
+          else
+            ReadSGPRs.set(RegN);
+        }
       }
-      NextMI++;
     }
   }
+}
+
+bool GCNHazardRecognizer::fixVALUReadSGPRHazard(MachineInstr *MI) {
+  if (!ST.hasVALUReadSGPRHazard())
+    return false;
+
+  // The hazard sequence is fundamentally three instructions:
+  //   1. VALU reads SGPR
+  //   2. SALU writes SGPR
+  //   3. VALU/SALU reads SGPR
+  // Try to avoid searching for (1) because the expiry point of the hazard is
+  // indeterminate; however, the hazard between (2) and (3) can expire if the
+  // gap contains sufficient SALU instructions with no usage of SGPR from (1).
+  // Note: SGPRs must be considered as 64-bit pairs as hazard exists
+  // even if individual SGPRs are accessed.
+
+  bool MIIsSALU = SIInstrInfo::isSALU(*MI);
+  bool MIIsVALU = SIInstrInfo::isVALU(*MI);
+  if (!(MIIsSALU || MIIsVALU))
+    return false;
+
+  // Avoid expensive search when compile time is priority by
+  // mitigating every SALU which writes an SGPR.
+  if (MF.getTarget().getOptLevel() == CodeGenOptLevel::None) {
+    if (!SIInstrInfo::isSALU(*MI) || SIInstrInfo::isSOPP(*MI))
+      return false;
+
+    const MachineOperand *SDSTOp =
+        TII.getNamedOperand(*MI, AMDGPU::OpName::sdst);
+    if (!SDSTOp || !SDSTOp->isReg())
+      return false;
+
+    const Register HazardReg = SDSTOp->getReg();
+    if (HazardReg == AMDGPU::EXEC || HazardReg == AMDGPU::EXEC_LO ||
+        HazardReg == AMDGPU::EXEC_HI || HazardReg == AMDGPU::M0)
+      return false;
+
+    // Add s_wait_alu sa_sdst(0) after SALU write.
+    auto NextMI = std::next(MI->getIterator());
+    auto NewMI = BuildMI(*MI->getParent(), NextMI, MI->getDebugLoc(),
+                         TII.get(AMDGPU::S_WAITCNT_DEPCTR))
+                     .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
+
+    // SALU write may be s_getpc in a bundle.
+    updateGetPCBundle(NewMI);
+
+    return true;
+  }
+
+  // Pre-compute set of SGPR pairs read by VALUs.
+  // Note: pass mutable pointer to MachineFunction for CycleInfo.
+  computeVALUHazardSGPRs(MI->getMF());
+
+  // If no VALUs hazard SGPRs exist then nothing to do.
+  if (VALUReadHazardSGPRs.none())
+    return false;
+
+  // All SGPR writes before a call/return must be flushed as the callee/caller
+  // will not will not see the hazard chain, i.e. (2) to (3) described above.
+  const bool IsSetPC = (MI->getOpcode() == AMDGPU::S_SETPC_B64 ||
+                        MI->getOpcode() == AMDGPU::S_SETPC_B64_return ||
+                        MI->getOpcode() == AMDGPU::S_SWAPPC_B64 ||
+                        MI->getOpcode() == AMDGPU::S_CALL_B64);
+
+  // Collect all SGPR sources for MI which are read by a VALU.
+  const unsigned SGPR_NULL = TRI.getEncodingValue(AMDGPU::SGPR_NULL_gfx11plus);
+  const MachineRegisterInfo &MRI = MF.getRegInfo();
+  SmallSet<Register, 4> SGPRsUsed;
+
+  if (!IsSetPC) {
+    for (const MachineOperand &Op : MI->all_uses()) {
+      Register OpReg = Op.getReg();
+
+      // Only consider VCC implicit uses on VALUs.
+      // The only expected SALU implicit access is SCC which is no hazard.
+      if (MIIsSALU && Op.isImplicit())
+        continue;
+
+      if (!TRI.isSGPRReg(MRI, OpReg))
+        continue;
+
+      // Ignore special purposes registers such as NULL, EXEC, and M0.
+      if (TRI.getEncodingValue(OpReg) >= SGPR_NULL)
+        continue;
+
+      unsigned RegN = baseSGPRNumber(OpReg, TRI);
+      if (!VALUReadHazardSGPRs[RegN])
+        continue;
+
+      SGPRsUsed.insert(OpReg);
+    }
+
+    // No SGPRs -> nothing to do.
+    if (SGPRsUsed.empty())
+      return false;
+  }
+
+  // A hazard is any SALU which writes one of the SGPRs read by MI.
+  auto IsHazardFn = [this, IsSetPC, &SGPRsUsed](const MachineInstr &I) {
+    if (!SIInstrInfo::isSALU(I))
+      return false;
+    // Ensure SGPR flush before call/return by conservatively assuming every
+    // SALU writes an SGPR.
+    if (IsSetPC && I.getNumDefs() > 0)
+      return true;
+    // Check for any register writes.
+    return llvm::any_of(SGPRsUsed, [this, &I](Register Reg) {
+      return I.modifiesRegister(Reg, &TRI);
+    });
+  };
+
+  const int SALUExpiryCount = SIInstrInfo::isSALU(*MI) ? 10 : 11;
+  auto IsExpiredFn = [&](const MachineInstr &I, int Count) {
+    if (Count >= SALUExpiryCount)
+      return true;
+    // s_wait_alu sa_sdst(0) on path mitigates hazard.
+    if (I.getOpcode() == AMDGPU::S_WAITCNT_DEPCTR &&
+        AMDGPU::DepCtr::decodeFieldSaSdst(I.getOperand(0).getImm()) == 0)
+      return true;
+    return false;
+  };
+
+  auto WaitStatesFn = [this, &SGPRsUsed](const MachineInstr &I) {
+    // Only count true SALUs as wait states.
+    if (!SIInstrInfo::isSALU(I) || SIInstrInfo::isSOPP(I))
+      return 0;
+    // SALU must be unrelated to any hazard registers.
+    if (llvm::any_of(SGPRsUsed, [this, &I](Register Reg) {
+          return I.readsRegister(Reg, &TRI);
+        }))
+      return 0;
+    return 1;
+  };
+
+  // Check for the hazard.
+  DenseSet<const MachineBasicBlock *> Visited;
+  int WaitStates = ::getWaitStatesSince(IsHazardFn, MI->getParent(),
+                                        std::next(MI->getReverseIterator()), 0,
+                                        IsExpiredFn, Visited, WaitStatesFn);
+
+  if (WaitStates >= SALUExpiryCount)
+    return false;
+
+  // Validate hazard through an exhaustive search.
+  if (UseVALUReadHazardExhaustiveSearch) {
+    // A hazard is any VALU which reads one of the paired SGPRs read by MI.
+    // This is searching for (1) in the hazard description.
+    auto hazardPair = [this](Register Reg) {
+      if (Reg == AMDGPU::VCC || Reg == AMDGPU::VCC_LO || Reg == AMDGPU::VCC_HI)
+        return Register(AMDGPU::VCC);
+      // TODO: handle TTMP?
+      return Register(AMDGPU::SGPR0_SGPR1 + baseSGPRNumber(Reg, TRI));
+    };
+    auto SearchHazardFn = [this, hazardPair,
+                           &SGPRsUsed](const MachineInstr &I) {
+      if (!SIInstrInfo::isVALU(I))
+        return false;
+      // Check for any register reads.
+      return llvm::any_of(SGPRsUsed, [this, hazardPair, &I](Register Reg) {
+        return I.readsRegister(hazardPair(Reg), &TRI);
+      });
+    };
+    auto SearchExpiredFn = [&](const MachineInstr &I, int Count) {
+      return false;
+    };
+    if (::getWaitStatesSince(SearchHazardFn, MI, SearchExpiredFn) ==
+        std::numeric_limits<int>::max())
+      return false;
+  }
+
+  // Add s_wait_alu sa_sdst(0) before SALU read.
+  auto NewMI = BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
+                       TII.get(AMDGPU::S_WAITCNT_DEPCTR))
+                   .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
+
+  // SALU read may be after s_getpc in a bundle.
+  updateGetPCBundle(NewMI);
 
   return true;
 }
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
index 3ccca527c626b..93b4b3771434b 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
@@ -48,6 +48,8 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
   const SIRegisterInfo &TRI;
   TargetSchedModel TSchedModel;
   bool RunLdsBranchVmemWARHazardFixup;
+  BitVector VALUReadHazardSGPRs;
+  bool UseVALUReadHazardExhaustiveSearch;
 
   /// RegUnits of uses in the current soft memory clause.
   BitVector ClauseUses;
@@ -107,6 +109,8 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
   bool fixWMMAHazards(MachineInstr *MI);
   bool fixShift64HighRegBug(MachineInstr *MI);
   bool fixVALUMaskWriteHazard(MachineInstr *MI);
+  void computeVALUHazardSGPRs(MachineFunction *MMF);
+  bool fixVALUReadSGPRHazard(MachineInstr *MI);
 
   int checkMAIHazards(MachineInstr *MI);
   int checkMAIHazards908(MachineInstr *MI);
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index e5817594a4521..1d151432f20b8 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -1245,6 +1245,8 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
 
   bool hasVALUMaskWriteHazard() const { return getGeneration() == GFX11; }
 
+  bool hasVALUReadSGPRHazard() const { return getGeneration() == GFX12; }
+
   /// Return if operations acting on VGPR tuples require even alignment.
   bool needsAlignedVGPRs() const { return GFX90AInsts; }
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
index c701e873fdd2c..95089d4ddbb18 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
@@ -334,13 +334,15 @@ define float @global_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(pt
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
 ; GFX12-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v3, v4
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_or_b32 s0, vcc_lo, s0
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    s_cbranch_execnz .LBB4_1
 ; GFX12-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    v_mov_b32_e32 v0, v3
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX940-LABEL: global_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory:
@@ -550,12 +552,14 @@ define void @global_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(p
 ; GFX12-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v2, v3
 ; GFX12-NEXT:    v_mov_b32_e32 v3, v2
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_or_b32 s0, vcc_lo, s0
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    s_cbranch_execnz .LBB5_1
 ; GFX12-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX940-LABEL: global_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory:
@@ -758,13 +762,15 @@ define double @global_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(p
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
 ; GFX12-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-NEXT:    v_cmp_eq_u64_e32 vcc_lo, v[4:5], v[6:7]
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_or_b32 s0, vcc_lo, s0
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    s_cbranch_execnz .LBB6_1
 ; GFX12-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    v_dual_mov_b32 v0, v4 :: v_dual_mov_b32 v1, v5
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX940-LABEL: global_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory:
@@ -986,12 +992,14 @@ define void @global_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(p
 ; GFX12-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-NEXT:    v_cmp_eq_u64_e32 vcc_lo, v[2:3], v[4:5]
 ; GFX12-NEXT:    v_dual_mov_b32 v5, v3 :: v_dual_mov_b32 v4, v2
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_or_b32 s0, vcc_lo, s0
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    s_cbranch_execnz .LBB7_1
 ; GFX12-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX940-LABEL: global_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory:
@@ -1200,13 +1208,15 @@ define float @flat_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(ptr
 ; GFX12-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX12-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v3, v4
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_or_b32 s0, vcc_lo, s0
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    s_cbranch_execnz .LBB8_1
 ; GFX12-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    v_mov_b32_e32 v0, v3
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX940-LABEL: flat_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory:
@@ -1411,12 +1421,14 @@ define void @flat_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(ptr
 ; GFX12-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v2, v3
 ; GFX12-NEXT:    v_mov_b32_e32 v3, v2
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_or_b32 s0,...
[truncated]

llvmbot · 2024-07-23T06:41:38Z

@llvm/pr-subscribers-llvm-globalisel

Author: Carl Ritson (perlfu)

Changes

Any SGPR read by a VALU can potentially obscure SALU writes to the same register.
Insert s_wait_alu instructions to mitigate the hazard on affected paths.

Compute a global cache of SGPRs with any VALU reads and use this to avoid inserting mitigation for SGPRs never accessed by VALUs.

To avoid excessive search when compile time is priority implement secondary mode where all SALU writes are mitigated.

Co-authored-by: Shilei Tian <[email protected]>

Patch is 1.67 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/100067.diff

92 Files Affected:

(modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp (+300-20)
(modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h (+4)
(modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll (+36-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll (+36-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll (+8-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.rsq.clamp.ll (+12-8)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+51-17)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+84-12)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+43-14)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+43-14)
(modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+6-2)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+89-30)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+119-43)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+119-43)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (+99-33)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll (+144-48)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll (+144-48)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll (+174-58)
(modified) llvm/test/CodeGen/AMDGPU/flat-scratch.ll (+12-6)
(modified) llvm/test/CodeGen/AMDGPU/fmaximum.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/fmaximum3.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/fminimum.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/fminimum3.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/fp-atomics-gfx940.ll (+7-3)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (+92-30)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll (+138-46)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll (+138-46)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll (+174-58)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64.ll (+24-24)
(added) llvm/test/CodeGen/AMDGPU/hazard-recognizer-src-shared-base.ll (+23)
(modified) llvm/test/CodeGen/AMDGPU/indirect-call-known-callees.ll (+10-5)
(modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+35-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.atomic.cond.sub.ll (+6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane.ll (+24-20)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.atomic.fadd.v2bf16.ll (+7-3)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.wait.ll (+14-5)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd.v2bf16.ll (+14-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd_nortn.ll (+14-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd_rtn.ll (+14-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmax.f32.ll (+12-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmin.f32.ll (+12-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.id.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll (+3-1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll (+2)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll (+3-1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll (+2)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i1.ll (+38-8)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i16.ll (+34-3)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i32.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i8.ll (+94-14)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+81-32)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+60-20)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+60-20)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+90-30)
(modified) llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/lower-work-group-id-intrinsics-hsa.ll (+4-1)
(modified) llvm/test/CodeGen/AMDGPU/lower-work-group-id-intrinsics-pal.ll (+4-1)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll (+156)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-lastuse.ll (+8)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-nontemporal.ll (+26)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-singlethread.ll (+156)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll (+156)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll (+26)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-wavefront.ll (+154)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll (+148)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll (+150)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-lastuse.ll (+8)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-nontemporal.ll (+18)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-singlethread.ll (+152)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll (+142)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll (+20)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-wavefront.ll (+152)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll (+152)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-invalid-syncscope.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll (+120)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-nontemporal.ll (+16)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-singlethread.ll (+120)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll (+120)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll (+14)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-wavefront.ll (+120)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll (+120)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-lastuse.ll (+6)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-nontemporal.ll (+18)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-volatile.ll (+16)
(modified) llvm/test/CodeGen/AMDGPU/offset-split-flat.ll (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/offset-split-global.ll (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/pseudo-scalar-transcendental.ll (+26-11)
(modified) llvm/test/CodeGen/AMDGPU/s-getpc-b64-remat.ll (+6-2)
(modified) llvm/test/CodeGen/AMDGPU/valu-mask-write-hazard.mir (+56-112)
(added) llvm/test/CodeGen/AMDGPU/valu-read-sgpr-hazard.mir (+863)
(modified) llvm/test/CodeGen/AMDGPU/vcmpx-permlane-hazard.mir (+4-1)

diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index a402fc6d7e611..45c2624e43d4c 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -14,6 +14,8 @@
 #include "GCNSubtarget.h"
 #include "MCTargetDesc/AMDGPUMCTargetDesc.h"
 #include "SIMachineFunctionInfo.h"
+#include "llvm/ADT/PostOrderIterator.h"
+#include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/ScheduleDAG.h"
 #include "llvm/TargetParser/TargetParser.h"
@@ -43,6 +45,10 @@ static cl::opt<unsigned, false, MFMAPaddingRatioParser>
                      cl::desc("Fill a percentage of the latency between "
                               "neighboring MFMA with s_nops."));
 
+static cl::opt<unsigned> MaxExhaustiveHazardSearch(
+    "amdgpu-max-exhaustive-hazard-search", cl::init(128), cl::Hidden,
+    cl::desc("Maximum function size for exhausive hazard search"));
+
 //===----------------------------------------------------------------------===//
 // Hazard Recognizer Implementation
 //===----------------------------------------------------------------------===//
@@ -50,15 +56,11 @@ static cl::opt<unsigned, false, MFMAPaddingRatioParser>
 static bool shouldRunLdsBranchVmemWARHazardFixup(const MachineFunction &MF,
                                                  const GCNSubtarget &ST);
 
-GCNHazardRecognizer::GCNHazardRecognizer(const MachineFunction &MF) :
-  IsHazardRecognizerMode(false),
-  CurrCycleInstr(nullptr),
-  MF(MF),
-  ST(MF.getSubtarget<GCNSubtarget>()),
-  TII(*ST.getInstrInfo()),
-  TRI(TII.getRegisterInfo()),
-  ClauseUses(TRI.getNumRegUnits()),
-  ClauseDefs(TRI.getNumRegUnits()) {
+GCNHazardRecognizer::GCNHazardRecognizer(const MachineFunction &MF)
+    : IsHazardRecognizerMode(false), CurrCycleInstr(nullptr), MF(MF),
+      ST(MF.getSubtarget<GCNSubtarget>()), TII(*ST.getInstrInfo()),
+      TRI(TII.getRegisterInfo()), UseVALUReadHazardExhaustiveSearch(false),
+      ClauseUses(TRI.getNumRegUnits()), ClauseDefs(TRI.getNumRegUnits()) {
   MaxLookAhead = MF.getRegInfo().isPhysRegUsed(AMDGPU::AGPR0) ? 19 : 5;
   TSchedModel.init(&ST);
   RunLdsBranchVmemWARHazardFixup = shouldRunLdsBranchVmemWARHazardFixup(MF, ST);
@@ -1104,6 +1106,7 @@ void GCNHazardRecognizer::fixHazards(MachineInstr *MI) {
   fixWMMAHazards(MI);
   fixShift64HighRegBug(MI);
   fixVALUMaskWriteHazard(MI);
+  fixVALUReadSGPRHazard(MI);
 }
 
 bool GCNHazardRecognizer::fixVcmpxPermlaneHazards(MachineInstr *MI) {
@@ -2759,6 +2762,36 @@ bool GCNHazardRecognizer::ShouldPreferAnother(SUnit *SU) {
   return false;
 }
 
+// Adjust global offsets for instructions bundled with S_GETPC_B64 after
+// insertion of a new instruction.
+static void updateGetPCBundle(MachineInstr *NewMI) {
+  if (!NewMI->isBundled())
+    return;
+
+  // Find start of bundle.
+  auto I = NewMI->getIterator();
+  while (I->isBundledWithPred())
+    I--;
+  if (I->isBundle())
+    I++;
+
+  // Bail if this is not an S_GETPC bundle.
+  if (I->getOpcode() != AMDGPU::S_GETPC_B64)
+    return;
+
+  // Update offsets of any references in the bundle.
+  const unsigned NewBytes = NewMI->getDesc().getSize();
+  auto NextMI = std::next(NewMI->getIterator());
+  auto End = NewMI->getParent()->end();
+  while (NextMI != End && NextMI->isBundledWithPred()) {
+    for (auto &Operand : NextMI->operands()) {
+      if (Operand.isGlobal())
+        Operand.setOffset(Operand.getOffset() + NewBytes);
+    }
+    NextMI++;
+  }
+}
+
 bool GCNHazardRecognizer::fixVALUMaskWriteHazard(MachineInstr *MI) {
   if (!ST.hasVALUMaskWriteHazard())
     return false;
@@ -2876,22 +2909,269 @@ bool GCNHazardRecognizer::fixVALUMaskWriteHazard(MachineInstr *MI) {
   auto NextMI = std::next(MI->getIterator());
 
   // Add s_waitcnt_depctr sa_sdst(0) after SALU write.
-  BuildMI(*MI->getParent(), NextMI, MI->getDebugLoc(),
-          TII.get(AMDGPU::S_WAITCNT_DEPCTR))
-      .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
+  auto NewMI = BuildMI(*MI->getParent(), NextMI, MI->getDebugLoc(),
+                       TII.get(AMDGPU::S_WAITCNT_DEPCTR))
+                   .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
 
   // SALU write may be s_getpc in a bundle.
-  if (MI->getOpcode() == AMDGPU::S_GETPC_B64) {
-    // Update offsets of any references in the bundle.
-    while (NextMI != MI->getParent()->end() &&
-           NextMI->isBundledWithPred()) {
-      for (auto &Operand : NextMI->operands()) {
-        if (Operand.isGlobal())
-          Operand.setOffset(Operand.getOffset() + 4);
+  updateGetPCBundle(NewMI);
+
+  return true;
+}
+
+static unsigned baseSGPRNumber(Register Reg, const SIRegisterInfo &TRI) {
+  unsigned RegN = TRI.getEncodingValue(Reg);
+  assert(RegN <= 127);
+  return (RegN >> 1) & 0x3f;
+}
+
+// For VALUReadSGPRHazard: pre-compute a bit vector of all SGPRs used by VALUs.
+void GCNHazardRecognizer::computeVALUHazardSGPRs(MachineFunction *MMF) {
+  assert(MMF == &MF);
+
+  // Assume non-empty vector means it has already been computed.
+  if (!VALUReadHazardSGPRs.empty())
+    return;
+
+  auto CallingConv = MF.getFunction().getCallingConv();
+  bool IsCallFree =
+      AMDGPU::isEntryFunctionCC(CallingConv) && !MF.getFrameInfo().hasCalls();
+
+  // Exhaustive search is only viable in non-caller/callee functions where
+  // VALUs will be exposed to the hazard recognizer.
+  UseVALUReadHazardExhaustiveSearch =
+      IsCallFree && MF.getTarget().getOptLevel() > CodeGenOptLevel::None &&
+      MF.getInstructionCount() <= MaxExhaustiveHazardSearch;
+
+  // Consider all SGPRs hazards if the shader uses function calls or is callee.
+  bool UseVALUUseCache =
+      IsCallFree && MF.getTarget().getOptLevel() > CodeGenOptLevel::None;
+  VALUReadHazardSGPRs.resize(64, !UseVALUUseCache);
+  if (!UseVALUUseCache)
+    return;
+
+  // Perform a post ordered reverse scan to find VALUs which read an SGPR
+  // before a SALU write to the same SGPR.  This provides a reduction in
+  // hazard insertion when all VALU access to an SGPR occurs after its last
+  // SALU write, when compared to a linear scan.
+  const unsigned SGPR_NULL = TRI.getEncodingValue(AMDGPU::SGPR_NULL_gfx11plus);
+  const MachineRegisterInfo &MRI = MF.getRegInfo();
+  BitVector SALUWriteSGPRs(64), ReadSGPRs(64);
+  MachineCycleInfo CI;
+  CI.compute(*MMF);
+
+  for (auto *MBB : post_order(&MF)) {
+    bool InCycle = CI.getCycle(MBB) != nullptr;
+    for (auto &MI : reverse(MBB->instrs())) {
+      bool IsVALU = SIInstrInfo::isVALU(MI);
+      bool IsSALU = SIInstrInfo::isSALU(MI);
+      if (!(IsVALU || IsSALU))
+        continue;
+
+      for (const MachineOperand &Op : MI.operands()) {
+        if (!Op.isReg())
+          continue;
+        Register Reg = Op.getReg();
+        // Only consider implicit operands of VCC.
+        if (Op.isImplicit() && !(Reg == AMDGPU::VCC_LO ||
+                                 Reg == AMDGPU::VCC_HI || Reg == AMDGPU::VCC))
+          continue;
+        if (!TRI.isSGPRReg(MRI, Reg))
+          continue;
+        if (TRI.getEncodingValue(Reg) >= SGPR_NULL)
+          continue;
+        unsigned RegN = baseSGPRNumber(Reg, TRI);
+        if (IsVALU && Op.isUse()) {
+          // Note: any access within a cycle must be considered a hazard.
+          if (InCycle || (ReadSGPRs[RegN] && SALUWriteSGPRs[RegN]))
+            VALUReadHazardSGPRs.set(RegN);
+          ReadSGPRs.set(RegN);
+        } else if (IsSALU) {
+          if (Op.isDef())
+            SALUWriteSGPRs.set(RegN);
+          else
+            ReadSGPRs.set(RegN);
+        }
       }
-      NextMI++;
     }
   }
+}
+
+bool GCNHazardRecognizer::fixVALUReadSGPRHazard(MachineInstr *MI) {
+  if (!ST.hasVALUReadSGPRHazard())
+    return false;
+
+  // The hazard sequence is fundamentally three instructions:
+  //   1. VALU reads SGPR
+  //   2. SALU writes SGPR
+  //   3. VALU/SALU reads SGPR
+  // Try to avoid searching for (1) because the expiry point of the hazard is
+  // indeterminate; however, the hazard between (2) and (3) can expire if the
+  // gap contains sufficient SALU instructions with no usage of SGPR from (1).
+  // Note: SGPRs must be considered as 64-bit pairs as hazard exists
+  // even if individual SGPRs are accessed.
+
+  bool MIIsSALU = SIInstrInfo::isSALU(*MI);
+  bool MIIsVALU = SIInstrInfo::isVALU(*MI);
+  if (!(MIIsSALU || MIIsVALU))
+    return false;
+
+  // Avoid expensive search when compile time is priority by
+  // mitigating every SALU which writes an SGPR.
+  if (MF.getTarget().getOptLevel() == CodeGenOptLevel::None) {
+    if (!SIInstrInfo::isSALU(*MI) || SIInstrInfo::isSOPP(*MI))
+      return false;
+
+    const MachineOperand *SDSTOp =
+        TII.getNamedOperand(*MI, AMDGPU::OpName::sdst);
+    if (!SDSTOp || !SDSTOp->isReg())
+      return false;
+
+    const Register HazardReg = SDSTOp->getReg();
+    if (HazardReg == AMDGPU::EXEC || HazardReg == AMDGPU::EXEC_LO ||
+        HazardReg == AMDGPU::EXEC_HI || HazardReg == AMDGPU::M0)
+      return false;
+
+    // Add s_wait_alu sa_sdst(0) after SALU write.
+    auto NextMI = std::next(MI->getIterator());
+    auto NewMI = BuildMI(*MI->getParent(), NextMI, MI->getDebugLoc(),
+                         TII.get(AMDGPU::S_WAITCNT_DEPCTR))
+                     .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
+
+    // SALU write may be s_getpc in a bundle.
+    updateGetPCBundle(NewMI);
+
+    return true;
+  }
+
+  // Pre-compute set of SGPR pairs read by VALUs.
+  // Note: pass mutable pointer to MachineFunction for CycleInfo.
+  computeVALUHazardSGPRs(MI->getMF());
+
+  // If no VALUs hazard SGPRs exist then nothing to do.
+  if (VALUReadHazardSGPRs.none())
+    return false;
+
+  // All SGPR writes before a call/return must be flushed as the callee/caller
+  // will not will not see the hazard chain, i.e. (2) to (3) described above.
+  const bool IsSetPC = (MI->getOpcode() == AMDGPU::S_SETPC_B64 ||
+                        MI->getOpcode() == AMDGPU::S_SETPC_B64_return ||
+                        MI->getOpcode() == AMDGPU::S_SWAPPC_B64 ||
+                        MI->getOpcode() == AMDGPU::S_CALL_B64);
+
+  // Collect all SGPR sources for MI which are read by a VALU.
+  const unsigned SGPR_NULL = TRI.getEncodingValue(AMDGPU::SGPR_NULL_gfx11plus);
+  const MachineRegisterInfo &MRI = MF.getRegInfo();
+  SmallSet<Register, 4> SGPRsUsed;
+
+  if (!IsSetPC) {
+    for (const MachineOperand &Op : MI->all_uses()) {
+      Register OpReg = Op.getReg();
+
+      // Only consider VCC implicit uses on VALUs.
+      // The only expected SALU implicit access is SCC which is no hazard.
+      if (MIIsSALU && Op.isImplicit())
+        continue;
+
+      if (!TRI.isSGPRReg(MRI, OpReg))
+        continue;
+
+      // Ignore special purposes registers such as NULL, EXEC, and M0.
+      if (TRI.getEncodingValue(OpReg) >= SGPR_NULL)
+        continue;
+
+      unsigned RegN = baseSGPRNumber(OpReg, TRI);
+      if (!VALUReadHazardSGPRs[RegN])
+        continue;
+
+      SGPRsUsed.insert(OpReg);
+    }
+
+    // No SGPRs -> nothing to do.
+    if (SGPRsUsed.empty())
+      return false;
+  }
+
+  // A hazard is any SALU which writes one of the SGPRs read by MI.
+  auto IsHazardFn = [this, IsSetPC, &SGPRsUsed](const MachineInstr &I) {
+    if (!SIInstrInfo::isSALU(I))
+      return false;
+    // Ensure SGPR flush before call/return by conservatively assuming every
+    // SALU writes an SGPR.
+    if (IsSetPC && I.getNumDefs() > 0)
+      return true;
+    // Check for any register writes.
+    return llvm::any_of(SGPRsUsed, [this, &I](Register Reg) {
+      return I.modifiesRegister(Reg, &TRI);
+    });
+  };
+
+  const int SALUExpiryCount = SIInstrInfo::isSALU(*MI) ? 10 : 11;
+  auto IsExpiredFn = [&](const MachineInstr &I, int Count) {
+    if (Count >= SALUExpiryCount)
+      return true;
+    // s_wait_alu sa_sdst(0) on path mitigates hazard.
+    if (I.getOpcode() == AMDGPU::S_WAITCNT_DEPCTR &&
+        AMDGPU::DepCtr::decodeFieldSaSdst(I.getOperand(0).getImm()) == 0)
+      return true;
+    return false;
+  };
+
+  auto WaitStatesFn = [this, &SGPRsUsed](const MachineInstr &I) {
+    // Only count true SALUs as wait states.
+    if (!SIInstrInfo::isSALU(I) || SIInstrInfo::isSOPP(I))
+      return 0;
+    // SALU must be unrelated to any hazard registers.
+    if (llvm::any_of(SGPRsUsed, [this, &I](Register Reg) {
+          return I.readsRegister(Reg, &TRI);
+        }))
+      return 0;
+    return 1;
+  };
+
+  // Check for the hazard.
+  DenseSet<const MachineBasicBlock *> Visited;
+  int WaitStates = ::getWaitStatesSince(IsHazardFn, MI->getParent(),
+                                        std::next(MI->getReverseIterator()), 0,
+                                        IsExpiredFn, Visited, WaitStatesFn);
+
+  if (WaitStates >= SALUExpiryCount)
+    return false;
+
+  // Validate hazard through an exhaustive search.
+  if (UseVALUReadHazardExhaustiveSearch) {
+    // A hazard is any VALU which reads one of the paired SGPRs read by MI.
+    // This is searching for (1) in the hazard description.
+    auto hazardPair = [this](Register Reg) {
+      if (Reg == AMDGPU::VCC || Reg == AMDGPU::VCC_LO || Reg == AMDGPU::VCC_HI)
+        return Register(AMDGPU::VCC);
+      // TODO: handle TTMP?
+      return Register(AMDGPU::SGPR0_SGPR1 + baseSGPRNumber(Reg, TRI));
+    };
+    auto SearchHazardFn = [this, hazardPair,
+                           &SGPRsUsed](const MachineInstr &I) {
+      if (!SIInstrInfo::isVALU(I))
+        return false;
+      // Check for any register reads.
+      return llvm::any_of(SGPRsUsed, [this, hazardPair, &I](Register Reg) {
+        return I.readsRegister(hazardPair(Reg), &TRI);
+      });
+    };
+    auto SearchExpiredFn = [&](const MachineInstr &I, int Count) {
+      return false;
+    };
+    if (::getWaitStatesSince(SearchHazardFn, MI, SearchExpiredFn) ==
+        std::numeric_limits<int>::max())
+      return false;
+  }
+
+  // Add s_wait_alu sa_sdst(0) before SALU read.
+  auto NewMI = BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
+                       TII.get(AMDGPU::S_WAITCNT_DEPCTR))
+                   .addImm(AMDGPU::DepCtr::encodeFieldSaSdst(0));
+
+  // SALU read may be after s_getpc in a bundle.
+  updateGetPCBundle(NewMI);
 
   return true;
 }
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
index 3ccca527c626b..93b4b3771434b 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
@@ -48,6 +48,8 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
   const SIRegisterInfo &TRI;
   TargetSchedModel TSchedModel;
   bool RunLdsBranchVmemWARHazardFixup;
+  BitVector VALUReadHazardSGPRs;
+  bool UseVALUReadHazardExhaustiveSearch;
 
   /// RegUnits of uses in the current soft memory clause.
   BitVector ClauseUses;
@@ -107,6 +109,8 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
   bool fixWMMAHazards(MachineInstr *MI);
   bool fixShift64HighRegBug(MachineInstr *MI);
   bool fixVALUMaskWriteHazard(MachineInstr *MI);
+  void computeVALUHazardSGPRs(MachineFunction *MMF);
+  bool fixVALUReadSGPRHazard(MachineInstr *MI);
 
   int checkMAIHazards(MachineInstr *MI);
   int checkMAIHazards908(MachineInstr *MI);
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index e5817594a4521..1d151432f20b8 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -1245,6 +1245,8 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
 
   bool hasVALUMaskWriteHazard() const { return getGeneration() == GFX11; }
 
+  bool hasVALUReadSGPRHazard() const { return getGeneration() == GFX12; }
+
   /// Return if operations acting on VGPR tuples require even alignment.
   bool needsAlignedVGPRs() const { return GFX90AInsts; }
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
index c701e873fdd2c..95089d4ddbb18 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
@@ -334,13 +334,15 @@ define float @global_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(pt
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
 ; GFX12-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v3, v4
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_or_b32 s0, vcc_lo, s0
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    s_cbranch_execnz .LBB4_1
 ; GFX12-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    v_mov_b32_e32 v0, v3
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX940-LABEL: global_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory:
@@ -550,12 +552,14 @@ define void @global_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(p
 ; GFX12-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v2, v3
 ; GFX12-NEXT:    v_mov_b32_e32 v3, v2
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_or_b32 s0, vcc_lo, s0
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    s_cbranch_execnz .LBB5_1
 ; GFX12-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX940-LABEL: global_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory:
@@ -758,13 +762,15 @@ define double @global_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory(p
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
 ; GFX12-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-NEXT:    v_cmp_eq_u64_e32 vcc_lo, v[4:5], v[6:7]
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_or_b32 s0, vcc_lo, s0
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    s_cbranch_execnz .LBB6_1
 ; GFX12-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    v_dual_mov_b32 v0, v4 :: v_dual_mov_b32 v1, v5
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX940-LABEL: global_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_memory:
@@ -986,12 +992,14 @@ define void @global_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory(p
 ; GFX12-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-NEXT:    v_cmp_eq_u64_e32 vcc_lo, v[2:3], v[4:5]
 ; GFX12-NEXT:    v_dual_mov_b32 v5, v3 :: v_dual_mov_b32 v4, v2
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_or_b32 s0, vcc_lo, s0
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    s_cbranch_execnz .LBB7_1
 ; GFX12-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX940-LABEL: global_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_memory:
@@ -1200,13 +1208,15 @@ define float @flat_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory(ptr
 ; GFX12-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX12-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v3, v4
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_or_b32 s0, vcc_lo, s0
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    s_cbranch_execnz .LBB8_1
 ; GFX12-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s0
 ; GFX12-NEXT:    v_mov_b32_e32 v0, v3
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX940-LABEL: flat_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_memory:
@@ -1411,12 +1421,14 @@ define void @flat_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_memory(ptr
 ; GFX12-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v2, v3
 ; GFX12-NEXT:    v_mov_b32_e32 v3, v2
+; GFX12-NEXT:    s_wait_alu 0xfffe
 ; GFX12-NEXT:    s_or_b32 s0,...
[truncated]

llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp

llvm/test/CodeGen/AMDGPU/valu-read-sgpr-hazard.mir

perlfu · 2024-08-11T07:53:35Z

Ping

As suggested in review for PR #100067. Refactor code for S_GETPC_B64 bundle updates for use with multiple hazard mitigations.

Any SGPR read by a VALU can potentially obscure SALU writes to the same register. Insert s_wait_alu instructions to mitigate the hazard on affected paths. Compute a global cache of SGPRs with any VALU reads and use this to avoid inserting mitigation for SGPRs never accessed by VALUs. To avoid excessive search when compile time is priority implement secondary mode where all SALU writes are mitigated.

- Encapsulate all use of encoding values in sgprPairNumber

perlfu · 2024-08-23T06:26:38Z

Rebase
Address remaining comments

github-actions · 2024-08-23T06:30:10Z

✅ With the latest revision this PR passed the C/C++ code formatter.

As suggested in review for PR llvm#100067. Refactor code for S_GETPC_B64 bundle updates for use with multiple hazard mitigations.

perlfu · 2024-09-02T07:00:49Z

Ping

As suggested in review for PR llvm#100067. Refactor code for S_GETPC_B64 bundle updates for use with multiple hazard mitigations. (cherry picked from commit 987ffc3)

Any SGPR read by a VALU can potentially obscure SALU writes to the same register. Insert s_wait_alu instructions to mitigate the hazard on affected paths. Compute a global cache of SGPRs with any VALU reads and use this to avoid inserting mitigation for SGPRs never accessed by VALUs. To avoid excessive search when compile time is priority implement secondary mode where all SALU writes are mitigated. (cherry picked from commit 8662714)

As suggested in review for PR llvm#100067. Refactor code for S_GETPC_B64 bundle updates for use with multiple hazard mitigations. (cherry picked from commit 987ffc3)

Any SGPR read by a VALU can potentially obscure SALU writes to the same register. Insert s_wait_alu instructions to mitigate the hazard on affected paths. Compute a global cache of SGPRs with any VALU reads and use this to avoid inserting mitigation for SGPRs never accessed by VALUs. To avoid excessive search when compile time is priority implement secondary mode where all SALU writes are mitigated. (cherry picked from commit 8662714)

perlfu requested review from jayfoad, arsenm, shiltian, rampitec and dstutt July 23, 2024 06:41

llvmbot added backend:AMDGPU llvm:globalisel labels Jul 23, 2024

arsenm reviewed Jul 23, 2024

View reviewed changes

perlfu force-pushed the amdgpu-gfx12-sgpr-hazard branch from b80da10 to a6326c7 Compare August 11, 2024 07:48

perlfu added a commit that referenced this pull request Aug 23, 2024

[AMDGPU] Refactor code for GETPC bundle updates in hazards (NFCI)

987ffc3

As suggested in review for PR #100067. Refactor code for S_GETPC_B64 bundle updates for use with multiple hazard mitigations.

perlfu added 3 commits August 23, 2024 13:47

Address reviewer comments.

06db451

Address reviewer comments:

d44d2f7

- Encapsulate all use of encoding values in sgprPairNumber

perlfu force-pushed the amdgpu-gfx12-sgpr-hazard branch from a6326c7 to d44d2f7 Compare August 23, 2024 06:26

Fix clang-format error.

f54febe

cjdb pushed a commit to cjdb/llvm-project that referenced this pull request Aug 23, 2024

[AMDGPU] Refactor code for GETPC bundle updates in hazards (NFCI)

23a54af

As suggested in review for PR llvm#100067. Refactor code for S_GETPC_B64 bundle updates for use with multiple hazard mitigations.

shiltian approved these changes Sep 2, 2024

View reviewed changes

perlfu merged commit 8662714 into llvm:main Sep 4, 2024
6 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Mitigate GFX12 VALU read SGPR hazard #100067

[AMDGPU] Mitigate GFX12 VALU read SGPR hazard #100067

Uh oh!

perlfu commented Jul 23, 2024

Uh oh!

llvmbot commented Jul 23, 2024

Uh oh!

llvmbot commented Jul 23, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

perlfu commented Aug 11, 2024

Uh oh!

perlfu commented Aug 23, 2024

Uh oh!

github-actions bot commented Aug 23, 2024 •

edited

Loading

Uh oh!

perlfu commented Sep 2, 2024

Uh oh!

Uh oh!

Uh oh!

[AMDGPU] Mitigate GFX12 VALU read SGPR hazard #100067

[AMDGPU] Mitigate GFX12 VALU read SGPR hazard #100067

Uh oh!

Conversation

perlfu commented Jul 23, 2024

Uh oh!

llvmbot commented Jul 23, 2024

Uh oh!

llvmbot commented Jul 23, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

perlfu commented Aug 11, 2024

Uh oh!

perlfu commented Aug 23, 2024

Uh oh!

github-actions bot commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

perlfu commented Sep 2, 2024

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 23, 2024 •

edited

Loading