AMDGPU: Expand remaining system atomic operations #122137

arsenm · 2025-01-08T16:32:42Z

System scope atomics need to use cmpxchg loops if we know
nothing about the allocation the address is from.
aea5980 started this, this
expands the set to cover the remaining integer operations.

Don't expand xchg and add, those theoretically should work over PCIe.
This is a pre-commit which will introduce performance regressions.
Subsequent changes will add handling of new atomicrmw metadata, which
will avoid the expansion.

Note this still isn't conservative enough; we do need to expand
some device scope atomics if the memory is in fine-grained remote
memory.

arsenm · 2025-01-08T16:33:08Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-01-08T16:33:42Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

System scope atomics need to use cmpxchg loops if we know
nothing about the allocation the address is from.
aea5980 started this, this
expands the set to cover the remaining integer operations.

Don't expand xchg and add, those theoretically should work over PCIe.
This is a pre-commit which will introduce performance regressions.
Subsequent changes will add handling of new atomicrmw metadata, which
will avoid the expansion.

Note this still isn't conservative enough; we do need to expand
some device scope atomics if the memory is in fine-grained remote
memory.

Patch is 1.58 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/122137.diff

13 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+25-25)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll (+818-164)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_uinc_wrap.ll (+1032-234)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+2865-327)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+4921-1617)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system_noprivate.ll (+3888-444)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+3154-502)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+3848-608)
(modified) llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-i16-system.ll (+33-6)
(modified) llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-i32-system.ll (+264-24)
(modified) llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-i64-system.ll (+264-24)
(modified) llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-i8-system.ll (+33-6)
(modified) llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomicrmw-integer-ops-0-to-add-0.ll (+20-2)

diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 0ac84f4e1f02af..513251e398ad4d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -16601,26 +16601,39 @@ SITargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *RMW) const {
 
   auto Op = RMW->getOperation();
   switch (Op) {
-  case AtomicRMWInst::Xchg: {
+  case AtomicRMWInst::Xchg:
     // PCIe supports add and xchg for system atomics.
     return isAtomicRMWLegalXChgTy(RMW)
                ? TargetLowering::AtomicExpansionKind::None
                : TargetLowering::AtomicExpansionKind::CmpXChg;
-  }
   case AtomicRMWInst::Add:
-  case AtomicRMWInst::And:
-  case AtomicRMWInst::UIncWrap:
-  case AtomicRMWInst::UDecWrap:
+    // PCIe supports add and xchg for system atomics.
     return atomicSupportedIfLegalIntType(RMW);
   case AtomicRMWInst::Sub:
+  case AtomicRMWInst::And:
   case AtomicRMWInst::Or:
-  case AtomicRMWInst::Xor: {
-    // Atomic sub/or/xor do not work over PCI express, but atomic add
-    // does. InstCombine transforms these with 0 to or, so undo that.
-    if (HasSystemScope && AMDGPU::isFlatGlobalAddrSpace(AS)) {
-      if (Constant *ConstVal = dyn_cast<Constant>(RMW->getValOperand());
-          ConstVal && ConstVal->isNullValue())
-        return AtomicExpansionKind::Expand;
+  case AtomicRMWInst::Xor:
+  case AtomicRMWInst::Max:
+  case AtomicRMWInst::Min:
+  case AtomicRMWInst::UMax:
+  case AtomicRMWInst::UMin:
+  case AtomicRMWInst::UIncWrap:
+  case AtomicRMWInst::UDecWrap: {
+    if (AMDGPU::isFlatGlobalAddrSpace(AS) ||
+        AS == AMDGPUAS::BUFFER_FAT_POINTER) {
+      // Always expand system scope atomics.
+      if (HasSystemScope) {
+        if (Op == AtomicRMWInst::Sub || Op == AtomicRMWInst::Or ||
+            Op == AtomicRMWInst::Xor) {
+          // Atomic sub/or/xor do not work over PCI express, but atomic add
+          // does. InstCombine transforms these with 0 to or, so undo that.
+          if (Constant *ConstVal = dyn_cast<Constant>(RMW->getValOperand());
+              ConstVal && ConstVal->isNullValue())
+            return AtomicExpansionKind::Expand;
+        }
+
+        return AtomicExpansionKind::CmpXChg;
+      }
     }
 
     return atomicSupportedIfLegalIntType(RMW);
@@ -16775,19 +16788,6 @@ SITargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *RMW) const {
 
     return AtomicExpansionKind::CmpXChg;
   }
-  case AtomicRMWInst::Min:
-  case AtomicRMWInst::Max:
-  case AtomicRMWInst::UMin:
-  case AtomicRMWInst::UMax: {
-    if (AMDGPU::isFlatGlobalAddrSpace(AS) ||
-        AS == AMDGPUAS::BUFFER_FAT_POINTER) {
-      // Always expand system scope min/max atomics.
-      if (HasSystemScope)
-        return AtomicExpansionKind::CmpXChg;
-    }
-
-    return atomicSupportedIfLegalIntType(RMW);
-  }
   case AtomicRMWInst::Nand:
   case AtomicRMWInst::FSub:
   default:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll
index b96fc71be057e7..35aa3cfbc841c8 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll
@@ -436,72 +436,161 @@ define amdgpu_kernel void @global_atomic_dec_ret_i32_offset_system(ptr addrspace
 ; CI-LABEL: global_atomic_dec_ret_i32_offset_system:
 ; CI:       ; %bb.0:
 ; CI-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
-; CI-NEXT:    v_mov_b32_e32 v2, 42
+; CI-NEXT:    v_not_b32_e32 v2, 41
 ; CI-NEXT:    s_waitcnt lgkmcnt(0)
-; CI-NEXT:    s_add_u32 s2, s2, 16
-; CI-NEXT:    s_addc_u32 s3, s3, 0
-; CI-NEXT:    v_mov_b32_e32 v0, s2
-; CI-NEXT:    v_mov_b32_e32 v1, s3
-; CI-NEXT:    flat_atomic_dec v2, v[0:1], v2 glc
+; CI-NEXT:    s_load_dword s6, s[2:3], 0x4
+; CI-NEXT:    s_add_u32 s4, s2, 16
+; CI-NEXT:    s_addc_u32 s5, s3, 0
+; CI-NEXT:    v_mov_b32_e32 v0, s4
+; CI-NEXT:    s_mov_b64 s[2:3], 0
+; CI-NEXT:    v_mov_b32_e32 v1, s5
+; CI-NEXT:    s_waitcnt lgkmcnt(0)
+; CI-NEXT:    v_mov_b32_e32 v3, s6
+; CI-NEXT:  .LBB6_1: ; %atomicrmw.start
+; CI-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CI-NEXT:    v_mov_b32_e32 v4, v3
+; CI-NEXT:    v_add_i32_e32 v3, vcc, -1, v4
+; CI-NEXT:    v_add_i32_e32 v5, vcc, 0xffffffd5, v4
+; CI-NEXT:    v_cmp_lt_u32_e32 vcc, v5, v2
+; CI-NEXT:    v_cndmask_b32_e64 v3, v3, 42, vcc
+; CI-NEXT:    flat_atomic_cmpswap v3, v[0:1], v[3:4] glc
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_wbinvl1_vol
+; CI-NEXT:    v_cmp_eq_u32_e32 vcc, v3, v4
+; CI-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
+; CI-NEXT:    s_andn2_b64 exec, exec, s[2:3]
+; CI-NEXT:    s_cbranch_execnz .LBB6_1
+; CI-NEXT:  ; %bb.2: ; %atomicrmw.end
+; CI-NEXT:    s_or_b64 exec, exec, s[2:3]
 ; CI-NEXT:    v_mov_b32_e32 v0, s0
 ; CI-NEXT:    v_mov_b32_e32 v1, s1
-; CI-NEXT:    flat_store_dword v[0:1], v2
+; CI-NEXT:    flat_store_dword v[0:1], v3
 ; CI-NEXT:    s_endpgm
 ;
 ; VI-LABEL: global_atomic_dec_ret_i32_offset_system:
 ; VI:       ; %bb.0:
 ; VI-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
-; VI-NEXT:    v_mov_b32_e32 v2, 42
+; VI-NEXT:    v_not_b32_e32 v2, 41
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
-; VI-NEXT:    s_add_u32 s2, s2, 16
-; VI-NEXT:    s_addc_u32 s3, s3, 0
-; VI-NEXT:    v_mov_b32_e32 v0, s2
-; VI-NEXT:    v_mov_b32_e32 v1, s3
-; VI-NEXT:    flat_atomic_dec v2, v[0:1], v2 glc
+; VI-NEXT:    s_load_dword s6, s[2:3], 0x10
+; VI-NEXT:    s_add_u32 s4, s2, 16
+; VI-NEXT:    s_addc_u32 s5, s3, 0
+; VI-NEXT:    v_mov_b32_e32 v0, s4
+; VI-NEXT:    s_mov_b64 s[2:3], 0
+; VI-NEXT:    v_mov_b32_e32 v1, s5
+; VI-NEXT:    s_waitcnt lgkmcnt(0)
+; VI-NEXT:    v_mov_b32_e32 v3, s6
+; VI-NEXT:  .LBB6_1: ; %atomicrmw.start
+; VI-NEXT:    ; =>This Inner Loop Header: Depth=1
+; VI-NEXT:    v_mov_b32_e32 v4, v3
+; VI-NEXT:    v_add_u32_e32 v3, vcc, -1, v4
+; VI-NEXT:    v_add_u32_e32 v5, vcc, 0xffffffd5, v4
+; VI-NEXT:    v_cmp_lt_u32_e32 vcc, v5, v2
+; VI-NEXT:    v_cndmask_b32_e64 v3, v3, 42, vcc
+; VI-NEXT:    flat_atomic_cmpswap v3, v[0:1], v[3:4] glc
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_wbinvl1_vol
+; VI-NEXT:    v_cmp_eq_u32_e32 vcc, v3, v4
+; VI-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
+; VI-NEXT:    s_andn2_b64 exec, exec, s[2:3]
+; VI-NEXT:    s_cbranch_execnz .LBB6_1
+; VI-NEXT:  ; %bb.2: ; %atomicrmw.end
+; VI-NEXT:    s_or_b64 exec, exec, s[2:3]
 ; VI-NEXT:    v_mov_b32_e32 v0, s0
 ; VI-NEXT:    v_mov_b32_e32 v1, s1
-; VI-NEXT:    flat_store_dword v[0:1], v2
+; VI-NEXT:    flat_store_dword v[0:1], v3
 ; VI-NEXT:    s_endpgm
 ;
 ; GFX9-LABEL: global_atomic_dec_ret_i32_offset_system:
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
-; GFX9-NEXT:    v_mov_b32_e32 v0, 42
+; GFX9-NEXT:    s_mov_b64 s[4:5], 0
+; GFX9-NEXT:    v_not_b32_e32 v0, 41
 ; GFX9-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9-NEXT:    global_atomic_dec v0, v1, v0, s[2:3] offset:16 glc
+; GFX9-NEXT:    s_load_dword s6, s[2:3], 0x10
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    v_mov_b32_e32 v2, s6
+; GFX9-NEXT:  .LBB6_1: ; %atomicrmw.start
+; GFX9-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GFX9-NEXT:    v_mov_b32_e32 v3, v2
+; GFX9-NEXT:    v_add_u32_e32 v4, 0xffffffd5, v3
+; GFX9-NEXT:    v_add_u32_e32 v2, -1, v3
+; GFX9-NEXT:    v_cmp_lt_u32_e32 vcc, v4, v0
+; GFX9-NEXT:    v_cndmask_b32_e64 v2, v2, 42, vcc
+; GFX9-NEXT:    global_atomic_cmpswap v2, v1, v[2:3], s[2:3] offset:16 glc
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_wbinvl1_vol
-; GFX9-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, v2, v3
+; GFX9-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]
+; GFX9-NEXT:    s_andn2_b64 exec, exec, s[4:5]
+; GFX9-NEXT:    s_cbranch_execnz .LBB6_1
+; GFX9-NEXT:  ; %bb.2: ; %atomicrmw.end
+; GFX9-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX9-NEXT:    v_mov_b32_e32 v0, 0
+; GFX9-NEXT:    global_store_dword v0, v2, s[0:1]
 ; GFX9-NEXT:    s_endpgm
 ;
 ; GFX10-LABEL: global_atomic_dec_ret_i32_offset_system:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
-; GFX10-NEXT:    v_mov_b32_e32 v0, 42
-; GFX10-NEXT:    v_mov_b32_e32 v1, 0
+; GFX10-NEXT:    v_mov_b32_e32 v0, 0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX10-NEXT:    global_atomic_dec v0, v1, v0, s[2:3] offset:16 glc
+; GFX10-NEXT:    s_load_dword s4, s[2:3], 0x10
+; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    v_mov_b32_e32 v1, s4
+; GFX10-NEXT:    s_mov_b32 s4, 0
+; GFX10-NEXT:  .LBB6_1: ; %atomicrmw.start
+; GFX10-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GFX10-NEXT:    v_mov_b32_e32 v2, v1
+; GFX10-NEXT:    v_add_nc_u32_e32 v1, 0xffffffd5, v2
+; GFX10-NEXT:    v_add_nc_u32_e32 v3, -1, v2
+; GFX10-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 0xffffffd6, v1
+; GFX10-NEXT:    v_cndmask_b32_e64 v1, v3, 42, vcc_lo
+; GFX10-NEXT:    global_atomic_cmpswap v1, v0, v[1:2], s[2:3] offset:16 glc
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
 ; GFX10-NEXT:    buffer_gl1_inv
 ; GFX10-NEXT:    buffer_gl0_inv
-; GFX10-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v1, v2
+; GFX10-NEXT:    s_or_b32 s4, vcc_lo, s4
+; GFX10-NEXT:    s_andn2_b32 exec_lo, exec_lo, s4
+; GFX10-NEXT:    s_cbranch_execnz .LBB6_1
+; GFX10-NEXT:  ; %bb.2: ; %atomicrmw.end
+; GFX10-NEXT:    s_or_b32 exec_lo, exec_lo, s4
+; GFX10-NEXT:    v_mov_b32_e32 v0, 0
+; GFX10-NEXT:    global_store_dword v0, v1, s[0:1]
 ; GFX10-NEXT:    s_endpgm
 ;
 ; GFX11-LABEL: global_atomic_dec_ret_i32_offset_system:
 ; GFX11:       ; %bb.0:
 ; GFX11-NEXT:    s_load_b128 s[0:3], s[4:5], 0x0
-; GFX11-NEXT:    v_dual_mov_b32 v0, 42 :: v_dual_mov_b32 v1, 0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-NEXT:    global_atomic_dec_u32 v0, v1, v0, s[2:3] offset:16 glc
+; GFX11-NEXT:    s_load_b32 s4, s[2:3], 0x10
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s4
+; GFX11-NEXT:    s_mov_b32 s4, 0
+; GFX11-NEXT:  .LBB6_1: ; %atomicrmw.start
+; GFX11-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-NEXT:    v_mov_b32_e32 v2, v1
+; GFX11-NEXT:    v_add_nc_u32_e32 v1, 0xffffffd5, v2
+; GFX11-NEXT:    v_add_nc_u32_e32 v3, -1, v2
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 0xffffffd6, v1
+; GFX11-NEXT:    v_cndmask_b32_e64 v1, v3, 42, vcc_lo
+; GFX11-NEXT:    global_atomic_cmpswap_b32 v1, v0, v[1:2], s[2:3] offset:16 glc
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
-; GFX11-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v1, v2
+; GFX11-NEXT:    s_or_b32 s4, vcc_lo, s4
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; GFX11-NEXT:    s_cbranch_execnz .LBB6_1
+; GFX11-NEXT:  ; %bb.2: ; %atomicrmw.end
+; GFX11-NEXT:    s_or_b32 exec_lo, exec_lo, s4
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    global_store_b32 v0, v1, s[0:1]
 ; GFX11-NEXT:    s_endpgm
   %gep = getelementptr i32, ptr addrspace(1) %ptr, i32 4
   %result = atomicrmw udec_wrap ptr addrspace(1) %gep, i32 42 seq_cst, align 4
@@ -642,63 +731,144 @@ define amdgpu_kernel void @global_atomic_dec_noret_i32_offset_system(ptr addrspa
 ; CI-LABEL: global_atomic_dec_noret_i32_offset_system:
 ; CI:       ; %bb.0:
 ; CI-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
-; CI-NEXT:    v_mov_b32_e32 v2, 42
+; CI-NEXT:    v_not_b32_e32 v4, 41
 ; CI-NEXT:    s_waitcnt lgkmcnt(0)
-; CI-NEXT:    s_add_u32 s0, s0, 16
-; CI-NEXT:    s_addc_u32 s1, s1, 0
-; CI-NEXT:    v_mov_b32_e32 v0, s0
-; CI-NEXT:    v_mov_b32_e32 v1, s1
-; CI-NEXT:    flat_atomic_dec v[0:1], v2
+; CI-NEXT:    s_load_dword s4, s[0:1], 0x4
+; CI-NEXT:    s_add_u32 s2, s0, 16
+; CI-NEXT:    s_addc_u32 s3, s1, 0
+; CI-NEXT:    v_mov_b32_e32 v0, s2
+; CI-NEXT:    s_mov_b64 s[0:1], 0
+; CI-NEXT:    v_mov_b32_e32 v1, s3
+; CI-NEXT:    s_waitcnt lgkmcnt(0)
+; CI-NEXT:    v_mov_b32_e32 v3, s4
+; CI-NEXT:  .LBB9_1: ; %atomicrmw.start
+; CI-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CI-NEXT:    v_add_i32_e32 v2, vcc, -1, v3
+; CI-NEXT:    v_add_i32_e32 v5, vcc, 0xffffffd5, v3
+; CI-NEXT:    v_cmp_lt_u32_e32 vcc, v5, v4
+; CI-NEXT:    v_cndmask_b32_e64 v2, v2, 42, vcc
+; CI-NEXT:    flat_atomic_cmpswap v2, v[0:1], v[2:3] glc
 ; CI-NEXT:    s_waitcnt vmcnt(0)
 ; CI-NEXT:    buffer_wbinvl1_vol
+; CI-NEXT:    v_cmp_eq_u32_e32 vcc, v2, v3
+; CI-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
+; CI-NEXT:    v_mov_b32_e32 v3, v2
+; CI-NEXT:    s_andn2_b64 exec, exec, s[0:1]
+; CI-NEXT:    s_cbranch_execnz .LBB9_1
+; CI-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; CI-NEXT:    s_endpgm
 ;
 ; VI-LABEL: global_atomic_dec_noret_i32_offset_system:
 ; VI:       ; %bb.0:
 ; VI-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
-; VI-NEXT:    v_mov_b32_e32 v2, 42
+; VI-NEXT:    v_not_b32_e32 v4, 41
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
-; VI-NEXT:    s_add_u32 s0, s0, 16
-; VI-NEXT:    s_addc_u32 s1, s1, 0
-; VI-NEXT:    v_mov_b32_e32 v0, s0
-; VI-NEXT:    v_mov_b32_e32 v1, s1
-; VI-NEXT:    flat_atomic_dec v[0:1], v2
+; VI-NEXT:    s_load_dword s4, s[0:1], 0x10
+; VI-NEXT:    s_add_u32 s2, s0, 16
+; VI-NEXT:    s_addc_u32 s3, s1, 0
+; VI-NEXT:    v_mov_b32_e32 v0, s2
+; VI-NEXT:    s_mov_b64 s[0:1], 0
+; VI-NEXT:    v_mov_b32_e32 v1, s3
+; VI-NEXT:    s_waitcnt lgkmcnt(0)
+; VI-NEXT:    v_mov_b32_e32 v3, s4
+; VI-NEXT:  .LBB9_1: ; %atomicrmw.start
+; VI-NEXT:    ; =>This Inner Loop Header: Depth=1
+; VI-NEXT:    v_add_u32_e32 v2, vcc, -1, v3
+; VI-NEXT:    v_add_u32_e32 v5, vcc, 0xffffffd5, v3
+; VI-NEXT:    v_cmp_lt_u32_e32 vcc, v5, v4
+; VI-NEXT:    v_cndmask_b32_e64 v2, v2, 42, vcc
+; VI-NEXT:    flat_atomic_cmpswap v2, v[0:1], v[2:3] glc
 ; VI-NEXT:    s_waitcnt vmcnt(0)
 ; VI-NEXT:    buffer_wbinvl1_vol
+; VI-NEXT:    v_cmp_eq_u32_e32 vcc, v2, v3
+; VI-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
+; VI-NEXT:    v_mov_b32_e32 v3, v2
+; VI-NEXT:    s_andn2_b64 exec, exec, s[0:1]
+; VI-NEXT:    s_cbranch_execnz .LBB9_1
+; VI-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; VI-NEXT:    s_endpgm
 ;
 ; GFX9-LABEL: global_atomic_dec_noret_i32_offset_system:
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
-; GFX9-NEXT:    v_mov_b32_e32 v0, 42
-; GFX9-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9-NEXT:    s_mov_b64 s[2:3], 0
+; GFX9-NEXT:    v_not_b32_e32 v2, 41
+; GFX9-NEXT:    v_mov_b32_e32 v3, 0
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9-NEXT:    global_atomic_dec v1, v0, s[0:1] offset:16
+; GFX9-NEXT:    s_load_dword s4, s[0:1], 0x10
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    v_mov_b32_e32 v1, s4
+; GFX9-NEXT:  .LBB9_1: ; %atomicrmw.start
+; GFX9-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GFX9-NEXT:    v_add_u32_e32 v4, 0xffffffd5, v1
+; GFX9-NEXT:    v_add_u32_e32 v0, -1, v1
+; GFX9-NEXT:    v_cmp_lt_u32_e32 vcc, v4, v2
+; GFX9-NEXT:    v_cndmask_b32_e64 v0, v0, 42, vcc
+; GFX9-NEXT:    global_atomic_cmpswap v0, v3, v[0:1], s[0:1] offset:16 glc
 ; GFX9-NEXT:    s_waitcnt vmcnt(0)
 ; GFX9-NEXT:    buffer_wbinvl1_vol
+; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v1
+; GFX9-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
+; GFX9-NEXT:    v_mov_b32_e32 v1, v0
+; GFX9-NEXT:    s_andn2_b64 exec, exec, s[2:3]
+; GFX9-NEXT:    s_cbranch_execnz .LBB9_1
+; GFX9-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX9-NEXT:    s_endpgm
 ;
 ; GFX10-LABEL: global_atomic_dec_noret_i32_offset_system:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
-; GFX10-NEXT:    v_mov_b32_e32 v0, 42
-; GFX10-NEXT:    v_mov_b32_e32 v1, 0
+; GFX10-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX10-NEXT:    global_atomic_dec v1, v0, s[0:1] offset:16
-; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX10-NEXT:    s_load_dword s2, s[0:1], 0x10
+; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    v_mov_b32_e32 v1, s2
+; GFX10-NEXT:    s_mov_b32 s2, 0
+; GFX10-NEXT:  .LBB9_1: ; %atomicrmw.start
+; GFX10-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GFX10-NEXT:    v_add_nc_u32_e32 v0, 0xffffffd5, v1
+; GFX10-NEXT:    v_add_nc_u32_e32 v3, -1, v1
+; GFX10-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 0xffffffd6, v0
+; GFX10-NEXT:    v_cndmask_b32_e64 v0, v3, 42, vcc_lo
+; GFX10-NEXT:    global_atomic_cmpswap v0, v2, v[0:1], s[0:1] offset:16 glc
+; GFX10-NEXT:    s_waitcnt vmcnt(0)
 ; GFX10-NEXT:    buffer_gl1_inv
 ; GFX10-NEXT:    buffer_gl0_inv
+; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v0, v1
+; GFX10-NEXT:    v_mov_b32_e32 v1, v0
+; GFX10-NEXT:    s_or_b32 s2, vcc_lo, s2
+; GFX10-NEXT:    s_andn2_b32 exec_lo, exec_lo, s2
+; GFX10-NEXT:    s_cbranch_execnz .LBB9_1
+; GFX10-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX10-NEXT:    s_endpgm
 ;
 ; GFX11-LABEL: global_atomic_dec_noret_i32_offset_system:
 ; GFX11:       ; %bb.0:
 ; GFX11-NEXT:    s_load_b64 s[0:1], s[4:5], 0x0
-; GFX11-NEXT:    v_dual_mov_b32 v0, 42 :: v_dual_mov_b32 v1, 0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-NEXT:    global_atomic_dec_u32 v1, v0, s[0:1] offset:16
-; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-NEXT:    s_load_b32 s2, s[0:1], 0x10
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v1, s2
+; GFX11-NEXT:    s_mov_b32 s2, 0
+; GFX11-NEXT:  .LBB9_1: ; %atomicrmw.start
+; GFX11-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-NEXT:    v_add_nc_u32_e32 v0, 0xffffffd5, v1
+; GFX11-NEXT:    v_add_nc_u32_e32 v3, -1, v1
+; GFX11-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 0xffffffd6, v0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; GFX11-NEXT:    v_cndmask_b32_e64 v0, v3, 42, vcc_lo
+; GFX11-NEXT:    global_atomic_cmpswap_b32 v0, v2, v[0:1], s[0:1] offset:16 glc
+; GFX11-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
+; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v0, v1
+; GFX11-NEXT:    v_mov_b32_e32 v1, v0
+; GFX11-NEXT:    s_or_b32 s2, vcc_lo, s2
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s2
+; GFX11-NEXT:    s_cbranch_execnz .LBB9_1
+; GFX11-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX11-NEXT:    s_endpgm
   %gep = getelementptr i32, ptr addrspace(1) %ptr, i32 4
   %result = atomicrmw udec_wrap ptr addrspace(1) %gep, i32 42 seq_cst, align 4
@@ -1045,65 +1215,128 @@ define amdgpu_kernel void @flat_atomic_dec_ret_i32_offset_system(ptr %out, ptr %
 ; CI-LABEL: flat_atomic_dec_ret_i32_offset_system:
 ; CI:       ; %bb.0:
 ; CI-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
-; CI-NEXT:    v_mov_b32_e32 v2, 42
+; CI-NEXT:    v_not_b32_e32 v2, 41
 ; CI-NEXT:    s_waitcnt lgkmcnt(0)
 ; CI-NEXT:    s_add_u32 s2, s2, 16
 ; CI-NEXT:    s_addc_u32 s3, s3, 0
 ; CI-NEXT:    v_mov_b32_e32 v0, s2
 ; CI-NEXT:    v_mov_b32_e32 v1, s3
-; CI-NEXT:    flat_atomic_dec v2, v[0:1], v2 glc
+; CI-NEXT:    flat_load_dword v3, v[0:1]
+; CI-NEXT:    s_mov_b64 s[2:3], 0
+; CI-NEXT:  .LBB14_1: ; %atomicrmw.start
+; CI-NEXT:    ; =>This Inner Loop Header: Depth=1
+; CI-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CI-NEXT:    v_mov_b32_e32 v4, v3
+; CI-NEXT:    v_add_i32_e32 v3, vcc, -1, v4
+; CI-NEXT:    v_add_i32_e32 v5, vcc, 0xffffffd5, v4
+; CI-NEXT:    v_cmp_lt_u32_e32 vcc, v5, v2
+; CI-NEXT:    v_cndmask_b32_e64 v3, v3, 42, vcc
+; CI-NEXT:    flat_atomic_cmpswap v3, v[0:1], v[3:4] glc
 ; CI-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; CI-NEXT:    buffer_wbinvl1_vol
+; CI-NEXT:    v_cmp_eq_u32_e32 vcc, v3, v4
+; CI-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
+; CI-NEXT:    s_andn2_b64 exec, exec, s[2:3]
+; CI-NEXT:    s_cbranch_execnz .LBB14_1
+; CI-NEXT:  ; %bb.2: ; %atomicrmw.end
+; CI-NEXT:    s_or_b64 exec, exec, s[2:3]
 ; CI-NEXT:    v_mov_b32_e32 v0, s0
 ; CI-NEXT:    v_mov_b32_e32 v1, s1
-; CI-NEXT:    flat_store_dword v[0:1], v2
+; CI-NEXT:    flat_store_dword v[0:1], v3
 ; CI-NEXT:    s_endpgm
 ;
 ; VI-LABEL: flat_atomic_dec_ret_i32_offset_system:
 ; VI:       ; %bb.0:
 ; VI-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
-; VI-NEXT:    v_mov_b32_e32 v2, 42
+; VI-NEXT:    v_not_b32_e32 v2, 41
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    s_add_u32 s2, s2, 16
 ; VI-NEXT:    s_addc_u32 s3, s3, 0
 ; VI-NEXT:    v_mov_b32_e32 v0, s2
 ; VI-NEXT:    v_mov_b32_e32 v1, s3
-; VI-NEXT:    flat_atomic_dec v2, v[0:1], v2 glc
+; VI-N...
[truncated]

System scope atomics need to use cmpxchg loops if we know nothing about the allocation the address is from. aea5980 started this, this expands the set to cover the remaining integer operations. Don't expand xchg and add, those theoretically should work over PCIe. This is a pre-commit which will introduce performance regressions. Subsequent changes will add handling of new atomicrmw metadata, which will avoid the expansion. Note this still isn't conservative enough; we do need to expand some device scope atomics if the memory is in fine-grained remote memory.

arsenm requested a review from yxsamliu January 8, 2025 16:33

arsenm mentioned this pull request Jan 8, 2025

AMDGPU: Start considering new atomicrmw metadata on integer operations #122138

Open

arsenm added the backend:AMDGPU label Jan 8, 2025 — with Graphite App

arsenm requested review from AlexVlx and Pierre-vh January 8, 2025 16:33

arsenm marked this pull request as ready for review January 8, 2025 16:33

llvmbot added llvm:globalisel llvm:transforms labels Jan 8, 2025

arsenm mentioned this pull request Jan 8, 2025

AMDGPU: Expand remaining system atomic operations #80798

Closed

arsenm force-pushed the users/arsenm/amdgpu-expand-system-atomics branch from 9ba2a99 to e2b5611 Compare February 24, 2025 15:02

arsenm force-pushed the users/arsenm/amdgpu-expand-system-atomics branch from e2b5611 to 493d7e3 Compare March 3, 2025 16:23

arsenm force-pushed the users/arsenm/amdgpu-expand-system-atomics branch from 493d7e3 to 976f7b9 Compare March 17, 2025 04:29

Pierre-vh approved these changes Mar 18, 2025

View reviewed changes

arsenm force-pushed the users/arsenm/amdgpu-expand-system-atomics branch from 976f7b9 to f7226e9 Compare May 20, 2025 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMDGPU: Expand remaining system atomic operations #122137

AMDGPU: Expand remaining system atomic operations #122137

Uh oh!

arsenm commented Jan 8, 2025

Uh oh!

arsenm commented Jan 8, 2025

Uh oh!

llvmbot commented Jan 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

AMDGPU: Expand remaining system atomic operations #122137

Are you sure you want to change the base?

AMDGPU: Expand remaining system atomic operations #122137

Uh oh!

Conversation

arsenm commented Jan 8, 2025

Uh oh!

arsenm commented Jan 8, 2025

Uh oh!

llvmbot commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Jan 8, 2025 •

edited

Loading