Skip to content

[AMDGPULowerBufferFatPointers] Fix offset-only ptrtoint #95543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 14, 2024

Conversation

nikic
Copy link
Contributor

@nikic nikic commented Jun 14, 2024

For ptrtoint that truncates to the offset only, the expansion generated a shift by the bit width, which is poison. Instead, we should return the offset directly.

(The same problem exists for the constant expression case, but I plan to address that separately, and more comprehensively.)

For ptrtoint that truncates to the offset only, the expansion
generated a shift by the bit width, which is poison. Instead, we
should return the offset directly.

(The same problem exists for the constant expression case, but I
plan to address that differently.)
@llvmbot
Copy link
Member

llvmbot commented Jun 14, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Nikita Popov (nikic)

Changes

For ptrtoint that truncates to the offset only, the expansion generated a shift by the bit width, which is poison. Instead, we should return the offset directly.

(The same problem exists for the constant expression case, but I plan to address that separately, and more comprehensively.)


Patch is 399.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/95543.diff

5 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp (+16-14)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+706-497)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+721-514)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+721-514)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-pointer-ops.ll (+1-2)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
index 0b261d8e33907..dfe0583767313 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
@@ -1435,20 +1435,22 @@ PtrParts SplitPtrStructs::visitPtrToIntInst(PtrToIntInst &PI) {
   const DataLayout &DL = PI.getModule()->getDataLayout();
   unsigned FatPtrWidth = DL.getPointerSizeInBits(AMDGPUAS::BUFFER_FAT_POINTER);
 
-  Value *RsrcInt;
-  if (Width <= BufferOffsetWidth)
-    RsrcInt = ConstantExpr::getIntegerValue(ResTy, APInt::getZero(Width));
-  else
-    RsrcInt = IRB.CreatePtrToInt(Rsrc, ResTy, PI.getName() + ".rsrc");
-  copyMetadata(RsrcInt, &PI);
-
-  Value *Shl = IRB.CreateShl(
-      RsrcInt,
-      ConstantExpr::getIntegerValue(ResTy, APInt(Width, BufferOffsetWidth)), "",
-      Width >= FatPtrWidth, Width > FatPtrWidth);
-  Value *OffCast =
-      IRB.CreateIntCast(Off, ResTy, /*isSigned=*/false, PI.getName() + ".off");
-  Value *Res = IRB.CreateOr(Shl, OffCast);
+  Value *Res;
+  if (Width <= BufferOffsetWidth) {
+    Res = IRB.CreateIntCast(Off, ResTy, /*isSigned=*/false,
+                            PI.getName() + ".off");
+  } else {
+    Value *RsrcInt = IRB.CreatePtrToInt(Rsrc, ResTy, PI.getName() + ".rsrc");
+    Value *Shl = IRB.CreateShl(
+        RsrcInt,
+        ConstantExpr::getIntegerValue(ResTy, APInt(Width, BufferOffsetWidth)),
+        "", Width >= FatPtrWidth, Width > FatPtrWidth);
+    Value *OffCast = IRB.CreateIntCast(Off, ResTy, /*isSigned=*/false,
+                                       PI.getName() + ".off");
+    Res = IRB.CreateOr(Shl, OffCast);
+  }
+
+  copyMetadata(Res, &PI);
   Res->takeName(&PI);
   SplitUsers.insert(&PI);
   PI.replaceAllUsesWith(Res);
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
index 2f4606035376d..b81730803d4a9 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
@@ -1814,22 +1814,27 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset(ptr addrspace(7) i
 ; GFX12-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-NEXT:    s_addk_co_i32 s4, 0x200
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX12-NEXT:    s_and_b32 s4, s4, -4
-; GFX12-NEXT:    v_mov_b32_e32 v5, s4
-; GFX12-NEXT:    s_mov_b32 s4, 0
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT:    s_and_b32 s5, s4, -4
+; GFX12-NEXT:    s_and_b32 s4, s4, 3
+; GFX12-NEXT:    v_mov_b32_e32 v5, s5
+; GFX12-NEXT:    s_lshl_b32 s4, s4, 3
+; GFX12-NEXT:    s_lshl_b32 s5, 0xffff, s4
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX12-NEXT:    s_not_b32 s6, s5
 ; GFX12-NEXT:    buffer_load_b32 v2, v5, s[0:3], null offen
+; GFX12-NEXT:    s_mov_b32 s5, 0
 ; GFX12-NEXT:  .LBB6_1: ; %atomicrmw.start
 ; GFX12-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
-; GFX12-NEXT:    v_lshrrev_b32_e32 v1, 24, v2
+; GFX12-NEXT:    v_lshrrev_b32_e32 v1, s4, v2
 ; GFX12-NEXT:    s_wait_storecnt 0x0
 ; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX12-NEXT:    v_add_f16_e32 v1, v1, v0
 ; GFX12-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
-; GFX12-NEXT:    v_and_or_b32 v1, 0xffffff, v2, v1
+; GFX12-NEXT:    v_lshlrev_b32_e32 v1, s4, v1
+; GFX12-NEXT:    v_and_or_b32 v1, v2, s6, v1
 ; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX12-NEXT:    v_dual_mov_b32 v4, v2 :: v_dual_mov_b32 v3, v1
 ; GFX12-NEXT:    buffer_atomic_cmpswap_b32 v[3:4], v5, s[0:3], null offen th:TH_ATOMIC_RETURN
@@ -1837,31 +1842,36 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset(ptr addrspace(7) i
 ; GFX12-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v3, v2
 ; GFX12-NEXT:    v_mov_b32_e32 v2, v3
-; GFX12-NEXT:    s_or_b32 s4, vcc_lo, s4
+; GFX12-NEXT:    s_or_b32 s5, vcc_lo, s5
 ; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; GFX12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s5
 ; GFX12-NEXT:    s_cbranch_execnz .LBB6_1
 ; GFX12-NEXT:  ; %bb.2: ; %atomicrmw.end
-; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s4
-; GFX12-NEXT:    v_lshrrev_b32_e32 v0, 24, v3
+; GFX12-NEXT:    s_or_b32 exec_lo, exec_lo, s5
+; GFX12-NEXT:    v_lshrrev_b32_e32 v0, s4, v3
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX940-LABEL: buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset:
 ; GFX940:       ; %bb.0:
 ; GFX940-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX940-NEXT:    s_addk_i32 s4, 0x200
-; GFX940-NEXT:    s_and_b32 s4, s4, -4
-; GFX940-NEXT:    v_mov_b32_e32 v1, s4
+; GFX940-NEXT:    s_and_b32 s5, s4, -4
+; GFX940-NEXT:    v_mov_b32_e32 v1, s5
 ; GFX940-NEXT:    buffer_load_dword v3, v1, s[0:3], 0 offen
-; GFX940-NEXT:    s_mov_b32 s6, 0xffffff
+; GFX940-NEXT:    s_and_b32 s4, s4, 3
+; GFX940-NEXT:    s_lshl_b32 s6, s4, 3
+; GFX940-NEXT:    s_lshl_b32 s4, 0xffff, s6
+; GFX940-NEXT:    s_not_b32 s7, s4
 ; GFX940-NEXT:    s_mov_b64 s[4:5], 0
 ; GFX940-NEXT:  .LBB6_1: ; %atomicrmw.start
 ; GFX940-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
-; GFX940-NEXT:    v_add_f16_sdwa v2, v3, v0 dst_sel:BYTE_3 dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:DWORD
-; GFX940-NEXT:    buffer_wbl2 sc1
-; GFX940-NEXT:    v_and_or_b32 v2, v3, s6, v2
+; GFX940-NEXT:    v_lshrrev_b32_e32 v2, s6, v3
+; GFX940-NEXT:    v_add_f16_e32 v2, v2, v0
+; GFX940-NEXT:    v_lshlrev_b32_e32 v2, s6, v2
+; GFX940-NEXT:    v_and_or_b32 v2, v3, s7, v2
 ; GFX940-NEXT:    v_mov_b64_e32 v[4:5], v[2:3]
+; GFX940-NEXT:    buffer_wbl2 sc1
 ; GFX940-NEXT:    buffer_atomic_cmpswap v[4:5], v1, s[0:3], 0 offen sc0
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
 ; GFX940-NEXT:    buffer_inv sc1
@@ -1872,30 +1882,34 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset(ptr addrspace(7) i
 ; GFX940-NEXT:    s_cbranch_execnz .LBB6_1
 ; GFX940-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX940-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GFX940-NEXT:    v_lshrrev_b32_e32 v0, 24, v4
+; GFX940-NEXT:    v_lshrrev_b32_e32 v0, s6, v4
 ; GFX940-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-LABEL: buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset:
 ; GFX11:       ; %bb.0:
 ; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX11-NEXT:    s_addk_i32 s4, 0x200
-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX11-NEXT:    s_and_b32 s4, s4, -4
-; GFX11-NEXT:    v_mov_b32_e32 v5, s4
-; GFX11-NEXT:    s_mov_b32 s4, 0
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    s_and_b32 s5, s4, -4
+; GFX11-NEXT:    s_and_b32 s4, s4, 3
+; GFX11-NEXT:    v_mov_b32_e32 v5, s5
+; GFX11-NEXT:    s_lshl_b32 s4, s4, 3
+; GFX11-NEXT:    s_lshl_b32 s5, 0xffff, s4
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    s_not_b32 s6, s5
 ; GFX11-NEXT:    buffer_load_b32 v2, v5, s[0:3], 0 offen
-; GFX11-NEXT:    .p2align 6
+; GFX11-NEXT:    s_mov_b32 s5, 0
 ; GFX11-NEXT:  .LBB6_1: ; %atomicrmw.start
 ; GFX11-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-NEXT:    v_lshrrev_b32_e32 v1, 24, v2
+; GFX11-NEXT:    v_lshrrev_b32_e32 v1, s4, v2
 ; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX11-NEXT:    v_add_f16_e32 v1, v1, v0
 ; GFX11-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
-; GFX11-NEXT:    v_and_or_b32 v1, 0xffffff, v2, v1
+; GFX11-NEXT:    v_lshlrev_b32_e32 v1, s4, v1
+; GFX11-NEXT:    v_and_or_b32 v1, v2, s6, v1
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX11-NEXT:    v_dual_mov_b32 v4, v2 :: v_dual_mov_b32 v3, v1
 ; GFX11-NEXT:    buffer_atomic_cmpswap_b32 v[3:4], v5, s[0:3], 0 offen glc
@@ -1904,31 +1918,35 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset(ptr addrspace(7) i
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v3, v2
 ; GFX11-NEXT:    v_mov_b32_e32 v2, v3
-; GFX11-NEXT:    s_or_b32 s4, vcc_lo, s4
+; GFX11-NEXT:    s_or_b32 s5, vcc_lo, s5
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
+; GFX11-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s5
 ; GFX11-NEXT:    s_cbranch_execnz .LBB6_1
 ; GFX11-NEXT:  ; %bb.2: ; %atomicrmw.end
-; GFX11-NEXT:    s_or_b32 exec_lo, exec_lo, s4
-; GFX11-NEXT:    v_lshrrev_b32_e32 v0, 24, v3
+; GFX11-NEXT:    s_or_b32 exec_lo, exec_lo, s5
+; GFX11-NEXT:    v_lshrrev_b32_e32 v0, s4, v3
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX10-LABEL: buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX10-NEXT:    s_addk_i32 s8, 0x200
-; GFX10-NEXT:    s_and_b32 s8, s8, -4
-; GFX10-NEXT:    v_mov_b32_e32 v5, s8
-; GFX10-NEXT:    s_mov_b32 s8, 0
+; GFX10-NEXT:    s_and_b32 s9, s8, -4
+; GFX10-NEXT:    s_and_b32 s8, s8, 3
+; GFX10-NEXT:    v_mov_b32_e32 v5, s9
+; GFX10-NEXT:    s_lshl_b32 s8, s8, 3
+; GFX10-NEXT:    s_lshl_b32 s9, 0xffff, s8
+; GFX10-NEXT:    s_not_b32 s10, s9
 ; GFX10-NEXT:    buffer_load_dword v2, v5, s[4:7], 0 offen
+; GFX10-NEXT:    s_mov_b32 s9, 0
 ; GFX10-NEXT:  .LBB6_1: ; %atomicrmw.start
 ; GFX10-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_add_f16_sdwa v1, v2, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:DWORD
+; GFX10-NEXT:    v_lshrrev_b32_e32 v1, s8, v2
 ; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
-; GFX10-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GFX10-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
-; GFX10-NEXT:    v_and_or_b32 v1, 0xffffff, v2, v1
+; GFX10-NEXT:    v_add_f16_e32 v1, v1, v0
+; GFX10-NEXT:    v_lshlrev_b32_sdwa v1, s8, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
+; GFX10-NEXT:    v_and_or_b32 v1, v2, s10, v1
 ; GFX10-NEXT:    v_mov_b32_e32 v4, v2
 ; GFX10-NEXT:    v_mov_b32_e32 v3, v1
 ; GFX10-NEXT:    buffer_atomic_cmpswap v[3:4], v5, s[4:7], 0 offen glc
@@ -1937,28 +1955,33 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset(ptr addrspace(7) i
 ; GFX10-NEXT:    buffer_gl0_inv
 ; GFX10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v3, v2
 ; GFX10-NEXT:    v_mov_b32_e32 v2, v3
-; GFX10-NEXT:    s_or_b32 s8, vcc_lo, s8
-; GFX10-NEXT:    s_andn2_b32 exec_lo, exec_lo, s8
+; GFX10-NEXT:    s_or_b32 s9, vcc_lo, s9
+; GFX10-NEXT:    s_andn2_b32 exec_lo, exec_lo, s9
 ; GFX10-NEXT:    s_cbranch_execnz .LBB6_1
 ; GFX10-NEXT:  ; %bb.2: ; %atomicrmw.end
-; GFX10-NEXT:    s_or_b32 exec_lo, exec_lo, s8
-; GFX10-NEXT:    v_lshrrev_b32_e32 v0, 24, v3
+; GFX10-NEXT:    s_or_b32 exec_lo, exec_lo, s9
+; GFX10-NEXT:    v_lshrrev_b32_e32 v0, s8, v3
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX90A-LABEL: buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset:
 ; GFX90A:       ; %bb.0:
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    s_addk_i32 s8, 0x200
-; GFX90A-NEXT:    s_and_b32 s8, s8, -4
-; GFX90A-NEXT:    v_mov_b32_e32 v1, s8
+; GFX90A-NEXT:    s_and_b32 s9, s8, -4
+; GFX90A-NEXT:    v_mov_b32_e32 v1, s9
 ; GFX90A-NEXT:    buffer_load_dword v3, v1, s[4:7], 0 offen
-; GFX90A-NEXT:    s_mov_b32 s10, 0xffffff
+; GFX90A-NEXT:    s_and_b32 s8, s8, 3
+; GFX90A-NEXT:    s_lshl_b32 s10, s8, 3
+; GFX90A-NEXT:    s_lshl_b32 s8, 0xffff, s10
+; GFX90A-NEXT:    s_not_b32 s11, s8
 ; GFX90A-NEXT:    s_mov_b64 s[8:9], 0
 ; GFX90A-NEXT:  .LBB6_1: ; %atomicrmw.start
 ; GFX90A-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
-; GFX90A-NEXT:    v_add_f16_sdwa v2, v3, v0 dst_sel:BYTE_3 dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:DWORD
-; GFX90A-NEXT:    v_and_or_b32 v2, v3, s10, v2
+; GFX90A-NEXT:    v_lshrrev_b32_e32 v2, s10, v3
+; GFX90A-NEXT:    v_add_f16_e32 v2, v2, v0
+; GFX90A-NEXT:    v_lshlrev_b32_e32 v2, s10, v2
+; GFX90A-NEXT:    v_and_or_b32 v2, v3, s11, v2
 ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], v[2:3], v[2:3] op_sel:[0,1]
 ; GFX90A-NEXT:    buffer_atomic_cmpswap v[4:5], v1, s[4:7], 0 offen glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
@@ -1970,23 +1993,28 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset(ptr addrspace(7) i
 ; GFX90A-NEXT:    s_cbranch_execnz .LBB6_1
 ; GFX90A-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX90A-NEXT:    s_or_b64 exec, exec, s[8:9]
-; GFX90A-NEXT:    v_lshrrev_b32_e32 v0, 24, v4
+; GFX90A-NEXT:    v_lshrrev_b32_e32 v0, s10, v4
 ; GFX90A-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX908-LABEL: buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset:
 ; GFX908:       ; %bb.0:
 ; GFX908-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX908-NEXT:    s_addk_i32 s8, 0x200
-; GFX908-NEXT:    s_and_b32 s8, s8, -4
-; GFX908-NEXT:    v_mov_b32_e32 v5, s8
+; GFX908-NEXT:    s_and_b32 s9, s8, -4
+; GFX908-NEXT:    v_mov_b32_e32 v5, s9
 ; GFX908-NEXT:    buffer_load_dword v2, v5, s[4:7], 0 offen
-; GFX908-NEXT:    s_mov_b32 s10, 0xffffff
+; GFX908-NEXT:    s_and_b32 s8, s8, 3
+; GFX908-NEXT:    s_lshl_b32 s10, s8, 3
+; GFX908-NEXT:    s_lshl_b32 s8, 0xffff, s10
+; GFX908-NEXT:    s_not_b32 s11, s8
 ; GFX908-NEXT:    s_mov_b64 s[8:9], 0
 ; GFX908-NEXT:  .LBB6_1: ; %atomicrmw.start
 ; GFX908-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX908-NEXT:    s_waitcnt vmcnt(0)
-; GFX908-NEXT:    v_add_f16_sdwa v1, v2, v0 dst_sel:BYTE_3 dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:DWORD
-; GFX908-NEXT:    v_and_or_b32 v1, v2, s10, v1
+; GFX908-NEXT:    v_lshrrev_b32_e32 v1, s10, v2
+; GFX908-NEXT:    v_add_f16_e32 v1, v1, v0
+; GFX908-NEXT:    v_lshlrev_b32_e32 v1, s10, v1
+; GFX908-NEXT:    v_and_or_b32 v1, v2, s11, v1
 ; GFX908-NEXT:    v_mov_b32_e32 v4, v2
 ; GFX908-NEXT:    v_mov_b32_e32 v3, v1
 ; GFX908-NEXT:    buffer_atomic_cmpswap v[3:4], v5, s[4:7], 0 offen glc
@@ -1999,22 +2027,28 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset(ptr addrspace(7) i
 ; GFX908-NEXT:    s_cbranch_execnz .LBB6_1
 ; GFX908-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX908-NEXT:    s_or_b64 exec, exec, s[8:9]
-; GFX908-NEXT:    v_lshrrev_b32_e32 v0, 24, v3
+; GFX908-NEXT:    v_lshrrev_b32_e32 v0, s10, v3
 ; GFX908-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX8-LABEL: buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX8-NEXT:    s_addk_i32 s8, 0x200
-; GFX8-NEXT:    s_and_b32 s8, s8, -4
-; GFX8-NEXT:    v_mov_b32_e32 v5, s8
+; GFX8-NEXT:    s_and_b32 s9, s8, -4
+; GFX8-NEXT:    v_mov_b32_e32 v5, s9
 ; GFX8-NEXT:    buffer_load_dword v2, v5, s[4:7], 0 offen
+; GFX8-NEXT:    s_and_b32 s8, s8, 3
+; GFX8-NEXT:    s_lshl_b32 s10, s8, 3
+; GFX8-NEXT:    s_lshl_b32 s8, 0xffff, s10
+; GFX8-NEXT:    s_not_b32 s11, s8
 ; GFX8-NEXT:    s_mov_b64 s[8:9], 0
 ; GFX8-NEXT:  .LBB6_1: ; %atomicrmw.start
 ; GFX8-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_add_f16_sdwa v1, v2, v0 dst_sel:BYTE_3 dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:DWORD
-; GFX8-NEXT:    v_and_b32_e32 v3, 0xffffff, v2
+; GFX8-NEXT:    v_lshrrev_b32_e32 v1, s10, v2
+; GFX8-NEXT:    v_add_f16_e32 v1, v1, v0
+; GFX8-NEXT:    v_and_b32_e32 v3, s11, v2
+; GFX8-NEXT:    v_lshlrev_b32_e32 v1, s10, v1
 ; GFX8-NEXT:    v_or_b32_e32 v1, v3, v1
 ; GFX8-NEXT:    v_mov_b32_e32 v4, v2
 ; GFX8-NEXT:    v_mov_b32_e32 v3, v1
@@ -2028,28 +2062,32 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset(ptr addrspace(7) i
 ; GFX8-NEXT:    s_cbranch_execnz .LBB6_1
 ; GFX8-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX8-NEXT:    s_or_b64 exec, exec, s[8:9]
-; GFX8-NEXT:    v_lshrrev_b32_e32 v0, 24, v3
+; GFX8-NEXT:    v_lshrrev_b32_e32 v0, s10, v3
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX7-LABEL: buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset:
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX7-NEXT:    s_addk_i32 s8, 0x200
-; GFX7-NEXT:    s_and_b32 s8, s8, -4
-; GFX7-NEXT:    v_mov_b32_e32 v4, s8
+; GFX7-NEXT:    s_and_b32 s9, s8, -4
+; GFX7-NEXT:    v_mov_b32_e32 v4, s9
 ; GFX7-NEXT:    buffer_load_dword v1, v4, s[4:7], 0 offen
 ; GFX7-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GFX7-NEXT:    s_mov_b64 s[8:9], 0
+; GFX7-NEXT:    s_and_b32 s8, s8, 3
+; GFX7-NEXT:    s_lshl_b32 s10, s8, 3
+; GFX7-NEXT:    s_lshl_b32 s8, 0xffff, s10
 ; GFX7-NEXT:    v_cvt_f32_f16_e32 v5, v0
+; GFX7-NEXT:    s_not_b32 s11, s8
+; GFX7-NEXT:    s_mov_b64 s[8:9], 0
 ; GFX7-NEXT:  .LBB6_1: ; %atomicrmw.start
 ; GFX7-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX7-NEXT:    s_waitcnt vmcnt(0)
-; GFX7-NEXT:    v_lshrrev_b32_e32 v0, 24, v1
+; GFX7-NEXT:    v_lshrrev_b32_e32 v0, s10, v1
 ; GFX7-NEXT:    v_cvt_f32_f16_e32 v0, v0
-; GFX7-NEXT:    v_and_b32_e32 v2, 0xffffff, v1
+; GFX7-NEXT:    v_and_b32_e32 v2, s11, v1
 ; GFX7-NEXT:    v_add_f32_e32 v0, v0, v5
 ; GFX7-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GFX7-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
+; GFX7-NEXT:    v_lshlrev_b32_e32 v0, s10, v0
 ; GFX7-NEXT:    v_or_b32_e32 v0, v2, v0
 ; GFX7-NEXT:    v_mov_b32_e32 v3, v1
 ; GFX7-NEXT:    v_mov_b32_e32 v2, v0
@@ -2063,7 +2101,7 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset(ptr addrspace(7) i
 ; GFX7-NEXT:    s_cbranch_execnz .LBB6_1
 ; GFX7-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX7-NEXT:    s_or_b64 exec, exec, s[8:9]
-; GFX7-NEXT:    v_lshrrev_b32_e32 v0, 24, v2
+; GFX7-NEXT:    v_lshrrev_b32_e32 v0, s10, v2
 ; GFX7-NEXT:    v_cvt_f32_f16_e32 v0, v0
 ; GFX7-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -2071,22 +2109,26 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset(ptr addrspace(7) i
 ; GFX6:       ; %bb.0:
 ; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX6-NEXT:    s_addk_i32 s8, 0x200
-; GFX6-NEXT:    s_and_b32 s8, s8, -4
-; GFX6-NEXT:    v_mov_b32_e32 v4, s8
+; GFX6-NEXT:    s_and_b32 s9, s8, -4
+; GFX6-NEXT:    v_mov_b32_e32 v4, s9
 ; GFX6-NEXT:    buffer_load_dword v1, v4, s[4:7], 0 offen
 ; GFX6-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GFX6-NEXT:    s_mov_b64 s[8:9], 0
+; GFX6-NEXT:    s_and_b32 s8, s8, 3
+; GFX6-NEXT:    s_lshl_b32 s10, s8, 3
+; GFX6-NEXT:    s_lshl_b32 s8, 0xffff, s10
 ; GFX6-NEXT:    v_cvt_f32_f16_e32 v5, v0
+; GFX6-NEXT:    s_not_b32 s11, s8
+; GFX6-NEXT:    s_mov_b64 s[8:9], 0
 ; GFX6-NEXT:  .LBB6_1: ; %atomicrmw.start
 ; GFX6-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX6-NEXT:    s_waitcnt vmcnt(0)
-; GFX6-NEXT:    v_lshrrev_b32_e32 v0, 24, v1
+; GFX6-NEXT:    v_lshrrev_b32_e32 v0, s10, v1
 ; GFX6-NEXT:    v_cvt_f32_f16_e32 v0, v0
 ; GFX6-NEXT:    s_waitcnt expcnt(0)
-; GFX6-NEXT:    v_and_b32_e32 v2, 0xffffff, v1
+; GFX6-NEXT:    v_and_b32_e32 v2, s11, v1
 ; GFX6-NEXT:    v_add_f32_e32 v0, v0, v5
 ; GFX6-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GFX6-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
+; GFX6-NEXT:    v_lshlrev_b32_e32 v0, s10, v0
 ; GFX6-NEXT:    v_or_b32_e32 v0, v2, v0
 ; GFX6-NEXT:    v_mov_b32_e32 v3, v1
 ; GFX6-NEXT:    v_mov_b32_e32 v2, v0
@@ -2100,7 +2142,7 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset(ptr addrspace(7) i
 ; GFX6-NEXT:    s_cbranch_execnz .LBB6_1
 ; GFX6-NEXT:  ; %bb.2: ; %atomicrmw.end
 ; GFX6-NEXT:    s_or_b64 exec, exec, s[8:9]
-; GFX6-NEXT:    v_lshrrev_b32_e32 v0, 24, v2
+; GFX6-NEXT:    v_lshrrev_b32_e32 v0, s10, v2
 ; GFX6-NEXT:    v_cvt_f32_f16_e32 v0, v0
 ; GFX6-NEXT:    s_waitcnt expcnt(0)
 ; GFX6-NEXT:    s_setpc_b64 s[30:31]
@@ -2118,22 +2160,27 @@ define void @buffer_fat_ptr_agent_atomic_fadd_noret_f16__offset(ptr addrspace(7)
 ; GFX12-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-NEXT:    s_addk_co_i32 s4, 0x200
-; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX12-NEXT:    s_and_b32 s4, s4, -4
-; GFX12-NEXT:    v_mov_b32_e32 v3, s4
-; GFX12-NEXT:    s_mov_b32 s4, 0
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT:    s_and_b32 s...
[truncated]

@nikic nikic merged commit 0774000 into llvm:main Jun 14, 2024
7 of 8 checks passed
@nikic nikic deleted the amdfpu-fat-ptr-ptrtoint branch June 14, 2024 14:38
Copy link
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants