[AMDGPU] Fix edge case of buffer OOB handling #115479

piotrAMD · 2024-11-08T13:23:43Z

Strengthen out-of-bounds guarantees for buffer accesses by disallowing buffer accesses with alignment lower than natural alignment.

This is needed to specifically address the edge case where an access starts out-of-bounds and then enters in-bounds, as the hardware would treat the entire access as being out-of-bounds. This is normally not needed for most users, but at least one graphics device extension (VK_EXT_robustness2) has very strict requirements - in-bounds accesses must return correct value, and out-of-bounds accesses must return zero.

The direct consequence of the patch is that a buffer access at negative address is not merged by load-store-vectorizer with one at a positive address, which fixes a CTS test.

Targets that do not care about the new behavior are advised to use the new target feature relaxed-buffer-oob-mode that maintains the state from before the patch.

Add a new target feature require-naturally-aligned-buffer-access to guarantee robust out-of-bounds behavior. When set, it will disallow buffer accesses with alignment lower than natural alignment. This is needed to specifically address the edge case where an access starts out-of-bounds and then enter in-bounds, as the hardware would treat the entire access as out-of-bounds. This is normally not needed for most users (hence the target feature), but at least one graphics device extension (VK_EXT_robustness2) has very strict requirements - in-bounds accesses must return correct value, and out-of-bounds accesses must return zero. The direct result of the patch is that, when the new target feature is set, a buffer access at a negative address will not be merged with one at a positive address.

llvmbot · 2024-11-08T13:24:18Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-amdgpu

Author: Piotr Sobczak (piotrAMD)

Changes

Add a new target feature require-naturally-aligned-buffer-access to guarantee robust out-of-bounds behavior. When set, it will disallow buffer accesses with alignment lower than natural alignment.

This is needed to specifically address the edge case where an access starts out-of-bounds and then enter in-bounds, as the hardware would treat the entire access as out-of-bounds. This is normally not needed for most users (hence the target feature), but at least one graphics device extension (VK_EXT_robustness2) has very strict requirements - in-bounds accesses must return correct value, and out-of-bounds accesses must return zero.

The direct result of the patch is that, when the new target feature is set, a buffer access at a negative address will not be merged with one at a positive address.

Full diff: https://github.com/llvm/llvm-project/pull/115479.diff

4 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+6)
(modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+5)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+14)
(added) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/unaligned-buffer.ll (+80)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index bde61a1f7e58df..8a184a92f016e0 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -119,6 +119,12 @@ def FeatureUnalignedDSAccess : SubtargetFeature<"unaligned-ds-access",
   "Hardware supports unaligned local and region loads and stores"
 >;
 
+def FeatureRequireNaturallyAlignedBufferAccess : SubtargetFeature<"require-naturally-aligned-buffer-access",
+  "RequireNaturallyAlignedBufferAccess",
+  "true",
+  "Requires natural alignment of buffer accesses"
+>;
+
 def FeatureApertureRegs : SubtargetFeature<"aperture-regs",
   "HasApertureRegs",
   "true",
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index 6ff964077d8fd0..541e3c0f399e30 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -76,6 +76,7 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   bool BackOffBarrier = false;
   bool UnalignedScratchAccess = false;
   bool UnalignedAccessMode = false;
+  bool RequireNaturallyAlignedBufferAccess = false;
   bool HasApertureRegs = false;
   bool SupportsXNACK = false;
   bool KernargPreload = false;
@@ -600,6 +601,10 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
     return UnalignedAccessMode;
   }
 
+  bool requiresNaturallyAlignedBufferAccess() const {
+    return RequireNaturallyAlignedBufferAccess;
+  }
+
   bool hasApertureRegs() const {
     return HasApertureRegs;
   }
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 419414e5bd993d..d4321eb682dd9b 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -1840,6 +1840,20 @@ bool SITargetLowering::allowsMisalignedMemoryAccessesImpl(
            Subtarget->hasUnalignedBufferAccessEnabled();
   }
 
+  // Check natural alignment of buffer if the target requires it. This is needed
+  // only if robust out-of-bounds guarantees are needed. Normally hardware will
+  // ensure proper out-of-bounds behavior, but in the edge case where an access
+  // starts out-of-bounds and then enter in-bounds, the entire access would be
+  // treated as out-of-bounds. Requiring the natural alignment avoids the
+  // problem.
+  if (AddrSpace == AMDGPUAS::BUFFER_FAT_POINTER ||
+      AddrSpace == AMDGPUAS::BUFFER_RESOURCE ||
+      AddrSpace == AMDGPUAS::BUFFER_STRIDED_POINTER) {
+    if (Subtarget->requiresNaturallyAlignedBufferAccess() &&
+        Alignment < Align(PowerOf2Ceil(divideCeil(Size, 8))))
+      return false;
+  }
+
   // Smaller than dword value must be aligned.
   if (Size < 32)
     return false;
diff --git a/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/unaligned-buffer.ll b/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/unaligned-buffer.ll
new file mode 100644
index 00000000000000..0d8a98feecb82b
--- /dev/null
+++ b/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/unaligned-buffer.ll
@@ -0,0 +1,80 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -mtriple=amdgcn--amdpal -passes=load-store-vectorizer -mattr=+require-naturally-aligned-buffer-access -S -o - %s | FileCheck --check-prefix=ALIGNED %s
+; RUN: opt -mtriple=amdgcn--amdpal -passes=load-store-vectorizer -S -o - %s | FileCheck --check-prefixes=UNALIGNED %s
+
+; The test checks that require-naturally-aligned-buffer-access target feature prevents merging loads if the target load would not be naturally aligned.
+
+define amdgpu_kernel void @merge_align_4(ptr addrspace(7) nocapture %p, ptr addrspace(7) nocapture %p2) #0 {
+;
+; ALIGNED-LABEL: define amdgpu_kernel void @merge_align_4(
+; ALIGNED-SAME: ptr addrspace(7) nocapture [[P:%.*]], ptr addrspace(7) nocapture [[P2:%.*]]) #[[ATTR0:[0-9]+]] {
+; ALIGNED-NEXT:  [[ENTRY:.*:]]
+; ALIGNED-NEXT:    [[GEP_M8:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -8
+; ALIGNED-NEXT:    [[LD_M8:%.*]] = load i32, ptr addrspace(7) [[GEP_M8]], align 4
+; ALIGNED-NEXT:    [[GEP_M4:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -4
+; ALIGNED-NEXT:    [[LD_M4:%.*]] = load i32, ptr addrspace(7) [[GEP_M4]], align 4
+; ALIGNED-NEXT:    [[GEP_0:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 0
+; ALIGNED-NEXT:    [[LD_0:%.*]] = load i32, ptr addrspace(7) [[GEP_0]], align 4
+; ALIGNED-NEXT:    [[GEP_4:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i64 4
+; ALIGNED-NEXT:    [[LD_4:%.*]] = load i32, ptr addrspace(7) [[GEP_4]], align 4
+; ALIGNED-NEXT:    ret void
+;
+; UNALIGNED-LABEL: define amdgpu_kernel void @merge_align_4(
+; UNALIGNED-SAME: ptr addrspace(7) nocapture [[P:%.*]], ptr addrspace(7) nocapture [[P2:%.*]]) {
+; UNALIGNED-NEXT:  [[ENTRY:.*:]]
+; UNALIGNED-NEXT:    [[GEP_M8:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -8
+; UNALIGNED-NEXT:    [[TMP0:%.*]] = load <4 x i32>, ptr addrspace(7) [[GEP_M8]], align 4
+; UNALIGNED-NEXT:    [[LD_M81:%.*]] = extractelement <4 x i32> [[TMP0]], i32 0
+; UNALIGNED-NEXT:    [[LD_M42:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
+; UNALIGNED-NEXT:    [[LD_03:%.*]] = extractelement <4 x i32> [[TMP0]], i32 2
+; UNALIGNED-NEXT:    [[LD_44:%.*]] = extractelement <4 x i32> [[TMP0]], i32 3
+; UNALIGNED-NEXT:    ret void
+;
+entry:
+  %gep_m8 = getelementptr i8, ptr addrspace(7) %p, i32 -8
+  %ld_m8 = load i32, ptr addrspace(7) %gep_m8, align 4
+  %gep_m4 = getelementptr i8, ptr addrspace(7) %p, i32 -4
+  %ld_m4 = load i32, ptr addrspace(7) %gep_m4, align 4
+  %gep_0 = getelementptr i8, ptr addrspace(7) %p, i32 0
+  %ld_0 = load i32, ptr addrspace(7) %gep_0, align 4
+  %gep_4 = getelementptr i8, ptr addrspace(7) %p, i64 4
+  %ld_4 = load i32, ptr addrspace(7) %gep_4, align 4
+  ret void
+}
+
+; The test checks that require-naturally-aligned-buffer-access target feature does not prevent merging loads if the target load would be naturally aligned.
+
+define amdgpu_kernel void @merge_align_16(ptr addrspace(7) nocapture %p, ptr addrspace(7) nocapture %p2) #0 {
+; ALIGNED-LABEL: define amdgpu_kernel void @merge_align_16(
+; ALIGNED-SAME: ptr addrspace(7) nocapture [[P:%.*]], ptr addrspace(7) nocapture [[P2:%.*]]) #[[ATTR0]] {
+; ALIGNED-NEXT:  [[ENTRY:.*:]]
+; ALIGNED-NEXT:    [[GEP_M8:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -8
+; ALIGNED-NEXT:    [[TMP0:%.*]] = load <4 x i32>, ptr addrspace(7) [[GEP_M8]], align 16
+; ALIGNED-NEXT:    [[LD_M81:%.*]] = extractelement <4 x i32> [[TMP0]], i32 0
+; ALIGNED-NEXT:    [[LD_M42:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
+; ALIGNED-NEXT:    [[LD_03:%.*]] = extractelement <4 x i32> [[TMP0]], i32 2
+; ALIGNED-NEXT:    [[LD_44:%.*]] = extractelement <4 x i32> [[TMP0]], i32 3
+; ALIGNED-NEXT:    ret void
+;
+; UNALIGNED-LABEL: define amdgpu_kernel void @merge_align_16(
+; UNALIGNED-SAME: ptr addrspace(7) nocapture [[P:%.*]], ptr addrspace(7) nocapture [[P2:%.*]]) {
+; UNALIGNED-NEXT:  [[ENTRY:.*:]]
+; UNALIGNED-NEXT:    [[GEP_M8:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -8
+; UNALIGNED-NEXT:    [[TMP0:%.*]] = load <4 x i32>, ptr addrspace(7) [[GEP_M8]], align 16
+; UNALIGNED-NEXT:    [[LD_M81:%.*]] = extractelement <4 x i32> [[TMP0]], i32 0
+; UNALIGNED-NEXT:    [[LD_M42:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
+; UNALIGNED-NEXT:    [[LD_03:%.*]] = extractelement <4 x i32> [[TMP0]], i32 2
+; UNALIGNED-NEXT:    [[LD_44:%.*]] = extractelement <4 x i32> [[TMP0]], i32 3
+; UNALIGNED-NEXT:    ret void
+;
+entry:
+  %gep_m8 = getelementptr i8, ptr addrspace(7) %p, i32 -8
+  %ld_m8 = load i32, ptr addrspace(7) %gep_m8, align 16
+  %gep_m4 = getelementptr i8, ptr addrspace(7) %p, i32 -4
+  %ld_m4 = load i32, ptr addrspace(7) %gep_m4, align 4
+  %gep_0 = getelementptr i8, ptr addrspace(7) %p, i32 0
+  %ld_0 = load i32, ptr addrspace(7) %gep_0, align 8
+  %gep_4 = getelementptr i8, ptr addrspace(7) %p, i64 4
+  %ld_4 = load i32, ptr addrspace(7) %gep_4, align 4
+  ret void
+}

llvm/lib/Target/AMDGPU/GCNSubtarget.h

arsenm

This needs codegen tests. I would expect this to touch addressing mode matching. I don't believe this is sufficient for correct handling, and will only partially avoid introducing new violations

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

piotrAMD · 2024-11-12T10:43:36Z

This needs codegen tests. I would expect this to touch addressing mode matching. I don't believe this is sufficient for correct handling, and will only partially avoid introducing new violations

Right, the current patch already fixes the CTS issue I came across, but as you say, there might be more places that need fixing. I will take a look and at least will add an extra test.

piotrAMD · 2024-11-21T15:47:14Z

Added a codegen test that demonstrates a problem with under-aligned memory accesses (and the new target feature). This will need a different patch to fix, adding a FIXME in the test.
I would expect in that case the load/stores get split. We already have a path in the lowering for regular loads (SplitVectorLoad), but this is too late for buffer loads because they get converted to intrinsics in amdgpu-lower-buffer-fat-pointers. Looks that pass would need to be extended to address this case.

arsenm

Also handle globalisel

llvm/lib/Target/AMDGPU/AMDGPU.td

piotrAMD · 2024-11-24T14:04:14Z

Added global-isel in codegen test and added more descriptive text.

piotrAMD · 2024-11-29T11:31:39Z

Can this be merged?
The only unaddressed point is potentially inverting the feature requiring naturally aligned accesses by default, but I do not think that is necessary. As far as I know no API mandate the specific oob behavior (apart from Vulkan VK_EXT_robustness2 mentioned earlier) and do not want to penalize them, as requiring stricter alignment by default will likely produce suboptimal code.

jayfoad · 2025-01-29T11:04:06Z

When set, it will disallow buffer accesses with alignment lower than natural alignment.

To be more precise, it will prevent passes like SILoadStoreOptimizer from introducing new loads/stores (by combining existing loads/stores) that are not naturally aligned, right? Is this based on known alignment of the access, or just by checking the alignment of the offset from the start of the buffer?

I think this will fix the bounds checking problem that you explain, but only if we can guarantee that the start of the buffer is at least as aligned as the largest buffer load/store we can generate, which is 16 bytes for a dwordx4 access.

piotrAMD · 2025-01-29T16:20:21Z

When set, it will disallow buffer accesses with alignment lower than natural alignment.

To be more precise, it will prevent passes like SILoadStoreOptimizer from introducing new loads/stores (by combining existing loads/stores) that are not naturally aligned, right? Is this based on known alignment of the access, or just by checking the alignment of the offset from the start of the buffer?

Right, this is about the target load being naturally aligned. This is based on known alignment of the access. See also test llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/unaligned-buffer.ll.

I think this will fix the bounds checking problem that you explain, but only if we can guarantee that the start of the buffer is at least as aligned as the largest buffer load/store we can generate, which is 16 bytes for a dwordx4 access.

My understanding is that Vulkan API enforces that for buffer ("A structure has a base alignment equal to the largest base alignment of any of its members", and there are more rules).

If we agree this is the right approach, I will invert the target feature so the stricter alignment is the base requirement, as Matt suggested (this can cause some test churn).

piotrAMD · 2025-02-14T06:56:59Z

Inverted target feature.

arsenm · 2025-02-18T02:41:44Z

llvm/lib/Target/AMDGPU/AMDGPU.td

+def FeatureRelaxedBufferOOBMode : SubtargetFeature<"relaxed-buffer-oob-mode",
+  "RelaxedBufferOOBMode",
+   "true",
+  "Enable relaxed out-of-bounds behavior for buffer accesses"


Describe what this means. I'm not sure relaxed is the best name for this

I have extended the description. Would it be better to replace "relaxed" with "non-strict"?

piotrAMD · 2025-03-03T08:50:41Z

Ping.

piotrAMD · 2025-03-06T09:48:22Z

Just to elaborate on the motivation for this change.

I had discovered a compiler bug, where an out-of-bounds access can cause an adjacent, valid access to return zeros (as if it were also out-of-bounds).
In this patch I fix it by disallowing aggressive merges.
At the same time, I provide a target feature to opt out of this conservative behavior and keep the status quo for targets that do not care about strict out-of-bounds behavior. In out-of-bounds scenarios, the default code is correct.

If we do not submit this patch one way or another, the bug will persist.

Strengthen out-of-bounds guarantees for buffer accesses by disallowing buffer accesses with alignment lower than natural alignment. This is needed to specifically address the edge case where an access starts out-of-bounds and then enters in-bounds, as the hardware would treat the entire access as being out-of-bounds. This is normally not needed for most users, but at least one graphics device extension (VK_EXT_robustness2) has very strict requirements - in-bounds accesses must return correct value, and out-of-bounds accesses must return zero. The direct consequence of the patch is that a buffer access at negative address is not merged by load-store-vectorizer with one at a positive address, which fixes a CTS test. Targets that do not care about the new behavior are advised to use the new target feature relaxed-buffer-oob-mode that maintains the state from before the patch.

llvmbot added backend:AMDGPU llvm:transforms labels Nov 8, 2024

shiltian reviewed Nov 8, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/GCNSubtarget.h Outdated Show resolved Hide resolved

Rename function.

b78e15b

arsenm reviewed Nov 8, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/SIISelLowering.cpp Show resolved Hide resolved

piotrAMD added 2 commits November 21, 2024 15:51

Simplify test

adbd8b3

Add codegen test for problems with underaligned memory accesses

52b3d95

arsenm reviewed Nov 21, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPU.td Outdated Show resolved Hide resolved

piotrAMD added 2 commits November 24, 2024 09:05

Handle globalisel in codegen test

13201b5

Add more description

f8feeab

piotrAMD requested a review from jayfoad January 15, 2025 14:07

piotrAMD added 2 commits February 13, 2025 23:28

Merge remote-tracking branch 'origin/main' into NATURALLY_ALIGNED

a031e09

Invert target feature

26d046b

piotrAMD changed the title ~~[AMDGPU] Add target feature require-naturally-aligned-buffer-access~~ [AMDGPU] Fix edge case of buffer OOB handling Feb 14, 2025

Remove stray whitespace

88346fb

arsenm reviewed Feb 18, 2025

View reviewed changes

Improve comment

166dc3a

arsenm approved these changes Mar 6, 2025

View reviewed changes

piotrAMD merged commit 170c0da into llvm:main Mar 7, 2025
11 checks passed

[AMDGPU] Fix edge case of buffer OOB handling #115479

[AMDGPU] Fix edge case of buffer OOB handling #115479

Uh oh!

Conversation

piotrAMD commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

piotrAMD commented Nov 12, 2024

Uh oh!

piotrAMD commented Nov 21, 2024

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

piotrAMD commented Nov 24, 2024

Uh oh!

piotrAMD commented Nov 29, 2024

Uh oh!

jayfoad commented Jan 29, 2025

Uh oh!

piotrAMD commented Jan 29, 2025

Uh oh!

piotrAMD commented Feb 14, 2025

Uh oh!

arsenm Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

piotrAMD Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

piotrAMD commented Mar 3, 2025

Uh oh!

piotrAMD commented Mar 6, 2025

Uh oh!

Uh oh!

Uh oh!

piotrAMD commented Nov 8, 2024 •

edited

Loading

llvmbot commented Nov 8, 2024 •

edited

Loading