Skip to content

[AMDGPU] Fix edge case of buffer OOB handling #115479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 7, 2025
Merged

Conversation

piotrAMD
Copy link
Collaborator

@piotrAMD piotrAMD commented Nov 8, 2024

Strengthen out-of-bounds guarantees for buffer accesses by disallowing buffer accesses with alignment lower than natural alignment.

This is needed to specifically address the edge case where an access starts out-of-bounds and then enters in-bounds, as the hardware would treat the entire access as being out-of-bounds. This is normally not needed for most users, but at least one graphics device extension (VK_EXT_robustness2) has very strict requirements - in-bounds accesses must return correct value, and out-of-bounds accesses must return zero.

The direct consequence of the patch is that a buffer access at negative address is not merged by load-store-vectorizer with one at a positive address, which fixes a CTS test.

Targets that do not care about the new behavior are advised to use the new target feature relaxed-buffer-oob-mode that maintains the state from before the patch.

Add a new target feature require-naturally-aligned-buffer-access to guarantee robust
out-of-bounds behavior. When set, it will disallow buffer accesses with alignment
lower than natural alignment.

This is needed to specifically address the edge case where an access starts
out-of-bounds and then enter in-bounds, as the hardware would treat the entire access
as out-of-bounds. This is normally not needed for most users (hence the target
feature), but at least one graphics device extension (VK_EXT_robustness2) has
very strict requirements - in-bounds accesses must return correct value,
and out-of-bounds accesses must return zero.

The direct result of the patch is that, when the new target feature is set, a buffer
access at a negative address will not be merged with one at a positive address.
@llvmbot
Copy link
Member

llvmbot commented Nov 8, 2024

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-amdgpu

Author: Piotr Sobczak (piotrAMD)

Changes

Add a new target feature require-naturally-aligned-buffer-access to guarantee robust out-of-bounds behavior. When set, it will disallow buffer accesses with alignment lower than natural alignment.

This is needed to specifically address the edge case where an access starts out-of-bounds and then enter in-bounds, as the hardware would treat the entire access as out-of-bounds. This is normally not needed for most users (hence the target feature), but at least one graphics device extension (VK_EXT_robustness2) has very strict requirements - in-bounds accesses must return correct value, and out-of-bounds accesses must return zero.

The direct result of the patch is that, when the new target feature is set, a buffer access at a negative address will not be merged with one at a positive address.


Full diff: https://github.com/llvm/llvm-project/pull/115479.diff

4 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+6)
  • (modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+5)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+14)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/unaligned-buffer.ll (+80)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index bde61a1f7e58df..8a184a92f016e0 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -119,6 +119,12 @@ def FeatureUnalignedDSAccess : SubtargetFeature<"unaligned-ds-access",
   "Hardware supports unaligned local and region loads and stores"
 >;
 
+def FeatureRequireNaturallyAlignedBufferAccess : SubtargetFeature<"require-naturally-aligned-buffer-access",
+  "RequireNaturallyAlignedBufferAccess",
+  "true",
+  "Requires natural alignment of buffer accesses"
+>;
+
 def FeatureApertureRegs : SubtargetFeature<"aperture-regs",
   "HasApertureRegs",
   "true",
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index 6ff964077d8fd0..541e3c0f399e30 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -76,6 +76,7 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   bool BackOffBarrier = false;
   bool UnalignedScratchAccess = false;
   bool UnalignedAccessMode = false;
+  bool RequireNaturallyAlignedBufferAccess = false;
   bool HasApertureRegs = false;
   bool SupportsXNACK = false;
   bool KernargPreload = false;
@@ -600,6 +601,10 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
     return UnalignedAccessMode;
   }
 
+  bool requiresNaturallyAlignedBufferAccess() const {
+    return RequireNaturallyAlignedBufferAccess;
+  }
+
   bool hasApertureRegs() const {
     return HasApertureRegs;
   }
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 419414e5bd993d..d4321eb682dd9b 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -1840,6 +1840,20 @@ bool SITargetLowering::allowsMisalignedMemoryAccessesImpl(
            Subtarget->hasUnalignedBufferAccessEnabled();
   }
 
+  // Check natural alignment of buffer if the target requires it. This is needed
+  // only if robust out-of-bounds guarantees are needed. Normally hardware will
+  // ensure proper out-of-bounds behavior, but in the edge case where an access
+  // starts out-of-bounds and then enter in-bounds, the entire access would be
+  // treated as out-of-bounds. Requiring the natural alignment avoids the
+  // problem.
+  if (AddrSpace == AMDGPUAS::BUFFER_FAT_POINTER ||
+      AddrSpace == AMDGPUAS::BUFFER_RESOURCE ||
+      AddrSpace == AMDGPUAS::BUFFER_STRIDED_POINTER) {
+    if (Subtarget->requiresNaturallyAlignedBufferAccess() &&
+        Alignment < Align(PowerOf2Ceil(divideCeil(Size, 8))))
+      return false;
+  }
+
   // Smaller than dword value must be aligned.
   if (Size < 32)
     return false;
diff --git a/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/unaligned-buffer.ll b/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/unaligned-buffer.ll
new file mode 100644
index 00000000000000..0d8a98feecb82b
--- /dev/null
+++ b/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/unaligned-buffer.ll
@@ -0,0 +1,80 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -mtriple=amdgcn--amdpal -passes=load-store-vectorizer -mattr=+require-naturally-aligned-buffer-access -S -o - %s | FileCheck --check-prefix=ALIGNED %s
+; RUN: opt -mtriple=amdgcn--amdpal -passes=load-store-vectorizer -S -o - %s | FileCheck --check-prefixes=UNALIGNED %s
+
+; The test checks that require-naturally-aligned-buffer-access target feature prevents merging loads if the target load would not be naturally aligned.
+
+define amdgpu_kernel void @merge_align_4(ptr addrspace(7) nocapture %p, ptr addrspace(7) nocapture %p2) #0 {
+;
+; ALIGNED-LABEL: define amdgpu_kernel void @merge_align_4(
+; ALIGNED-SAME: ptr addrspace(7) nocapture [[P:%.*]], ptr addrspace(7) nocapture [[P2:%.*]]) #[[ATTR0:[0-9]+]] {
+; ALIGNED-NEXT:  [[ENTRY:.*:]]
+; ALIGNED-NEXT:    [[GEP_M8:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -8
+; ALIGNED-NEXT:    [[LD_M8:%.*]] = load i32, ptr addrspace(7) [[GEP_M8]], align 4
+; ALIGNED-NEXT:    [[GEP_M4:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -4
+; ALIGNED-NEXT:    [[LD_M4:%.*]] = load i32, ptr addrspace(7) [[GEP_M4]], align 4
+; ALIGNED-NEXT:    [[GEP_0:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 0
+; ALIGNED-NEXT:    [[LD_0:%.*]] = load i32, ptr addrspace(7) [[GEP_0]], align 4
+; ALIGNED-NEXT:    [[GEP_4:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i64 4
+; ALIGNED-NEXT:    [[LD_4:%.*]] = load i32, ptr addrspace(7) [[GEP_4]], align 4
+; ALIGNED-NEXT:    ret void
+;
+; UNALIGNED-LABEL: define amdgpu_kernel void @merge_align_4(
+; UNALIGNED-SAME: ptr addrspace(7) nocapture [[P:%.*]], ptr addrspace(7) nocapture [[P2:%.*]]) {
+; UNALIGNED-NEXT:  [[ENTRY:.*:]]
+; UNALIGNED-NEXT:    [[GEP_M8:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -8
+; UNALIGNED-NEXT:    [[TMP0:%.*]] = load <4 x i32>, ptr addrspace(7) [[GEP_M8]], align 4
+; UNALIGNED-NEXT:    [[LD_M81:%.*]] = extractelement <4 x i32> [[TMP0]], i32 0
+; UNALIGNED-NEXT:    [[LD_M42:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
+; UNALIGNED-NEXT:    [[LD_03:%.*]] = extractelement <4 x i32> [[TMP0]], i32 2
+; UNALIGNED-NEXT:    [[LD_44:%.*]] = extractelement <4 x i32> [[TMP0]], i32 3
+; UNALIGNED-NEXT:    ret void
+;
+entry:
+  %gep_m8 = getelementptr i8, ptr addrspace(7) %p, i32 -8
+  %ld_m8 = load i32, ptr addrspace(7) %gep_m8, align 4
+  %gep_m4 = getelementptr i8, ptr addrspace(7) %p, i32 -4
+  %ld_m4 = load i32, ptr addrspace(7) %gep_m4, align 4
+  %gep_0 = getelementptr i8, ptr addrspace(7) %p, i32 0
+  %ld_0 = load i32, ptr addrspace(7) %gep_0, align 4
+  %gep_4 = getelementptr i8, ptr addrspace(7) %p, i64 4
+  %ld_4 = load i32, ptr addrspace(7) %gep_4, align 4
+  ret void
+}
+
+; The test checks that require-naturally-aligned-buffer-access target feature does not prevent merging loads if the target load would be naturally aligned.
+
+define amdgpu_kernel void @merge_align_16(ptr addrspace(7) nocapture %p, ptr addrspace(7) nocapture %p2) #0 {
+; ALIGNED-LABEL: define amdgpu_kernel void @merge_align_16(
+; ALIGNED-SAME: ptr addrspace(7) nocapture [[P:%.*]], ptr addrspace(7) nocapture [[P2:%.*]]) #[[ATTR0]] {
+; ALIGNED-NEXT:  [[ENTRY:.*:]]
+; ALIGNED-NEXT:    [[GEP_M8:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -8
+; ALIGNED-NEXT:    [[TMP0:%.*]] = load <4 x i32>, ptr addrspace(7) [[GEP_M8]], align 16
+; ALIGNED-NEXT:    [[LD_M81:%.*]] = extractelement <4 x i32> [[TMP0]], i32 0
+; ALIGNED-NEXT:    [[LD_M42:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
+; ALIGNED-NEXT:    [[LD_03:%.*]] = extractelement <4 x i32> [[TMP0]], i32 2
+; ALIGNED-NEXT:    [[LD_44:%.*]] = extractelement <4 x i32> [[TMP0]], i32 3
+; ALIGNED-NEXT:    ret void
+;
+; UNALIGNED-LABEL: define amdgpu_kernel void @merge_align_16(
+; UNALIGNED-SAME: ptr addrspace(7) nocapture [[P:%.*]], ptr addrspace(7) nocapture [[P2:%.*]]) {
+; UNALIGNED-NEXT:  [[ENTRY:.*:]]
+; UNALIGNED-NEXT:    [[GEP_M8:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -8
+; UNALIGNED-NEXT:    [[TMP0:%.*]] = load <4 x i32>, ptr addrspace(7) [[GEP_M8]], align 16
+; UNALIGNED-NEXT:    [[LD_M81:%.*]] = extractelement <4 x i32> [[TMP0]], i32 0
+; UNALIGNED-NEXT:    [[LD_M42:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
+; UNALIGNED-NEXT:    [[LD_03:%.*]] = extractelement <4 x i32> [[TMP0]], i32 2
+; UNALIGNED-NEXT:    [[LD_44:%.*]] = extractelement <4 x i32> [[TMP0]], i32 3
+; UNALIGNED-NEXT:    ret void
+;
+entry:
+  %gep_m8 = getelementptr i8, ptr addrspace(7) %p, i32 -8
+  %ld_m8 = load i32, ptr addrspace(7) %gep_m8, align 16
+  %gep_m4 = getelementptr i8, ptr addrspace(7) %p, i32 -4
+  %ld_m4 = load i32, ptr addrspace(7) %gep_m4, align 4
+  %gep_0 = getelementptr i8, ptr addrspace(7) %p, i32 0
+  %ld_0 = load i32, ptr addrspace(7) %gep_0, align 8
+  %gep_4 = getelementptr i8, ptr addrspace(7) %p, i64 4
+  %ld_4 = load i32, ptr addrspace(7) %gep_4, align 4
+  ret void
+}

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs codegen tests. I would expect this to touch addressing mode matching. I don't believe this is sufficient for correct handling, and will only partially avoid introducing new violations

@piotrAMD
Copy link
Collaborator Author

This needs codegen tests. I would expect this to touch addressing mode matching. I don't believe this is sufficient for correct handling, and will only partially avoid introducing new violations

Right, the current patch already fixes the CTS issue I came across, but as you say, there might be more places that need fixing. I will take a look and at least will add an extra test.

@piotrAMD
Copy link
Collaborator Author

Added a codegen test that demonstrates a problem with under-aligned memory accesses (and the new target feature). This will need a different patch to fix, adding a FIXME in the test.
I would expect in that case the load/stores get split. We already have a path in the lowering for regular loads (SplitVectorLoad), but this is too late for buffer loads because they get converted to intrinsics in amdgpu-lower-buffer-fat-pointers. Looks that pass would need to be extended to address this case.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also handle globalisel

@piotrAMD
Copy link
Collaborator Author

Added global-isel in codegen test and added more descriptive text.

@piotrAMD
Copy link
Collaborator Author

Can this be merged?
The only unaddressed point is potentially inverting the feature requiring naturally aligned accesses by default, but I do not think that is necessary. As far as I know no API mandate the specific oob behavior (apart from Vulkan VK_EXT_robustness2 mentioned earlier) and do not want to penalize them, as requiring stricter alignment by default will likely produce suboptimal code.

@piotrAMD piotrAMD requested a review from jayfoad January 15, 2025 14:07
@jayfoad
Copy link
Contributor

jayfoad commented Jan 29, 2025

When set, it will disallow buffer accesses with alignment lower than natural alignment.

To be more precise, it will prevent passes like SILoadStoreOptimizer from introducing new loads/stores (by combining existing loads/stores) that are not naturally aligned, right? Is this based on known alignment of the access, or just by checking the alignment of the offset from the start of the buffer?

I think this will fix the bounds checking problem that you explain, but only if we can guarantee that the start of the buffer is at least as aligned as the largest buffer load/store we can generate, which is 16 bytes for a dwordx4 access.

@piotrAMD
Copy link
Collaborator Author

When set, it will disallow buffer accesses with alignment lower than natural alignment.

To be more precise, it will prevent passes like SILoadStoreOptimizer from introducing new loads/stores (by combining existing loads/stores) that are not naturally aligned, right? Is this based on known alignment of the access, or just by checking the alignment of the offset from the start of the buffer?

Right, this is about the target load being naturally aligned. This is based on known alignment of the access. See also test llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/unaligned-buffer.ll.

I think this will fix the bounds checking problem that you explain, but only if we can guarantee that the start of the buffer is at least as aligned as the largest buffer load/store we can generate, which is 16 bytes for a dwordx4 access.

My understanding is that Vulkan API enforces that for buffer ("A structure has a base alignment equal to the largest base alignment of any of its members", and there are more rules).

If we agree this is the right approach, I will invert the target feature so the stricter alignment is the base requirement, as Matt suggested (this can cause some test churn).

@piotrAMD piotrAMD changed the title [AMDGPU] Add target feature require-naturally-aligned-buffer-access [AMDGPU] Fix edge case of buffer OOB handling Feb 14, 2025
@piotrAMD
Copy link
Collaborator Author

Inverted target feature.

def FeatureRelaxedBufferOOBMode : SubtargetFeature<"relaxed-buffer-oob-mode",
"RelaxedBufferOOBMode",
"true",
"Enable relaxed out-of-bounds behavior for buffer accesses"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe what this means. I'm not sure relaxed is the best name for this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have extended the description. Would it be better to replace "relaxed" with "non-strict"?

@piotrAMD
Copy link
Collaborator Author

piotrAMD commented Mar 3, 2025

Ping.

@piotrAMD
Copy link
Collaborator Author

piotrAMD commented Mar 6, 2025

Just to elaborate on the motivation for this change.

I had discovered a compiler bug, where an out-of-bounds access can cause an adjacent, valid access to return zeros (as if it were also out-of-bounds).
In this patch I fix it by disallowing aggressive merges.
At the same time, I provide a target feature to opt out of this conservative behavior and keep the status quo for targets that do not care about strict out-of-bounds behavior. In out-of-bounds scenarios, the default code is correct.

If we do not submit this patch one way or another, the bug will persist.

@piotrAMD piotrAMD merged commit 170c0da into llvm:main Mar 7, 2025
11 checks passed
jph-13 pushed a commit to jph-13/llvm-project that referenced this pull request Mar 21, 2025
Strengthen out-of-bounds guarantees for buffer accesses by disallowing
buffer accesses with alignment lower than natural alignment.

This is needed to specifically address the edge case where an access
starts out-of-bounds and then enters in-bounds, as the hardware would
treat the entire access as being out-of-bounds. This is normally not
needed for most users, but at least one graphics device extension
(VK_EXT_robustness2) has very strict requirements - in-bounds accesses
must return correct value, and out-of-bounds accesses must return zero.

The direct consequence of the patch is that a buffer access at negative
address is not merged by load-store-vectorizer with one at a positive
address, which fixes a CTS test.

Targets that do not care about the new behavior are advised to use the
new target feature relaxed-buffer-oob-mode that maintains the state from
before the patch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants