-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AMDGPU] Fix edge case of buffer OOB handling #115479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add a new target feature require-naturally-aligned-buffer-access to guarantee robust out-of-bounds behavior. When set, it will disallow buffer accesses with alignment lower than natural alignment. This is needed to specifically address the edge case where an access starts out-of-bounds and then enter in-bounds, as the hardware would treat the entire access as out-of-bounds. This is normally not needed for most users (hence the target feature), but at least one graphics device extension (VK_EXT_robustness2) has very strict requirements - in-bounds accesses must return correct value, and out-of-bounds accesses must return zero. The direct result of the patch is that, when the new target feature is set, a buffer access at a negative address will not be merged with one at a positive address.
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-amdgpu Author: Piotr Sobczak (piotrAMD) ChangesAdd a new target feature require-naturally-aligned-buffer-access to guarantee robust out-of-bounds behavior. When set, it will disallow buffer accesses with alignment lower than natural alignment. This is needed to specifically address the edge case where an access starts out-of-bounds and then enter in-bounds, as the hardware would treat the entire access as out-of-bounds. This is normally not needed for most users (hence the target feature), but at least one graphics device extension (VK_EXT_robustness2) has very strict requirements - in-bounds accesses must return correct value, and out-of-bounds accesses must return zero. The direct result of the patch is that, when the new target feature is set, a buffer access at a negative address will not be merged with one at a positive address. Full diff: https://github.com/llvm/llvm-project/pull/115479.diff 4 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index bde61a1f7e58df..8a184a92f016e0 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -119,6 +119,12 @@ def FeatureUnalignedDSAccess : SubtargetFeature<"unaligned-ds-access",
"Hardware supports unaligned local and region loads and stores"
>;
+def FeatureRequireNaturallyAlignedBufferAccess : SubtargetFeature<"require-naturally-aligned-buffer-access",
+ "RequireNaturallyAlignedBufferAccess",
+ "true",
+ "Requires natural alignment of buffer accesses"
+>;
+
def FeatureApertureRegs : SubtargetFeature<"aperture-regs",
"HasApertureRegs",
"true",
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index 6ff964077d8fd0..541e3c0f399e30 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -76,6 +76,7 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
bool BackOffBarrier = false;
bool UnalignedScratchAccess = false;
bool UnalignedAccessMode = false;
+ bool RequireNaturallyAlignedBufferAccess = false;
bool HasApertureRegs = false;
bool SupportsXNACK = false;
bool KernargPreload = false;
@@ -600,6 +601,10 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
return UnalignedAccessMode;
}
+ bool requiresNaturallyAlignedBufferAccess() const {
+ return RequireNaturallyAlignedBufferAccess;
+ }
+
bool hasApertureRegs() const {
return HasApertureRegs;
}
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 419414e5bd993d..d4321eb682dd9b 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -1840,6 +1840,20 @@ bool SITargetLowering::allowsMisalignedMemoryAccessesImpl(
Subtarget->hasUnalignedBufferAccessEnabled();
}
+ // Check natural alignment of buffer if the target requires it. This is needed
+ // only if robust out-of-bounds guarantees are needed. Normally hardware will
+ // ensure proper out-of-bounds behavior, but in the edge case where an access
+ // starts out-of-bounds and then enter in-bounds, the entire access would be
+ // treated as out-of-bounds. Requiring the natural alignment avoids the
+ // problem.
+ if (AddrSpace == AMDGPUAS::BUFFER_FAT_POINTER ||
+ AddrSpace == AMDGPUAS::BUFFER_RESOURCE ||
+ AddrSpace == AMDGPUAS::BUFFER_STRIDED_POINTER) {
+ if (Subtarget->requiresNaturallyAlignedBufferAccess() &&
+ Alignment < Align(PowerOf2Ceil(divideCeil(Size, 8))))
+ return false;
+ }
+
// Smaller than dword value must be aligned.
if (Size < 32)
return false;
diff --git a/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/unaligned-buffer.ll b/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/unaligned-buffer.ll
new file mode 100644
index 00000000000000..0d8a98feecb82b
--- /dev/null
+++ b/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/unaligned-buffer.ll
@@ -0,0 +1,80 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -mtriple=amdgcn--amdpal -passes=load-store-vectorizer -mattr=+require-naturally-aligned-buffer-access -S -o - %s | FileCheck --check-prefix=ALIGNED %s
+; RUN: opt -mtriple=amdgcn--amdpal -passes=load-store-vectorizer -S -o - %s | FileCheck --check-prefixes=UNALIGNED %s
+
+; The test checks that require-naturally-aligned-buffer-access target feature prevents merging loads if the target load would not be naturally aligned.
+
+define amdgpu_kernel void @merge_align_4(ptr addrspace(7) nocapture %p, ptr addrspace(7) nocapture %p2) #0 {
+;
+; ALIGNED-LABEL: define amdgpu_kernel void @merge_align_4(
+; ALIGNED-SAME: ptr addrspace(7) nocapture [[P:%.*]], ptr addrspace(7) nocapture [[P2:%.*]]) #[[ATTR0:[0-9]+]] {
+; ALIGNED-NEXT: [[ENTRY:.*:]]
+; ALIGNED-NEXT: [[GEP_M8:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -8
+; ALIGNED-NEXT: [[LD_M8:%.*]] = load i32, ptr addrspace(7) [[GEP_M8]], align 4
+; ALIGNED-NEXT: [[GEP_M4:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -4
+; ALIGNED-NEXT: [[LD_M4:%.*]] = load i32, ptr addrspace(7) [[GEP_M4]], align 4
+; ALIGNED-NEXT: [[GEP_0:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 0
+; ALIGNED-NEXT: [[LD_0:%.*]] = load i32, ptr addrspace(7) [[GEP_0]], align 4
+; ALIGNED-NEXT: [[GEP_4:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i64 4
+; ALIGNED-NEXT: [[LD_4:%.*]] = load i32, ptr addrspace(7) [[GEP_4]], align 4
+; ALIGNED-NEXT: ret void
+;
+; UNALIGNED-LABEL: define amdgpu_kernel void @merge_align_4(
+; UNALIGNED-SAME: ptr addrspace(7) nocapture [[P:%.*]], ptr addrspace(7) nocapture [[P2:%.*]]) {
+; UNALIGNED-NEXT: [[ENTRY:.*:]]
+; UNALIGNED-NEXT: [[GEP_M8:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -8
+; UNALIGNED-NEXT: [[TMP0:%.*]] = load <4 x i32>, ptr addrspace(7) [[GEP_M8]], align 4
+; UNALIGNED-NEXT: [[LD_M81:%.*]] = extractelement <4 x i32> [[TMP0]], i32 0
+; UNALIGNED-NEXT: [[LD_M42:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
+; UNALIGNED-NEXT: [[LD_03:%.*]] = extractelement <4 x i32> [[TMP0]], i32 2
+; UNALIGNED-NEXT: [[LD_44:%.*]] = extractelement <4 x i32> [[TMP0]], i32 3
+; UNALIGNED-NEXT: ret void
+;
+entry:
+ %gep_m8 = getelementptr i8, ptr addrspace(7) %p, i32 -8
+ %ld_m8 = load i32, ptr addrspace(7) %gep_m8, align 4
+ %gep_m4 = getelementptr i8, ptr addrspace(7) %p, i32 -4
+ %ld_m4 = load i32, ptr addrspace(7) %gep_m4, align 4
+ %gep_0 = getelementptr i8, ptr addrspace(7) %p, i32 0
+ %ld_0 = load i32, ptr addrspace(7) %gep_0, align 4
+ %gep_4 = getelementptr i8, ptr addrspace(7) %p, i64 4
+ %ld_4 = load i32, ptr addrspace(7) %gep_4, align 4
+ ret void
+}
+
+; The test checks that require-naturally-aligned-buffer-access target feature does not prevent merging loads if the target load would be naturally aligned.
+
+define amdgpu_kernel void @merge_align_16(ptr addrspace(7) nocapture %p, ptr addrspace(7) nocapture %p2) #0 {
+; ALIGNED-LABEL: define amdgpu_kernel void @merge_align_16(
+; ALIGNED-SAME: ptr addrspace(7) nocapture [[P:%.*]], ptr addrspace(7) nocapture [[P2:%.*]]) #[[ATTR0]] {
+; ALIGNED-NEXT: [[ENTRY:.*:]]
+; ALIGNED-NEXT: [[GEP_M8:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -8
+; ALIGNED-NEXT: [[TMP0:%.*]] = load <4 x i32>, ptr addrspace(7) [[GEP_M8]], align 16
+; ALIGNED-NEXT: [[LD_M81:%.*]] = extractelement <4 x i32> [[TMP0]], i32 0
+; ALIGNED-NEXT: [[LD_M42:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
+; ALIGNED-NEXT: [[LD_03:%.*]] = extractelement <4 x i32> [[TMP0]], i32 2
+; ALIGNED-NEXT: [[LD_44:%.*]] = extractelement <4 x i32> [[TMP0]], i32 3
+; ALIGNED-NEXT: ret void
+;
+; UNALIGNED-LABEL: define amdgpu_kernel void @merge_align_16(
+; UNALIGNED-SAME: ptr addrspace(7) nocapture [[P:%.*]], ptr addrspace(7) nocapture [[P2:%.*]]) {
+; UNALIGNED-NEXT: [[ENTRY:.*:]]
+; UNALIGNED-NEXT: [[GEP_M8:%.*]] = getelementptr i8, ptr addrspace(7) [[P]], i32 -8
+; UNALIGNED-NEXT: [[TMP0:%.*]] = load <4 x i32>, ptr addrspace(7) [[GEP_M8]], align 16
+; UNALIGNED-NEXT: [[LD_M81:%.*]] = extractelement <4 x i32> [[TMP0]], i32 0
+; UNALIGNED-NEXT: [[LD_M42:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
+; UNALIGNED-NEXT: [[LD_03:%.*]] = extractelement <4 x i32> [[TMP0]], i32 2
+; UNALIGNED-NEXT: [[LD_44:%.*]] = extractelement <4 x i32> [[TMP0]], i32 3
+; UNALIGNED-NEXT: ret void
+;
+entry:
+ %gep_m8 = getelementptr i8, ptr addrspace(7) %p, i32 -8
+ %ld_m8 = load i32, ptr addrspace(7) %gep_m8, align 16
+ %gep_m4 = getelementptr i8, ptr addrspace(7) %p, i32 -4
+ %ld_m4 = load i32, ptr addrspace(7) %gep_m4, align 4
+ %gep_0 = getelementptr i8, ptr addrspace(7) %p, i32 0
+ %ld_0 = load i32, ptr addrspace(7) %gep_0, align 8
+ %gep_4 = getelementptr i8, ptr addrspace(7) %p, i64 4
+ %ld_4 = load i32, ptr addrspace(7) %gep_4, align 4
+ ret void
+}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs codegen tests. I would expect this to touch addressing mode matching. I don't believe this is sufficient for correct handling, and will only partially avoid introducing new violations
Right, the current patch already fixes the CTS issue I came across, but as you say, there might be more places that need fixing. I will take a look and at least will add an extra test. |
Added a codegen test that demonstrates a problem with under-aligned memory accesses (and the new target feature). This will need a different patch to fix, adding a FIXME in the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also handle globalisel
Added global-isel in codegen test and added more descriptive text. |
Can this be merged? |
To be more precise, it will prevent passes like SILoadStoreOptimizer from introducing new loads/stores (by combining existing loads/stores) that are not naturally aligned, right? Is this based on known alignment of the access, or just by checking the alignment of the offset from the start of the buffer? I think this will fix the bounds checking problem that you explain, but only if we can guarantee that the start of the buffer is at least as aligned as the largest buffer load/store we can generate, which is 16 bytes for a dwordx4 access. |
Right, this is about the target load being naturally aligned. This is based on known alignment of the access. See also test llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/unaligned-buffer.ll.
My understanding is that Vulkan API enforces that for buffer ("A structure has a base alignment equal to the largest base alignment of any of its members", and there are more rules). If we agree this is the right approach, I will invert the target feature so the stricter alignment is the base requirement, as Matt suggested (this can cause some test churn). |
Inverted target feature. |
llvm/lib/Target/AMDGPU/AMDGPU.td
Outdated
def FeatureRelaxedBufferOOBMode : SubtargetFeature<"relaxed-buffer-oob-mode", | ||
"RelaxedBufferOOBMode", | ||
"true", | ||
"Enable relaxed out-of-bounds behavior for buffer accesses" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe what this means. I'm not sure relaxed is the best name for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have extended the description. Would it be better to replace "relaxed" with "non-strict"?
Ping. |
Just to elaborate on the motivation for this change. I had discovered a compiler bug, where an out-of-bounds access can cause an adjacent, valid access to return zeros (as if it were also out-of-bounds). If we do not submit this patch one way or another, the bug will persist. |
Strengthen out-of-bounds guarantees for buffer accesses by disallowing buffer accesses with alignment lower than natural alignment. This is needed to specifically address the edge case where an access starts out-of-bounds and then enters in-bounds, as the hardware would treat the entire access as being out-of-bounds. This is normally not needed for most users, but at least one graphics device extension (VK_EXT_robustness2) has very strict requirements - in-bounds accesses must return correct value, and out-of-bounds accesses must return zero. The direct consequence of the patch is that a buffer access at negative address is not merged by load-store-vectorizer with one at a positive address, which fixes a CTS test. Targets that do not care about the new behavior are advised to use the new target feature relaxed-buffer-oob-mode that maintains the state from before the patch.
Strengthen out-of-bounds guarantees for buffer accesses by disallowing buffer accesses with alignment lower than natural alignment.
This is needed to specifically address the edge case where an access starts out-of-bounds and then enters in-bounds, as the hardware would treat the entire access as being out-of-bounds. This is normally not needed for most users, but at least one graphics device extension (VK_EXT_robustness2) has very strict requirements - in-bounds accesses must return correct value, and out-of-bounds accesses must return zero.
The direct consequence of the patch is that a buffer access at negative address is not merged by load-store-vectorizer with one at a positive address, which fixes a CTS test.
Targets that do not care about the new behavior are advised to use the new target feature relaxed-buffer-oob-mode that maintains the state from before the patch.