-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AArch64] Use dupq (SVE2.1) for segmented lane splats #144482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64] Use dupq (SVE2.1) for segmented lane splats #144482
Conversation
✅ With the latest revision this PR passed the C/C++ code formatter. |
%splat.lanes = shufflevector <32 x i8> %load, <32 x i8> poison, <32 x i32> <i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, | ||
i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe choose 15 and 31 here as indices to test boundaries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured a mix would work, as I have the 3,7 indices for i32 values, but sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
*Remove sdnode and use intrinsic
for (unsigned I = 0; I < Segments; ++I) { | ||
unsigned Broadcast = (unsigned)M[I * SegmentElts]; | ||
if (Broadcast - (I * SegmentElts) > SegmentElts) | ||
return std::nullopt; | ||
for (unsigned J = 0; J < SegmentElts; ++J) { | ||
int Idx = M[(I * SegmentElts) + J]; | ||
if ((unsigned)Idx != Broadcast) | ||
return std::nullopt; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it might be easier to follow if we'd write this with a single loop instead, using something like this:
for (unsigned I=0; I<M.size(); ++I) {
if (M[I] != (M[0] + ((I/Segments)*SegmentElts)))
return false;
}
return M[0];
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That discards the range check (does the splatted index value refer to a lane within the current segment). I guess we could have a second loop to check that afterwards.
Honestly, I wanted to use all_equal() on a std::span instead of the inner loop, but we don't have span yet (C++20). I could perhaps make a second ArrayRef instead, since it's just a pair of iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the range check, I guess you'd only need to check M[0] (because we're not supporting poison values)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@llvm/pr-subscribers-backend-aarch64 Author: Graham Hunter (huntergr-arm) ChangesUse the dupq instructions (when available) to represent a splat of the same lane within each 128b segment of a wider fixed vector. Full diff: https://github.com/llvm/llvm-project/pull/144482.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7519ac5260a64..1e7e40ab663be 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -13430,6 +13430,30 @@ static bool isUZP_v_undef_Mask(ArrayRef<int> M, EVT VT, unsigned &WhichResult) {
return true;
}
+/// isDUPQMask - matches a splat of equivalent lanes within 128b segments in
+/// the first vector operand.
+static std::optional<unsigned> isDUPQMask(ArrayRef<int> M, EVT VT) {
+ unsigned Lane = (unsigned)M[0];
+ unsigned Segments = VT.getFixedSizeInBits() / 128;
+ unsigned SegmentElts = VT.getVectorNumElements() / Segments;
+
+ // Make sure there's no size changes.
+ if (SegmentElts * Segments != M.size())
+ return std::nullopt;
+
+ // Check that the first index corresponds to one of the lanes in the first
+ // segment.
+ if ((unsigned)M[0] >= SegmentElts)
+ return std::nullopt;
+
+ // Check that all lanes match the first, adjusted for segment.
+ for (unsigned I = 0; I < M.size(); ++I)
+ if ((unsigned)M[I] != ((unsigned)M[0] + ((I / SegmentElts) * SegmentElts)))
+ return std::nullopt;
+
+ return Lane;
+}
+
/// isTRN_v_undef_Mask - Special case of isTRNMask for canonical form of
/// "vector_shuffle v, v", i.e., "vector_shuffle v, undef".
/// Mask is e.g., <0, 0, 2, 2> instead of <0, 4, 2, 6>.
@@ -30013,6 +30037,19 @@ SDValue AArch64TargetLowering::LowerFixedLengthVECTOR_SHUFFLEToSVE(
return convertFromScalableVector(
DAG, VT, DAG.getNode(Opc, DL, ContainerVT, Op1, Op1));
}
+
+ if (Subtarget->hasSVE2p1()) {
+ if (std::optional<unsigned> Lane = isDUPQMask(ShuffleMask, VT)) {
+ SDValue IID =
+ DAG.getConstant(Intrinsic::aarch64_sve_dup_laneq, DL, MVT::i64);
+ return convertFromScalableVector(
+ DAG, VT,
+ DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, {ContainerVT, MVT::i64},
+ {IID, Op1,
+ DAG.getConstant(*Lane, DL, MVT::i64,
+ /*isTarget=*/true)}));
+ }
+ }
}
// Try to widen the shuffle before generating a possibly expensive SVE TBL.
diff --git a/llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll b/llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll
new file mode 100644
index 0000000000000..40d4d0ff60148
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll
@@ -0,0 +1,115 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu < %s | FileCheck %s
+
+define void @dupq_i8_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i8_256b:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ldr z0, [x0]
+; CHECK-NEXT: dupq z0.b, z0.b[15]
+; CHECK-NEXT: str z0, [x0]
+; CHECK-NEXT: ret
+ %load = load <32 x i8>, ptr %addr
+ %splat.lanes = shufflevector <32 x i8> %load, <32 x i8> poison, <32 x i32> <i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15,
+ i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31>
+ store <32 x i8> %splat.lanes, ptr %addr
+ ret void
+}
+
+define void @dupq_i16_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i16_256b:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ldr z0, [x0]
+; CHECK-NEXT: dupq z0.h, z0.h[2]
+; CHECK-NEXT: str z0, [x0]
+; CHECK-NEXT: ret
+ %load = load <16 x i16>, ptr %addr
+ %splat.lanes = shufflevector <16 x i16> %load, <16 x i16> poison, <16 x i32> <i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2,
+ i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+ store <16 x i16> %splat.lanes, ptr %addr
+ ret void
+}
+
+define void @dupq_i32_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i32_256b:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ldr z0, [x0]
+; CHECK-NEXT: dupq z0.s, z0.s[3]
+; CHECK-NEXT: str z0, [x0]
+; CHECK-NEXT: ret
+ %load = load <8 x i32>, ptr %addr
+ %splat.lanes = shufflevector <8 x i32> %load, <8 x i32> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3,
+ i32 7, i32 7, i32 7, i32 7>
+ store <8 x i32> %splat.lanes, ptr %addr
+ ret void
+}
+
+define void @dupq_i64_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i64_256b:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ldr z0, [x0]
+; CHECK-NEXT: trn1 z0.d, z0.d, z0.d
+; CHECK-NEXT: str z0, [x0]
+; CHECK-NEXT: ret
+ %load = load <4 x i64>, ptr %addr
+ %splat.lanes = shufflevector <4 x i64> %load, <4 x i64> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+ store <4 x i64> %splat.lanes, ptr %addr
+ ret void
+}
+
+define void @dupq_f16_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_f16_256b:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ldr z0, [x0]
+; CHECK-NEXT: dupq z0.h, z0.h[2]
+; CHECK-NEXT: str z0, [x0]
+; CHECK-NEXT: ret
+ %load = load <16 x half>, ptr %addr
+ %splat.lanes = shufflevector <16 x half> %load, <16 x half> poison, <16 x i32> <i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2,
+ i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+ store <16 x half> %splat.lanes, ptr %addr
+ ret void
+}
+
+define void @dupq_bf16_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_bf16_256b:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ldp q0, q1, [x0]
+; CHECK-NEXT: dup v0.8h, v0.h[2]
+; CHECK-NEXT: dup v1.8h, v1.h[2]
+; CHECK-NEXT: stp q0, q1, [x0]
+; CHECK-NEXT: ret
+ %load = load <16 x bfloat>, ptr %addr
+ %splat.lanes = shufflevector <16 x bfloat> %load, <16 x bfloat> poison, <16 x i32> <i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2,
+ i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+ store <16 x bfloat> %splat.lanes, ptr %addr
+ ret void
+}
+
+define void @dupq_f32_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_f32_256b:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ldr z0, [x0]
+; CHECK-NEXT: dupq z0.s, z0.s[3]
+; CHECK-NEXT: str z0, [x0]
+; CHECK-NEXT: ret
+ %load = load <8 x float>, ptr %addr
+ %splat.lanes = shufflevector <8 x float> %load, <8 x float> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3,
+ i32 7, i32 7, i32 7, i32 7>
+ store <8 x float> %splat.lanes, ptr %addr
+ ret void
+}
+
+define void @dupq_f64_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_f64_256b:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ldr z0, [x0]
+; CHECK-NEXT: trn1 z0.d, z0.d, z0.d
+; CHECK-NEXT: str z0, [x0]
+; CHECK-NEXT: ret
+ %load = load <4 x double>, ptr %addr
+ %splat.lanes = shufflevector <4 x double> %load, <4 x double> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+ store <4 x double> %splat.lanes, ptr %addr
+ ret void
+}
+
+attributes #0 = { noinline vscale_range(2,2) "target-features"="+sve2p1,+bf16" }
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of recommendations but otherwise looks good to me.
* Use Lane variable instead of repeatedly dereferencing M[0] * Wording change for comment
Use the dupq instructions (when available) to represent a splat of the same lane within each 128b segment of a wider fixed vector.