[AArch64] Use dupq (SVE2.1) for segmented lane splats #144482

huntergr-arm · 2025-06-17T08:59:44Z

Use the dupq instructions (when available) to represent a splat of the same lane within each 128b segment of a wider fixed vector.

github-actions · 2025-06-17T09:02:13Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

sdesmalen-arm · 2025-06-17T09:10:35Z

llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll

+  %splat.lanes = shufflevector <32 x i8> %load, <32 x i8> poison, <32 x i32> <i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11,
+                                                                              i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27>


Maybe choose 15 and 31 here as indices to test boundaries?

I figured a mix would work, as I have the 3,7 indices for i32 values, but sure.

llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/lib/Target/AArch64/AArch64InstrInfo.td

*Remove sdnode and use intrinsic

sdesmalen-arm · 2025-06-17T13:31:29Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

+  for (unsigned I = 0; I < Segments; ++I) {
+    unsigned Broadcast = (unsigned)M[I * SegmentElts];
+    if (Broadcast - (I * SegmentElts) > SegmentElts)
+      return std::nullopt;
+    for (unsigned J = 0; J < SegmentElts; ++J) {
+      int Idx = M[(I * SegmentElts) + J];
+      if ((unsigned)Idx != Broadcast)
+        return std::nullopt;
+    }
+  }


nit: it might be easier to follow if we'd write this with a single loop instead, using something like this:

for (unsigned I=0; I<M.size(); ++I) { if (M[I] != (M[0] + ((I/Segments)*SegmentElts))) return false; } return M[0];

what do you think?

That discards the range check (does the splatted index value refer to a lane within the current segment). I guess we could have a second loop to check that afterwards.

Honestly, I wanted to use all_equal() on a std::span instead of the inner loop, but we don't have span yet (C++20). I could perhaps make a second ArrayRef instead, since it's just a pair of iterators.

For the range check, I guess you'd only need to check M[0] (because we're not supporting poison values)?

llvmbot · 2025-06-17T14:55:21Z

@llvm/pr-subscribers-backend-aarch64

Author: Graham Hunter (huntergr-arm)

Changes

Use the dupq instructions (when available) to represent a splat of the same lane within each 128b segment of a wider fixed vector.

Full diff: https://github.com/llvm/llvm-project/pull/144482.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+37)
(added) llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll (+115)

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7519ac5260a64..1e7e40ab663be 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -13430,6 +13430,30 @@ static bool isUZP_v_undef_Mask(ArrayRef<int> M, EVT VT, unsigned &WhichResult) {
   return true;
 }
 
+/// isDUPQMask - matches a splat of equivalent lanes within 128b segments in
+/// the first vector operand.
+static std::optional<unsigned> isDUPQMask(ArrayRef<int> M, EVT VT) {
+  unsigned Lane = (unsigned)M[0];
+  unsigned Segments = VT.getFixedSizeInBits() / 128;
+  unsigned SegmentElts = VT.getVectorNumElements() / Segments;
+
+  // Make sure there's no size changes.
+  if (SegmentElts * Segments != M.size())
+    return std::nullopt;
+
+  // Check that the first index corresponds to one of the lanes in the first
+  // segment.
+  if ((unsigned)M[0] >= SegmentElts)
+    return std::nullopt;
+
+  // Check that all lanes match the first, adjusted for segment.
+  for (unsigned I = 0; I < M.size(); ++I)
+    if ((unsigned)M[I] != ((unsigned)M[0] + ((I / SegmentElts) * SegmentElts)))
+      return std::nullopt;
+
+  return Lane;
+}
+
 /// isTRN_v_undef_Mask - Special case of isTRNMask for canonical form of
 /// "vector_shuffle v, v", i.e., "vector_shuffle v, undef".
 /// Mask is e.g., <0, 0, 2, 2> instead of <0, 4, 2, 6>.
@@ -30013,6 +30037,19 @@ SDValue AArch64TargetLowering::LowerFixedLengthVECTOR_SHUFFLEToSVE(
       return convertFromScalableVector(
           DAG, VT, DAG.getNode(Opc, DL, ContainerVT, Op1, Op1));
     }
+
+    if (Subtarget->hasSVE2p1()) {
+      if (std::optional<unsigned> Lane = isDUPQMask(ShuffleMask, VT)) {
+        SDValue IID =
+            DAG.getConstant(Intrinsic::aarch64_sve_dup_laneq, DL, MVT::i64);
+        return convertFromScalableVector(
+            DAG, VT,
+            DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, {ContainerVT, MVT::i64},
+                        {IID, Op1,
+                         DAG.getConstant(*Lane, DL, MVT::i64,
+                                         /*isTarget=*/true)}));
+      }
+    }
   }
 
   // Try to widen the shuffle before generating a possibly expensive SVE TBL.
diff --git a/llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll b/llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll
new file mode 100644
index 0000000000000..40d4d0ff60148
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll
@@ -0,0 +1,115 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu < %s | FileCheck %s
+
+define void @dupq_i8_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i8_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.b, z0.b[15]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <32 x i8>, ptr %addr
+  %splat.lanes = shufflevector <32 x i8> %load, <32 x i8> poison, <32 x i32> <i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15,
+                                                                              i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31>
+  store <32 x i8> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_i16_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i16_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.h, z0.h[2]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <16 x i16>, ptr %addr
+  %splat.lanes = shufflevector <16 x i16> %load, <16 x i16> poison, <16 x i32> <i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2,
+                                                                                i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+  store <16 x i16> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_i32_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i32_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.s, z0.s[3]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <8 x i32>, ptr %addr
+  %splat.lanes = shufflevector <8 x i32> %load, <8 x i32> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3,
+                                                                             i32 7, i32 7, i32 7, i32 7>
+  store <8 x i32> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_i64_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i64_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    trn1 z0.d, z0.d, z0.d
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <4 x i64>, ptr %addr
+  %splat.lanes = shufflevector <4 x i64> %load, <4 x i64> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+  store <4 x i64> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_f16_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_f16_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.h, z0.h[2]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <16 x half>, ptr %addr
+  %splat.lanes = shufflevector <16 x half> %load, <16 x half> poison, <16 x i32> <i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2,
+                                                                                  i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+  store <16 x half> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_bf16_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_bf16_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldp q0, q1, [x0]
+; CHECK-NEXT:    dup v0.8h, v0.h[2]
+; CHECK-NEXT:    dup v1.8h, v1.h[2]
+; CHECK-NEXT:    stp q0, q1, [x0]
+; CHECK-NEXT:    ret
+  %load = load <16 x bfloat>, ptr %addr
+  %splat.lanes = shufflevector <16 x bfloat> %load, <16 x bfloat> poison, <16 x i32> <i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2,
+                                                                                      i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+  store <16 x bfloat> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_f32_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_f32_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.s, z0.s[3]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <8 x float>, ptr %addr
+  %splat.lanes = shufflevector <8 x float> %load, <8 x float> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3,
+                                                                                 i32 7, i32 7, i32 7, i32 7>
+  store <8 x float> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_f64_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_f64_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    trn1 z0.d, z0.d, z0.d
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <4 x double>, ptr %addr
+  %splat.lanes = shufflevector <4 x double> %load, <4 x double> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+  store <4 x double> %splat.lanes, ptr %addr
+  ret void
+}
+
+attributes #0 = { noinline vscale_range(2,2) "target-features"="+sve2p1,+bf16" }

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

paulwalker-arm

A couple of recommendations but otherwise looks good to me.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

* Use Lane variable instead of repeatedly dereferencing M[0] * Wording change for comment

huntergr-arm added 3 commits June 17, 2025 08:48

[AArch64] Use dupq for segmented lane splats (SVE 2.1)

e848506

Add dupq sdnodes and patterns to match

5dd61ba

Lower to new sdnodes

5ffdce6

huntergr-arm requested review from paulwalker-arm, sdesmalen-arm and gbossu June 17, 2025 08:59

sdesmalen-arm reviewed Jun 17, 2025

View reviewed changes

paulwalker-arm reviewed Jun 17, 2025

View reviewed changes

llvm/lib/Target/AArch64/AArch64InstrInfo.td Outdated Show resolved Hide resolved

huntergr-arm added 3 commits June 17, 2025 12:18

*Use optional result instead of reference argument

1f05588

*Remove sdnode and use intrinsic

use end lane index in 8b test

1fc5190

Formatting

31ee725

sdesmalen-arm reviewed Jun 17, 2025

View reviewed changes

Single loop

6b380b0

llvmbot added the backend:AArch64 label Jun 17, 2025

sdesmalen-arm approved these changes Jun 17, 2025

View reviewed changes

paulwalker-arm reviewed Jun 17, 2025

View reviewed changes

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Outdated Show resolved Hide resolved

Drop extra VT

ddcf5d5

paulwalker-arm approved these changes Jun 18, 2025

View reviewed changes

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Outdated Show resolved Hide resolved

* Add assert to guard against 64b vectortypes

c512f9b

* Use Lane variable instead of repeatedly dereferencing M[0] * Wording change for comment

huntergr-arm merged commit 8b8a369 into llvm:main Jun 18, 2025
6 of 7 checks passed

huntergr-arm deleted the segmented-lane-splat-codegen branch June 19, 2025 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64] Use dupq (SVE2.1) for segmented lane splats #144482

[AArch64] Use dupq (SVE2.1) for segmented lane splats #144482

Uh oh!

huntergr-arm commented Jun 17, 2025

Uh oh!

github-actions bot commented Jun 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

sdesmalen-arm Jun 17, 2025

Uh oh!

huntergr-arm Jun 17, 2025

Uh oh!

huntergr-arm Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sdesmalen-arm Jun 17, 2025

Uh oh!

huntergr-arm Jun 17, 2025

Uh oh!

sdesmalen-arm Jun 17, 2025

Uh oh!

huntergr-arm Jun 17, 2025

Uh oh!

llvmbot commented Jun 17, 2025

Uh oh!

Uh oh!

paulwalker-arm left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		%splat.lanes = shufflevector <32 x i8> %load, <32 x i8> poison, <32 x i32> <i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11,
		i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27>

[AArch64] Use dupq (SVE2.1) for segmented lane splats #144482

[AArch64] Use dupq (SVE2.1) for segmented lane splats #144482

Uh oh!

Conversation

huntergr-arm commented Jun 17, 2025

Uh oh!

github-actions bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sdesmalen-arm Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

huntergr-arm Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

huntergr-arm Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sdesmalen-arm Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

huntergr-arm Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

sdesmalen-arm Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

huntergr-arm Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

llvmbot commented Jun 17, 2025

Uh oh!

Uh oh!

paulwalker-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jun 17, 2025 •

edited

Loading