[RISCV] Improve lowering of spread(2) shuffles #118658

preames · 2024-12-04T15:46:12Z

A spread(2) shuffle is just a interleave with an undef lane. The existing lowering was reusing the even lane for the undef value. This was entirely legal, but non-optimal.

llvmbot · 2024-12-04T15:46:50Z

@llvm/pr-subscribers-backend-risc-v

Author: Philip Reames (preames)

Changes

A spread(2) shuffle is just a interleave with an undef lane. The existing lowering was reusing the even lane for the undef value. This was entirely legal, but non-optimal.

Full diff: https://github.com/llvm/llvm-project/pull/118658.diff

4 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+28-9)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-interleave.ll (+15-21)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-interleave.ll (+29-41)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll (+7-10)

diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 4544a922def1a3..1efb9e7d52dbd2 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -5331,17 +5331,36 @@ static SDValue lowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG,
     // Extract the halves of the vectors.
     MVT HalfVT = VT.getHalfNumVectorElementsVT();
 
+    // Recognize if one half is actually undef; the matching above will
+    // otherwise reuse the even stream for the undef one.  This improves
+    // spread(2) shuffles.
+    bool EvenIsUndef = true, OddIsUndef = true;
+    for (unsigned i = 0; i < Mask.size(); i++) {
+      if (i % 2 == 0)
+        EvenIsUndef &= (Mask[i] == -1);
+      else
+        OddIsUndef &= (Mask[i] == -1);
+    }
+
     int Size = Mask.size();
     SDValue EvenV, OddV;
-    assert(EvenSrc >= 0 && "Undef source?");
-    EvenV = (EvenSrc / Size) == 0 ? V1 : V2;
-    EvenV = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, HalfVT, EvenV,
-                        DAG.getVectorIdxConstant(EvenSrc % Size, DL));
-
-    assert(OddSrc >= 0 && "Undef source?");
-    OddV = (OddSrc / Size) == 0 ? V1 : V2;
-    OddV = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, HalfVT, OddV,
-                       DAG.getVectorIdxConstant(OddSrc % Size, DL));
+    if (EvenIsUndef) {
+      EvenV = DAG.getUNDEF(HalfVT);
+    } else {
+      assert(EvenSrc >= 0 && "Undef source?");
+      EvenV = (EvenSrc / Size) == 0 ? V1 : V2;
+      EvenV = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, HalfVT, EvenV,
+                          DAG.getVectorIdxConstant(EvenSrc % Size, DL));
+    }
+
+    if (OddIsUndef) {
+      OddV = DAG.getUNDEF(HalfVT);
+    } else {
+      assert(OddSrc >= 0 && "Undef source?");
+      OddV = (OddSrc / Size) == 0 ? V1 : V2;
+      OddV = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, HalfVT, OddV,
+                         DAG.getVectorIdxConstant(OddSrc % Size, DL));
+    }
 
     return getWideningInterleave(EvenV, OddV, DL, DAG, Subtarget);
   }
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-interleave.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-interleave.ll
index e4b8e9debad271..97e458e70565ce 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-interleave.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-interleave.ll
@@ -242,33 +242,27 @@ define <64 x float> @interleave_v32f32(<32 x float> %x, <32 x float> %y) {
 ; V128-NEXT:    slli a0, a0, 3
 ; V128-NEXT:    sub sp, sp, a0
 ; V128-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x08, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 8 * vlenb
-; V128-NEXT:    vmv8r.v v24, v16
-; V128-NEXT:    vmv8r.v v16, v8
-; V128-NEXT:    vmv8r.v v8, v24
 ; V128-NEXT:    addi a0, sp, 16
-; V128-NEXT:    vs8r.v v24, (a0) # Unknown-size Folded Spill
+; V128-NEXT:    vs8r.v v8, (a0) # Unknown-size Folded Spill
 ; V128-NEXT:    vsetivli zero, 16, e32, m8, ta, ma
-; V128-NEXT:    vslidedown.vi v0, v24, 16
-; V128-NEXT:    li a0, -1
-; V128-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; V128-NEXT:    vwaddu.vv v24, v8, v0
-; V128-NEXT:    vwmaccu.vx v24, a0, v0
-; V128-NEXT:    vsetivli zero, 16, e32, m8, ta, ma
-; V128-NEXT:    vslidedown.vi v0, v16, 16
+; V128-NEXT:    vslidedown.vi v24, v16, 16
+; V128-NEXT:    li a0, 32
+; V128-NEXT:    vslidedown.vi v0, v8, 16
 ; V128-NEXT:    lui a1, 699051
-; V128-NEXT:    li a2, 32
-; V128-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; V128-NEXT:    vwaddu.vv v8, v0, v16
+; V128-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
+; V128-NEXT:    vzext.vf2 v8, v24
+; V128-NEXT:    vzext.vf2 v24, v0
 ; V128-NEXT:    addi a1, a1, -1366
 ; V128-NEXT:    vmv.s.x v0, a1
-; V128-NEXT:    vwmaccu.vx v8, a0, v16
-; V128-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
-; V128-NEXT:    vmerge.vvm v24, v8, v24, v0
-; V128-NEXT:    addi a1, sp, 16
-; V128-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
+; V128-NEXT:    vsll.vx v8, v8, a0
+; V128-NEXT:    vsetvli zero, a0, e32, m8, ta, ma
+; V128-NEXT:    vmerge.vvm v24, v24, v8, v0
+; V128-NEXT:    addi a0, sp, 16
+; V128-NEXT:    vl8r.v v8, (a0) # Unknown-size Folded Reload
 ; V128-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; V128-NEXT:    vwaddu.vv v0, v16, v8
-; V128-NEXT:    vwmaccu.vx v0, a0, v8
+; V128-NEXT:    vwaddu.vv v0, v8, v16
+; V128-NEXT:    li a0, -1
+; V128-NEXT:    vwmaccu.vx v0, a0, v16
 ; V128-NEXT:    vmv8r.v v8, v0
 ; V128-NEXT:    vmv8r.v v16, v24
 ; V128-NEXT:    csrr a0, vlenb
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-interleave.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-interleave.ll
index 66af5718fb9dc5..a8eb1f97fd1a2c 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-interleave.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-interleave.ll
@@ -186,35 +186,29 @@ define <4 x i32> @interleave_v4i32_offset_2(<4 x i32> %x, <4 x i32> %y) {
 define <4 x i32> @interleave_v4i32_offset_1(<4 x i32> %x, <4 x i32> %y) {
 ; V128-LABEL: interleave_v4i32_offset_1:
 ; V128:       # %bb.0:
-; V128-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
-; V128-NEXT:    vwaddu.vv v10, v8, v8
-; V128-NEXT:    li a0, -1
 ; V128-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
-; V128-NEXT:    vid.v v11
-; V128-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
-; V128-NEXT:    vwmaccu.vx v10, a0, v8
-; V128-NEXT:    vsetivli zero, 4, e32, m1, ta, mu
-; V128-NEXT:    vsrl.vi v8, v11, 1
+; V128-NEXT:    vid.v v10
 ; V128-NEXT:    vmv.v.i v0, 10
-; V128-NEXT:    vadd.vi v8, v8, 1
-; V128-NEXT:    vrgather.vv v10, v9, v8, v0.t
+; V128-NEXT:    vsrl.vi v10, v10, 1
+; V128-NEXT:    vadd.vi v11, v10, 1
+; V128-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
+; V128-NEXT:    vzext.vf2 v10, v8
+; V128-NEXT:    vsetivli zero, 4, e32, m1, ta, mu
+; V128-NEXT:    vrgather.vv v10, v9, v11, v0.t
 ; V128-NEXT:    vmv.v.v v8, v10
 ; V128-NEXT:    ret
 ;
 ; V512-LABEL: interleave_v4i32_offset_1:
 ; V512:       # %bb.0:
-; V512-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
-; V512-NEXT:    vwaddu.vv v10, v8, v8
-; V512-NEXT:    li a0, -1
 ; V512-NEXT:    vsetivli zero, 4, e32, mf2, ta, ma
-; V512-NEXT:    vid.v v11
-; V512-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
-; V512-NEXT:    vwmaccu.vx v10, a0, v8
-; V512-NEXT:    vsetivli zero, 4, e32, mf2, ta, mu
-; V512-NEXT:    vsrl.vi v8, v11, 1
+; V512-NEXT:    vid.v v10
 ; V512-NEXT:    vmv.v.i v0, 10
-; V512-NEXT:    vadd.vi v8, v8, 1
-; V512-NEXT:    vrgather.vv v10, v9, v8, v0.t
+; V512-NEXT:    vsrl.vi v10, v10, 1
+; V512-NEXT:    vadd.vi v11, v10, 1
+; V512-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
+; V512-NEXT:    vzext.vf2 v10, v8
+; V512-NEXT:    vsetivli zero, 4, e32, mf2, ta, mu
+; V512-NEXT:    vrgather.vv v10, v9, v11, v0.t
 ; V512-NEXT:    vmv1r.v v8, v10
 ; V512-NEXT:    ret
   %a = shufflevector <4 x i32> %x, <4 x i32> %y, <4 x i32> <i32 0, i32 5, i32 1, i32 6>
@@ -411,33 +405,27 @@ define <64 x i32> @interleave_v32i32(<32 x i32> %x, <32 x i32> %y) {
 ; V128-NEXT:    slli a0, a0, 3
 ; V128-NEXT:    sub sp, sp, a0
 ; V128-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x08, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 8 * vlenb
-; V128-NEXT:    vmv8r.v v24, v16
-; V128-NEXT:    vmv8r.v v16, v8
-; V128-NEXT:    vmv8r.v v8, v24
 ; V128-NEXT:    addi a0, sp, 16
-; V128-NEXT:    vs8r.v v24, (a0) # Unknown-size Folded Spill
-; V128-NEXT:    vsetivli zero, 16, e32, m8, ta, ma
-; V128-NEXT:    vslidedown.vi v0, v24, 16
-; V128-NEXT:    li a0, -1
-; V128-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; V128-NEXT:    vwaddu.vv v24, v8, v0
-; V128-NEXT:    vwmaccu.vx v24, a0, v0
+; V128-NEXT:    vs8r.v v8, (a0) # Unknown-size Folded Spill
 ; V128-NEXT:    vsetivli zero, 16, e32, m8, ta, ma
-; V128-NEXT:    vslidedown.vi v0, v16, 16
+; V128-NEXT:    vslidedown.vi v24, v16, 16
+; V128-NEXT:    li a0, 32
+; V128-NEXT:    vslidedown.vi v0, v8, 16
 ; V128-NEXT:    lui a1, 699051
-; V128-NEXT:    li a2, 32
-; V128-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; V128-NEXT:    vwaddu.vv v8, v0, v16
+; V128-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
+; V128-NEXT:    vzext.vf2 v8, v24
+; V128-NEXT:    vzext.vf2 v24, v0
 ; V128-NEXT:    addi a1, a1, -1366
 ; V128-NEXT:    vmv.s.x v0, a1
-; V128-NEXT:    vwmaccu.vx v8, a0, v16
-; V128-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
-; V128-NEXT:    vmerge.vvm v24, v8, v24, v0
-; V128-NEXT:    addi a1, sp, 16
-; V128-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
+; V128-NEXT:    vsll.vx v8, v8, a0
+; V128-NEXT:    vsetvli zero, a0, e32, m8, ta, ma
+; V128-NEXT:    vmerge.vvm v24, v24, v8, v0
+; V128-NEXT:    addi a0, sp, 16
+; V128-NEXT:    vl8r.v v8, (a0) # Unknown-size Folded Reload
 ; V128-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; V128-NEXT:    vwaddu.vv v0, v16, v8
-; V128-NEXT:    vwmaccu.vx v0, a0, v8
+; V128-NEXT:    vwaddu.vv v0, v8, v16
+; V128-NEXT:    li a0, -1
+; V128-NEXT:    vwmaccu.vx v0, a0, v16
 ; V128-NEXT:    vmv8r.v v8, v0
 ; V128-NEXT:    vmv8r.v v16, v24
 ; V128-NEXT:    csrr a0, vlenb
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
index 5d307211ead6e6..a9ae2181333e88 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
@@ -801,11 +801,9 @@ define <8 x i32> @shuffle_compress_singlesrc_gaps_e32(<8 x i32> %v) {
 define <8 x i32> @shuffle_spread2_singlesrc_e32(<8 x i32> %v) {
 ; CHECK-LABEL: shuffle_spread2_singlesrc_e32:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
-; CHECK-NEXT:    vwaddu.vv v10, v8, v8
-; CHECK-NEXT:    li a0, -1
-; CHECK-NEXT:    vwmaccu.vx v10, a0, v8
-; CHECK-NEXT:    vmv2r.v v8, v10
+; CHECK-NEXT:    vsetivli zero, 4, e64, m2, ta, ma
+; CHECK-NEXT:    vzext.vf2 v10, v8
+; CHECK-NEXT:    vmv.v.v v8, v10
 ; CHECK-NEXT:    ret
   %out = shufflevector <8 x i32> %v, <8 x i32> poison, <8 x i32> <i32 0, i32 undef, i32 1, i32 undef, i32 2, i32 undef, i32 3, i32 undef>
   ret <8 x i32> %out
@@ -814,11 +812,10 @@ define <8 x i32> @shuffle_spread2_singlesrc_e32(<8 x i32> %v) {
 define <8 x i32> @shuffle_spread2_singlesrc_e32_index1(<8 x i32> %v) {
 ; CHECK-LABEL: shuffle_spread2_singlesrc_e32_index1:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
-; CHECK-NEXT:    vwaddu.vv v10, v8, v8
-; CHECK-NEXT:    li a0, -1
-; CHECK-NEXT:    vwmaccu.vx v10, a0, v8
-; CHECK-NEXT:    vmv2r.v v8, v10
+; CHECK-NEXT:    vsetivli zero, 4, e64, m2, ta, ma
+; CHECK-NEXT:    vzext.vf2 v10, v8
+; CHECK-NEXT:    li a0, 32
+; CHECK-NEXT:    vsll.vx v8, v10, a0
 ; CHECK-NEXT:    ret
   %out = shufflevector <8 x i32> %v, <8 x i32> poison, <8 x i32> <i32 undef, i32 0, i32 undef, i32 1, i32 undef, i32 2, i32 undef, i32 3>
   ret <8 x i32> %out

preames · 2024-12-04T15:47:01Z

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

+      OddV = (OddSrc / Size) == 0 ? V1 : V2;
+      OddV = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, HalfVT, OddV,
+                         DAG.getVectorIdxConstant(OddSrc % Size, DL));
+    }

    return getWideningInterleave(EvenV, OddV, DL, DAG, Subtarget);


Note that we already have the required undef handling in getWideningInterleave because it's used by the deinterleave2 intrinsic lowering.

preames · 2024-12-04T15:50:25Z

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-interleave.ll

-; V128-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
+; V128-NEXT:    vsll.vx v8, v8, a0
+; V128-NEXT:    vsetvli zero, a0, e32, m8, ta, ma
+; V128-NEXT:    vmerge.vvm v24, v24, v8, v0


The weird two spread(2) and then merge - which could be one interleave - comes from lowering this:
t24: v32i32 = vector_shuffle<16,48,17,49,18,50,19,51,20,52,21,53,22,54,23,55,24,56,25,57,26,58,27,59,28,60,29,61,30,62,31,63> t4, t7

This fails our current restrictions in the definition of isInterleaveShuffle, but we could probably relax that. I'm going to glance at that separately.

I did glance at this, and decided not to pursue for now. The change itself is simple, but it exposes two unrelated issues I don't have cycles to fix. 1) isShuffleMaskLegal causes unprofitable transforms somewhere in generic DAG 2) MachineScheduling creates a truly adversarial schedule, and register allocator is forced to greatly increase spilling.

topperc

LGTM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

[RISCV] Improve lowering of spread(2) shuffles

c37e402

A spread(2) shuffle is just a interleave with an undef lane. The existing lowering was reusing the even lane for the undef value. This was entirely legal, but non-optimal.

preames requested review from lukel97, BeMg, topperc and wangpc-pp December 4, 2024 15:46

llvmbot added the backend:RISC-V label Dec 4, 2024

preames commented Dec 4, 2024

View reviewed changes

topperc approved these changes Dec 4, 2024

View reviewed changes

topperc reviewed Dec 4, 2024

View reviewed changes

llvm/lib/Target/RISCV/RISCVISelLowering.cpp Outdated Show resolved Hide resolved

Address review comment

7df9043

preames merged commit a6e7749 into llvm:main Dec 4, 2024
3 of 6 checks passed

preames deleted the pr-riscv-shuffle-spread2 branch December 4, 2024 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RISCV] Improve lowering of spread(2) shuffles #118658

[RISCV] Improve lowering of spread(2) shuffles #118658

Uh oh!

preames commented Dec 4, 2024

Uh oh!

llvmbot commented Dec 4, 2024

Uh oh!

preames Dec 4, 2024

Uh oh!

preames Dec 4, 2024

Uh oh!

preames Dec 4, 2024

Uh oh!

topperc left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[RISCV] Improve lowering of spread(2) shuffles #118658

[RISCV] Improve lowering of spread(2) shuffles #118658

Uh oh!

Conversation

preames commented Dec 4, 2024

Uh oh!

llvmbot commented Dec 4, 2024

Uh oh!

preames Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

preames Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

preames Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

topperc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!