Skip to content

[RISCV] Decompose LMUL > 1 reverses into LMUL * M1 vrgather.vv #104574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 29, 2024

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Aug 16, 2024

As far as I'm aware, vrgather.vv is quadratic in LMUL on most microarchitectures today due to each output register needing to read from each input register in the group.

For example, the reciprocal throughput for vrgather.vv on the spacemit-x60 is listed on https://camel-cdr.github.io/rvv-bench-results/bpi_f3 as:

LMUL1   LMUL2   LMUL4   LMUL8
4.0	16.0	64.0	256.1

Vector reverses are commonly emitted by the loop vectorizer and are lowered as vrgather.vvs, but since the loop vectorizer uses LMUL 2 by default they end up being quadratic.

The output registers in a reverse only need to read from one input register though, so we can decompose this into LMUL * M1 vrgather.vvs to get linear performance.

This gives a 0.43% runtime improvement on 526.blender_r at rva22u64_v O3 on the Banana Pi F3.

@llvmbot
Copy link
Member

llvmbot commented Aug 16, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

As far as I'm aware, vrgather.vv is quadratic in LMUL on most microarchitectures today due to each output register needing to read from each input register in the group.

From https://camel-cdr.github.io/rvv-bench-results/bpi_f3:

LMUL1   LMUL2   LMUL4   LMUL8
4.0	16.0	64.0	256.1

Vector reverses are commonly emitted by the loop vectorizer and are lowered as vrgather.vvs, but since the loop vectorizer uses LMUL 2 by default they end up being quadratic.

The output registers in a reverse only need to read from one input register though, so we can decompose this into LMUL * M1 vrgather.vvs to get linear performance.

This gives a 0.43% runtime improvement on 526.blender_r at rva22u64_v O3 on the Banana Pi F3.


Patch is 117.13 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/104574.diff

5 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+48-3)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll (+430-196)
  • (modified) llvm/test/CodeGen/RISCV/rvv/named-vector-shuffle-reverse.ll (+590-381)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vp-reverse-int.ll (+26-16)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vp-reverse-mask.ll (+28-18)
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 434f22bb9c8df..c4bb3e86ad65f 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -10320,6 +10320,54 @@ SDValue RISCVTargetLowering::lowerVECTOR_REVERSE(SDValue Op,
     Vec = convertToScalableVector(ContainerVT, Vec, DAG, Subtarget);
   }
 
+  MVT XLenVT = Subtarget.getXLenVT();
+  auto [Mask, VL] = getDefaultVLOps(VecVT, ContainerVT, DL, DAG, Subtarget);
+
+  // On most uarchs vrgather.vv is quadratic in LMUL because each output
+  // register may read from LMUL registers. However to reverse a vector each
+  // output register only needs to read from one register. So decompose it into
+  // LMUL * M1 vrgather.vvs, so we get O(LMUL) performance instead of O(LMUL^2).
+  //
+  // vsetvli a1, zero, e64, m4, ta, ma
+  // vrgatherei16.vv v12, v8, v16
+  // ->
+  // vsetvli a1, zero, e64, m1, ta, ma
+  // vrgather.vv v15, v8, v16
+  // vrgather.vv v14, v9, v16
+  // vrgather.vv v13, v10, v16
+  // vrgather.vv v12, v11, v16
+  if (ContainerVT.bitsGT(getLMUL1VT(ContainerVT)) &&
+      ContainerVT.getVectorElementCount().isKnownMultipleOf(2)) {
+    MVT HalfVT = ContainerVT.getHalfNumVectorElementsVT();
+    SDValue Lo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, HalfVT, Vec,
+                             DAG.getVectorIdxConstant(0, DL));
+    SDValue Hi = DAG.getNode(
+        ISD::EXTRACT_SUBVECTOR, DL, HalfVT, Vec,
+        DAG.getVectorIdxConstant(HalfVT.getVectorMinNumElements(), DL));
+    Lo = DAG.getNode(ISD::VECTOR_REVERSE, DL, HalfVT, Lo);
+    Hi = DAG.getNode(ISD::VECTOR_REVERSE, DL, HalfVT, Hi);
+    SDValue Concat = DAG.getNode(ISD::CONCAT_VECTORS, DL, ContainerVT, Hi, Lo);
+
+    // Fixed length vectors might not fit exactly into their container, and so
+    // leave a gap in the front of the vector after being reversed. Slide this
+    // away.
+    //
+    // x x x x 3 2 1 0 <- v4i16 @ vlen=128
+    // 0 1 2 3 x x x x <- reverse
+    // x x x x 0 1 2 3 <- vslidedown.vx
+    if (VecVT.isFixedLengthVector()) {
+      SDValue Offset = DAG.getNode(
+          ISD::SUB, DL, XLenVT,
+          DAG.getElementCount(DL, XLenVT, ContainerVT.getVectorElementCount()),
+          DAG.getElementCount(DL, XLenVT, VecVT.getVectorElementCount()));
+      Concat =
+          getVSlidedown(DAG, Subtarget, DL, ContainerVT,
+                        DAG.getUNDEF(ContainerVT), Concat, Offset, Mask, VL);
+      Concat = convertFromScalableVector(VecVT, Concat, DAG, Subtarget);
+    }
+    return Concat;
+  }
+
   unsigned EltSize = ContainerVT.getScalarSizeInBits();
   unsigned MinSize = ContainerVT.getSizeInBits().getKnownMinValue();
   unsigned VectorBitsMax = Subtarget.getRealMaxVLen();
@@ -10367,9 +10415,6 @@ SDValue RISCVTargetLowering::lowerVECTOR_REVERSE(SDValue Op,
     IntVT = IntVT.changeVectorElementType(MVT::i16);
   }
 
-  MVT XLenVT = Subtarget.getXLenVT();
-  auto [Mask, VL] = getDefaultVLOps(VecVT, ContainerVT, DL, DAG, Subtarget);
-
   // Calculate VLMAX-1 for the desired SEW.
   SDValue VLMinus1 = DAG.getNode(
       ISD::SUB, DL, XLenVT,
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll
index e773d93fad479..7fe8465e2aec8 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll
@@ -94,13 +94,21 @@ define <32 x i1> @reverse_v32i1(<32 x i1> %a) {
 ; NO-ZVBB:       # %bb.0:
 ; NO-ZVBB-NEXT:    li a0, 32
 ; NO-ZVBB-NEXT:    vsetvli zero, a0, e8, m2, ta, ma
-; NO-ZVBB-NEXT:    vid.v v8
-; NO-ZVBB-NEXT:    li a0, 31
-; NO-ZVBB-NEXT:    vrsub.vx v8, v8, a0
-; NO-ZVBB-NEXT:    vmv.v.i v10, 0
-; NO-ZVBB-NEXT:    vmerge.vim v10, v10, 1, v0
-; NO-ZVBB-NEXT:    vrgather.vv v12, v10, v8
-; NO-ZVBB-NEXT:    vmsne.vi v0, v12, 0
+; NO-ZVBB-NEXT:    vmv.v.i v8, 0
+; NO-ZVBB-NEXT:    vmerge.vim v8, v8, 1, v0
+; NO-ZVBB-NEXT:    csrr a1, vlenb
+; NO-ZVBB-NEXT:    addi a2, a1, -1
+; NO-ZVBB-NEXT:    vsetvli a3, zero, e16, m2, ta, ma
+; NO-ZVBB-NEXT:    vid.v v10
+; NO-ZVBB-NEXT:    vrsub.vx v10, v10, a2
+; NO-ZVBB-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
+; NO-ZVBB-NEXT:    vrgatherei16.vv v13, v8, v10
+; NO-ZVBB-NEXT:    vrgatherei16.vv v12, v9, v10
+; NO-ZVBB-NEXT:    slli a1, a1, 1
+; NO-ZVBB-NEXT:    addi a1, a1, -32
+; NO-ZVBB-NEXT:    vsetvli zero, a0, e8, m2, ta, ma
+; NO-ZVBB-NEXT:    vslidedown.vx v8, v12, a1
+; NO-ZVBB-NEXT:    vmsne.vi v0, v8, 0
 ; NO-ZVBB-NEXT:    ret
 ;
 ; ZVBB-LABEL: reverse_v32i1:
@@ -117,13 +125,23 @@ define <64 x i1> @reverse_v64i1(<64 x i1> %a) {
 ; NO-ZVBB:       # %bb.0:
 ; NO-ZVBB-NEXT:    li a0, 64
 ; NO-ZVBB-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
+; NO-ZVBB-NEXT:    vmv.v.i v8, 0
+; NO-ZVBB-NEXT:    vmerge.vim v12, v8, 1, v0
+; NO-ZVBB-NEXT:    csrr a1, vlenb
+; NO-ZVBB-NEXT:    addi a2, a1, -1
+; NO-ZVBB-NEXT:    vsetvli a3, zero, e16, m2, ta, ma
 ; NO-ZVBB-NEXT:    vid.v v8
-; NO-ZVBB-NEXT:    li a0, 63
-; NO-ZVBB-NEXT:    vrsub.vx v8, v8, a0
-; NO-ZVBB-NEXT:    vmv.v.i v12, 0
-; NO-ZVBB-NEXT:    vmerge.vim v12, v12, 1, v0
-; NO-ZVBB-NEXT:    vrgather.vv v16, v12, v8
-; NO-ZVBB-NEXT:    vmsne.vi v0, v16, 0
+; NO-ZVBB-NEXT:    vrsub.vx v16, v8, a2
+; NO-ZVBB-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
+; NO-ZVBB-NEXT:    vrgatherei16.vv v11, v12, v16
+; NO-ZVBB-NEXT:    vrgatherei16.vv v10, v13, v16
+; NO-ZVBB-NEXT:    vrgatherei16.vv v9, v14, v16
+; NO-ZVBB-NEXT:    vrgatherei16.vv v8, v15, v16
+; NO-ZVBB-NEXT:    slli a1, a1, 2
+; NO-ZVBB-NEXT:    addi a1, a1, -64
+; NO-ZVBB-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
+; NO-ZVBB-NEXT:    vslidedown.vx v8, v8, a1
+; NO-ZVBB-NEXT:    vmsne.vi v0, v8, 0
 ; NO-ZVBB-NEXT:    ret
 ;
 ; ZVBB-LABEL: reverse_v64i1:
@@ -140,13 +158,27 @@ define <128 x i1> @reverse_v128i1(<128 x i1> %a) {
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    li a0, 128
 ; CHECK-NEXT:    vsetvli zero, a0, e8, m8, ta, ma
+; CHECK-NEXT:    vmv.v.i v8, 0
+; CHECK-NEXT:    vmerge.vim v16, v8, 1, v0
+; CHECK-NEXT:    csrr a1, vlenb
+; CHECK-NEXT:    addi a2, a1, -1
+; CHECK-NEXT:    vsetvli a3, zero, e16, m2, ta, ma
 ; CHECK-NEXT:    vid.v v8
-; CHECK-NEXT:    li a0, 127
-; CHECK-NEXT:    vrsub.vx v8, v8, a0
-; CHECK-NEXT:    vmv.v.i v16, 0
-; CHECK-NEXT:    vmerge.vim v16, v16, 1, v0
-; CHECK-NEXT:    vrgather.vv v24, v16, v8
-; CHECK-NEXT:    vmsne.vi v0, v24, 0
+; CHECK-NEXT:    vrsub.vx v24, v8, a2
+; CHECK-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
+; CHECK-NEXT:    vrgatherei16.vv v15, v16, v24
+; CHECK-NEXT:    vrgatherei16.vv v14, v17, v24
+; CHECK-NEXT:    vrgatherei16.vv v13, v18, v24
+; CHECK-NEXT:    vrgatherei16.vv v12, v19, v24
+; CHECK-NEXT:    vrgatherei16.vv v11, v20, v24
+; CHECK-NEXT:    vrgatherei16.vv v10, v21, v24
+; CHECK-NEXT:    vrgatherei16.vv v9, v22, v24
+; CHECK-NEXT:    vrgatherei16.vv v8, v23, v24
+; CHECK-NEXT:    slli a1, a1, 3
+; CHECK-NEXT:    addi a1, a1, -128
+; CHECK-NEXT:    vsetvli zero, a0, e8, m8, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v8, a1
+; CHECK-NEXT:    vmsne.vi v0, v8, 0
 ; CHECK-NEXT:    ret
   %res = shufflevector <128 x i1> %a, <128 x i1> poison, <128 x i32> <i32 127, i32 126, i32 125, i32 124, i32 123, i32 122, i32 121, i32 120, i32 119, i32 118, i32 117, i32 116, i32 115, i32 114, i32 113, i32 112, i32 111, i32 110, i32 109, i32 108, i32 107, i32 106, i32 105, i32 104, i32 103, i32 102, i32 101, i32 100, i32 99, i32 98, i32 97, i32 96, i32 95, i32 94, i32 93, i32 92, i32 91, i32 90, i32 89, i32 88, i32 87, i32 86, i32 85, i32 84, i32 83, i32 82, i32 81, i32 80, i32 79, i32 78, i32 77, i32 76, i32 75, i32 74, i32 73, i32 72, i32 71, i32 70, i32 69, i32 68, i32 67, i32 66, i32 65, i32 64, i32 63, i32 62, i32 61, i32 60, i32 59, i32 58, i32 57, i32 56, i32 55, i32 54, i32 53, i32 52, i32 51, i32 50, i32 49, i32 48, i32 47, i32 46, i32 45, i32 44, i32 43, i32 42, i32 41, i32 40, i32 39, i32 38, i32 37, i32 36, i32 35, i32 34, i32 33, i32 32, i32 31, i32 30, i32 29, i32 28, i32 27, i32 26, i32 25, i32 24, i32 23, i32 22, i32 21, i32 20, i32 19, i32 18, i32 17, i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
   ret <128 x i1> %res
@@ -220,13 +252,19 @@ define <16 x i8> @reverse_v16i8(<16 x i8> %a) {
 define <32 x i8> @reverse_v32i8(<32 x i8> %a) {
 ; CHECK-LABEL: reverse_v32i8:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    li a0, 32
-; CHECK-NEXT:    vsetvli zero, a0, e8, m2, ta, ma
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    addi a1, a0, -1
+; CHECK-NEXT:    vsetvli a2, zero, e16, m2, ta, ma
 ; CHECK-NEXT:    vid.v v10
-; CHECK-NEXT:    li a0, 31
-; CHECK-NEXT:    vrsub.vx v12, v10, a0
-; CHECK-NEXT:    vrgather.vv v10, v8, v12
-; CHECK-NEXT:    vmv.v.v v8, v10
+; CHECK-NEXT:    vrsub.vx v10, v10, a1
+; CHECK-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
+; CHECK-NEXT:    vrgatherei16.vv v13, v8, v10
+; CHECK-NEXT:    vrgatherei16.vv v12, v9, v10
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    addi a0, a0, -32
+; CHECK-NEXT:    li a1, 32
+; CHECK-NEXT:    vsetvli zero, a1, e8, m2, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v12, a0
 ; CHECK-NEXT:    ret
   %res = shufflevector <32 x i8> %a, <32 x i8> poison, <32 x i32> <i32 31, i32 30, i32 29, i32 28, i32 27, i32 26, i32 25, i32 24, i32 23, i32 22, i32 21, i32 20, i32 19, i32 18, i32 17, i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
   ret <32 x i8> %res
@@ -235,13 +273,21 @@ define <32 x i8> @reverse_v32i8(<32 x i8> %a) {
 define <64 x i8> @reverse_v64i8(<64 x i8> %a) {
 ; CHECK-LABEL: reverse_v64i8:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    li a0, 64
-; CHECK-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    addi a1, a0, -1
+; CHECK-NEXT:    vsetvli a2, zero, e16, m2, ta, ma
 ; CHECK-NEXT:    vid.v v12
-; CHECK-NEXT:    li a0, 63
-; CHECK-NEXT:    vrsub.vx v16, v12, a0
-; CHECK-NEXT:    vrgather.vv v12, v8, v16
-; CHECK-NEXT:    vmv.v.v v8, v12
+; CHECK-NEXT:    vrsub.vx v16, v12, a1
+; CHECK-NEXT:    vsetvli zero, zero, e8, m1, ta, ma
+; CHECK-NEXT:    vrgatherei16.vv v15, v8, v16
+; CHECK-NEXT:    vrgatherei16.vv v14, v9, v16
+; CHECK-NEXT:    vrgatherei16.vv v13, v10, v16
+; CHECK-NEXT:    vrgatherei16.vv v12, v11, v16
+; CHECK-NEXT:    slli a0, a0, 2
+; CHECK-NEXT:    addi a0, a0, -64
+; CHECK-NEXT:    li a1, 64
+; CHECK-NEXT:    vsetvli zero, a1, e8, m4, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v12, a0
 ; CHECK-NEXT:    ret
   %res = shufflevector <64 x i8> %a, <64 x i8> poison, <64 x i32> <i32 63, i32 62, i32 61, i32 60, i32 59, i32 58, i32 57, i32 56, i32 55, i32 54, i32 53, i32 52, i32 51, i32 50, i32 49, i32 48, i32 47, i32 46, i32 45, i32 44, i32 43, i32 42, i32 41, i32 40, i32 39, i32 38, i32 37, i32 36, i32 35, i32 34, i32 33, i32 32, i32 31, i32 30, i32 29, i32 28, i32 27, i32 26, i32 25, i32 24, i32 23, i32 22, i32 21, i32 20, i32 19, i32 18, i32 17, i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
   ret <64 x i8> %res
@@ -302,11 +348,17 @@ define <8 x i16> @reverse_v8i16(<8 x i16> %a) {
 define <16 x i16> @reverse_v16i16(<16 x i16> %a) {
 ; CHECK-LABEL: reverse_v16i16:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    srli a1, a0, 1
+; CHECK-NEXT:    addi a1, a1, -1
+; CHECK-NEXT:    vsetvli a2, zero, e16, m1, ta, ma
 ; CHECK-NEXT:    vid.v v10
-; CHECK-NEXT:    vrsub.vi v12, v10, 15
-; CHECK-NEXT:    vrgather.vv v10, v8, v12
-; CHECK-NEXT:    vmv.v.v v8, v10
+; CHECK-NEXT:    vrsub.vx v10, v10, a1
+; CHECK-NEXT:    vrgather.vv v13, v8, v10
+; CHECK-NEXT:    vrgather.vv v12, v9, v10
+; CHECK-NEXT:    addi a0, a0, -16
+; CHECK-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v12, a0
 ; CHECK-NEXT:    ret
   %res = shufflevector <16 x i16> %a, <16 x i16> poison, <16 x i32> <i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
   ret <16 x i16> %res
@@ -315,13 +367,21 @@ define <16 x i16> @reverse_v16i16(<16 x i16> %a) {
 define <32 x i16> @reverse_v32i16(<32 x i16> %a) {
 ; CHECK-LABEL: reverse_v32i16:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    li a0, 32
-; CHECK-NEXT:    vsetvli zero, a0, e16, m4, ta, ma
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    srli a1, a0, 1
+; CHECK-NEXT:    addi a1, a1, -1
+; CHECK-NEXT:    vsetvli a2, zero, e16, m1, ta, ma
 ; CHECK-NEXT:    vid.v v12
-; CHECK-NEXT:    li a0, 31
-; CHECK-NEXT:    vrsub.vx v16, v12, a0
-; CHECK-NEXT:    vrgather.vv v12, v8, v16
-; CHECK-NEXT:    vmv.v.v v8, v12
+; CHECK-NEXT:    vrsub.vx v16, v12, a1
+; CHECK-NEXT:    vrgather.vv v15, v8, v16
+; CHECK-NEXT:    vrgather.vv v14, v9, v16
+; CHECK-NEXT:    vrgather.vv v13, v10, v16
+; CHECK-NEXT:    vrgather.vv v12, v11, v16
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    addi a0, a0, -32
+; CHECK-NEXT:    li a1, 32
+; CHECK-NEXT:    vsetvli zero, a1, e16, m4, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v12, a0
 ; CHECK-NEXT:    ret
   %res = shufflevector <32 x i16> %a, <32 x i16> poison, <32 x i32> <i32 31, i32 30, i32 29, i32 28, i32 27, i32 26, i32 25, i32 24, i32 23, i32 22, i32 21, i32 20, i32 19, i32 18, i32 17, i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
   ret <32 x i16> %res
@@ -369,12 +429,18 @@ define <4 x i32> @reverse_v4i32(<4 x i32> %a) {
 define <8 x i32> @reverse_v8i32(<8 x i32> %a) {
 ; CHECK-LABEL: reverse_v8i32:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    srli a1, a0, 2
+; CHECK-NEXT:    addi a1, a1, -1
+; CHECK-NEXT:    vsetvli a2, zero, e32, m1, ta, ma
 ; CHECK-NEXT:    vid.v v10
-; CHECK-NEXT:    vrsub.vi v12, v10, 7
-; CHECK-NEXT:    vsetvli zero, zero, e32, m2, ta, ma
-; CHECK-NEXT:    vrgatherei16.vv v10, v8, v12
-; CHECK-NEXT:    vmv.v.v v8, v10
+; CHECK-NEXT:    vrsub.vx v10, v10, a1
+; CHECK-NEXT:    vrgather.vv v13, v8, v10
+; CHECK-NEXT:    vrgather.vv v12, v9, v10
+; CHECK-NEXT:    srli a0, a0, 1
+; CHECK-NEXT:    addi a0, a0, -8
+; CHECK-NEXT:    vsetivli zero, 8, e32, m2, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v12, a0
 ; CHECK-NEXT:    ret
   %res = shufflevector <8 x i32> %a, <8 x i32> poison, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
   ret <8 x i32> %res
@@ -383,12 +449,19 @@ define <8 x i32> @reverse_v8i32(<8 x i32> %a) {
 define <16 x i32> @reverse_v16i32(<16 x i32> %a) {
 ; CHECK-LABEL: reverse_v16i32:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    srli a1, a0, 2
+; CHECK-NEXT:    addi a1, a1, -1
+; CHECK-NEXT:    vsetvli a2, zero, e32, m1, ta, ma
 ; CHECK-NEXT:    vid.v v12
-; CHECK-NEXT:    vrsub.vi v16, v12, 15
-; CHECK-NEXT:    vsetvli zero, zero, e32, m4, ta, ma
-; CHECK-NEXT:    vrgatherei16.vv v12, v8, v16
-; CHECK-NEXT:    vmv.v.v v8, v12
+; CHECK-NEXT:    vrsub.vx v16, v12, a1
+; CHECK-NEXT:    vrgather.vv v15, v8, v16
+; CHECK-NEXT:    vrgather.vv v14, v9, v16
+; CHECK-NEXT:    vrgather.vv v13, v10, v16
+; CHECK-NEXT:    vrgather.vv v12, v11, v16
+; CHECK-NEXT:    addi a0, a0, -16
+; CHECK-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v12, a0
 ; CHECK-NEXT:    ret
   %res = shufflevector <16 x i32> %a, <16 x i32> poison, <16 x i32> <i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
   ret <16 x i32> %res
@@ -417,12 +490,18 @@ define <2 x i64> @reverse_v2i64(<2 x i64> %a) {
 define <4 x i64> @reverse_v4i64(<4 x i64> %a) {
 ; CHECK-LABEL: reverse_v4i64:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    srli a1, a0, 3
+; CHECK-NEXT:    addi a1, a1, -1
+; CHECK-NEXT:    vsetvli a2, zero, e64, m1, ta, ma
 ; CHECK-NEXT:    vid.v v10
-; CHECK-NEXT:    vrsub.vi v12, v10, 3
-; CHECK-NEXT:    vsetvli zero, zero, e64, m2, ta, ma
-; CHECK-NEXT:    vrgatherei16.vv v10, v8, v12
-; CHECK-NEXT:    vmv.v.v v8, v10
+; CHECK-NEXT:    vrsub.vx v10, v10, a1
+; CHECK-NEXT:    vrgather.vv v13, v8, v10
+; CHECK-NEXT:    vrgather.vv v12, v9, v10
+; CHECK-NEXT:    srli a0, a0, 2
+; CHECK-NEXT:    addi a0, a0, -4
+; CHECK-NEXT:    vsetivli zero, 4, e64, m2, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v12, a0
 ; CHECK-NEXT:    ret
   %res = shufflevector <4 x i64> %a, <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
   ret <4 x i64> %res
@@ -431,12 +510,20 @@ define <4 x i64> @reverse_v4i64(<4 x i64> %a) {
 define <8 x i64> @reverse_v8i64(<8 x i64> %a) {
 ; CHECK-LABEL: reverse_v8i64:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    srli a1, a0, 3
+; CHECK-NEXT:    addi a1, a1, -1
+; CHECK-NEXT:    vsetvli a2, zero, e64, m1, ta, ma
 ; CHECK-NEXT:    vid.v v12
-; CHECK-NEXT:    vrsub.vi v16, v12, 7
-; CHECK-NEXT:    vsetvli zero, zero, e64, m4, ta, ma
-; CHECK-NEXT:    vrgatherei16.vv v12, v8, v16
-; CHECK-NEXT:    vmv.v.v v8, v12
+; CHECK-NEXT:    vrsub.vx v16, v12, a1
+; CHECK-NEXT:    vrgather.vv v15, v8, v16
+; CHECK-NEXT:    vrgather.vv v14, v9, v16
+; CHECK-NEXT:    vrgather.vv v13, v10, v16
+; CHECK-NEXT:    vrgather.vv v12, v11, v16
+; CHECK-NEXT:    srli a0, a0, 1
+; CHECK-NEXT:    addi a0, a0, -8
+; CHECK-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v12, a0
 ; CHECK-NEXT:    ret
   %res = shufflevector <8 x i64> %a, <8 x i64> poison, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
   ret <8 x i64> %res
@@ -498,11 +585,17 @@ define <8 x half> @reverse_v8f16(<8 x half> %a) {
 define <16 x half> @reverse_v16f16(<16 x half> %a) {
 ; CHECK-LABEL: reverse_v16f16:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    srli a1, a0, 1
+; CHECK-NEXT:    addi a1, a1, -1
+; CHECK-NEXT:    vsetvli a2, zero, e16, m1, ta, ma
 ; CHECK-NEXT:    vid.v v10
-; CHECK-NEXT:    vrsub.vi v12, v10, 15
-; CHECK-NEXT:    vrgather.vv v10, v8, v12
-; CHECK-NEXT:    vmv.v.v v8, v10
+; CHECK-NEXT:    vrsub.vx v10, v10, a1
+; CHECK-NEXT:    vrgather.vv v13, v8, v10
+; CHECK-NEXT:    vrgather.vv v12, v9, v10
+; CHECK-NEXT:    addi a0, a0, -16
+; CHECK-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v12, a0
 ; CHECK-NEXT:    ret
   %res = shufflevector <16 x half> %a, <16 x half> poison, <16 x i32> <i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
   ret <16 x half> %res
@@ -511,13 +604,21 @@ define <16 x half> @reverse_v16f16(<16 x half> %a) {
 define <32 x half> @reverse_v32f16(<32 x half> %a) {
 ; CHECK-LABEL: reverse_v32f16:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    li a0, 32
-; CHECK-NEXT:    vsetvli zero, a0, e16, m4, ta, ma
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    srli a1, a0, 1
+; CHECK-NEXT:    addi a1, a1, -1
+; CHECK-NEXT:    vsetvli a2, zero, e16, m1, ta, ma
 ; CHECK-NEXT:    vid.v v12
-; CHECK-NEXT:    li a0, 31
-; CHECK-NEXT:    vrsub.vx v16, v12, a0
-; CHECK-NEXT:    vrgather.vv v12, v8, v16
-; CHECK-NEXT:    vmv.v.v v8, v12
+; CHECK-NEXT:    vrsub.vx v16, v12, a1
+; CHECK-NEXT:    vrgather.vv v15, v8, v16
+; CHECK-NEXT:    vrgather.vv v14, v9, v16
+; CHECK-NEXT:    vrgather.vv v13, v10, v16
+; CHECK-NEXT:    vrgather.vv v12, v11, v16
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    addi a0, a0, -32
+; CHECK-NEXT:    li a1, 32
+; CHECK-NEXT:    vsetvli zero, a1, e16, m4, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v12, a0
 ; CHECK-NEXT:    ret
   %res = shufflevector <32 x half> %a, <32 x half> poison, <32 x i32> <i32 31, i32 30, i32 29, i32 28, i32 27, i32 26, i32 25, i32 24, i32 23, i32 22, i32 21, i32 20, i32 19, i32 18, i32 17, i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
   ret <32 x half> %res
@@ -565,12 +666,18 @@ define <4 x float> @reverse_v4f32(<4 x float> %a) {
 define <8 x float> @reverse_v8f32(<8 x float> %a) {
 ; CHECK-LABEL: reverse_v8f32:
 ; CHECK:       # %bb.0:
...
[truncated]

@topperc
Copy link
Collaborator

topperc commented Aug 16, 2024

As far as I'm aware, vrgather.vv is quadratic in LMUL on most microarchitectures today due to each output register needing to read from each input register in the group.

SiFive p470 and p670 are quadratic in the worst case, but will skip reading input registers when they aren't used.

Earlier versions of x280 were one element per cycle, but newer generations will improve.

For smaller VLEN and large EEW it's impossible to read all sources at high LMUL. For example, VLEN=128 SEW=64 has only 2 elements per register so can only depend on 2 source registers in the worst case. Is known hardware still quadratic in LMUL for this case?

@W-angler
Copy link

For reverse operations in vectorizer, I think we can generate strided load/store with negative stride for some cases. This is what llvm-epi has done, and as far as I know, many downstreams do the same thing.

@preames
Copy link
Collaborator

preames commented Aug 16, 2024

As far as I'm aware, vrgather.vv is quadratic in LMUL on most microarchitectures today due to each output register needing to read from each input register in the group.

SiFive p470 and p670 are quadratic in the worst case, but will skip reading input registers when they aren't used.

Earlier versions of x280 were one element per cycle, but newer generations will improve.

For these processors, how does the lowering here compare to the default vrgather.vv? Is it at least neutral? Or do we need a tuning flag?

For smaller VLEN and large EEW it's impossible to read all sources at high LMUL. For example, VLEN=128 SEW=64 has only 2 elements per register so can only depend on 2 source registers in the worst case. Is known hardware still quadratic in LMUL for this case?

It can only depend on two source registers, but which two source registers is not known until runtime. Depending on where/how vector operations are split, I could see reasonable implementations which both were and weren't able to exploit this.

We should definitely run this test on the BP3. From the fact Luke saw an improvement on the BP3, I'd guess it either doesn't implement this optimization or the VLEN=256 imply potentially four sources is enough decrease the impact.

@preames
Copy link
Collaborator

preames commented Aug 16, 2024

For reverse operations in vectorizer, I think we can generate strided load/store with negative stride for some cases. This is what llvm-epi has done, and as far as I know, many downstreams do the same thing.

Agreed, but this is not a reason to not improve the lowering. It's more orthogonal work in the same area.

; CHECK-NEXT: srli a0, a0, 1
; CHECK-NEXT: addi a0, a0, -8
; CHECK-NEXT: vsetivli zero, 8, e64, m4, ta, ma
; CHECK-NEXT: vslidedown.vx v8, v16, a0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the fixed vector tests, we should have an exact VLEN test for (a) precisely full registers (where the vslidedown can be pruned entirely), and (b) a vector prefix (where the slide amount is a known constant). I skimmed and didn't see these, but there's enough test churn I may have just missed them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly enough we never actually hit the vrgather expansion in lowerVECTOR_REVERSE with an exact VLEN, since lowerShuffleViaVRegSplitting will kick in beforehand and already split it up into LMUL * M1 shuffles.

@camel-cdr
Copy link

camel-cdr commented Aug 16, 2024

Is known hardware still quadratic in LMUL for this case?

Yes, it behaves like that at least, C906/C908/C910, and the X60 seem to perform the same regardless of SEW.

SiFive p470 and p670 are quadratic in the worst case, but will skip reading input registers when they aren't used.

That isn't reflected in the scheduling models, though, unless I don't understand how they work. The P670 scheduling models reports 1/2/4/8 scaling, and the P470 1/12/16/24. See RISCVSchedSiFiveP600.td and the P470 PR

I think it will be unlikely that unrolling to LMUL=1 when possible will perform worse by more than a few percent on any processor, while it has already been shown that there can be huge benefits:

  • SpacemiT X60
  • XuanTie C908
  • XuanTie C910 (the measurements are all over the place, I'm not sure what happened there)
  • XiangShanV3 (there were some performance problems with the vsetvli implementation, which should be fixed now, however I wasn't able to find a commit on which I could run the benchmark again)

There was no difference on saturn-vector, it's probably close to the X280 implementations, since it's also one element per cycle and uses a lot of chaining.

@topperc
Copy link
Collaborator

topperc commented Aug 16, 2024

That isn't reflected in the scheduling models, though, unless I don't understand how they work. The P670 scheduling models reports 1/2/4/8 scaling, and the P470 1/12/16/24. See RISCVSchedSiFiveP600.td and the P470 PR

The scheduling model numbers came from llvm-exegesis, but we realized yesterday that llvm-exegesis wasn't controlling the input indices to produce the worst case.

@lukel97
Copy link
Contributor Author

lukel97 commented Aug 16, 2024

For reverse operations in vectorizer, I think we can generate strided load/store with negative stride for some cases.

I was thinking that as well, I haven't fully looked into why the loop vectorizer doesn't just do that.

SiFive p470 and p670 are quadratic in the worst case, but will skip reading input registers when they aren't used.

I would be happy to gate this behind a "HasOptimizedVrgather" feature, enabled for the spacemit-x60 by default.

@topperc
Copy link
Collaborator

topperc commented Aug 16, 2024

I was thinking that as well, I haven't fully looked into why the loop vectorizer doesn't just do that.

I don't think the vectorizer knows how to make a strided load. So it would have to be a masked.gather with vscale minus step vector?

I have code in my downstream for vp.load+vp.reverse -> vp.strided.load (might have come from BSC), but not for load+reverse.

@wangpc-pp
Copy link
Contributor

I was thinking that as well, I haven't fully looked into why the loop vectorizer doesn't just do that.

I don't think the vectorizer knows how to make a strided load. So it would have to be a masked.gather with vscale minus step vector?

I have code in my downstream for vp.load+vp.reverse -> vp.strided.load (might have come from BSC), but not for load+reverse.

We don't generate strided load/store in vectorizer, for example:

if (auto *VPMask = getMask()) {
// Mask reversal is only needed for non-all-one (null) masks, as reverse
// of a null all-one mask is a null mask.
Mask = State.get(VPMask, Part);
if (isReverse())
Mask = Builder.CreateVectorReverse(Mask, "reverse");
}
Value *Addr = State.get(getAddr(), Part, /*IsScalar*/ !CreateGather);
if (CreateGather) {
NewLI = Builder.CreateMaskedGather(DataTy, Addr, Alignment, Mask, nullptr,
"wide.masked.gather");
} else if (Mask) {
NewLI = Builder.CreateMaskedLoad(DataTy, Addr, Alignment, Mask,
PoisonValue::get(DataTy),
"wide.masked.load");
} else {
NewLI = Builder.CreateAlignedLoad(DataTy, Addr, Alignment, "wide.load");
}
// Add metadata to the load, but setVectorValue to the reverse shuffle.
State.addMetadata(NewLI, LI);
if (Reverse)
NewLI = Builder.CreateVectorReverse(NewLI, "reverse");
State.set(this, NewLI, Part);

But it can just simply generate vp.strided.load/vp.strided.store:
https://llvm.org/docs/LangRef.html#llvm-experimental-vp-strided-load-intrinsic
The change won't be large, but it may be too specific.

; RV32-BITS-UNKNOWN-NEXT: vrgatherei16.vv v13, v10, v8
; RV32-BITS-UNKNOWN-NEXT: vrgatherei16.vv v12, v11, v8
; RV32-BITS-UNKNOWN-NEXT: vsetvli a0, zero, e8, m2, ta, ma
; RV32-BITS-UNKNOWN-NEXT: vand.vi v8, v12, 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: the vand.vi here is unnecessary.

@@ -269,7 +269,8 @@ def SIFIVE_P470 : RISCVProcessorModel<"sifive-p470", SiFiveP400Model,
FeatureUnalignedScalarMem,
FeatureUnalignedVectorMem]),
!listconcat(SiFiveP400TuneFeatures,
[TuneNoSinkSplatOperands])>;
[TuneNoSinkSplatOperands,
TuneOptimizedVectorGather])>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked back over our internal documentation. I think this might still be profitable on these CPUs. There's an additional constant overhead for LMUL>1 even when the indices only read a single source.

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

It seems like the profitability discussion has settled out with a "better in some cases, and not significantly worse in others" answer, and thus this is reasonable to enable by default. Please chime in if I'm off in believing we have consensus this is okay.

On the vectorizer discussion, that seems like an unrelated work item. Fair warning, I had looked into that a long time ago and it was a more complicated than it first appears.

As far as I'm aware, vrgather.vv is quadratic in LMUL on most microarchitectures today due to each output register needing to read from each input register in the group.

From https://camel-cdr.github.io/rvv-bench-results/bpi_f3:

LMUL1   LMUL2   LMUL4   LMUL8
4.0	16.0	64.0	256.1

Vector reverses are commonly emitted by the loop vectorizer and are lowered as vrgather.vvs, but since the loop vectorizer uses LMUL 2 by default they end up being quadratic.

The output registers in a reverse only need to read from one input register though, so we can decompose this into LMUL * M1 vrgather.vvs to get linear performance.

This gives a 0.43% runtime improvement on 526.blender_r at rva22u64_v O3 on the Banana Pi F3.
This reverts commit a2d5a537a98111c8659dad1d85dc4f3a2b4a7786.
@lukel97 lukel97 force-pushed the vector_reverse-decompose branch from b975e68 to b5c5953 Compare August 29, 2024 05:47
@lukel97 lukel97 merged commit 3b64ede into llvm:main Aug 29, 2024
6 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants