Skip to content

Commit 3b64ede

Browse files
authored
[RISCV] Decompose LMUL > 1 reverses into LMUL * M1 vrgather.vv (llvm#104574)
As far as I'm aware, vrgather.vv is quadratic in LMUL on most microarchitectures today due to each output register needing to read from each input register in the group. For example, the reciprocal throughput for vrgather.vv on the spacemit-x60 is listed on https://camel-cdr.github.io/rvv-bench-results/bpi_f3 as: LMUL1 LMUL2 LMUL4 LMUL8 4.0 16.0 64.0 256.1 Vector reverses are commonly emitted by the loop vectorizer and are lowered as vrgather.vvs, but since the loop vectorizer uses LMUL 2 by default they end up being quadratic. The output registers in a reverse only need to read from one input register though, so we can decompose this into LMUL * M1 vrgather.vvs to get linear performance. This gives a 0.43% runtime improvement on 526.blender_r at rva22u64_v O3 on the Banana Pi F3.
1 parent 2adc94c commit 3b64ede

File tree

5 files changed

+1146
-614
lines changed

5 files changed

+1146
-614
lines changed

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

Lines changed: 44 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10328,6 +10328,50 @@ SDValue RISCVTargetLowering::lowerVECTOR_REVERSE(SDValue Op,
1032810328
Vec = convertToScalableVector(ContainerVT, Vec, DAG, Subtarget);
1032910329
}
1033010330

10331+
MVT XLenVT = Subtarget.getXLenVT();
10332+
auto [Mask, VL] = getDefaultVLOps(VecVT, ContainerVT, DL, DAG, Subtarget);
10333+
10334+
// On some uarchs vrgather.vv will read from every input register for each
10335+
// output register, regardless of the indices. However to reverse a vector
10336+
// each output register only needs to read from one register. So decompose it
10337+
// into LMUL * M1 vrgather.vvs, so we get O(LMUL) performance instead of
10338+
// O(LMUL^2).
10339+
//
10340+
// vsetvli a1, zero, e64, m4, ta, ma
10341+
// vrgatherei16.vv v12, v8, v16
10342+
// ->
10343+
// vsetvli a1, zero, e64, m1, ta, ma
10344+
// vrgather.vv v15, v8, v16
10345+
// vrgather.vv v14, v9, v16
10346+
// vrgather.vv v13, v10, v16
10347+
// vrgather.vv v12, v11, v16
10348+
if (ContainerVT.bitsGT(getLMUL1VT(ContainerVT)) &&
10349+
ContainerVT.getVectorElementCount().isKnownMultipleOf(2)) {
10350+
auto [Lo, Hi] = DAG.SplitVector(Vec, DL);
10351+
Lo = DAG.getNode(ISD::VECTOR_REVERSE, DL, Lo.getSimpleValueType(), Lo);
10352+
Hi = DAG.getNode(ISD::VECTOR_REVERSE, DL, Hi.getSimpleValueType(), Hi);
10353+
SDValue Concat = DAG.getNode(ISD::CONCAT_VECTORS, DL, ContainerVT, Hi, Lo);
10354+
10355+
// Fixed length vectors might not fit exactly into their container, and so
10356+
// leave a gap in the front of the vector after being reversed. Slide this
10357+
// away.
10358+
//
10359+
// x x x x 3 2 1 0 <- v4i16 @ vlen=128
10360+
// 0 1 2 3 x x x x <- reverse
10361+
// x x x x 0 1 2 3 <- vslidedown.vx
10362+
if (VecVT.isFixedLengthVector()) {
10363+
SDValue Offset = DAG.getNode(
10364+
ISD::SUB, DL, XLenVT,
10365+
DAG.getElementCount(DL, XLenVT, ContainerVT.getVectorElementCount()),
10366+
DAG.getElementCount(DL, XLenVT, VecVT.getVectorElementCount()));
10367+
Concat =
10368+
getVSlidedown(DAG, Subtarget, DL, ContainerVT,
10369+
DAG.getUNDEF(ContainerVT), Concat, Offset, Mask, VL);
10370+
Concat = convertFromScalableVector(VecVT, Concat, DAG, Subtarget);
10371+
}
10372+
return Concat;
10373+
}
10374+
1033110375
unsigned EltSize = ContainerVT.getScalarSizeInBits();
1033210376
unsigned MinSize = ContainerVT.getSizeInBits().getKnownMinValue();
1033310377
unsigned VectorBitsMax = Subtarget.getRealMaxVLen();
@@ -10375,9 +10419,6 @@ SDValue RISCVTargetLowering::lowerVECTOR_REVERSE(SDValue Op,
1037510419
IntVT = IntVT.changeVectorElementType(MVT::i16);
1037610420
}
1037710421

10378-
MVT XLenVT = Subtarget.getXLenVT();
10379-
auto [Mask, VL] = getDefaultVLOps(VecVT, ContainerVT, DL, DAG, Subtarget);
10380-
1038110422
// Calculate VLMAX-1 for the desired SEW.
1038210423
SDValue VLMinus1 = DAG.getNode(
1038310424
ISD::SUB, DL, XLenVT,

0 commit comments

Comments
 (0)