Skip to content

Commit 487686b

Browse files
authored
[SDAG][RISCV] Don't promote VP_REDUCE_{FADD,FMUL} (#111000)
In https://reviews.llvm.org/D153848, promotion was added for a variety of f16 ops with zvfhmin, including VP reductions. However I don't believe it's correct to promote f16 fadd or fmul reductions to f32 since we need to round the intermediate results. Today if we lower @llvm.vp.reduce.fadd.nxv1f16 on RISC-V, we'll get two different results depending on whether we compiled with +zvfh or +zvfhmin, for example with a 3 element reduction: ; v9 = [0.1563, 5.97e-8, 0.00006104] ; zvfh vsetivli x0, 3, e16, m1, ta, ma vmv.v.i v8, 0 vfredosum.vs v8, v9, v8 vfmv.f.s fa0, v8 ; fa0 = 0.1563 ; zvfhmin vsetivli x0, 3, e16, m1, ta, ma vfwcvt.f.f.v v10, v9 vsetivli x0, 3, e32, m1, ta, ma vmv.v.i v8, 0 vfredosum.vs v8, v10, v8 vfmv.f.s fa0, v8 fcvt.h.s fa0, fa0 ; fa0 = 0.1564 This same thing happens with reassociative reductions e.g. vfredusum.vs, and this also applies for bf16. I couldn't find anything in the LangRef for reductions that suggest the excess precision is allowed. There may be something we can do in Clang with -fexcess-precision=fast, but I haven't looked into this yet. I presume the same precision issue occurs with fmul, but not with fmin/fmax/fminimum/fmaximum. I can't think of another way of lowering these other than scalarizing, and we can't scalarize scalable vectors, so this just removes the promotion and adjusts the cost model to return an invalid cost. (It looks like we also don't currently cost fmul reductions, so presumably they also have an invalid cost?) I think this should be enough to stop the loop vectorizer or SLP from emitting these intrinsics.
1 parent caa265e commit 487686b

File tree

6 files changed

+170
-726
lines changed

6 files changed

+170
-726
lines changed

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5851,13 +5851,10 @@ void SelectionDAGLegalize::PromoteNode(SDNode *Node) {
58515851
DAG.getIntPtrConstant(0, dl, /*isTarget=*/true)));
58525852
break;
58535853
}
5854-
case ISD::VP_REDUCE_FADD:
5855-
case ISD::VP_REDUCE_FMUL:
58565854
case ISD::VP_REDUCE_FMAX:
58575855
case ISD::VP_REDUCE_FMIN:
58585856
case ISD::VP_REDUCE_FMAXIMUM:
58595857
case ISD::VP_REDUCE_FMINIMUM:
5860-
case ISD::VP_REDUCE_SEQ_FADD:
58615858
Results.push_back(PromoteReduction(Node));
58625859
break;
58635860
}

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -957,8 +957,6 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
957957
ISD::VP_FMUL,
958958
ISD::VP_FDIV,
959959
ISD::VP_FMA,
960-
ISD::VP_REDUCE_FADD,
961-
ISD::VP_REDUCE_SEQ_FADD,
962960
ISD::VP_REDUCE_FMIN,
963961
ISD::VP_REDUCE_FMAX,
964962
ISD::VP_SQRT,

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1531,6 +1531,11 @@ RISCVTTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
15311531
Opcodes = {RISCV::VMV_S_X, RISCV::VREDAND_VS, RISCV::VMV_X_S};
15321532
break;
15331533
case ISD::FADD:
1534+
// We can't promote f16/bf16 fadd reductions.
1535+
if ((LT.second.getVectorElementType() == MVT::f16 &&
1536+
!ST->hasVInstructionsF16()) ||
1537+
LT.second.getVectorElementType() == MVT::bf16)
1538+
return InstructionCost::getInvalid();
15341539
SplitOp = RISCV::VFADD_VV;
15351540
Opcodes = {RISCV::VFMV_S_F, RISCV::VFREDUSUM_VS, RISCV::VFMV_F_S};
15361541
break;

llvm/test/Analysis/CostModel/RISCV/reduce-fadd.ll

Lines changed: 23 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,30 @@
11
; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
2-
; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+zfh,+zvfh -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s --check-prefix=FP-REDUCE
2+
; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+zfh,+zvfh -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s --check-prefixes=FP-REDUCE,FP-REDUCE-ZVFH
3+
; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+zfh,+zvfhmin -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s --check-prefixes=FP-REDUCE,FP-REDUCE-ZVFHMIN
34
; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+zfh,+zvfh -passes="print<cost-model>" -cost-kind=code-size 2>&1 -disable-output | FileCheck %s --check-prefix=SIZE
45

56
define void @reduce_fadd_half() {
6-
; FP-REDUCE-LABEL: 'reduce_fadd_half'
7-
; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call fast half @llvm.vector.reduce.fadd.v1f16(half 0xH0000, <1 x half> undef)
8-
; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V2 = call fast half @llvm.vector.reduce.fadd.v2f16(half 0xH0000, <2 x half> undef)
9-
; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
10-
; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
11-
; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V16 = call fast half @llvm.vector.reduce.fadd.v16f16(half 0xH0000, <16 x half> undef)
12-
; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %v32 = call fast half @llvm.vector.reduce.fadd.v32f16(half 0xH0000, <32 x half> undef)
13-
; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V64 = call fast half @llvm.vector.reduce.fadd.v64f16(half 0xH0000, <64 x half> undef)
14-
; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V128 = call fast half @llvm.vector.reduce.fadd.v128f16(half 0xH0000, <128 x half> undef)
15-
; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
7+
; FP-REDUCE-ZVFH-LABEL: 'reduce_fadd_half'
8+
; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call fast half @llvm.vector.reduce.fadd.v1f16(half 0xH0000, <1 x half> undef)
9+
; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V2 = call fast half @llvm.vector.reduce.fadd.v2f16(half 0xH0000, <2 x half> undef)
10+
; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
11+
; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
12+
; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V16 = call fast half @llvm.vector.reduce.fadd.v16f16(half 0xH0000, <16 x half> undef)
13+
; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %v32 = call fast half @llvm.vector.reduce.fadd.v32f16(half 0xH0000, <32 x half> undef)
14+
; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V64 = call fast half @llvm.vector.reduce.fadd.v64f16(half 0xH0000, <64 x half> undef)
15+
; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V128 = call fast half @llvm.vector.reduce.fadd.v128f16(half 0xH0000, <128 x half> undef)
16+
; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
17+
;
18+
; FP-REDUCE-ZVFHMIN-LABEL: 'reduce_fadd_half'
19+
; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %V1 = call fast half @llvm.vector.reduce.fadd.v1f16(half 0xH0000, <1 x half> undef)
20+
; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %V2 = call fast half @llvm.vector.reduce.fadd.v2f16(half 0xH0000, <2 x half> undef)
21+
; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %V4 = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
22+
; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %V8 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
23+
; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %V16 = call fast half @llvm.vector.reduce.fadd.v16f16(half 0xH0000, <16 x half> undef)
24+
; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %v32 = call fast half @llvm.vector.reduce.fadd.v32f16(half 0xH0000, <32 x half> undef)
25+
; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %V64 = call fast half @llvm.vector.reduce.fadd.v64f16(half 0xH0000, <64 x half> undef)
26+
; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %V128 = call fast half @llvm.vector.reduce.fadd.v128f16(half 0xH0000, <128 x half> undef)
27+
; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
1628
;
1729
; SIZE-LABEL: 'reduce_fadd_half'
1830
; SIZE-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V1 = call fast half @llvm.vector.reduce.fadd.v1f16(half 0xH0000, <1 x half> undef)

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp-vp.ll

Lines changed: 34 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -1,117 +1,61 @@
11
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
22
; RUN: llc -mtriple=riscv32 -mattr=+d,+zfh,+zvfh,+v -target-abi=ilp32d \
3-
; RUN: -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,ZVFH
3+
; RUN: -verify-machineinstrs < %s | FileCheck %s
44
; RUN: llc -mtriple=riscv64 -mattr=+d,+zfh,+zvfh,+v -target-abi=lp64d \
5-
; RUN: -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,ZVFH
6-
; RUN: llc -mtriple=riscv32 -mattr=+d,+zfh,+zvfhmin,+v -target-abi=ilp32d \
7-
; RUN: -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,ZVFHMIN
8-
; RUN: llc -mtriple=riscv64 -mattr=+d,+zfh,+zvfhmin,+v -target-abi=lp64d \
9-
; RUN: -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,ZVFHMIN
5+
; RUN: -verify-machineinstrs < %s | FileCheck %s
106

117
declare half @llvm.vp.reduce.fadd.v2f16(half, <2 x half>, <2 x i1>, i32)
128

139
define half @vpreduce_fadd_v2f16(half %s, <2 x half> %v, <2 x i1> %m, i32 zeroext %evl) {
14-
; ZVFH-LABEL: vpreduce_fadd_v2f16:
15-
; ZVFH: # %bb.0:
16-
; ZVFH-NEXT: vsetivli zero, 1, e16, m1, ta, ma
17-
; ZVFH-NEXT: vfmv.s.f v9, fa0
18-
; ZVFH-NEXT: vsetvli zero, a0, e16, mf4, ta, ma
19-
; ZVFH-NEXT: vfredusum.vs v9, v8, v9, v0.t
20-
; ZVFH-NEXT: vfmv.f.s fa0, v9
21-
; ZVFH-NEXT: ret
22-
;
23-
; ZVFHMIN-LABEL: vpreduce_fadd_v2f16:
24-
; ZVFHMIN: # %bb.0:
25-
; ZVFHMIN-NEXT: vsetivli zero, 2, e16, mf4, ta, ma
26-
; ZVFHMIN-NEXT: vfwcvt.f.f.v v9, v8
27-
; ZVFHMIN-NEXT: fcvt.s.h fa5, fa0
28-
; ZVFHMIN-NEXT: vsetvli zero, zero, e32, mf2, ta, ma
29-
; ZVFHMIN-NEXT: vfmv.s.f v8, fa5
30-
; ZVFHMIN-NEXT: vsetvli zero, a0, e32, mf2, ta, ma
31-
; ZVFHMIN-NEXT: vfredusum.vs v8, v9, v8, v0.t
32-
; ZVFHMIN-NEXT: vfmv.f.s fa5, v8
33-
; ZVFHMIN-NEXT: fcvt.h.s fa0, fa5
34-
; ZVFHMIN-NEXT: ret
10+
; CHECK-LABEL: vpreduce_fadd_v2f16:
11+
; CHECK: # %bb.0:
12+
; CHECK-NEXT: vsetivli zero, 1, e16, m1, ta, ma
13+
; CHECK-NEXT: vfmv.s.f v9, fa0
14+
; CHECK-NEXT: vsetvli zero, a0, e16, mf4, ta, ma
15+
; CHECK-NEXT: vfredusum.vs v9, v8, v9, v0.t
16+
; CHECK-NEXT: vfmv.f.s fa0, v9
17+
; CHECK-NEXT: ret
3518
%r = call reassoc half @llvm.vp.reduce.fadd.v2f16(half %s, <2 x half> %v, <2 x i1> %m, i32 %evl)
3619
ret half %r
3720
}
3821

3922
define half @vpreduce_ord_fadd_v2f16(half %s, <2 x half> %v, <2 x i1> %m, i32 zeroext %evl) {
40-
; ZVFH-LABEL: vpreduce_ord_fadd_v2f16:
41-
; ZVFH: # %bb.0:
42-
; ZVFH-NEXT: vsetivli zero, 1, e16, m1, ta, ma
43-
; ZVFH-NEXT: vfmv.s.f v9, fa0
44-
; ZVFH-NEXT: vsetvli zero, a0, e16, mf4, ta, ma
45-
; ZVFH-NEXT: vfredosum.vs v9, v8, v9, v0.t
46-
; ZVFH-NEXT: vfmv.f.s fa0, v9
47-
; ZVFH-NEXT: ret
48-
;
49-
; ZVFHMIN-LABEL: vpreduce_ord_fadd_v2f16:
50-
; ZVFHMIN: # %bb.0:
51-
; ZVFHMIN-NEXT: vsetivli zero, 2, e16, mf4, ta, ma
52-
; ZVFHMIN-NEXT: vfwcvt.f.f.v v9, v8
53-
; ZVFHMIN-NEXT: fcvt.s.h fa5, fa0
54-
; ZVFHMIN-NEXT: vsetvli zero, zero, e32, mf2, ta, ma
55-
; ZVFHMIN-NEXT: vfmv.s.f v8, fa5
56-
; ZVFHMIN-NEXT: vsetvli zero, a0, e32, mf2, ta, ma
57-
; ZVFHMIN-NEXT: vfredosum.vs v8, v9, v8, v0.t
58-
; ZVFHMIN-NEXT: vfmv.f.s fa5, v8
59-
; ZVFHMIN-NEXT: fcvt.h.s fa0, fa5
60-
; ZVFHMIN-NEXT: ret
23+
; CHECK-LABEL: vpreduce_ord_fadd_v2f16:
24+
; CHECK: # %bb.0:
25+
; CHECK-NEXT: vsetivli zero, 1, e16, m1, ta, ma
26+
; CHECK-NEXT: vfmv.s.f v9, fa0
27+
; CHECK-NEXT: vsetvli zero, a0, e16, mf4, ta, ma
28+
; CHECK-NEXT: vfredosum.vs v9, v8, v9, v0.t
29+
; CHECK-NEXT: vfmv.f.s fa0, v9
30+
; CHECK-NEXT: ret
6131
%r = call half @llvm.vp.reduce.fadd.v2f16(half %s, <2 x half> %v, <2 x i1> %m, i32 %evl)
6232
ret half %r
6333
}
6434

6535
declare half @llvm.vp.reduce.fadd.v4f16(half, <4 x half>, <4 x i1>, i32)
6636

6737
define half @vpreduce_fadd_v4f16(half %s, <4 x half> %v, <4 x i1> %m, i32 zeroext %evl) {
68-
; ZVFH-LABEL: vpreduce_fadd_v4f16:
69-
; ZVFH: # %bb.0:
70-
; ZVFH-NEXT: vsetivli zero, 1, e16, m1, ta, ma
71-
; ZVFH-NEXT: vfmv.s.f v9, fa0
72-
; ZVFH-NEXT: vsetvli zero, a0, e16, mf2, ta, ma
73-
; ZVFH-NEXT: vfredusum.vs v9, v8, v9, v0.t
74-
; ZVFH-NEXT: vfmv.f.s fa0, v9
75-
; ZVFH-NEXT: ret
76-
;
77-
; ZVFHMIN-LABEL: vpreduce_fadd_v4f16:
78-
; ZVFHMIN: # %bb.0:
79-
; ZVFHMIN-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
80-
; ZVFHMIN-NEXT: vfwcvt.f.f.v v9, v8
81-
; ZVFHMIN-NEXT: fcvt.s.h fa5, fa0
82-
; ZVFHMIN-NEXT: vsetvli zero, zero, e32, m1, ta, ma
83-
; ZVFHMIN-NEXT: vfmv.s.f v8, fa5
84-
; ZVFHMIN-NEXT: vsetvli zero, a0, e32, m1, ta, ma
85-
; ZVFHMIN-NEXT: vfredusum.vs v8, v9, v8, v0.t
86-
; ZVFHMIN-NEXT: vfmv.f.s fa5, v8
87-
; ZVFHMIN-NEXT: fcvt.h.s fa0, fa5
88-
; ZVFHMIN-NEXT: ret
38+
; CHECK-LABEL: vpreduce_fadd_v4f16:
39+
; CHECK: # %bb.0:
40+
; CHECK-NEXT: vsetivli zero, 1, e16, m1, ta, ma
41+
; CHECK-NEXT: vfmv.s.f v9, fa0
42+
; CHECK-NEXT: vsetvli zero, a0, e16, mf2, ta, ma
43+
; CHECK-NEXT: vfredusum.vs v9, v8, v9, v0.t
44+
; CHECK-NEXT: vfmv.f.s fa0, v9
45+
; CHECK-NEXT: ret
8946
%r = call reassoc half @llvm.vp.reduce.fadd.v4f16(half %s, <4 x half> %v, <4 x i1> %m, i32 %evl)
9047
ret half %r
9148
}
9249

9350
define half @vpreduce_ord_fadd_v4f16(half %s, <4 x half> %v, <4 x i1> %m, i32 zeroext %evl) {
94-
; ZVFH-LABEL: vpreduce_ord_fadd_v4f16:
95-
; ZVFH: # %bb.0:
96-
; ZVFH-NEXT: vsetivli zero, 1, e16, m1, ta, ma
97-
; ZVFH-NEXT: vfmv.s.f v9, fa0
98-
; ZVFH-NEXT: vsetvli zero, a0, e16, mf2, ta, ma
99-
; ZVFH-NEXT: vfredosum.vs v9, v8, v9, v0.t
100-
; ZVFH-NEXT: vfmv.f.s fa0, v9
101-
; ZVFH-NEXT: ret
102-
;
103-
; ZVFHMIN-LABEL: vpreduce_ord_fadd_v4f16:
104-
; ZVFHMIN: # %bb.0:
105-
; ZVFHMIN-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
106-
; ZVFHMIN-NEXT: vfwcvt.f.f.v v9, v8
107-
; ZVFHMIN-NEXT: fcvt.s.h fa5, fa0
108-
; ZVFHMIN-NEXT: vsetvli zero, zero, e32, m1, ta, ma
109-
; ZVFHMIN-NEXT: vfmv.s.f v8, fa5
110-
; ZVFHMIN-NEXT: vsetvli zero, a0, e32, m1, ta, ma
111-
; ZVFHMIN-NEXT: vfredosum.vs v8, v9, v8, v0.t
112-
; ZVFHMIN-NEXT: vfmv.f.s fa5, v8
113-
; ZVFHMIN-NEXT: fcvt.h.s fa0, fa5
114-
; ZVFHMIN-NEXT: ret
51+
; CHECK-LABEL: vpreduce_ord_fadd_v4f16:
52+
; CHECK: # %bb.0:
53+
; CHECK-NEXT: vsetivli zero, 1, e16, m1, ta, ma
54+
; CHECK-NEXT: vfmv.s.f v9, fa0
55+
; CHECK-NEXT: vsetvli zero, a0, e16, mf2, ta, ma
56+
; CHECK-NEXT: vfredosum.vs v9, v8, v9, v0.t
57+
; CHECK-NEXT: vfmv.f.s fa0, v9
58+
; CHECK-NEXT: ret
11559
%r = call half @llvm.vp.reduce.fadd.v4f16(half %s, <4 x half> %v, <4 x i1> %m, i32 %evl)
11660
ret half %r
11761
}

0 commit comments

Comments
 (0)