Skip to content

Commit 63e8a1b

Browse files
authored
[SLP] Enable reordering for non-power-of-two vectors (#106638)
This change tries to enable vector reordering during vectorization for non-power-of-two vectors. Specifically, my goal is to be able to vectorize reductions whose operands appear in other than identity order. (i.e. a[1] + a[0] + a[2]). Our standard pass pipeline, Reassociation effectively canonicalizes towards this form. So for reduction vectorization to be wildly applicable, we need this feature. This change enables the use of a non-empty ReorderIndices structure - which is effectively required for out of order loads or gathers - while leaving the ReuseShuffleIndices mechanism unused and disabled. If I've understood the code structure, the former is used when describing implicit shuffles required by the vectorization strategy (i.e. loading elements 0,1,3,2 in the order 0,1,2,3 and then shuffling later), while the later is used when trying to optimize explode/buildvectors (called gathers in this code). I audited all the code enabled by this change, but can't claim to deeply understand most of it. I added a couple of bailouts in places which appeared to be difficult to audit and optional optimizations. I've tried to do so in the least risky way I can, but am not completely confident in this change. Careful review appreciated.
1 parent bded3b3 commit 63e8a1b

File tree

4 files changed

+89
-71
lines changed

4 files changed

+89
-71
lines changed

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

Lines changed: 23 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -3388,6 +3388,10 @@ class BoUpSLP {
33883388
TreeEntry *Last = VectorizableTree.back().get();
33893389
Last->Idx = VectorizableTree.size() - 1;
33903390
Last->State = EntryState;
3391+
// FIXME: Remove once support for ReuseShuffleIndices has been implemented
3392+
// for non-power-of-two vectors.
3393+
assert((has_single_bit(VL.size()) || ReuseShuffleIndices.empty()) &&
3394+
"Reshuffling scalars not yet supported for nodes with padding");
33913395
Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),
33923396
ReuseShuffleIndices.end());
33933397
if (ReorderIndices.empty()) {
@@ -3452,11 +3456,8 @@ class BoUpSLP {
34523456
MustGather.insert(VL.begin(), VL.end());
34533457
}
34543458

3455-
if (UserTreeIdx.UserTE) {
3459+
if (UserTreeIdx.UserTE)
34563460
Last->UserTreeIndices.push_back(UserTreeIdx);
3457-
assert((!Last->isNonPowOf2Vec() || Last->ReorderIndices.empty()) &&
3458-
"Reordering isn't implemented for non-power-of-2 nodes yet");
3459-
}
34603461
return Last;
34613462
}
34623463

@@ -4731,12 +4732,6 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
47314732
auto *VecTy = getWidenedType(ScalarTy, Sz);
47324733
// Check the order of pointer operands or that all pointers are the same.
47334734
bool IsSorted = sortPtrAccesses(PointerOps, ScalarTy, *DL, *SE, Order);
4734-
// FIXME: Reordering isn't implemented for non-power-of-2 nodes yet.
4735-
if (!Order.empty() && !has_single_bit(VL.size())) {
4736-
assert(VectorizeNonPowerOf2 && "non-power-of-2 number of loads only "
4737-
"supported with VectorizeNonPowerOf2");
4738-
return LoadsState::Gather;
4739-
}
47404735

47414736
Align CommonAlignment = computeCommonAlignment<LoadInst>(VL);
47424737
if (!IsSorted && Sz > MinProfitableStridedLoads && TTI->isTypeLegal(VecTy) &&
@@ -4824,6 +4819,12 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
48244819
// representation is better than just gather.
48254820
auto CheckForShuffledLoads = [&, &TTI = *TTI](Align CommonAlignment,
48264821
bool ProfitableGatherPointers) {
4822+
// FIXME: The following code has not been updated for non-power-of-2
4823+
// vectors. The splitting logic here does not cover the original
4824+
// vector if the vector factor is not a power of two. FIXME
4825+
if (!has_single_bit(VL.size()))
4826+
return false;
4827+
48274828
// Compare masked gather cost and loads + insert subvector costs.
48284829
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
48294830
auto [ScalarGEPCost, VectorGEPCost] =
@@ -5195,13 +5196,13 @@ static bool areTwoInsertFromSameBuildVector(
51955196

51965197
std::optional<BoUpSLP::OrdersType>
51975198
BoUpSLP::getReorderingData(const TreeEntry &TE, bool TopToBottom) {
5198-
// FIXME: Vectorizing is not supported yet for non-power-of-2 ops.
5199-
if (TE.isNonPowOf2Vec())
5200-
return std::nullopt;
5201-
52025199
// No need to reorder if need to shuffle reuses, still need to shuffle the
52035200
// node.
52045201
if (!TE.ReuseShuffleIndices.empty()) {
5202+
// FIXME: Support ReuseShuffleIndices for non-power-of-two vectors.
5203+
assert(!TE.isNonPowOf2Vec() &&
5204+
"Reshuffling scalars not yet supported for nodes with padding");
5205+
52055206
if (isSplat(TE.Scalars))
52065207
return std::nullopt;
52075208
// Check if reuse shuffle indices can be improved by reordering.
@@ -5424,11 +5425,15 @@ BoUpSLP::getReorderingData(const TreeEntry &TE, bool TopToBottom) {
54245425
}
54255426
if (isSplat(TE.Scalars))
54265427
return std::nullopt;
5427-
if (TE.Scalars.size() >= 4)
5428+
if (TE.Scalars.size() >= 3)
54285429
if (std::optional<OrdersType> Order = findPartiallyOrderedLoads(TE))
54295430
return Order;
5430-
if (std::optional<OrdersType> CurrentOrder = findReusedOrderedScalars(TE))
5431-
return CurrentOrder;
5431+
5432+
// FIXME: Remove the non-power-of-two check once findReusedOrderedScalars
5433+
// has been auditted for correctness with non-power-of-two vectors.
5434+
if (!TE.isNonPowOf2Vec())
5435+
if (std::optional<OrdersType> CurrentOrder = findReusedOrderedScalars(TE))
5436+
return CurrentOrder;
54325437
}
54335438
return std::nullopt;
54345439
}
@@ -5580,7 +5585,7 @@ void BoUpSLP::reorderTopToBottom() {
55805585

55815586
// Reorder the graph nodes according to their vectorization factor.
55825587
for (unsigned VF = VectorizableTree.front()->getVectorFactor(); VF > 1;
5583-
VF /= 2) {
5588+
VF = bit_ceil(VF) / 2) {
55845589
auto It = VFToOrderedEntries.find(VF);
55855590
if (It == VFToOrderedEntries.end())
55865591
continue;
@@ -5752,10 +5757,6 @@ bool BoUpSLP::canReorderOperands(
57525757
TreeEntry *UserTE, SmallVectorImpl<std::pair<unsigned, TreeEntry *>> &Edges,
57535758
ArrayRef<TreeEntry *> ReorderableGathers,
57545759
SmallVectorImpl<TreeEntry *> &GatherOps) {
5755-
// FIXME: Reordering isn't implemented for non-power-of-2 nodes yet.
5756-
if (UserTE->isNonPowOf2Vec())
5757-
return false;
5758-
57595760
for (unsigned I = 0, E = UserTE->getNumOperands(); I < E; ++I) {
57605761
if (any_of(Edges, [I](const std::pair<unsigned, TreeEntry *> &OpData) {
57615762
return OpData.first == I &&
@@ -5927,9 +5928,6 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
59275928
}
59285929
auto Res = OrdersUses.insert(std::make_pair(OrdersType(), 0));
59295930
const auto AllowsReordering = [&](const TreeEntry *TE) {
5930-
// FIXME: Reordering isn't implemented for non-power-of-2 nodes yet.
5931-
if (TE->isNonPowOf2Vec())
5932-
return false;
59335931
if (!TE->ReorderIndices.empty() || !TE->ReuseShuffleIndices.empty() ||
59345932
(TE->State == TreeEntry::Vectorize && TE->isAltShuffle()) ||
59355933
(IgnoreReorder && TE->Idx == 0))

llvm/test/Transforms/SLPVectorizer/AArch64/vec3-reorder-reshuffle.ll

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -191,12 +191,12 @@ define i32 @reorder_indices_1(float %0) {
191191
; NON-POW2-NEXT: entry:
192192
; NON-POW2-NEXT: [[NOR1:%.*]] = alloca [0 x [3 x float]], i32 0, align 4
193193
; NON-POW2-NEXT: [[TMP1:%.*]] = load <3 x float>, ptr [[NOR1]], align 4
194-
; NON-POW2-NEXT: [[TMP2:%.*]] = shufflevector <3 x float> [[TMP1]], <3 x float> poison, <3 x i32> <i32 1, i32 2, i32 0>
195-
; NON-POW2-NEXT: [[TMP3:%.*]] = fneg <3 x float> [[TMP2]]
194+
; NON-POW2-NEXT: [[TMP3:%.*]] = fneg <3 x float> [[TMP1]]
196195
; NON-POW2-NEXT: [[TMP4:%.*]] = insertelement <3 x float> poison, float [[TMP0]], i32 0
197196
; NON-POW2-NEXT: [[TMP5:%.*]] = shufflevector <3 x float> [[TMP4]], <3 x float> poison, <3 x i32> zeroinitializer
198197
; NON-POW2-NEXT: [[TMP6:%.*]] = fmul <3 x float> [[TMP3]], [[TMP5]]
199-
; NON-POW2-NEXT: [[TMP7:%.*]] = call <3 x float> @llvm.fmuladd.v3f32(<3 x float> [[TMP1]], <3 x float> zeroinitializer, <3 x float> [[TMP6]])
198+
; NON-POW2-NEXT: [[TMP10:%.*]] = shufflevector <3 x float> [[TMP6]], <3 x float> poison, <3 x i32> <i32 1, i32 2, i32 0>
199+
; NON-POW2-NEXT: [[TMP7:%.*]] = call <3 x float> @llvm.fmuladd.v3f32(<3 x float> [[TMP1]], <3 x float> zeroinitializer, <3 x float> [[TMP10]])
200200
; NON-POW2-NEXT: [[TMP8:%.*]] = call <3 x float> @llvm.fmuladd.v3f32(<3 x float> [[TMP5]], <3 x float> [[TMP7]], <3 x float> zeroinitializer)
201201
; NON-POW2-NEXT: [[TMP9:%.*]] = fmul <3 x float> [[TMP8]], zeroinitializer
202202
; NON-POW2-NEXT: store <3 x float> [[TMP9]], ptr [[NOR1]], align 4
@@ -263,7 +263,8 @@ define void @reorder_indices_2(ptr %spoint) {
263263
; NON-POW2-NEXT: [[DSCO:%.*]] = getelementptr float, ptr [[SPOINT]], i64 0
264264
; NON-POW2-NEXT: [[TMP0:%.*]] = call <3 x float> @llvm.fmuladd.v3f32(<3 x float> zeroinitializer, <3 x float> zeroinitializer, <3 x float> zeroinitializer)
265265
; NON-POW2-NEXT: [[TMP1:%.*]] = fmul <3 x float> [[TMP0]], zeroinitializer
266-
; NON-POW2-NEXT: store <3 x float> [[TMP1]], ptr [[DSCO]], align 4
266+
; NON-POW2-NEXT: [[TMP2:%.*]] = shufflevector <3 x float> [[TMP1]], <3 x float> poison, <3 x i32> <i32 1, i32 2, i32 0>
267+
; NON-POW2-NEXT: store <3 x float> [[TMP2]], ptr [[DSCO]], align 4
267268
; NON-POW2-NEXT: ret void
268269
;
269270
; POW2-ONLY-LABEL: define void @reorder_indices_2(
@@ -566,11 +567,11 @@ define void @can_reorder_vec3_op_with_padding(ptr %A, <3 x float> %in) {
566567
; NON-POW2-LABEL: define void @can_reorder_vec3_op_with_padding(
567568
; NON-POW2-SAME: ptr [[A:%.*]], <3 x float> [[IN:%.*]]) {
568569
; NON-POW2-NEXT: entry:
569-
; NON-POW2-NEXT: [[TMP0:%.*]] = shufflevector <3 x float> [[IN]], <3 x float> poison, <3 x i32> <i32 1, i32 2, i32 0>
570-
; NON-POW2-NEXT: [[TMP1:%.*]] = fsub <3 x float> [[TMP0]], [[TMP0]]
570+
; NON-POW2-NEXT: [[TMP1:%.*]] = fsub <3 x float> [[IN]], [[IN]]
571571
; NON-POW2-NEXT: [[TMP2:%.*]] = call <3 x float> @llvm.fmuladd.v3f32(<3 x float> [[TMP1]], <3 x float> <float 2.000000e+00, float 2.000000e+00, float 2.000000e+00>, <3 x float> <float 3.000000e+00, float 3.000000e+00, float 3.000000e+00>)
572572
; NON-POW2-NEXT: [[TMP3:%.*]] = fmul <3 x float> [[TMP2]], <float 3.000000e+00, float 3.000000e+00, float 3.000000e+00>
573-
; NON-POW2-NEXT: store <3 x float> [[TMP3]], ptr [[A]], align 4
573+
; NON-POW2-NEXT: [[TMP4:%.*]] = shufflevector <3 x float> [[TMP3]], <3 x float> poison, <3 x i32> <i32 1, i32 2, i32 0>
574+
; NON-POW2-NEXT: store <3 x float> [[TMP4]], ptr [[A]], align 4
574575
; NON-POW2-NEXT: ret void
575576
;
576577
; POW2-ONLY-LABEL: define void @can_reorder_vec3_op_with_padding(

llvm/test/Transforms/SLPVectorizer/RISCV/vec3-base.ll

Lines changed: 53 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -557,25 +557,34 @@ define i32 @dot_product_i32(ptr %a, ptr %b) {
557557
; Same as above, except the reduction order has been perturbed. This
558558
; is checking for our ability to reorder.
559559
define i32 @dot_product_i32_reorder(ptr %a, ptr %b) {
560-
; CHECK-LABEL: @dot_product_i32_reorder(
561-
; CHECK-NEXT: [[GEP_A_0:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i32 0
562-
; CHECK-NEXT: [[L_A_0:%.*]] = load i32, ptr [[GEP_A_0]], align 4
563-
; CHECK-NEXT: [[GEP_A_1:%.*]] = getelementptr inbounds i32, ptr [[A]], i32 1
564-
; CHECK-NEXT: [[L_A_1:%.*]] = load i32, ptr [[GEP_A_1]], align 4
565-
; CHECK-NEXT: [[GEP_A_2:%.*]] = getelementptr inbounds i32, ptr [[A]], i32 2
566-
; CHECK-NEXT: [[L_A_2:%.*]] = load i32, ptr [[GEP_A_2]], align 4
567-
; CHECK-NEXT: [[GEP_B_0:%.*]] = getelementptr inbounds i32, ptr [[B:%.*]], i32 0
568-
; CHECK-NEXT: [[L_B_0:%.*]] = load i32, ptr [[GEP_B_0]], align 4
569-
; CHECK-NEXT: [[GEP_B_1:%.*]] = getelementptr inbounds i32, ptr [[B]], i32 1
570-
; CHECK-NEXT: [[L_B_1:%.*]] = load i32, ptr [[GEP_B_1]], align 4
571-
; CHECK-NEXT: [[GEP_B_2:%.*]] = getelementptr inbounds i32, ptr [[B]], i32 2
572-
; CHECK-NEXT: [[L_B_2:%.*]] = load i32, ptr [[GEP_B_2]], align 4
573-
; CHECK-NEXT: [[MUL_0:%.*]] = mul nsw i32 [[L_A_0]], [[L_B_0]]
574-
; CHECK-NEXT: [[MUL_1:%.*]] = mul nsw i32 [[L_A_1]], [[L_B_1]]
575-
; CHECK-NEXT: [[MUL_2:%.*]] = mul nsw i32 [[L_A_2]], [[L_B_2]]
576-
; CHECK-NEXT: [[ADD_0:%.*]] = add i32 [[MUL_1]], [[MUL_0]]
577-
; CHECK-NEXT: [[ADD_1:%.*]] = add i32 [[ADD_0]], [[MUL_2]]
578-
; CHECK-NEXT: ret i32 [[ADD_1]]
560+
; NON-POW2-LABEL: @dot_product_i32_reorder(
561+
; NON-POW2-NEXT: [[GEP_A_0:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i32 0
562+
; NON-POW2-NEXT: [[GEP_B_0:%.*]] = getelementptr inbounds i32, ptr [[B:%.*]], i32 0
563+
; NON-POW2-NEXT: [[TMP1:%.*]] = load <3 x i32>, ptr [[GEP_A_0]], align 4
564+
; NON-POW2-NEXT: [[TMP2:%.*]] = load <3 x i32>, ptr [[GEP_B_0]], align 4
565+
; NON-POW2-NEXT: [[TMP3:%.*]] = mul nsw <3 x i32> [[TMP1]], [[TMP2]]
566+
; NON-POW2-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v3i32(<3 x i32> [[TMP3]])
567+
; NON-POW2-NEXT: ret i32 [[TMP4]]
568+
;
569+
; POW2-ONLY-LABEL: @dot_product_i32_reorder(
570+
; POW2-ONLY-NEXT: [[GEP_A_0:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i32 0
571+
; POW2-ONLY-NEXT: [[L_A_0:%.*]] = load i32, ptr [[GEP_A_0]], align 4
572+
; POW2-ONLY-NEXT: [[GEP_A_1:%.*]] = getelementptr inbounds i32, ptr [[A]], i32 1
573+
; POW2-ONLY-NEXT: [[L_A_1:%.*]] = load i32, ptr [[GEP_A_1]], align 4
574+
; POW2-ONLY-NEXT: [[GEP_A_2:%.*]] = getelementptr inbounds i32, ptr [[A]], i32 2
575+
; POW2-ONLY-NEXT: [[L_A_2:%.*]] = load i32, ptr [[GEP_A_2]], align 4
576+
; POW2-ONLY-NEXT: [[GEP_B_0:%.*]] = getelementptr inbounds i32, ptr [[B:%.*]], i32 0
577+
; POW2-ONLY-NEXT: [[L_B_0:%.*]] = load i32, ptr [[GEP_B_0]], align 4
578+
; POW2-ONLY-NEXT: [[GEP_B_1:%.*]] = getelementptr inbounds i32, ptr [[B]], i32 1
579+
; POW2-ONLY-NEXT: [[L_B_1:%.*]] = load i32, ptr [[GEP_B_1]], align 4
580+
; POW2-ONLY-NEXT: [[GEP_B_2:%.*]] = getelementptr inbounds i32, ptr [[B]], i32 2
581+
; POW2-ONLY-NEXT: [[L_B_2:%.*]] = load i32, ptr [[GEP_B_2]], align 4
582+
; POW2-ONLY-NEXT: [[MUL_0:%.*]] = mul nsw i32 [[L_A_0]], [[L_B_0]]
583+
; POW2-ONLY-NEXT: [[MUL_1:%.*]] = mul nsw i32 [[L_A_1]], [[L_B_1]]
584+
; POW2-ONLY-NEXT: [[MUL_2:%.*]] = mul nsw i32 [[L_A_2]], [[L_B_2]]
585+
; POW2-ONLY-NEXT: [[ADD_0:%.*]] = add i32 [[MUL_1]], [[MUL_0]]
586+
; POW2-ONLY-NEXT: [[ADD_1:%.*]] = add i32 [[ADD_0]], [[MUL_2]]
587+
; POW2-ONLY-NEXT: ret i32 [[ADD_1]]
579588
;
580589
%gep.a.0 = getelementptr inbounds i32, ptr %a, i32 0
581590
%l.a.0 = load i32, ptr %gep.a.0, align 4
@@ -653,22 +662,31 @@ define float @dot_product_fp32(ptr %a, ptr %b) {
653662
; Same as above, except the reduction order has been perturbed. This
654663
; is checking for our ability to reorder.
655664
define float @dot_product_fp32_reorder(ptr %a, ptr %b) {
656-
; CHECK-LABEL: @dot_product_fp32_reorder(
657-
; CHECK-NEXT: [[GEP_A_0:%.*]] = getelementptr inbounds float, ptr [[A:%.*]], i32 0
658-
; CHECK-NEXT: [[GEP_A_2:%.*]] = getelementptr inbounds float, ptr [[A]], i32 2
659-
; CHECK-NEXT: [[L_A_2:%.*]] = load float, ptr [[GEP_A_2]], align 4
660-
; CHECK-NEXT: [[GEP_B_0:%.*]] = getelementptr inbounds float, ptr [[B:%.*]], i32 0
661-
; CHECK-NEXT: [[GEP_B_2:%.*]] = getelementptr inbounds float, ptr [[B]], i32 2
662-
; CHECK-NEXT: [[L_B_2:%.*]] = load float, ptr [[GEP_B_2]], align 4
663-
; CHECK-NEXT: [[TMP1:%.*]] = load <2 x float>, ptr [[GEP_A_0]], align 4
664-
; CHECK-NEXT: [[TMP2:%.*]] = load <2 x float>, ptr [[GEP_B_0]], align 4
665-
; CHECK-NEXT: [[TMP3:%.*]] = fmul fast <2 x float> [[TMP1]], [[TMP2]]
666-
; CHECK-NEXT: [[MUL_2:%.*]] = fmul fast float [[L_A_2]], [[L_B_2]]
667-
; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP3]], i32 0
668-
; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP3]], i32 1
669-
; CHECK-NEXT: [[ADD_0:%.*]] = fadd fast float [[TMP5]], [[TMP4]]
670-
; CHECK-NEXT: [[ADD_1:%.*]] = fadd fast float [[ADD_0]], [[MUL_2]]
671-
; CHECK-NEXT: ret float [[ADD_1]]
665+
; NON-POW2-LABEL: @dot_product_fp32_reorder(
666+
; NON-POW2-NEXT: [[GEP_A_0:%.*]] = getelementptr inbounds float, ptr [[A:%.*]], i32 0
667+
; NON-POW2-NEXT: [[GEP_B_0:%.*]] = getelementptr inbounds float, ptr [[B:%.*]], i32 0
668+
; NON-POW2-NEXT: [[TMP1:%.*]] = load <3 x float>, ptr [[GEP_A_0]], align 4
669+
; NON-POW2-NEXT: [[TMP2:%.*]] = load <3 x float>, ptr [[GEP_B_0]], align 4
670+
; NON-POW2-NEXT: [[TMP3:%.*]] = fmul fast <3 x float> [[TMP1]], [[TMP2]]
671+
; NON-POW2-NEXT: [[TMP4:%.*]] = call fast float @llvm.vector.reduce.fadd.v3f32(float 0.000000e+00, <3 x float> [[TMP3]])
672+
; NON-POW2-NEXT: ret float [[TMP4]]
673+
;
674+
; POW2-ONLY-LABEL: @dot_product_fp32_reorder(
675+
; POW2-ONLY-NEXT: [[GEP_A_0:%.*]] = getelementptr inbounds float, ptr [[A:%.*]], i32 0
676+
; POW2-ONLY-NEXT: [[GEP_A_2:%.*]] = getelementptr inbounds float, ptr [[A]], i32 2
677+
; POW2-ONLY-NEXT: [[L_A_2:%.*]] = load float, ptr [[GEP_A_2]], align 4
678+
; POW2-ONLY-NEXT: [[GEP_B_0:%.*]] = getelementptr inbounds float, ptr [[B:%.*]], i32 0
679+
; POW2-ONLY-NEXT: [[GEP_B_2:%.*]] = getelementptr inbounds float, ptr [[B]], i32 2
680+
; POW2-ONLY-NEXT: [[L_B_2:%.*]] = load float, ptr [[GEP_B_2]], align 4
681+
; POW2-ONLY-NEXT: [[TMP1:%.*]] = load <2 x float>, ptr [[GEP_A_0]], align 4
682+
; POW2-ONLY-NEXT: [[TMP2:%.*]] = load <2 x float>, ptr [[GEP_B_0]], align 4
683+
; POW2-ONLY-NEXT: [[TMP3:%.*]] = fmul fast <2 x float> [[TMP1]], [[TMP2]]
684+
; POW2-ONLY-NEXT: [[MUL_2:%.*]] = fmul fast float [[L_A_2]], [[L_B_2]]
685+
; POW2-ONLY-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP3]], i32 0
686+
; POW2-ONLY-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP3]], i32 1
687+
; POW2-ONLY-NEXT: [[ADD_0:%.*]] = fadd fast float [[TMP5]], [[TMP4]]
688+
; POW2-ONLY-NEXT: [[ADD_1:%.*]] = fadd fast float [[ADD_0]], [[MUL_2]]
689+
; POW2-ONLY-NEXT: ret float [[ADD_1]]
672690
;
673691
%gep.a.0 = getelementptr inbounds float, ptr %a, i32 0
674692
%l.a.0 = load float, ptr %gep.a.0, align 4

llvm/test/Transforms/SLPVectorizer/X86/vec3-reorder-reshuffle.ll

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -190,12 +190,12 @@ define i32 @reorder_indices_1(float %0) {
190190
; NON-POW2-NEXT: entry:
191191
; NON-POW2-NEXT: [[NOR1:%.*]] = alloca [0 x [3 x float]], i32 0, align 4
192192
; NON-POW2-NEXT: [[TMP1:%.*]] = load <3 x float>, ptr [[NOR1]], align 4
193-
; NON-POW2-NEXT: [[TMP2:%.*]] = shufflevector <3 x float> [[TMP1]], <3 x float> poison, <3 x i32> <i32 1, i32 2, i32 0>
194-
; NON-POW2-NEXT: [[TMP3:%.*]] = fneg <3 x float> [[TMP2]]
193+
; NON-POW2-NEXT: [[TMP3:%.*]] = fneg <3 x float> [[TMP1]]
195194
; NON-POW2-NEXT: [[TMP4:%.*]] = insertelement <3 x float> poison, float [[TMP0]], i32 0
196195
; NON-POW2-NEXT: [[TMP5:%.*]] = shufflevector <3 x float> [[TMP4]], <3 x float> poison, <3 x i32> zeroinitializer
197196
; NON-POW2-NEXT: [[TMP6:%.*]] = fmul <3 x float> [[TMP3]], [[TMP5]]
198-
; NON-POW2-NEXT: [[TMP7:%.*]] = call <3 x float> @llvm.fmuladd.v3f32(<3 x float> [[TMP1]], <3 x float> zeroinitializer, <3 x float> [[TMP6]])
197+
; NON-POW2-NEXT: [[TMP10:%.*]] = shufflevector <3 x float> [[TMP6]], <3 x float> poison, <3 x i32> <i32 1, i32 2, i32 0>
198+
; NON-POW2-NEXT: [[TMP7:%.*]] = call <3 x float> @llvm.fmuladd.v3f32(<3 x float> [[TMP1]], <3 x float> zeroinitializer, <3 x float> [[TMP10]])
199199
; NON-POW2-NEXT: [[TMP8:%.*]] = call <3 x float> @llvm.fmuladd.v3f32(<3 x float> [[TMP5]], <3 x float> [[TMP7]], <3 x float> zeroinitializer)
200200
; NON-POW2-NEXT: [[TMP9:%.*]] = fmul <3 x float> [[TMP8]], zeroinitializer
201201
; NON-POW2-NEXT: store <3 x float> [[TMP9]], ptr [[NOR1]], align 4
@@ -262,7 +262,8 @@ define void @reorder_indices_2(ptr %spoint) {
262262
; NON-POW2-NEXT: [[DSCO:%.*]] = getelementptr float, ptr [[SPOINT]], i64 0
263263
; NON-POW2-NEXT: [[TMP0:%.*]] = call <3 x float> @llvm.fmuladd.v3f32(<3 x float> zeroinitializer, <3 x float> zeroinitializer, <3 x float> zeroinitializer)
264264
; NON-POW2-NEXT: [[TMP1:%.*]] = fmul <3 x float> [[TMP0]], zeroinitializer
265-
; NON-POW2-NEXT: store <3 x float> [[TMP1]], ptr [[DSCO]], align 4
265+
; NON-POW2-NEXT: [[TMP2:%.*]] = shufflevector <3 x float> [[TMP1]], <3 x float> poison, <3 x i32> <i32 1, i32 2, i32 0>
266+
; NON-POW2-NEXT: store <3 x float> [[TMP2]], ptr [[DSCO]], align 4
266267
; NON-POW2-NEXT: ret void
267268
;
268269
; POW2-ONLY-LABEL: define void @reorder_indices_2(

0 commit comments

Comments
 (0)