Skip to content
This repository was archived by the owner on Mar 28, 2020. It is now read-only.

Commit dfdada0

Browse files
author
Hal Finkel
committed
[LoopVectorize] Don't vectorize loops when everything will be scalarized
This change prevents the loop vectorizer from vectorizing when all of the vector types it generates will be scalarized. I've run into this problem on the PPC's QPX vector ISA, which only holds floating-point vector types. The loop vectorizer will, however, happily vectorize loops with purely integer computation. Here's an example: LV: The Smallest and Widest types: 32 / 32 bits. LV: The Widest register is: 256 bits. LV: Found an estimated cost of 0 for VF 1 For instruction: %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ] LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25 LV: Found an estimated cost of 0 for VF 1 For instruction: %2 = trunc i64 %indvars.iv25 to i32 LV: Found an estimated cost of 1 for VF 1 For instruction: store i32 %2, i32* %arrayidx, align 4 LV: Found an estimated cost of 1 for VF 1 For instruction: %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1 LV: Found an estimated cost of 1 for VF 1 For instruction: %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600 LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %exitcond27, label %for.cond.cleanup, label %for.body LV: Scalar loop costs: 3. LV: Found an estimated cost of 0 for VF 2 For instruction: %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ] LV: Found an estimated cost of 0 for VF 2 For instruction: %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25 LV: Found an estimated cost of 0 for VF 2 For instruction: %2 = trunc i64 %indvars.iv25 to i32 LV: Found an estimated cost of 2 for VF 2 For instruction: store i32 %2, i32* %arrayidx, align 4 LV: Found an estimated cost of 1 for VF 2 For instruction: %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1 LV: Found an estimated cost of 1 for VF 2 For instruction: %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600 LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %exitcond27, label %for.cond.cleanup, label %for.body LV: Vector loop of width 2 costs: 2. LV: Found an estimated cost of 0 for VF 4 For instruction: %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ] LV: Found an estimated cost of 0 for VF 4 For instruction: %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25 LV: Found an estimated cost of 0 for VF 4 For instruction: %2 = trunc i64 %indvars.iv25 to i32 LV: Found an estimated cost of 4 for VF 4 For instruction: store i32 %2, i32* %arrayidx, align 4 LV: Found an estimated cost of 1 for VF 4 For instruction: %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1 LV: Found an estimated cost of 1 for VF 4 For instruction: %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600 LV: Found an estimated cost of 0 for VF 4 For instruction: br i1 %exitcond27, label %for.cond.cleanup, label %for.body LV: Vector loop of width 4 costs: 1. ... LV: Selecting VF: 8. LV: The target has 32 registers LV(REG): Calculating max register usage: LV(REG): At #0 Interval # 0 LV(REG): At #1 Interval # 1 LV(REG): At #2 Interval # 2 LV(REG): At #4 Interval # 1 LV(REG): At #5 Interval # 1 LV(REG): VF = 8 The problem is that the cost model here is not wrong, exactly. Since all of these operations are scalarized, their cost (aside from the uniform ones) are indeed VF*(scalar cost), just as the model suggests. In fact, the larger the VF picked, the lower the relative overhead from the loop itself (and the induction-variable update and check), and so in a sense, picking the largest VF here is the right thing to do. The problem is that vectorizing like this, where all of the vectors will be scalarized in the backend, isn't really vectorizing, but rather interleaving. By itself, this would be okay, but then the vectorizer itself also interleaves, and that's where the problem manifests itself. There's aren't actually enough scalar registers to support the normal interleave factor multiplied by a factor of VF (8 in this example). In other words, the problem with this is that our register-pressure heuristic does not account for scalarization. While we might want to improve our register-pressure heuristic, I don't think this is the right motivating case for that work. Here we have a more-basic problem: The job of the vectorizer is to vectorize things (interleaving aside), and if the IR it generates won't generate any actual vector code, then something is wrong. Thus, if every type looks like it will be scalarized (i.e. will be split into VF or more parts), then don't consider that VF. This is not a problem specific to PPC/QPX, however. The problem comes up under SSE on x86 too, and as such, this change fixes PR26837 too. I've added Sanjay's reduced test case from PR26837 to this commit. Differential Revision: http://reviews.llvm.org/D18537 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@264904 91177308-0d34-0410-b5e6-96231b3b80d8
1 parent 3744401 commit dfdada0

File tree

3 files changed

+150
-18
lines changed

3 files changed

+150
-18
lines changed

lib/Transforms/Vectorize/LoopVectorize.cpp

Lines changed: 49 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1532,15 +1532,26 @@ class LoopVectorizationCostModel {
15321532
calculateRegisterUsage(const SmallVector<unsigned, 8> &VFs);
15331533

15341534
private:
1535+
/// The vectorization cost is a combination of the cost itself and a boolean
1536+
/// indicating whether any of the contributing operations will actually operate on
1537+
/// vector values after type legalization in the backend. If this latter value is
1538+
/// false, then all operations will be scalarized (i.e. no vectorization has
1539+
/// actually taken place).
1540+
typedef std::pair<unsigned, bool> VectorizationCostTy;
1541+
15351542
/// Returns the expected execution cost. The unit of the cost does
15361543
/// not matter because we use the 'cost' units to compare different
15371544
/// vector widths. The cost that is returned is *not* normalized by
15381545
/// the factor width.
1539-
unsigned expectedCost(unsigned VF);
1546+
VectorizationCostTy expectedCost(unsigned VF);
15401547

15411548
/// Returns the execution time cost of an instruction for a given vector
15421549
/// width. Vector width of one means scalar.
1543-
unsigned getInstructionCost(Instruction *I, unsigned VF);
1550+
VectorizationCostTy getInstructionCost(Instruction *I, unsigned VF);
1551+
1552+
/// The cost-computation logic from getInstructionCost which provides
1553+
/// the vector type as an output parameter.
1554+
unsigned getInstructionCost(Instruction *I, unsigned VF, Type *&VectorTy);
15441555

15451556
/// Returns whether the instruction is a load or store and will be a emitted
15461557
/// as a vector operation.
@@ -5145,7 +5156,7 @@ LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) {
51455156
return Factor;
51465157
}
51475158

5148-
float Cost = expectedCost(1);
5159+
float Cost = expectedCost(1).first;
51495160
#ifndef NDEBUG
51505161
const float ScalarCost = Cost;
51515162
#endif /* NDEBUG */
@@ -5156,16 +5167,22 @@ LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) {
51565167
// Ignore scalar width, because the user explicitly wants vectorization.
51575168
if (ForceVectorization && VF > 1) {
51585169
Width = 2;
5159-
Cost = expectedCost(Width) / (float)Width;
5170+
Cost = expectedCost(Width).first / (float)Width;
51605171
}
51615172

51625173
for (unsigned i=2; i <= VF; i*=2) {
51635174
// Notice that the vector loop needs to be executed less times, so
51645175
// we need to divide the cost of the vector loops by the width of
51655176
// the vector elements.
5166-
float VectorCost = expectedCost(i) / (float)i;
5177+
VectorizationCostTy C = expectedCost(i);
5178+
float VectorCost = C.first / (float)i;
51675179
DEBUG(dbgs() << "LV: Vector loop of width " << i << " costs: " <<
51685180
(int)VectorCost << ".\n");
5181+
if (!C.second && !ForceVectorization) {
5182+
DEBUG(dbgs() << "LV: Not considering vector loop of width " << i <<
5183+
" because it will not generate any vector instructions.\n");
5184+
continue;
5185+
}
51695186
if (VectorCost < Cost) {
51705187
Cost = VectorCost;
51715188
Width = i;
@@ -5313,7 +5330,7 @@ unsigned LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize,
53135330
// If we did not calculate the cost for VF (because the user selected the VF)
53145331
// then we calculate the cost of VF here.
53155332
if (LoopCost == 0)
5316-
LoopCost = expectedCost(VF);
5333+
LoopCost = expectedCost(VF).first;
53175334

53185335
// Clamp the calculated IC to be between the 1 and the max interleave count
53195336
// that the target allows.
@@ -5540,13 +5557,14 @@ LoopVectorizationCostModel::calculateRegisterUsage(
55405557
return RUs;
55415558
}
55425559

5543-
unsigned LoopVectorizationCostModel::expectedCost(unsigned VF) {
5544-
unsigned Cost = 0;
5560+
LoopVectorizationCostModel::VectorizationCostTy
5561+
LoopVectorizationCostModel::expectedCost(unsigned VF) {
5562+
VectorizationCostTy Cost;
55455563

55465564
// For each block.
55475565
for (Loop::block_iterator bb = TheLoop->block_begin(),
55485566
be = TheLoop->block_end(); bb != be; ++bb) {
5549-
unsigned BlockCost = 0;
5567+
VectorizationCostTy BlockCost;
55505568
BasicBlock *BB = *bb;
55515569

55525570
// For each instruction in the old loop.
@@ -5559,24 +5577,26 @@ unsigned LoopVectorizationCostModel::expectedCost(unsigned VF) {
55595577
if (ValuesToIgnore.count(&*it))
55605578
continue;
55615579

5562-
unsigned C = getInstructionCost(&*it, VF);
5580+
VectorizationCostTy C = getInstructionCost(&*it, VF);
55635581

55645582
// Check if we should override the cost.
55655583
if (ForceTargetInstructionCost.getNumOccurrences() > 0)
5566-
C = ForceTargetInstructionCost;
5584+
C.first = ForceTargetInstructionCost;
55675585

5568-
BlockCost += C;
5569-
DEBUG(dbgs() << "LV: Found an estimated cost of " << C << " for VF " <<
5570-
VF << " For instruction: " << *it << '\n');
5586+
BlockCost.first += C.first;
5587+
BlockCost.second |= C.second;
5588+
DEBUG(dbgs() << "LV: Found an estimated cost of " << C.first <<
5589+
" for VF " << VF << " For instruction: " << *it << '\n');
55715590
}
55725591

55735592
// We assume that if-converted blocks have a 50% chance of being executed.
55745593
// When the code is scalar then some of the blocks are avoided due to CF.
55755594
// When the code is vectorized we execute all code paths.
55765595
if (VF == 1 && Legal->blockNeedsPredication(*bb))
5577-
BlockCost /= 2;
5596+
BlockCost.first /= 2;
55785597

5579-
Cost += BlockCost;
5598+
Cost.first += BlockCost.first;
5599+
Cost.second |= BlockCost.second;
55805600
}
55815601

55825602
return Cost;
@@ -5653,17 +5673,28 @@ static bool isStrideMul(Instruction *I, LoopVectorizationLegality *Legal) {
56535673
Legal->hasStride(I->getOperand(1));
56545674
}
56555675

5656-
unsigned
5676+
LoopVectorizationCostModel::VectorizationCostTy
56575677
LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {
56585678
// If we know that this instruction will remain uniform, check the cost of
56595679
// the scalar version.
56605680
if (Legal->isUniformAfterVectorization(I))
56615681
VF = 1;
56625682

5683+
Type *VectorTy;
5684+
unsigned C = getInstructionCost(I, VF, VectorTy);
5685+
5686+
bool TypeNotScalarized = VF > 1 && !VectorTy->isVoidTy() &&
5687+
TTI.getNumberOfParts(VectorTy) < VF;
5688+
return VectorizationCostTy(C, TypeNotScalarized);
5689+
}
5690+
5691+
unsigned
5692+
LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF,
5693+
Type *&VectorTy) {
56635694
Type *RetTy = I->getType();
56645695
if (VF > 1 && MinBWs.count(I))
56655696
RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);
5666-
Type *VectorTy = ToVectorTy(RetTy, VF);
5697+
VectorTy = ToVectorTy(RetTy, VF);
56675698

56685699
// TODO: We need to estimate the cost of intrinsic calls.
56695700
switch (I->getOpcode()) {
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
; RUN: opt -S -loop-vectorize < %s | FileCheck %s
2+
target datalayout = "E-m:e-i64:64-n32:64"
3+
target triple = "powerpc64-bgq-linux"
4+
5+
; Function Attrs: nounwind
6+
define zeroext i32 @test() #0 {
7+
; CHECK-LABEL: @test
8+
; CHECK-NOT: x i32>
9+
10+
entry:
11+
%a = alloca [1600 x i32], align 4
12+
%c = alloca [1600 x i32], align 4
13+
%0 = bitcast [1600 x i32]* %a to i8*
14+
call void @llvm.lifetime.start(i64 6400, i8* %0) #3
15+
br label %for.body
16+
17+
for.cond.cleanup: ; preds = %for.body
18+
%1 = bitcast [1600 x i32]* %c to i8*
19+
call void @llvm.lifetime.start(i64 6400, i8* %1) #3
20+
%arraydecay = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 0
21+
%arraydecay1 = getelementptr inbounds [1600 x i32], [1600 x i32]* %c, i64 0, i64 0
22+
%call = call signext i32 @bar(i32* %arraydecay, i32* %arraydecay1) #3
23+
br label %for.body6
24+
25+
for.body: ; preds = %for.body, %entry
26+
%indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
27+
%arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
28+
%2 = trunc i64 %indvars.iv25 to i32
29+
store i32 %2, i32* %arrayidx, align 4
30+
%indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
31+
%exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
32+
br i1 %exitcond27, label %for.cond.cleanup, label %for.body
33+
34+
for.cond.cleanup5: ; preds = %for.body6
35+
call void @llvm.lifetime.end(i64 6400, i8* nonnull %1) #3
36+
call void @llvm.lifetime.end(i64 6400, i8* %0) #3
37+
ret i32 %add
38+
39+
for.body6: ; preds = %for.body6, %for.cond.cleanup
40+
%indvars.iv = phi i64 [ 0, %for.cond.cleanup ], [ %indvars.iv.next, %for.body6 ]
41+
%s.022 = phi i32 [ 0, %for.cond.cleanup ], [ %add, %for.body6 ]
42+
%arrayidx8 = getelementptr inbounds [1600 x i32], [1600 x i32]* %c, i64 0, i64 %indvars.iv
43+
%3 = load i32, i32* %arrayidx8, align 4
44+
%add = add i32 %3, %s.022
45+
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
46+
%exitcond = icmp eq i64 %indvars.iv.next, 1600
47+
br i1 %exitcond, label %for.cond.cleanup5, label %for.body6
48+
}
49+
50+
; Function Attrs: argmemonly nounwind
51+
declare void @llvm.lifetime.start(i64, i8* nocapture) #1
52+
53+
; Function Attrs: argmemonly nounwind
54+
declare void @llvm.lifetime.end(i64, i8* nocapture) #1
55+
56+
declare signext i32 @bar(i32*, i32*) #2
57+
58+
attributes #0 = { nounwind "target-cpu"="a2q" "target-features"="+qpx,-altivec,-bpermd,-crypto,-direct-move,-extdiv,-power8-vector,-vsx" }
59+
attributes #1 = { argmemonly nounwind }
60+
attributes #2 = { "target-cpu"="a2q" "target-features"="+qpx,-altivec,-bpermd,-crypto,-direct-move,-extdiv,-power8-vector,-vsx" }
61+
attributes #3 = { nounwind }
62+
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
; RUN: opt -S -basicaa -loop-vectorize < %s | FileCheck %s
2+
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
3+
target triple = "x86_64-apple-macosx10.11.0"
4+
5+
define i32 @accum(i32* nocapture readonly %x, i32 %N) #0 {
6+
entry:
7+
; CHECK-LABEL: @accum
8+
; CHECK-NOT: x i32>
9+
10+
%cmp1 = icmp sgt i32 %N, 0
11+
br i1 %cmp1, label %for.inc.preheader, label %for.end
12+
13+
for.inc.preheader:
14+
br label %for.inc
15+
16+
for.inc:
17+
%indvars.iv = phi i64 [ %indvars.iv.next, %for.inc ], [ 0, %for.inc.preheader ]
18+
%sum.02 = phi i32 [ %add, %for.inc ], [ 0, %for.inc.preheader ]
19+
%arrayidx = getelementptr inbounds i32, i32* %x, i64 %indvars.iv
20+
%0 = load i32, i32* %arrayidx, align 4
21+
%add = add nsw i32 %0, %sum.02
22+
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
23+
%lftr.wideiv = trunc i64 %indvars.iv.next to i32
24+
%exitcond = icmp eq i32 %lftr.wideiv, %N
25+
br i1 %exitcond, label %for.end.loopexit, label %for.inc
26+
27+
for.end.loopexit:
28+
%add.lcssa = phi i32 [ %add, %for.inc ]
29+
br label %for.end
30+
31+
for.end:
32+
%sum.0.lcssa = phi i32 [ 0, %entry ], [ %add.lcssa, %for.end.loopexit ]
33+
ret i32 %sum.0.lcssa
34+
35+
; CHECK: ret i32
36+
}
37+
38+
attributes #0 = { "target-cpu"="core2" "target-features"="+sse,-avx,-avx2,-sse2" }
39+

0 commit comments

Comments
 (0)