Skip to content

[AArch64][CostModel] Improve cost estimate of scalarizing a vector di… #118055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
#include "llvm/CodeGen/BasicTTIImpl.h"
#include "llvm/CodeGen/CostTable.h"
#include "llvm/CodeGen/TargetLowering.h"
#include "llvm/IR/Constants.h"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove this in next version. Got this accidently

#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Intrinsics.h"
#include "llvm/IR/IntrinsicsAArch64.h"
Expand Down Expand Up @@ -3572,6 +3573,40 @@ InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
Cost *= 4;
return Cost;
} else {
// If the information about individual scalars being vectorized is
// available, this yeilds better cost estimation.
if (auto *VTy = dyn_cast<FixedVectorType>(Ty); VTy && !Args.empty()) {
assert(Args.size() % 2 == 0 && "Args size should be even");
InstructionCost InsertExtractCost =
ST->getVectorInsertExtractBaseCost();
// If the cost of single sdiv is inquired through the cost-model.
// FIXME: remove the isa checks once the PR 122236 lands.
if (Args.size() == 2 &&
!(isa<ConstantVector>(Args[1]) ||
isa<ConstantDataVector>(Args[1]) ||
isa<ConstantExpr>(Args[1])) &&
Comment on lines +3584 to +3587
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you do not need to pass args for this, instead you can rely on Op1Info and Op2Info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the test where Op1 only or Op2 only is constant(see the added tests), simply relying on Op1Info and Op2Info does not work. In such cases, we need to pass args

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cost of the division when the divisor is constant is less than the case where the divisor is unknown.

When we are considering scalarizing cost, we are considering div cost of each lane and additional insert/extract cost. This is where this patch yeilds less cost with extra knowledge about the values that go into that lanes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Op2Info provides info about constantness of the divisor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I told this before already, if the node is scalarized, it must be represented as a buildvector node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the case @sdiv_v2i32_Op1_unknown_Op2_const where only one of the divisors is constant. In that case, Op2Info would be something generic and doesnt give sufficient information. I hope this makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I told this before already, if the node is scalarized, it must be represented as a buildvector node.

I have tried to justify this why this wont be a better option already

none_of(Args, IsaPred<UndefValue, PoisonValue>)) {
unsigned NElts = VTy->getNumElements();
// Compute per element cost
Cost = getArithmeticInstrCost(Opcode, VTy->getScalarType(),
CostKind, Op1Info.getNoProps(),
Op2Info.getNoProps());
Cost += 3 * InsertExtractCost;
Cost *= NElts;
return Cost;
} else if (Args.size() > 2) // vectorization cost is inquired
{
Cost = (3 * InsertExtractCost) * VTy->getNumElements();
for (int i = 0, Sz = Args.size(); i < Sz; i += 2) {
Cost +=
getArithmeticInstrCost(Opcode, VTy->getScalarType(), CostKind,
TTI::getOperandInfo(Args[i]),
TTI::getOperandInfo(Args[i + 1]));
}
return Cost;
}
}

// If one of the operands is a uniform constant then the cost for each
// element is Cost for insertion, extraction and division.
// Insertion cost = 2, Extraction Cost = 2, Division = cost for the
Expand Down
17 changes: 14 additions & 3 deletions llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11650,9 +11650,20 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
unsigned OpIdx = isa<UnaryOperator>(VL0) ? 0 : 1;
TTI::OperandValueInfo Op1Info = getOperandInfo(E->getOperand(0));
TTI::OperandValueInfo Op2Info = getOperandInfo(E->getOperand(OpIdx));
return TTI->getArithmeticInstrCost(ShuffleOrOp, VecTy, CostKind, Op1Info,
Op2Info, {}, nullptr, TLI) +
CommonCost;
SmallVector<Value *, 16> Operands;
if (all_of(E->Scalars, [ShuffleOrOp](Value *V) {
return !IsaPred<UndefValue, PoisonValue>(V) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return !isa<UndefValue, PoisonValue>(V) &&

cast<Instruction>(V)->getOpcode() == ShuffleOrOp;
})) {
for (auto *Scalar : E->Scalars) {
Instruction *I = cast<Instruction>(Scalar);
auto IOperands = I->operand_values();
Operands.insert(Operands.end(), IOperands.begin(), IOperands.end());
}
}
return CommonCost +
TTI->getArithmeticInstrCost(ShuffleOrOp, VecTy, CostKind, Op1Info,
Op2Info, Operands, nullptr, TLI);
};
return GetCostDiff(GetScalarCost, GetVectorCost);
}
Expand Down
110 changes: 22 additions & 88 deletions llvm/test/Transforms/SLPVectorizer/AArch64/div.ll
Original file line number Diff line number Diff line change
Expand Up @@ -553,35 +553,13 @@ define <4 x i32> @slp_v4i32_Op1_unknown_Op2_const_pow2(<4 x i32> %a)
}

define <2 x i32> @sdiv_v2i32_unknown_divisor(<2 x i32> %a, <2 x i32> %x, <2 x i32> %y, <2 x i32> %z)
; NO-SVE-LABEL: define <2 x i32> @sdiv_v2i32_unknown_divisor(
; NO-SVE-SAME: <2 x i32> [[A:%.*]], <2 x i32> [[X:%.*]], <2 x i32> [[Y:%.*]], <2 x i32> [[Z:%.*]]) #[[ATTR0]] {
; NO-SVE-NEXT: [[A0:%.*]] = extractelement <2 x i32> [[A]], i64 0
; NO-SVE-NEXT: [[A1:%.*]] = extractelement <2 x i32> [[A]], i64 1
; NO-SVE-NEXT: [[X0:%.*]] = extractelement <2 x i32> [[X]], i64 0
; NO-SVE-NEXT: [[X1:%.*]] = extractelement <2 x i32> [[X]], i64 1
; NO-SVE-NEXT: [[TMP1:%.*]] = sdiv i32 [[A0]], [[X0]]
; NO-SVE-NEXT: [[TMP2:%.*]] = sdiv i32 [[A1]], [[X1]]
; NO-SVE-NEXT: [[TMP3:%.*]] = add i32 [[TMP1]], [[X0]]
; NO-SVE-NEXT: [[TMP4:%.*]] = add i32 [[TMP2]], [[X1]]
; NO-SVE-NEXT: [[Y0:%.*]] = extractelement <2 x i32> [[Y]], i64 0
; NO-SVE-NEXT: [[Y1:%.*]] = extractelement <2 x i32> [[Y]], i64 1
; NO-SVE-NEXT: [[TMP5:%.*]] = sub i32 [[TMP3]], [[Y0]]
; NO-SVE-NEXT: [[TMP6:%.*]] = sub i32 [[TMP4]], [[Y1]]
; NO-SVE-NEXT: [[Z0:%.*]] = extractelement <2 x i32> [[Z]], i64 0
; NO-SVE-NEXT: [[Z1:%.*]] = extractelement <2 x i32> [[Z]], i64 1
; NO-SVE-NEXT: [[TMP7:%.*]] = mul i32 [[TMP5]], [[Z0]]
; NO-SVE-NEXT: [[TMP8:%.*]] = mul i32 [[TMP6]], [[Z1]]
; NO-SVE-NEXT: [[RES0:%.*]] = insertelement <2 x i32> poison, i32 [[TMP7]], i32 0
; NO-SVE-NEXT: [[RES1:%.*]] = insertelement <2 x i32> [[RES0]], i32 [[TMP8]], i32 1
; NO-SVE-NEXT: ret <2 x i32> [[RES1]]
;
; SVE-LABEL: define <2 x i32> @sdiv_v2i32_unknown_divisor(
; SVE-SAME: <2 x i32> [[A:%.*]], <2 x i32> [[X:%.*]], <2 x i32> [[Y:%.*]], <2 x i32> [[Z:%.*]]) #[[ATTR0]] {
; SVE-NEXT: [[TMP2:%.*]] = sdiv <2 x i32> [[A]], [[X]]
; SVE-NEXT: [[TMP3:%.*]] = add <2 x i32> [[TMP2]], [[X]]
; SVE-NEXT: [[TMP4:%.*]] = sub <2 x i32> [[TMP3]], [[Y]]
; SVE-NEXT: [[TMP5:%.*]] = mul <2 x i32> [[TMP4]], [[Z]]
; SVE-NEXT: ret <2 x i32> [[TMP5]]
; CHECK-LABEL: define <2 x i32> @sdiv_v2i32_unknown_divisor(
; CHECK-SAME: <2 x i32> [[A:%.*]], <2 x i32> [[X:%.*]], <2 x i32> [[Y:%.*]], <2 x i32> [[Z:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = sdiv <2 x i32> [[A]], [[X]]
; CHECK-NEXT: [[TMP2:%.*]] = add <2 x i32> [[TMP1]], [[X]]
; CHECK-NEXT: [[TMP3:%.*]] = sub <2 x i32> [[TMP2]], [[Y]]
; CHECK-NEXT: [[TMP4:%.*]] = mul <2 x i32> [[TMP3]], [[Z]]
; CHECK-NEXT: ret <2 x i32> [[TMP4]]
;
{
%a0 = extractelement <2 x i32> %a, i64 0
Expand All @@ -607,35 +585,13 @@ define <2 x i32> @sdiv_v2i32_unknown_divisor(<2 x i32> %a, <2 x i32> %x, <2 x i3

; computes (a/const + x - y) * z
define <2 x i32> @sdiv_v2i32_const_divisor(<2 x i32> %a, <2 x i32> %x, <2 x i32> %y, <2 x i32> %z)
; NO-SVE-LABEL: define <2 x i32> @sdiv_v2i32_const_divisor(
; NO-SVE-SAME: <2 x i32> [[A:%.*]], <2 x i32> [[X:%.*]], <2 x i32> [[Y:%.*]], <2 x i32> [[Z:%.*]]) #[[ATTR0]] {
; NO-SVE-NEXT: [[A0:%.*]] = extractelement <2 x i32> [[A]], i64 0
; NO-SVE-NEXT: [[A1:%.*]] = extractelement <2 x i32> [[A]], i64 1
; NO-SVE-NEXT: [[TMP1:%.*]] = sdiv i32 [[A0]], 2
; NO-SVE-NEXT: [[TMP2:%.*]] = sdiv i32 [[A1]], 4
; NO-SVE-NEXT: [[X0:%.*]] = extractelement <2 x i32> [[X]], i64 0
; NO-SVE-NEXT: [[X1:%.*]] = extractelement <2 x i32> [[X]], i64 1
; NO-SVE-NEXT: [[TMP3:%.*]] = add i32 [[TMP1]], [[X0]]
; NO-SVE-NEXT: [[TMP4:%.*]] = add i32 [[TMP2]], [[X1]]
; NO-SVE-NEXT: [[Y0:%.*]] = extractelement <2 x i32> [[Y]], i64 0
; NO-SVE-NEXT: [[Y1:%.*]] = extractelement <2 x i32> [[Y]], i64 1
; NO-SVE-NEXT: [[TMP5:%.*]] = sub i32 [[TMP3]], [[Y0]]
; NO-SVE-NEXT: [[TMP6:%.*]] = sub i32 [[TMP4]], [[Y1]]
; NO-SVE-NEXT: [[Z0:%.*]] = extractelement <2 x i32> [[Z]], i64 0
; NO-SVE-NEXT: [[Z1:%.*]] = extractelement <2 x i32> [[Z]], i64 1
; NO-SVE-NEXT: [[TMP7:%.*]] = mul i32 [[TMP5]], [[Z0]]
; NO-SVE-NEXT: [[TMP8:%.*]] = mul i32 [[TMP6]], [[Z1]]
; NO-SVE-NEXT: [[RES0:%.*]] = insertelement <2 x i32> poison, i32 [[TMP7]], i32 0
; NO-SVE-NEXT: [[RES1:%.*]] = insertelement <2 x i32> [[RES0]], i32 [[TMP8]], i32 1
; NO-SVE-NEXT: ret <2 x i32> [[RES1]]
;
; SVE-LABEL: define <2 x i32> @sdiv_v2i32_const_divisor(
; SVE-SAME: <2 x i32> [[A:%.*]], <2 x i32> [[X:%.*]], <2 x i32> [[Y:%.*]], <2 x i32> [[Z:%.*]]) #[[ATTR0]] {
; SVE-NEXT: [[TMP1:%.*]] = sdiv <2 x i32> [[A]], <i32 2, i32 4>
; SVE-NEXT: [[TMP2:%.*]] = add <2 x i32> [[TMP1]], [[X]]
; SVE-NEXT: [[TMP3:%.*]] = sub <2 x i32> [[TMP2]], [[Y]]
; SVE-NEXT: [[TMP4:%.*]] = mul <2 x i32> [[TMP3]], [[Z]]
; SVE-NEXT: ret <2 x i32> [[TMP4]]
; CHECK-LABEL: define <2 x i32> @sdiv_v2i32_const_divisor(
; CHECK-SAME: <2 x i32> [[A:%.*]], <2 x i32> [[X:%.*]], <2 x i32> [[Y:%.*]], <2 x i32> [[Z:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = sdiv <2 x i32> [[A]], <i32 2, i32 4>
; CHECK-NEXT: [[TMP2:%.*]] = add <2 x i32> [[TMP1]], [[X]]
; CHECK-NEXT: [[TMP3:%.*]] = sub <2 x i32> [[TMP2]], [[Y]]
; CHECK-NEXT: [[TMP4:%.*]] = mul <2 x i32> [[TMP3]], [[Z]]
; CHECK-NEXT: ret <2 x i32> [[TMP4]]
;
{
%a0 = extractelement <2 x i32> %a, i64 0
Expand All @@ -660,36 +616,14 @@ define <2 x i32> @sdiv_v2i32_const_divisor(<2 x i32> %a, <2 x i32> %x, <2 x i32>
}

define <2 x i32> @sdiv_v2i32_Op1_unknown_Op2_const(<2 x i32> %a, <2 x i32> %x, <2 x i32> %y, <2 x i32> %z)
; NO-SVE-LABEL: define <2 x i32> @sdiv_v2i32_Op1_unknown_Op2_const(
; NO-SVE-SAME: <2 x i32> [[A:%.*]], <2 x i32> [[X:%.*]], <2 x i32> [[Y:%.*]], <2 x i32> [[Z:%.*]]) #[[ATTR0]] {
; NO-SVE-NEXT: [[A0:%.*]] = extractelement <2 x i32> [[A]], i64 0
; NO-SVE-NEXT: [[A1:%.*]] = extractelement <2 x i32> [[A]], i64 1
; NO-SVE-NEXT: [[TMP1:%.*]] = sdiv i32 [[A0]], [[A0]]
; NO-SVE-NEXT: [[TMP2:%.*]] = sdiv i32 [[A1]], 4
; NO-SVE-NEXT: [[X0:%.*]] = extractelement <2 x i32> [[X]], i64 0
; NO-SVE-NEXT: [[X1:%.*]] = extractelement <2 x i32> [[X]], i64 1
; NO-SVE-NEXT: [[TMP3:%.*]] = add i32 [[TMP1]], [[X0]]
; NO-SVE-NEXT: [[TMP4:%.*]] = add i32 [[TMP2]], [[X1]]
; NO-SVE-NEXT: [[Y0:%.*]] = extractelement <2 x i32> [[Y]], i64 0
; NO-SVE-NEXT: [[Y1:%.*]] = extractelement <2 x i32> [[Y]], i64 1
; NO-SVE-NEXT: [[TMP5:%.*]] = sub i32 [[TMP3]], [[Y0]]
; NO-SVE-NEXT: [[TMP6:%.*]] = sub i32 [[TMP4]], [[Y1]]
; NO-SVE-NEXT: [[Z0:%.*]] = extractelement <2 x i32> [[Z]], i64 0
; NO-SVE-NEXT: [[Z1:%.*]] = extractelement <2 x i32> [[Z]], i64 1
; NO-SVE-NEXT: [[TMP7:%.*]] = mul i32 [[TMP5]], [[Z0]]
; NO-SVE-NEXT: [[TMP8:%.*]] = mul i32 [[TMP6]], [[Z1]]
; NO-SVE-NEXT: [[RES0:%.*]] = insertelement <2 x i32> poison, i32 [[TMP7]], i32 0
; NO-SVE-NEXT: [[RES1:%.*]] = insertelement <2 x i32> [[RES0]], i32 [[TMP8]], i32 1
; NO-SVE-NEXT: ret <2 x i32> [[RES1]]
;
; SVE-LABEL: define <2 x i32> @sdiv_v2i32_Op1_unknown_Op2_const(
; SVE-SAME: <2 x i32> [[A:%.*]], <2 x i32> [[X:%.*]], <2 x i32> [[Y:%.*]], <2 x i32> [[Z:%.*]]) #[[ATTR0]] {
; SVE-NEXT: [[TMP1:%.*]] = shufflevector <2 x i32> [[A]], <2 x i32> <i32 poison, i32 4>, <2 x i32> <i32 0, i32 3>
; SVE-NEXT: [[TMP2:%.*]] = sdiv <2 x i32> [[A]], [[TMP1]]
; SVE-NEXT: [[TMP3:%.*]] = add <2 x i32> [[TMP2]], [[X]]
; SVE-NEXT: [[TMP4:%.*]] = sub <2 x i32> [[TMP3]], [[Y]]
; SVE-NEXT: [[TMP5:%.*]] = mul <2 x i32> [[TMP4]], [[Z]]
; SVE-NEXT: ret <2 x i32> [[TMP5]]
; CHECK-LABEL: define <2 x i32> @sdiv_v2i32_Op1_unknown_Op2_const(
; CHECK-SAME: <2 x i32> [[A:%.*]], <2 x i32> [[X:%.*]], <2 x i32> [[Y:%.*]], <2 x i32> [[Z:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <2 x i32> [[A]], <2 x i32> <i32 poison, i32 4>, <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP2:%.*]] = sdiv <2 x i32> [[A]], [[TMP1]]
; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i32> [[TMP2]], [[X]]
; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i32> [[TMP3]], [[Y]]
; CHECK-NEXT: [[TMP5:%.*]] = mul <2 x i32> [[TMP4]], [[Z]]
; CHECK-NEXT: ret <2 x i32> [[TMP5]]
;
{
%a0 = extractelement <2 x i32> %a, i64 0
Expand Down
Loading
Loading