-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[LoopCacheAnalysis] Fix loop cache cost to always round the cost up to the nearest integer number #88915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write If you have received no comments on your PR for a week, you can request a review If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-backend-powerpc @llvm/pr-subscribers-llvm-transforms Author: Rouzbeh (RouzbehPaktinat) ChangesCurrently loop cache analysis uses following formula to evaluate cost of an RefGroup for a consecutive memory access:
This cost evaluates to zero when This patch fixes the problem by rounding the cost to 1 once this problem happens. Full diff: https://github.com/llvm/llvm-project/pull/88915.diff 3 Files Affected:
diff --git a/llvm/lib/Analysis/LoopCacheAnalysis.cpp b/llvm/lib/Analysis/LoopCacheAnalysis.cpp
index c3a56639b5c8f8..9d1878b4cd3085 100644
--- a/llvm/lib/Analysis/LoopCacheAnalysis.cpp
+++ b/llvm/lib/Analysis/LoopCacheAnalysis.cpp
@@ -299,8 +299,15 @@ CacheCostTy IndexedReference::computeRefCost(const Loop &L,
Stride = SE.getNoopOrAnyExtend(Stride, WiderType);
TripCount = SE.getNoopOrZeroExtend(TripCount, WiderType);
const SCEV *Numerator = SE.getMulExpr(Stride, TripCount);
- RefCost = SE.getUDivExpr(Numerator, CacheLineSize);
-
+ ConstantInt *One =
+ ConstantInt::get(TripCount->getType()->getContext(), APInt(32, 1));
+ bool IsZero =
+ SE.isKnownPredicate(ICmpInst::ICMP_ULT, Numerator, CacheLineSize);
+ // When result is zero, round it to one because at least one cache line must
+ // be used. It does not make sense to output the result that zero cache line
+ // is used
+ RefCost =
+ IsZero ? SE.getSCEV(One) : SE.getUDivExpr(Numerator, CacheLineSize);
LLVM_DEBUG(dbgs().indent(4)
<< "Access is consecutive: RefCost=(TripCount*Stride)/CLS="
<< *RefCost << "\n");
diff --git a/llvm/test/Analysis/LoopCacheAnalysis/interchange-cost-beneficial.ll b/llvm/test/Analysis/LoopCacheAnalysis/interchange-cost-beneficial.ll
new file mode 100644
index 00000000000000..3086224c582048
--- /dev/null
+++ b/llvm/test/Analysis/LoopCacheAnalysis/interchange-cost-beneficial.ll
@@ -0,0 +1,62 @@
+; RUN: opt < %s -cache-line-size=64 -passes='print<loop-cache-cost>' -disable-output 2>&1 | FileCheck %s
+
+;; This test checks the effect of rounding cache cost to 1 when it is
+;; evaluated to 0 because at least 1 cache line is accessed by the loopnest.
+;; It does not make sense to output that zero cache lines are used.
+;; The cost of reference group for B[j], C[j], D[j] and E[j] were
+;; calculted 0 before but now they are 1 which makes each loop cost more reasonable.
+;
+; void test(int n, int m, int o, int A[2][3], int B[2], int C[2], int D[2], int E[2]) {
+; for (int i = 0; i < 3; i++)
+; for (int j = 0; j < 2; j++)
+; A[j][i] = 1;
+; B[j] = 1;
+; C[j] = 1;
+; D[j] = 1
+; E[j] = 1
+; }
+
+; CHECK: Loop 'for.j' has cost = 18
+; CHECK-NEXT: Loop 'for.i' has cost = 10
+
+define void @test(ptr %A, ptr %B, ptr %C, ptr %D, ptr %E) {
+
+entry:
+ br label %for.i.preheader.split
+
+for.i.preheader.split: ; preds = %for.i.preheader
+ br label %for.i
+
+for.i: ; preds = %for.inci, %for.i.preheader.split
+ %i = phi i64 [ %inci, %for.inci ], [ 0, %for.i.preheader.split ]
+ br label %for.j
+
+for.j: ; preds = %for.incj, %for.i
+ %j = phi i64 [ %incj, %for.j ], [ 0, %for.i ]
+ %mul_j = mul nsw i64 %j, 3
+ %index_j = add i64 %mul_j, %i
+ %arrayidxA = getelementptr inbounds [2 x [ 3 x i32]], ptr %A, i64 %j, i64 %i
+ store i32 1, ptr %arrayidxA, align 4
+ %arrayidxB = getelementptr inbounds i32, ptr %B, i64 %j
+ store i32 1, ptr %arrayidxB, align 4
+ %arrayidxC = getelementptr inbounds i32, ptr %C, i64 %j
+ store i32 1, ptr %arrayidxC, align 4
+ %arrayidxD = getelementptr inbounds i32, ptr %D, i64 %j
+ store i32 1, ptr %arrayidxD, align 4
+ %arrayidxE = getelementptr inbounds i32, ptr %E, i64 %j
+ store i32 1, ptr %arrayidxE, align 4
+ %incj = add nsw i64 %j, 1
+ %exitcond.us = icmp eq i64 %incj, 2
+ br i1 %exitcond.us, label %for.inci, label %for.j
+
+for.inci: ; preds = %for.incj
+ %inci = add nsw i64 %i, 1
+ %exitcond55.us = icmp eq i64 %inci, 3
+ br i1 %exitcond55.us, label %for.end.loopexit, label %for.i
+
+for.end.loopexit: ; preds = %for.inci
+ br label %for.end
+
+for.end: ; preds = %for.end.loopexit, %for.cond1.preheader.lr.ph, %entry
+ ret void
+}
diff --git a/llvm/test/Transforms/LoopInterchange/pr43176-move-to-new-latch.ll b/llvm/test/Transforms/LoopInterchange/pr43176-move-to-new-latch.ll
index 965d95110da466..cc787fa55600a6 100644
--- a/llvm/test/Transforms/LoopInterchange/pr43176-move-to-new-latch.ll
+++ b/llvm/test/Transforms/LoopInterchange/pr43176-move-to-new-latch.ll
@@ -1,42 +1,25 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
-; RUN: opt -passes=loop-interchange -cache-line-size=64 -verify-loop-lcssa -verify-dom-info -S %s | FileCheck %s
+; RUN: opt < %s -passes=loop-interchange -cache-line-size=64 -pass-remarks-missed='loop-interchange' -pass-remarks-output=%t -S
+; RUN: FileCheck --input-file=%t %s
@b = external dso_local global [5 x i32], align 16
+;; Not profitable to interchange, because the access is invariant to j loop.
+;;
+;; for(int i=0;i<4;i++) {
+;; for(int j=1;j<4;j++) {
+;; b[i] = ....
+;; }
+;; }
+
+; CHECK: --- !Missed
+; CHECK-NEXT: Pass: loop-interchange
+; CHECK-NEXT: Name: InterchangeNotProfitable
+; CHECK-NEXT: Function: test1
+; CHECK-NEXT: Args:
+; CHECK-NEXT: - String: Interchanging loops is not considered to improve cache locality nor vectorization.
+
define void @test1() {
-; CHECK-LABEL: @test1(
-; CHECK-NEXT: entry:
-; CHECK-NEXT: br label [[FOR_BODY2_PREHEADER:%.*]]
-; CHECK: for.body.preheader:
-; CHECK-NEXT: br label [[FOR_BODY:%.*]]
-; CHECK: for.body:
-; CHECK-NEXT: [[INC41:%.*]] = phi i32 [ [[INC4:%.*]], [[FOR_INC3:%.*]] ], [ undef, [[FOR_BODY_PREHEADER:%.*]] ]
-; CHECK-NEXT: [[IDXPROM:%.*]] = sext i32 [[INC41]] to i64
-; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds [5 x i32], ptr @b, i64 0, i64 [[IDXPROM]]
-; CHECK-NEXT: br label [[FOR_INC:%.*]]
-; CHECK: for.body2.preheader:
-; CHECK-NEXT: br label [[FOR_BODY2:%.*]]
-; CHECK: for.body2:
-; CHECK-NEXT: [[LSR_IV:%.*]] = phi i32 [ [[TMP1:%.*]], [[FOR_INC_SPLIT:%.*]] ], [ 1, [[FOR_BODY2_PREHEADER]] ]
-; CHECK-NEXT: br label [[FOR_BODY_PREHEADER]]
-; CHECK: for.inc:
-; CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT: store i32 undef, ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[LSR_IV]], 4
-; CHECK-NEXT: [[LSR_IV_NEXT:%.*]] = add nuw nsw i32 [[LSR_IV]], 1
-; CHECK-NEXT: br label [[FOR_COND1_FOR_END_CRIT_EDGE:%.*]]
-; CHECK: for.inc.split:
-; CHECK-NEXT: [[TMP1]] = add nuw nsw i32 [[LSR_IV]], 1
-; CHECK-NEXT: [[TMP2:%.*]] = icmp slt i32 [[LSR_IV]], 4
-; CHECK-NEXT: br i1 [[TMP2]], label [[FOR_BODY2]], label [[FOR_COND_FOR_END5_CRIT_EDGE:%.*]]
-; CHECK: for.cond1.for.end_crit_edge:
-; CHECK-NEXT: br label [[FOR_INC3]]
-; CHECK: for.inc3:
-; CHECK-NEXT: [[INC4]] = add nsw i32 [[INC41]], 1
-; CHECK-NEXT: br i1 false, label [[FOR_BODY]], label [[FOR_INC_SPLIT]]
-; CHECK: for.cond.for.end5_crit_edge:
-; CHECK-NEXT: ret void
-;
entry:
br label %for.body
@@ -68,41 +51,15 @@ for.cond.for.end5_crit_edge: ; preds = %for.inc3
ret void
}
+
+; CHECK: --- !Missed
+; CHECK-NEXT: Pass: loop-interchange
+; CHECK-NEXT: Name: InterchangeNotProfitable
+; CHECK-NEXT: Function: test2
+; CHECK-NEXT: Args:
+; CHECK-NEXT: - String: Interchanging loops is not considered to improve cache locality nor vectorization.
+
define void @test2() {
-; CHECK-LABEL: @test2(
-; CHECK-NEXT: entry:
-; CHECK-NEXT: br label [[FOR_BODY2_PREHEADER:%.*]]
-; CHECK: for.body.preheader:
-; CHECK-NEXT: br label [[FOR_BODY:%.*]]
-; CHECK: for.body:
-; CHECK-NEXT: [[INC41:%.*]] = phi i32 [ [[INC4:%.*]], [[FOR_INC3:%.*]] ], [ undef, [[FOR_BODY_PREHEADER:%.*]] ]
-; CHECK-NEXT: [[IDXPROM:%.*]] = sext i32 [[INC41]] to i64
-; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds [5 x i32], ptr @b, i64 0, i64 [[IDXPROM]]
-; CHECK-NEXT: br label [[FOR_INC:%.*]]
-; CHECK: for.body2.preheader:
-; CHECK-NEXT: br label [[FOR_BODY2:%.*]]
-; CHECK: for.body2:
-; CHECK-NEXT: [[LSR_IV:%.*]] = phi i32 [ [[TMP1:%.*]], [[FOR_INC_SPLIT:%.*]] ], [ 1, [[FOR_BODY2_PREHEADER]] ]
-; CHECK-NEXT: br label [[FOR_BODY_PREHEADER]]
-; CHECK: for.inc:
-; CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[LSR_IV]], 4
-; CHECK-NEXT: [[CMP_ZEXT:%.*]] = zext i1 [[CMP]] to i32
-; CHECK-NEXT: store i32 [[CMP_ZEXT]], ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT: [[LSR_IV_NEXT:%.*]] = add nuw nsw i32 [[LSR_IV]], 1
-; CHECK-NEXT: br label [[FOR_COND1_FOR_END_CRIT_EDGE:%.*]]
-; CHECK: for.inc.split:
-; CHECK-NEXT: [[TMP1]] = add nuw nsw i32 [[LSR_IV]], 1
-; CHECK-NEXT: [[TMP2:%.*]] = icmp slt i32 [[LSR_IV]], 4
-; CHECK-NEXT: br i1 [[TMP2]], label [[FOR_BODY2]], label [[FOR_COND_FOR_END5_CRIT_EDGE:%.*]]
-; CHECK: for.cond1.for.end_crit_edge:
-; CHECK-NEXT: br label [[FOR_INC3]]
-; CHECK: for.inc3:
-; CHECK-NEXT: [[INC4]] = add nsw i32 [[INC41]], 1
-; CHECK-NEXT: br i1 false, label [[FOR_BODY]], label [[FOR_INC_SPLIT]]
-; CHECK: for.cond.for.end5_crit_edge:
-; CHECK-NEXT: ret void
-;
entry:
br label %for.body
|
ConstantInt *One = | ||
ConstantInt::get(TripCount->getType()->getContext(), APInt(32, 1)); | ||
bool IsZero = | ||
SE.isKnownPredicate(ICmpInst::ICMP_ULT, Numerator, CacheLineSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be simpler / more obvious to check whether the result is zero directly?
// be used. It does not make sense to output the result that zero cache line | ||
// is used | ||
RefCost = | ||
IsZero ? SE.getSCEV(One) : SE.getUDivExpr(Numerator, CacheLineSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with this code, but possibly what you are looking for is actually getUDivCeilSCEV? That is round the division up instead of down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nikic
Thank you for your reviews.
I made both changes. I used getUDivCeilSCEV
only when refCost evaluates to zero. If I completely replace getUDivExpr
with getUDivCeilSCEV
many tests fail because it will round all cost values not just zeros which is not the purpose of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using getUDivCeil after checking for zero doesn't really make sense -- at that point, just directly setting it to one would be better.
That many tests require changes when using getUDivCeil is expected -- the question is whether those changes are incorrect or not. Without being familiar with the code, I would expect that the round-up division is what is correct in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are test case failures. Can you please check?
2b6692a
to
c2c97a4
Compare
It's fixed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please check if other reviewers are okay with your updates, before you merge the patch.
Gently ping @nikic |
7e4041a
to
580b684
Compare
@@ -8,6 +8,9 @@ | |||
; Check IndexedReference::computeRefCost can handle type differences between | |||
; Stride and TripCount | |||
|
|||
; Round costs up to the nearest whole number i.e. in 'for.cond5' cost is calculated 12.5 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikic
I changed it to use getUDivCeil for all values as you suggested and made required changes to the impacted tests similar to this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me, but it's probably best if @CongzheUalberta takes another look.
RefCost = SE.getUDivExpr(Numerator, CacheLineSize); | ||
// Round the costs up to the nearest whole number i.e. when cost is | ||
// calculated 12.5, it makes more sense to say 13 cache lines are used | ||
// rather than 12 cache lines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Round the fractional cost up to the nearest integer number. The impact is the most
// significant when cost is calculated to be a number less than one, because
// it makes more sense to say one cache line is used rather than zero cache line is used.
Please squash your commits, and also update your commit messages to reflect your most recent changes (you are now rounding up the cost not only from 0 to 1). There's typo in your commit message as well (e.g., "stride"), please double check. |
580b684
to
73b2593
Compare
Thank you for your reviews @nikic @CongzheUalberta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, LGTM.
7836418
to
fca3278
Compare
fca3278
to
f001e8f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind wait for some days for this patch? @RouzbehPaktinat
Most changes happen inside PowerPC target. I will add some PowerPC reviewers and ping them for review. Thanks.
I didn't see anything in the PPC LIT changes that I think is alarming. I think this is okay. |
f001e8f
to
7227448
Compare
Thanks for checking, Roland. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for your patience.
Thank you all for the reviews. We'll merge it when Rouzbeh is back from vacation. |
…o the nearest integer number. Currently loop cache analysis uses following formula to evaluate cost of an RefGroup for a consecutive memory access: "RefCost=(TripCount*Stride)/CLS". When the cost is fractional, it's always rounded to the smaller integer number but this is problematic specially when the cost is calculated to be a number less than one which will be rounded to zero. But it makes more sense to say one cache line is used in this case rather than zero cache lines. This patch fixes the problem by always rounding the cost up to the nearest larger integer number.
7227448
to
3e3fc24
Compare
Thank you everyone for your reviews and comments. |
@RouzbehPaktinat Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested Please check whether problems have been caused by your change specifically, as How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
Currently loop cache analysis uses following formula to evaluate cost of an RefGroup for a consecutive memory access:
RefCost=(TripCount*Stride)/CLS
This cost evaluates to zero when
TripCount*Stride
is smaller than cache-line-size. This results in wrong cost value for a loop and misleads loopInterchange decisions as shown in this case.This patch fixes the problem by rounding the cost to 1 once this problem happens.