Skip to content

[LoopCacheAnalysis] Fix loop cache cost to always round the cost up to the nearest integer number #88915

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 27, 2024

Conversation

RouzbehPaktinat
Copy link
Contributor

Currently loop cache analysis uses following formula to evaluate cost of an RefGroup for a consecutive memory access:

RefCost=(TripCount*Stride)/CLS

This cost evaluates to zero when TripCount*Stride is smaller than cache-line-size. This results in wrong cost value for a loop and misleads loopInterchange decisions as shown in this case.

This patch fixes the problem by rounding the cost to 1 once this problem happens.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Apr 16, 2024

@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-transforms

Author: Rouzbeh (RouzbehPaktinat)

Changes

Currently loop cache analysis uses following formula to evaluate cost of an RefGroup for a consecutive memory access:

RefCost=(TripCount*Stride)/CLS

This cost evaluates to zero when TripCount*Stride is smaller than cache-line-size. This results in wrong cost value for a loop and misleads loopInterchange decisions as shown in this case.

This patch fixes the problem by rounding the cost to 1 once this problem happens.


Full diff: https://github.com/llvm/llvm-project/pull/88915.diff

3 Files Affected:

  • (modified) llvm/lib/Analysis/LoopCacheAnalysis.cpp (+9-2)
  • (added) llvm/test/Analysis/LoopCacheAnalysis/interchange-cost-beneficial.ll (+62)
  • (modified) llvm/test/Transforms/LoopInterchange/pr43176-move-to-new-latch.ll (+25-68)
diff --git a/llvm/lib/Analysis/LoopCacheAnalysis.cpp b/llvm/lib/Analysis/LoopCacheAnalysis.cpp
index c3a56639b5c8f8..9d1878b4cd3085 100644
--- a/llvm/lib/Analysis/LoopCacheAnalysis.cpp
+++ b/llvm/lib/Analysis/LoopCacheAnalysis.cpp
@@ -299,8 +299,15 @@ CacheCostTy IndexedReference::computeRefCost(const Loop &L,
     Stride = SE.getNoopOrAnyExtend(Stride, WiderType);
     TripCount = SE.getNoopOrZeroExtend(TripCount, WiderType);
     const SCEV *Numerator = SE.getMulExpr(Stride, TripCount);
-    RefCost = SE.getUDivExpr(Numerator, CacheLineSize);
-
+    ConstantInt *One =
+        ConstantInt::get(TripCount->getType()->getContext(), APInt(32, 1));
+    bool IsZero =
+        SE.isKnownPredicate(ICmpInst::ICMP_ULT, Numerator, CacheLineSize);
+    // When result is zero, round it to one because at least one cache line must
+    // be used. It does not make sense to output the result that zero cache line
+    // is used
+    RefCost =
+        IsZero ? SE.getSCEV(One) : SE.getUDivExpr(Numerator, CacheLineSize);
     LLVM_DEBUG(dbgs().indent(4)
                << "Access is consecutive: RefCost=(TripCount*Stride)/CLS="
                << *RefCost << "\n");
diff --git a/llvm/test/Analysis/LoopCacheAnalysis/interchange-cost-beneficial.ll b/llvm/test/Analysis/LoopCacheAnalysis/interchange-cost-beneficial.ll
new file mode 100644
index 00000000000000..3086224c582048
--- /dev/null
+++ b/llvm/test/Analysis/LoopCacheAnalysis/interchange-cost-beneficial.ll
@@ -0,0 +1,62 @@
+; RUN: opt <  %s  -cache-line-size=64 -passes='print<loop-cache-cost>' -disable-output 2>&1 | FileCheck  %s
+
+;; This test checks the effect of rounding cache cost to 1 when it is 
+;; evaluated to 0 because at least 1 cache line is accessed by the loopnest.
+;; It does not make sense to output that zero cache lines are used.
+;; The cost of reference group for B[j], C[j], D[j] and E[j] were 
+;; calculted 0 before but now they are 1 which makes each loop cost more reasonable.
+;
+; void test(int n, int m, int o, int A[2][3], int B[2], int C[2], int D[2], int E[2]) {
+;   for (int i = 0; i < 3; i++)
+;     for (int j = 0; j < 2; j++)
+;        A[j][i] = 1;
+;        B[j] = 1;
+;        C[j] = 1;
+;        D[j] = 1
+;        E[j] = 1
+; }
+
+; CHECK: Loop 'for.j' has cost = 18
+; CHECK-NEXT: Loop 'for.i' has cost = 10
+
+define void @test(ptr %A, ptr %B, ptr %C, ptr %D, ptr %E) {
+
+entry:
+  br label %for.i.preheader.split
+
+for.i.preheader.split:                            ; preds = %for.i.preheader
+  br label %for.i
+
+for.i:                                            ; preds = %for.inci, %for.i.preheader.split
+  %i = phi i64 [ %inci, %for.inci ], [ 0, %for.i.preheader.split ]
+  br label %for.j
+
+for.j:                                            ; preds = %for.incj, %for.i
+  %j = phi i64 [ %incj, %for.j ], [ 0, %for.i ]
+  %mul_j = mul nsw i64 %j, 3
+  %index_j = add i64 %mul_j, %i
+  %arrayidxA = getelementptr inbounds [2 x [ 3 x i32]], ptr %A, i64 %j, i64 %i
+  store i32 1, ptr %arrayidxA, align 4
+  %arrayidxB = getelementptr inbounds i32, ptr %B, i64 %j
+  store i32 1, ptr %arrayidxB, align 4
+  %arrayidxC = getelementptr inbounds i32, ptr %C, i64 %j
+  store i32 1, ptr %arrayidxC, align 4
+  %arrayidxD = getelementptr inbounds i32, ptr %D, i64 %j
+  store i32 1, ptr %arrayidxD, align 4
+  %arrayidxE = getelementptr inbounds i32, ptr %E, i64 %j
+  store i32 1, ptr %arrayidxE, align 4
+  %incj = add nsw i64 %j, 1
+  %exitcond.us = icmp eq i64 %incj, 2
+  br i1 %exitcond.us, label %for.inci, label %for.j
+
+for.inci:                                         ; preds = %for.incj
+  %inci = add nsw i64 %i, 1
+  %exitcond55.us = icmp eq i64 %inci, 3
+  br i1 %exitcond55.us, label %for.end.loopexit, label %for.i
+
+for.end.loopexit:                                 ; preds = %for.inci
+  br label %for.end
+
+for.end:                                          ; preds = %for.end.loopexit, %for.cond1.preheader.lr.ph, %entry
+  ret void
+}
diff --git a/llvm/test/Transforms/LoopInterchange/pr43176-move-to-new-latch.ll b/llvm/test/Transforms/LoopInterchange/pr43176-move-to-new-latch.ll
index 965d95110da466..cc787fa55600a6 100644
--- a/llvm/test/Transforms/LoopInterchange/pr43176-move-to-new-latch.ll
+++ b/llvm/test/Transforms/LoopInterchange/pr43176-move-to-new-latch.ll
@@ -1,42 +1,25 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
-; RUN: opt -passes=loop-interchange -cache-line-size=64 -verify-loop-lcssa -verify-dom-info -S %s | FileCheck %s
+; RUN: opt < %s -passes=loop-interchange -cache-line-size=64 -pass-remarks-missed='loop-interchange' -pass-remarks-output=%t -S
+; RUN: FileCheck --input-file=%t %s
 
 @b = external dso_local global [5 x i32], align 16
 
+;; Not profitable to interchange, because the access is invariant to j loop.
+;;
+;;  for(int i=0;i<4;i++) {
+;;    for(int j=1;j<4;j++) {
+;;      b[i] = ....
+;;    }
+;; }
+
+; CHECK: --- !Missed
+; CHECK-NEXT: Pass:            loop-interchange
+; CHECK-NEXT: Name:            InterchangeNotProfitable
+; CHECK-NEXT: Function:        test1
+; CHECK-NEXT: Args:
+; CHECK-NEXT:  - String:          Interchanging loops is not considered to improve cache locality nor vectorization.
+
 define void @test1() {
-; CHECK-LABEL: @test1(
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    br label [[FOR_BODY2_PREHEADER:%.*]]
-; CHECK:       for.body.preheader:
-; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
-; CHECK:       for.body:
-; CHECK-NEXT:    [[INC41:%.*]] = phi i32 [ [[INC4:%.*]], [[FOR_INC3:%.*]] ], [ undef, [[FOR_BODY_PREHEADER:%.*]] ]
-; CHECK-NEXT:    [[IDXPROM:%.*]] = sext i32 [[INC41]] to i64
-; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [5 x i32], ptr @b, i64 0, i64 [[IDXPROM]]
-; CHECK-NEXT:    br label [[FOR_INC:%.*]]
-; CHECK:       for.body2.preheader:
-; CHECK-NEXT:    br label [[FOR_BODY2:%.*]]
-; CHECK:       for.body2:
-; CHECK-NEXT:    [[LSR_IV:%.*]] = phi i32 [ [[TMP1:%.*]], [[FOR_INC_SPLIT:%.*]] ], [ 1, [[FOR_BODY2_PREHEADER]] ]
-; CHECK-NEXT:    br label [[FOR_BODY_PREHEADER]]
-; CHECK:       for.inc:
-; CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT:    store i32 undef, ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT:    [[CMP:%.*]] = icmp slt i32 [[LSR_IV]], 4
-; CHECK-NEXT:    [[LSR_IV_NEXT:%.*]] = add nuw nsw i32 [[LSR_IV]], 1
-; CHECK-NEXT:    br label [[FOR_COND1_FOR_END_CRIT_EDGE:%.*]]
-; CHECK:       for.inc.split:
-; CHECK-NEXT:    [[TMP1]] = add nuw nsw i32 [[LSR_IV]], 1
-; CHECK-NEXT:    [[TMP2:%.*]] = icmp slt i32 [[LSR_IV]], 4
-; CHECK-NEXT:    br i1 [[TMP2]], label [[FOR_BODY2]], label [[FOR_COND_FOR_END5_CRIT_EDGE:%.*]]
-; CHECK:       for.cond1.for.end_crit_edge:
-; CHECK-NEXT:    br label [[FOR_INC3]]
-; CHECK:       for.inc3:
-; CHECK-NEXT:    [[INC4]] = add nsw i32 [[INC41]], 1
-; CHECK-NEXT:    br i1 false, label [[FOR_BODY]], label [[FOR_INC_SPLIT]]
-; CHECK:       for.cond.for.end5_crit_edge:
-; CHECK-NEXT:    ret void
-;
 entry:
   br label %for.body
 
@@ -68,41 +51,15 @@ for.cond.for.end5_crit_edge:                      ; preds = %for.inc3
   ret void
 }
 
+
+; CHECK: --- !Missed
+; CHECK-NEXT: Pass:            loop-interchange
+; CHECK-NEXT: Name:            InterchangeNotProfitable
+; CHECK-NEXT: Function:        test2
+; CHECK-NEXT: Args:
+; CHECK-NEXT:  - String:          Interchanging loops is not considered to improve cache locality nor vectorization.
+
 define void @test2() {
-; CHECK-LABEL: @test2(
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    br label [[FOR_BODY2_PREHEADER:%.*]]
-; CHECK:       for.body.preheader:
-; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
-; CHECK:       for.body:
-; CHECK-NEXT:    [[INC41:%.*]] = phi i32 [ [[INC4:%.*]], [[FOR_INC3:%.*]] ], [ undef, [[FOR_BODY_PREHEADER:%.*]] ]
-; CHECK-NEXT:    [[IDXPROM:%.*]] = sext i32 [[INC41]] to i64
-; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [5 x i32], ptr @b, i64 0, i64 [[IDXPROM]]
-; CHECK-NEXT:    br label [[FOR_INC:%.*]]
-; CHECK:       for.body2.preheader:
-; CHECK-NEXT:    br label [[FOR_BODY2:%.*]]
-; CHECK:       for.body2:
-; CHECK-NEXT:    [[LSR_IV:%.*]] = phi i32 [ [[TMP1:%.*]], [[FOR_INC_SPLIT:%.*]] ], [ 1, [[FOR_BODY2_PREHEADER]] ]
-; CHECK-NEXT:    br label [[FOR_BODY_PREHEADER]]
-; CHECK:       for.inc:
-; CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT:    [[CMP:%.*]] = icmp slt i32 [[LSR_IV]], 4
-; CHECK-NEXT:    [[CMP_ZEXT:%.*]] = zext i1 [[CMP]] to i32
-; CHECK-NEXT:    store i32 [[CMP_ZEXT]], ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT:    [[LSR_IV_NEXT:%.*]] = add nuw nsw i32 [[LSR_IV]], 1
-; CHECK-NEXT:    br label [[FOR_COND1_FOR_END_CRIT_EDGE:%.*]]
-; CHECK:       for.inc.split:
-; CHECK-NEXT:    [[TMP1]] = add nuw nsw i32 [[LSR_IV]], 1
-; CHECK-NEXT:    [[TMP2:%.*]] = icmp slt i32 [[LSR_IV]], 4
-; CHECK-NEXT:    br i1 [[TMP2]], label [[FOR_BODY2]], label [[FOR_COND_FOR_END5_CRIT_EDGE:%.*]]
-; CHECK:       for.cond1.for.end_crit_edge:
-; CHECK-NEXT:    br label [[FOR_INC3]]
-; CHECK:       for.inc3:
-; CHECK-NEXT:    [[INC4]] = add nsw i32 [[INC41]], 1
-; CHECK-NEXT:    br i1 false, label [[FOR_BODY]], label [[FOR_INC_SPLIT]]
-; CHECK:       for.cond.for.end5_crit_edge:
-; CHECK-NEXT:    ret void
-;
 entry:
   br label %for.body
 

ConstantInt *One =
ConstantInt::get(TripCount->getType()->getContext(), APInt(32, 1));
bool IsZero =
SE.isKnownPredicate(ICmpInst::ICMP_ULT, Numerator, CacheLineSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler / more obvious to check whether the result is zero directly?

// be used. It does not make sense to output the result that zero cache line
// is used
RefCost =
IsZero ? SE.getSCEV(One) : SE.getUDivExpr(Numerator, CacheLineSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with this code, but possibly what you are looking for is actually getUDivCeilSCEV? That is round the division up instead of down.

Copy link
Contributor Author

@RouzbehPaktinat RouzbehPaktinat Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nikic
Thank you for your reviews.
I made both changes. I used getUDivCeilSCEV only when refCost evaluates to zero. If I completely replace getUDivExpr with getUDivCeilSCEV many tests fail because it will round all cost values not just zeros which is not the purpose of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using getUDivCeil after checking for zero doesn't really make sense -- at that point, just directly setting it to one would be better.

That many tests require changes when using getUDivCeil is expected -- the question is whether those changes are incorrect or not. Without being familiar with the code, I would expect that the round-up division is what is correct in this context.

Copy link
Contributor

@CongzheUalberta CongzheUalberta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are test case failures. Can you please check?

@RouzbehPaktinat RouzbehPaktinat force-pushed the fix-loop-cache-cost branch 2 times, most recently from 2b6692a to c2c97a4 Compare April 17, 2024 19:29
@RouzbehPaktinat
Copy link
Contributor Author

There are test case failures. Can you please check?

It's fixed now.

Copy link
Contributor

@CongzheUalberta CongzheUalberta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please check if other reviewers are okay with your updates, before you merge the patch.

@RouzbehPaktinat
Copy link
Contributor Author

Gently ping @nikic

@@ -8,6 +8,9 @@
; Check IndexedReference::computeRefCost can handle type differences between
; Stride and TripCount

; Round costs up to the nearest whole number i.e. in 'for.cond5' cost is calculated 12.5 and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikic
I changed it to use getUDivCeil for all values as you suggested and made required changes to the impacted tests similar to this one.

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me, but it's probably best if @CongzheUalberta takes another look.

RefCost = SE.getUDivExpr(Numerator, CacheLineSize);
// Round the costs up to the nearest whole number i.e. when cost is
// calculated 12.5, it makes more sense to say 13 cache lines are used
// rather than 12 cache lines.
Copy link
Contributor

@CongzheUalberta CongzheUalberta Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Round the fractional cost up to the nearest integer number. The impact is the most
// significant when cost is calculated to be a number less than one, because
// it makes more sense to say one cache line is used rather than zero cache line is used.

@CongzheUalberta
Copy link
Contributor

Please squash your commits, and also update your commit messages to reflect your most recent changes (you are now rounding up the cost not only from 0 to 1). There's typo in your commit message as well (e.g., "stride"), please double check.

@RouzbehPaktinat RouzbehPaktinat changed the title Fix loop cache cost to avoid cost of zero for refgroups. [LoopCacheAnalysis] Fix loop cache cost to always round the cost up to the nearest integer number Apr 24, 2024
@RouzbehPaktinat
Copy link
Contributor Author

Thank you for your reviews @nikic @CongzheUalberta
I addressed new comments.

Copy link
Contributor

@CongzheUalberta CongzheUalberta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, LGTM.

@RouzbehPaktinat RouzbehPaktinat force-pushed the fix-loop-cache-cost branch 2 times, most recently from 7836418 to fca3278 Compare April 25, 2024 17:06
@Narutoworld Narutoworld force-pushed the fix-loop-cache-cost branch from fca3278 to f001e8f Compare April 29, 2024 18:12
Copy link
Collaborator

@chenzheng1030 chenzheng1030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind wait for some days for this patch? @RouzbehPaktinat
Most changes happen inside PowerPC target. I will add some PowerPC reviewers and ping them for review. Thanks.

@chenzheng1030 chenzheng1030 requested a review from scui-ibm April 30, 2024 08:53
@RolandF77
Copy link
Collaborator

I didn't see anything in the PPC LIT changes that I think is alarming. I think this is okay.

@Narutoworld Narutoworld force-pushed the fix-loop-cache-cost branch 2 times, most recently from f001e8f to 7227448 Compare May 3, 2024 00:39
@chenzheng1030
Copy link
Collaborator

I didn't see anything in the PPC LIT changes that I think is alarming. I think this is okay.

Thanks for checking, Roland.

Copy link
Collaborator

@chenzheng1030 chenzheng1030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for your patience.

@Narutoworld
Copy link
Contributor

Thank you all for the reviews. We'll merge it when Rouzbeh is back from vacation.

…o the nearest integer number.

Currently loop cache analysis uses following formula to evaluate cost of an RefGroup for a consecutive memory access: "RefCost=(TripCount*Stride)/CLS". When the cost is fractional, it's always rounded to the smaller integer number but this is problematic specially when the cost is calculated to be a number less than one which will be rounded to zero. But it makes more sense to say one cache line is used in this case rather than zero cache lines.

This patch fixes the problem by always rounding the cost up to the nearest larger integer number.
@Narutoworld Narutoworld force-pushed the fix-loop-cache-cost branch from 7227448 to 3e3fc24 Compare May 22, 2024 16:04
@RouzbehPaktinat
Copy link
Contributor Author

Thank you everyone for your reviews and comments.

@Narutoworld Narutoworld merged commit 6702594 into llvm:main May 27, 2024
4 checks passed
Copy link

@RouzbehPaktinat Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested
by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as
the builds can include changes from many authors. It is not uncommon for your
change to be included in a build that fails due to someone else's changes, or
infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself.
This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants