Skip to content

[LoopInterchange] Add an option to prioritize vectorization #131988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 21, 2025

Conversation

kasuga-fj
Copy link
Contributor

The LoopInterchange cost-model consists of several decision rules. They are called one by one, and if some rule can determine the profitability, then the subsequent rules aren't called. In the current implementation, the rule for CacheCostAnalysis is called first, and if it fails to determine the profitability, then the rule for vectorization is called. However, there are cases where interchanging loops for vectorization makes the code faster even if such exchanges are detrimental to the cache. For example, exchanging the inner two loops in the following example looks about x3 faster in my local (compiled with -O3 -mcpu=neoverse-v2 -mllvm -cache-line-size=64), even though it's rejected by the rule based on cache cost. (NOTE: LoopInterchange cannot exchange these loops due to legality checks. This should also be improved.)

__attribute__((aligned(64))) float aa[256][256],bb[256][256],cc[256][256],
                                   dd[256][256],ee[256][256],ff[256][256];

// Alternative of TSVC s231 with more array accesses than the original.
void s231_alternative() {
  for (int nl = 0; nl < 100*(100000/256); nl++) {
    for (int i = 0; i < 256; ++i) {
      for (int j = 1; j < 256; j++) {
        aa[j][i] = aa[j-1][i] + bb[j][i] + cc[i][j]
                 + dd[i][j] + ee[i][j] + ff[i][j];
      }
    }
  }
}

This patch introduces a new option to prioritize the vectorization rule over the cache cost rule.

Related issue: #131130

The LoopInterchange cost-model consists of several decision rules. They
are called one by one, and if some rule can determine the profitability,
then the subsequent rules aren't called. In the current implementation,
the rule for `CacheCostAnalysis` is called first, and if it fails to
determine the profitability, then the rule for vectorization is called.
However, there are cases where interchanging loops for vectorization
makes the code faster even if such exchanges are detrimental to the
cache. For example, exchanging the inner two loops in the following
example looks about x3 faster in my local (compiled with `-O3
-mcpu=neoverse-v2 -mllvm -cache-line-size=64`), even though it's
rejected by the rule based on cache cost. (NOTE: LoopInterchange cannot
exchange these loops due to legality checks. This should also be
improved.)

```c
__attribute__((aligned(64))) float aa[256][256],bb[256][256],cc[256][256],
                                   dd[256][256],ee[256][256],ff[256][256];

// Alternative of TSVC s231 with more array accesses than the original.
void s231_alternative() {
  for (int nl = 0; nl < 100*(100000/256); nl++) {
    for (int i = 0; i < 256; ++i) {
      for (int j = 1; j < 256; j++) {
        aa[j][i] = aa[j-1][i] + bb[j][i] + cc[i][j]
                 + dd[i][j] + ee[i][j] + ff[i][j];
      }
    }
  }
}
```

This patch introduces a new option to prioritize the vectorization rule
over the cache cost rule.

Related issue: llvm#131130
@llvmbot
Copy link
Member

llvmbot commented Mar 19, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Ryotaro Kasuga (kasuga-fj)

Changes

The LoopInterchange cost-model consists of several decision rules. They are called one by one, and if some rule can determine the profitability, then the subsequent rules aren't called. In the current implementation, the rule for CacheCostAnalysis is called first, and if it fails to determine the profitability, then the rule for vectorization is called. However, there are cases where interchanging loops for vectorization makes the code faster even if such exchanges are detrimental to the cache. For example, exchanging the inner two loops in the following example looks about x3 faster in my local (compiled with -O3 -mcpu=neoverse-v2 -mllvm -cache-line-size=64), even though it's rejected by the rule based on cache cost. (NOTE: LoopInterchange cannot exchange these loops due to legality checks. This should also be improved.)

__attribute__((aligned(64))) float aa[256][256],bb[256][256],cc[256][256],
                                   dd[256][256],ee[256][256],ff[256][256];

// Alternative of TSVC s231 with more array accesses than the original.
void s231_alternative() {
  for (int nl = 0; nl &lt; 100*(100000/256); nl++) {
    for (int i = 0; i &lt; 256; ++i) {
      for (int j = 1; j &lt; 256; j++) {
        aa[j][i] = aa[j-1][i] + bb[j][i] + cc[i][j]
                 + dd[i][j] + ee[i][j] + ff[i][j];
      }
    }
  }
}

This patch introduces a new option to prioritize the vectorization rule over the cache cost rule.

Related issue: #131130


Full diff: https://github.com/llvm/llvm-project/pull/131988.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/LoopInterchange.cpp (+49-13)
  • (added) llvm/test/Transforms/LoopInterchange/profitability-vectorization.ll (+81)
diff --git a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
index 61b86d2aecef0..364acb0ebc344 100644
--- a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
@@ -84,6 +84,11 @@ static cl::opt<unsigned int> MaxLoopNestDepth(
     "loop-interchange-max-loop-nest-depth", cl::init(10), cl::Hidden,
     cl::desc("Maximum depth of loop nest considered for the transform"));
 
+static cl::opt<bool> PrioritizeVectorization(
+    "loop-interchange-prioritize-vectorization", cl::init(false), cl::Hidden,
+    cl::desc("Prioritize increasing vectorization opportunity over cache cost "
+             "when determining profitability"));
+
 #ifndef NDEBUG
 static void printDepMatrix(CharMatrix &DepMatrix) {
   for (auto &Row : DepMatrix) {
@@ -1193,22 +1198,53 @@ bool LoopInterchangeProfitability::isProfitable(
     unsigned OuterLoopId, CharMatrix &DepMatrix,
     const DenseMap<const Loop *, unsigned> &CostMap,
     std::unique_ptr<CacheCost> &CC) {
-  // isProfitable() is structured to avoid endless loop interchange.
-  // If loop cache analysis could decide the profitability then,
-  // profitability check will stop and return the analysis result.
-  // If cache analysis failed to analyze the loopnest (e.g.,
-  // due to delinearization issues) then only check whether it is
-  // profitable for InstrOrderCost. Likewise, if InstrOrderCost failed to
-  // analysis the profitability then only, isProfitableForVectorization
-  // will decide.
-  std::optional<bool> shouldInterchange =
-      isProfitablePerLoopCacheAnalysis(CostMap, CC);
-  if (!shouldInterchange.has_value()) {
-    shouldInterchange = isProfitablePerInstrOrderCost();
-    if (!shouldInterchange.has_value())
+  // isProfitable() is structured to avoid endless loop interchange. If the
+  // highest priority rule (isProfitablePerLoopCacheAnalysis by default) could
+  // decide the profitability then, profitability check will stop and return the
+  // analysis result. If it failed to determine it (e.g., cache analysis failed
+  // to analyze the loopnest due to delinearization issues) then go ahead the
+  // second highest priority rule (isProfitablePerInstrOrderCost by default).
+  // Likewise, if it failed to analysis the profitability then only, the last
+  // rule (isProfitableForVectorization by default) will decide.
+  enum class RuleTy {
+    PerLoopCacheAnalysis,
+    PerInstrOrderCost,
+    ForVectorization,
+  };
+
+  // We prefer cache cost to vectorization by default.
+  RuleTy RuleOrder[3] = {RuleTy::PerLoopCacheAnalysis,
+                         RuleTy::PerInstrOrderCost, RuleTy::ForVectorization};
+
+  // If we prefer vectorization to cache cost, change the order of application
+  // of each rule.
+  if (PrioritizeVectorization) {
+    RuleOrder[0] = RuleTy::ForVectorization;
+    RuleOrder[1] = RuleTy::PerLoopCacheAnalysis;
+    RuleOrder[2] = RuleTy::PerInstrOrderCost;
+  }
+
+  std::optional<bool> shouldInterchange;
+  for (RuleTy RT : RuleOrder) {
+    switch (RT) {
+    case RuleTy::PerLoopCacheAnalysis:
+      shouldInterchange = isProfitablePerLoopCacheAnalysis(CostMap, CC);
+      break;
+    case RuleTy::PerInstrOrderCost:
+      shouldInterchange = isProfitablePerInstrOrderCost();
+      break;
+    case RuleTy::ForVectorization:
       shouldInterchange =
           isProfitableForVectorization(InnerLoopId, OuterLoopId, DepMatrix);
+      break;
+    }
+
+    // If this rule could determine the profitability, don't call subsequent
+    // rules.
+    if (shouldInterchange.has_value())
+      break;
   }
+
   if (!shouldInterchange.has_value()) {
     ORE->emit([&]() {
       return OptimizationRemarkMissed(DEBUG_TYPE, "InterchangeNotProfitable",
diff --git a/llvm/test/Transforms/LoopInterchange/profitability-vectorization.ll b/llvm/test/Transforms/LoopInterchange/profitability-vectorization.ll
new file mode 100644
index 0000000000000..0018aa0308f28
--- /dev/null
+++ b/llvm/test/Transforms/LoopInterchange/profitability-vectorization.ll
@@ -0,0 +1,81 @@
+; RUN: opt < %s -passes=loop-interchange -cache-line-size=64 \
+; RUN:     -pass-remarks-output=%t -disable-output
+; RUN: FileCheck -input-file %t --check-prefix=PROFIT-CACHE %s
+
+; RUN: opt < %s -passes=loop-interchange -cache-line-size=64 \
+; RUN:     -pass-remarks-output=%t -disable-output -loop-interchange-prioritize-vectorization=1
+; RUN: FileCheck -input-file %t --check-prefix=PROFIT-VEC %s
+
+@A = dso_local global [256 x [256 x float]] zeroinitializer
+@B = dso_local global [256 x [256 x float]] zeroinitializer
+@C = dso_local global [256 x [256 x float]] zeroinitializer
+@D = dso_local global [256 x [256 x float]] zeroinitializer
+@E = dso_local global [256 x [256 x float]] zeroinitializer
+@F = dso_local global [256 x [256 x float]] zeroinitializer
+
+; Check the behavior of the LoopInterchange cost-model. In the below code,
+; exchanging the loops is not profitable in terms of cache, but it is necessary
+; to vectorize the innermost loop.
+;
+; for (int i = 0; i < 256; i++)
+;   for (int j = 1; j < 256; j++)
+;     A[j][i] = A[j-1][i] + B[j][i] + C[i][j] + D[i][j] + E[i][j] + F[i][j];
+;
+
+; PROFIT-CACHE:      --- !Missed
+; PROFIT-CACHE-NEXT: Pass:            loop-interchange
+; PROFIT-CACHE-NEXT: Name:            InterchangeNotProfitable
+; PROFIT-CACHE-NEXT: Function:        f
+; PROFIT-CACHE-NEXT: Args:
+; PROFIT-CACHE-NEXT:   - String:          Interchanging loops is not considered to improve cache locality nor vectorization.
+; PROFIT-CACHE-NEXT: ...
+
+; PROFIT-VEC:      --- !Passed
+; PROFIT-VEC-NEXT: Pass:            loop-interchange
+; PROFIT-VEC-NEXT: Name:            Interchanged
+; PROFIT-VEC-NEXT: Function:        f
+; PROFIT-VEC-NEXT: Args:
+; PROFIT-VEC-NEXT:   - String:          Loop interchanged with enclosing loop.
+; PROFIT-VEC-NEXT: ...
+define void @f() {
+entry:
+  br label %for.i.header
+
+for.i.header:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.i.inc ]
+  br label %for.j.body
+
+for.j.body:
+  %j = phi i64 [ 1, %for.i.header ], [ %j.next, %for.j.body ]
+  %j.dec = add nsw i64 %j, -1
+  %a.0.index = getelementptr nuw inbounds [256 x [256 x float]], ptr @A, i64 %j.dec, i64 %i
+  %b.index = getelementptr nuw inbounds [256 x [256 x float]], ptr @B, i64 %j, i64 %i
+  %c.index = getelementptr nuw inbounds [256 x [256 x float]], ptr @C, i64 %i, i64 %j
+  %d.index = getelementptr nuw inbounds [256 x [256 x float]], ptr @D, i64 %i, i64 %j
+  %e.index = getelementptr nuw inbounds [256 x [256 x float]], ptr @E, i64 %i, i64 %j
+  %f.index = getelementptr nuw inbounds [256 x [256 x float]], ptr @F, i64 %i, i64 %j
+  %a.0 = load float, ptr %a.0.index, align 4
+  %b = load float, ptr %b.index, align 4
+  %c = load float, ptr %c.index, align 4
+  %d = load float, ptr %d.index, align 4
+  %e = load float, ptr %e.index, align 4
+  %f = load float, ptr %f.index, align 4
+  %add.0 = fadd float %a.0, %b
+  %add.1 = fadd float %add.0, %c
+  %add.2 = fadd float %add.1, %d
+  %add.3 = fadd float %add.2, %e
+  %add.4 = fadd float %add.3, %f
+  %a.1.index = getelementptr nuw inbounds [256 x [256 x float]], ptr @A, i64 %j, i64 %i
+  store float %add.4, ptr %a.1.index, align 4
+  %j.next = add nuw nsw i64 %j, 1
+  %cmp.j = icmp eq i64 %j.next, 256
+  br i1 %cmp.j, label %for.i.inc, label %for.j.body
+
+for.i.inc:
+  %i.next = add nuw nsw i64 %i, 1
+  %cmp.i = icmp eq i64 %i.next, 256
+  br i1 %cmp.i, label %exit, label %for.i.header
+
+exit:
+  ret void
+}

Copy link
Member

@Meinersbur Meinersbur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not requiring for this review, but have you considered exposing the generality of the implementation to the user? Something like -loop-interchange-priorities=cache,instcost,vector (One may also intentionally remove one of them)?

Copy link
Member

@Meinersbur Meinersbur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If you would like to address my suggestion, feel free to create another PR.

@kasuga-fj
Copy link
Contributor Author

Not requiring for this review, but have you considered exposing the generality of the implementation to the user? Something like -loop-interchange-priorities=cache,instcost,vector (One may also intentionally remove one of them)?

I had thought of passing it as a constructor argument like LoopUnroll, but had not thought to add such an option. However, supporting it sounds good to me, especially when I want to disable some of them. Thanks for your idea, I'll add it.

@kasuga-fj
Copy link
Contributor Author

Ah, the timing overlapped. So I'll create another PR later.

}

if (!shouldInterchange.has_value()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you will need to update various other places to avoid build failures?

Copy link
Contributor Author

@kasuga-fj kasuga-fj Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pressed the wrong button of GitHub...

@sjoerdmeijer
Copy link
Collaborator

Sorry for picking this up late, I don't have any problems with this patch and this looks good as a first step, but this is not ideal as it relies on a user option. I need to have a look again how this works, but at the risk of saying something silly, why is this not a "or"? So, if it is profitable for cache or vectorisation, do it?

@Meinersbur
Copy link
Member

Meinersbur commented Mar 19, 2025

@sjoerdmeijer Each cost concern can also return "the order is optimal as-is; if it was in a different order I would suggest to change it". Lets say, vectorization prefers to swap, but cache analysis wants to keep. Since we are in a loop that finishes until a fixpoint is reached1, we would alternate between vectorization wants to swap, then cache analysis swaps back, then vectorization wants to swap again, .... .

Changing the default cost model is difficult as it must not cause major regressions, or users will rightfully complain that their code got slower. Ideally, the compiler could determine whether e.g. optimization vectorization or cache pattern will enable the larger speedup. If you know an heuristic on how to determine that, we would be very interested.

Footnotes

  1. Actually, a bubble sort; so theoretically there is a maximum number of iterations if the compare function is consistent. 'or' would not be a consistent compare function, just as [](auto lhs, auto rhs) { return lhs.first < rhs.first || lhs.second < rhs.second; } would not be a consistent compare function for tuples.

@sjoerdmeijer
Copy link
Collaborator

Thanks for explaining, @Meinersbur , makes sense.

So maybe we need to review what is the better default, but that is for another day. Same for a heuristic. My instinct says vectorisation would be more beneficial, but maybe I am biased from looking more at compute bound problems.

@kasuga-fj
Copy link
Contributor Author

Ideally, the compiler could determine whether e.g. optimization vectorization or cache pattern will enable the larger speedup. If you know an heuristic on how to determine that, we would be very interested.

I completely agree with this, and I don't have any idea at this moment.

My instinct says vectorisation would be more beneficial, but maybe I am biased from looking more at compute bound problems.

There are cases where artificial (maybe not realistic), less interesting but cache is more advantageous than vectorization, as follows.

// Swapping innermost two loops make it possible to vectorize the innermost one.
// However, it also has a negative impact on cache.
for (int nl = 0; nl < 100*(100000/256); nl++) {
  for (int i = 0; i < 256; ++i) {
    for (int j = 1; j < 256; j++) {
      aa[j][i] = aa[j-1][i] + bb[j][i] + cc[i][j]
               + dd[i][j] + ee[i][j] + ff[i][j]
               + gg[i][j] + hh[i][j] + kk[i][j] + ...;  // More addition.
    }
  }
}

@kasuga-fj kasuga-fj merged commit 17b202f into llvm:main Mar 21, 2025
11 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Mar 21, 2025

LLVM Buildbot has detected a new failure on builder lldb-aarch64-ubuntu running on linaro-lldb-aarch64-ubuntu while building llvm at step 6 "test".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/59/builds/14681

Here is the relevant piece of the build log for the reference
Step 6 (test) failure: build (failure)
...
PASS: lldb-api :: lang/cpp/stl/TestSTL.py (862 of 2109)
XFAIL: lldb-api :: lang/cpp/thread_local/TestThreadLocal.py (863 of 2109)
PASS: lldb-api :: lang/cpp/typedef-to-outer-fwd/TestTypedefToOuterFwd.py (864 of 2109)
PASS: lldb-api :: lang/cpp/type_lookup_duplicate/TestCppTypeLookupDuplicate.py (865 of 2109)
XFAIL: lldb-api :: lang/cpp/trivial_abi/TestTrivialABI.py (866 of 2109)
PASS: lldb-api :: lang/cpp/type_lookup/TestCppTypeLookup.py (867 of 2109)
PASS: lldb-api :: lang/cpp/typeof/TestTypeOfDeclTypeExpr.py (868 of 2109)
PASS: lldb-api :: lang/cpp/thunk/TestThunk.py (869 of 2109)
PASS: lldb-api :: lang/cpp/typedef/TestCppTypedef.py (870 of 2109)
PASS: lldb-api :: lang/cpp/template/TestTemplateArgs.py (871 of 2109)
FAIL: lldb-api :: lang/cpp/unique-types/TestUniqueTypes.py (872 of 2109)
******************** TEST 'lldb-api :: lang/cpp/unique-types/TestUniqueTypes.py' FAILED ********************
Script:
--
/usr/bin/python3.10 /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/API/dotest.py -u CXXFLAGS -u CFLAGS --env LLVM_LIBS_DIR=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./lib --env LLVM_INCLUDE_DIR=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/include --env LLVM_TOOLS_DIR=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin --arch aarch64 --build-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex --lldb-module-cache-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex/module-cache-clang/lldb-api --executable /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin/lldb --compiler /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin/clang --dsymutil /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin/dsymutil --make /usr/bin/gmake --llvm-tools-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin --lldb-obj-root /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb --lldb-libs-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./lib /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/API/lang/cpp/unique-types -p TestUniqueTypes.py
--
Exit Code: -11

Command Output (stdout):
--
lldb version 21.0.0git (https://github.com/llvm/llvm-project.git revision 17b202fc17f2b4b1df3e8fc842226597a0ed364e)
  clang revision 17b202fc17f2b4b1df3e8fc842226597a0ed364e
  llvm revision 17b202fc17f2b4b1df3e8fc842226597a0ed364e
Skipping the following test categories: ['libc++', 'dsym', 'gmodules', 'debugserver', 'objc']

--
Command Output (stderr):
--
UNSUPPORTED: LLDB (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/bin/clang-aarch64) :: test_dsym (TestUniqueTypes.UniqueTypesTestCase) (test case does not fall in any category of interest for this run) 
PASS: LLDB (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/bin/clang-aarch64) :: test_dwarf (TestUniqueTypes.UniqueTypesTestCase)
PASS: LLDB (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/bin/clang-aarch64) :: test_dwo (TestUniqueTypes.UniqueTypesTestCase)
----------------------------------------------------------------------
Ran 3 tests in 0.969s

OK (skipped=1)

--

********************
PASS: lldb-api :: lang/cpp/unique-types2/TestUniqueTypes2.py (873 of 2109)
PASS: lldb-api :: lang/cpp/unsigned_types/TestUnsignedTypes.py (874 of 2109)
PASS: lldb-api :: lang/cpp/unique-types3/TestUniqueTypes3.py (875 of 2109)
PASS: lldb-api :: lang/cpp/unicode-literals/TestUnicodeLiterals.py (876 of 2109)
PASS: lldb-api :: lang/cpp/unique-types4/TestUniqueTypes4.py (877 of 2109)
PASS: lldb-api :: lang/cpp/union-static-data-members/TestCppUnionStaticMembers.py (878 of 2109)
PASS: lldb-api :: lang/cpp/virtual-overload/TestVirtualOverload.py (879 of 2109)
UNSUPPORTED: lldb-api :: lang/objc/bitfield_ivars/TestBitfieldIvars.py (880 of 2109)
UNSUPPORTED: lldb-api :: lang/objc/blocks/TestObjCIvarsInBlocks.py (881 of 2109)
UNSUPPORTED: lldb-api :: lang/objc/charstar_dyntype/TestCharStarDynType.py (882 of 2109)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants