Skip to content

[AArch64][LoopVectorize] Enable tail-folding on neoverse-v2 #135357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

c-rhodes
Copy link
Collaborator

This patch enables tail-folding of simple loops by default when targeting the neoverse-v2 CPU. This was done for neoverse-v1 in c7dbe32.

For SPEC2017 with "-Ofast -mcpu=neoverse-v2 -flto" this gives some small wins:

549.fotonik3d_r: ~3.2%
525.x264_r: ~2.7%
554.roms_r: ~1.2%

This patch enables tail-folding of simple loops by default when
targeting the neoverse-v2 CPU. This was done for neoverse-v1 in
c7dbe32.

For SPEC2017 with "-Ofast -mcpu=neoverse-v2 -flto" this gives some small
wins:

549.fotonik3d_r: ~3.2%
     525.x264_r: ~2.7%
     554.roms_r: ~1.2%
@llvmbot
Copy link
Member

llvmbot commented Apr 11, 2025

@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-llvm-transforms

Author: Cullen Rhodes (c-rhodes)

Changes

This patch enables tail-folding of simple loops by default when targeting the neoverse-v2 CPU. This was done for neoverse-v1 in c7dbe32.

For SPEC2017 with "-Ofast -mcpu=neoverse-v2 -flto" this gives some small wins:

549.fotonik3d_r: ~3.2%
525.x264_r: ~2.7%
554.roms_r: ~1.2%


Full diff: https://github.com/llvm/llvm-project/pull/135357.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64Subtarget.cpp (+2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll (+2)
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index 7b4ded6322098..adee9899f7fd8 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -268,6 +268,8 @@ void AArch64Subtarget::initializeProperties(bool HasMinSize) {
     MaxBytesForLoopAlignment = 16;
     break;
   case NeoverseV2:
+    DefaultSVETFOpts = TailFoldingOpts::Simple;
+    LLVM_FALLTHROUGH;
   case NeoverseV3:
     EpilogueVectorizationMinVF = 8;
     MaxInterleaveFactor = 4;
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
index 7dd0f0c0ad8e0..d2b8dd9c2be48 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
@@ -11,6 +11,8 @@
 ; RUN: opt < %s -passes=loop-vectorize -sve-tail-folding-insn-threshold=0 -S -sve-tail-folding=default -mcpu=neoverse-v1 | FileCheck %s -check-prefix=CHECK-NEOVERSE-V1
 ; RUN: opt < %s -passes=loop-vectorize -sve-tail-folding-insn-threshold=0 -S -mcpu=neoverse-v1 -sve-tail-folding=default | FileCheck %s -check-prefix=CHECK-NEOVERSE-V1
 ; RUN: opt < %s -passes=loop-vectorize -sve-tail-folding-insn-threshold=0 -S -mcpu=neoverse-v1 | FileCheck %s -check-prefix=CHECK-NEOVERSE-V1
+; Simple tail-folding is also enabled by default on neoverse-v2. Use same check prefix.
+; RUN: opt < %s -passes=loop-vectorize -sve-tail-folding-insn-threshold=0 -S -mcpu=neoverse-v2 | FileCheck %s -check-prefix=CHECK-NEOVERSE-V1
 
 target triple = "aarch64-unknown-linux-gnu"
 

@davemgreen davemgreen requested a review from sjoerdmeijer April 11, 2025 13:01
@davemgreen
Copy link
Collaborator

I thought we decided not to do this because it was bad for performance in too many cases?

@c-rhodes
Copy link
Collaborator Author

I thought we decided not to do this because it was bad for performance in too many cases?

I wasn't aware this had been looked at previously, apologies. I see no regressions in spec2017 (int + fp), but will do some further benchmarking to make sure.

@davemgreen
Copy link
Collaborator

I should have given more details but there are quite a few things going on with it and some things could have changes since last we looked. As far as I understand the vectorizer still makes the decision very early as to whether to tail fold, and if we return true for preferPredicateOverEpilogue then we essentially force the vectorizer to predicate the loop we see. For AArch64 I believe this currently still means forcing scalable vectorization (as masked-loads/stores are not given a cheap cost for fixed length vectors), and forces the interleave factor to 1 as it is difficult to predicate loops well that are also unrolled.

In recent times on Neoverse V2 there has been a push in the opposite directions. It has preferred fixed-width to scalable vectors when the costs are equal (#95819) and allow larger vector bodies to make use of all the vector pipelines available on the V2 (#100385).

Tail predication has some efficiency bonuses of its own especially for loops with low trip counts that are called often, but it makes it difficult to get the most out of the hardware for loops with high trip counts. Saturating 4 vector pipelines sometimes requires some interleaving and making sure that the predication does not become a bottleneck. So whilst this might help on certain benchmark it can hurt in other domains like ML, HPC and DSP. (We know for example that x264 has certain low-trip count loops that can be helped by forcing the vectorizer to pick a lower trip count). Currently there are some heuristics to disable reductions, small loops and a few other cases when tail folding, but the problem AFAIU isn't really predication+reductions or predication+small loops, it is that interleaving can be so important for performance in these loops.

The way GCC approaches this is to generate a fast unpredicated loop with some interleaving, and use a predicated remainder to handle the tail. In VPlan this would mean that it generated multiple vplans with and without predication and costed them against one another. So long as it had a way to detect bottlenecks in the loop, it should then be able to produce the big unpredicated vector body with predicated remainder version where it will be beneficial, otherwise choosing to tail fold where that was more efficient. This requires the loop vectorizer to not opt into tail predication so early, which might still require some quite major surgery.

So maybe the tuning is just right for V2 and this doesn't conflict with #95819 and #100385, but I worry it will currently make some cases better and some (important, high trip count) case worse, limiting the top-end performance when we want things to go as fast as they can.

@david-arm
Copy link
Contributor

I can absolutely believe this is an overall win for SPEC2017, and for neoverse-v1 it made sense because of the 256-bit vector length which generally gave SVE an advantage anyway. However, like @davemgreen said we're now effectively forcing the compiler to use SVE on neoverse-v2 where it no longer has the vector length advantage. And like @davemgreen says the ideal situation is to have an unpredicated main vector body where you are free to interleave (since interleaving is very expensive with tail-folding). This is followed by a predicated vector epilogue to handle the remainder. In the right circumstances the vector tail will not even be a loop, but a single iteration.

@c-rhodes
Copy link
Collaborator Author

thanks both for detailed comments, seems like this one is a bad idea so I'm going to close it.

@c-rhodes c-rhodes closed this Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants