[AArch64][LoopVectorize] Enable tail-folding on neoverse-v2 #135357

c-rhodes · 2025-04-11T12:29:58Z

This patch enables tail-folding of simple loops by default when targeting the neoverse-v2 CPU. This was done for neoverse-v1 in c7dbe32.

For SPEC2017 with "-Ofast -mcpu=neoverse-v2 -flto" this gives some small wins:

549.fotonik3d_r: ~3.2%
525.x264_r: ~2.7%
554.roms_r: ~1.2%

This patch enables tail-folding of simple loops by default when targeting the neoverse-v2 CPU. This was done for neoverse-v1 in c7dbe32. For SPEC2017 with "-Ofast -mcpu=neoverse-v2 -flto" this gives some small wins: 549.fotonik3d_r: ~3.2% 525.x264_r: ~2.7% 554.roms_r: ~1.2%

llvmbot · 2025-04-11T12:30:30Z

@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-llvm-transforms

Author: Cullen Rhodes (c-rhodes)

Changes

This patch enables tail-folding of simple loops by default when targeting the neoverse-v2 CPU. This was done for neoverse-v1 in c7dbe32.

For SPEC2017 with "-Ofast -mcpu=neoverse-v2 -flto" this gives some small wins:

549.fotonik3d_r: ~3.2%
525.x264_r: ~2.7%
554.roms_r: ~1.2%

Full diff: https://github.com/llvm/llvm-project/pull/135357.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64Subtarget.cpp (+2)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll (+2)

diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index 7b4ded6322098..adee9899f7fd8 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -268,6 +268,8 @@ void AArch64Subtarget::initializeProperties(bool HasMinSize) {
     MaxBytesForLoopAlignment = 16;
     break;
   case NeoverseV2:
+    DefaultSVETFOpts = TailFoldingOpts::Simple;
+    LLVM_FALLTHROUGH;
   case NeoverseV3:
     EpilogueVectorizationMinVF = 8;
     MaxInterleaveFactor = 4;
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
index 7dd0f0c0ad8e0..d2b8dd9c2be48 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
@@ -11,6 +11,8 @@
 ; RUN: opt < %s -passes=loop-vectorize -sve-tail-folding-insn-threshold=0 -S -sve-tail-folding=default -mcpu=neoverse-v1 | FileCheck %s -check-prefix=CHECK-NEOVERSE-V1
 ; RUN: opt < %s -passes=loop-vectorize -sve-tail-folding-insn-threshold=0 -S -mcpu=neoverse-v1 -sve-tail-folding=default | FileCheck %s -check-prefix=CHECK-NEOVERSE-V1
 ; RUN: opt < %s -passes=loop-vectorize -sve-tail-folding-insn-threshold=0 -S -mcpu=neoverse-v1 | FileCheck %s -check-prefix=CHECK-NEOVERSE-V1
+; Simple tail-folding is also enabled by default on neoverse-v2. Use same check prefix.
+; RUN: opt < %s -passes=loop-vectorize -sve-tail-folding-insn-threshold=0 -S -mcpu=neoverse-v2 | FileCheck %s -check-prefix=CHECK-NEOVERSE-V1
 
 target triple = "aarch64-unknown-linux-gnu"

davemgreen · 2025-04-11T13:06:54Z

I thought we decided not to do this because it was bad for performance in too many cases?

c-rhodes · 2025-04-11T13:27:11Z

I thought we decided not to do this because it was bad for performance in too many cases?

I wasn't aware this had been looked at previously, apologies. I see no regressions in spec2017 (int + fp), but will do some further benchmarking to make sure.

davemgreen · 2025-04-14T07:24:38Z

I should have given more details but there are quite a few things going on with it and some things could have changes since last we looked. As far as I understand the vectorizer still makes the decision very early as to whether to tail fold, and if we return true for preferPredicateOverEpilogue then we essentially force the vectorizer to predicate the loop we see. For AArch64 I believe this currently still means forcing scalable vectorization (as masked-loads/stores are not given a cheap cost for fixed length vectors), and forces the interleave factor to 1 as it is difficult to predicate loops well that are also unrolled.

In recent times on Neoverse V2 there has been a push in the opposite directions. It has preferred fixed-width to scalable vectors when the costs are equal (#95819) and allow larger vector bodies to make use of all the vector pipelines available on the V2 (#100385).

Tail predication has some efficiency bonuses of its own especially for loops with low trip counts that are called often, but it makes it difficult to get the most out of the hardware for loops with high trip counts. Saturating 4 vector pipelines sometimes requires some interleaving and making sure that the predication does not become a bottleneck. So whilst this might help on certain benchmark it can hurt in other domains like ML, HPC and DSP. (We know for example that x264 has certain low-trip count loops that can be helped by forcing the vectorizer to pick a lower trip count). Currently there are some heuristics to disable reductions, small loops and a few other cases when tail folding, but the problem AFAIU isn't really predication+reductions or predication+small loops, it is that interleaving can be so important for performance in these loops.

The way GCC approaches this is to generate a fast unpredicated loop with some interleaving, and use a predicated remainder to handle the tail. In VPlan this would mean that it generated multiple vplans with and without predication and costed them against one another. So long as it had a way to detect bottlenecks in the loop, it should then be able to produce the big unpredicated vector body with predicated remainder version where it will be beneficial, otherwise choosing to tail fold where that was more efficient. This requires the loop vectorizer to not opt into tail predication so early, which might still require some quite major surgery.

So maybe the tuning is just right for V2 and this doesn't conflict with #95819 and #100385, but I worry it will currently make some cases better and some (important, high trip count) case worse, limiting the top-end performance when we want things to go as fast as they can.

david-arm · 2025-04-15T08:43:40Z

I can absolutely believe this is an overall win for SPEC2017, and for neoverse-v1 it made sense because of the 256-bit vector length which generally gave SVE an advantage anyway. However, like @davemgreen said we're now effectively forcing the compiler to use SVE on neoverse-v2 where it no longer has the vector length advantage. And like @davemgreen says the ideal situation is to have an unpredicated main vector body where you are free to interleave (since interleaving is very expensive with tail-folding). This is followed by a predicated vector epilogue to handle the remainder. In the right circumstances the vector tail will not even be a loop, but a single iteration.

c-rhodes · 2025-04-15T09:01:19Z

thanks both for detailed comments, seems like this one is a bad idea so I'm going to close it.

c-rhodes requested review from davemgreen, paulwalker-arm and david-arm April 11, 2025 12:29

llvmbot added backend:AArch64 llvm:transforms labels Apr 11, 2025

davemgreen requested a review from sjoerdmeijer April 11, 2025 13:01

c-rhodes closed this Apr 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64][LoopVectorize] Enable tail-folding on neoverse-v2 #135357

[AArch64][LoopVectorize] Enable tail-folding on neoverse-v2 #135357

Uh oh!

c-rhodes commented Apr 11, 2025

Uh oh!

llvmbot commented Apr 11, 2025 •

edited

Loading

Uh oh!

davemgreen commented Apr 11, 2025

Uh oh!

c-rhodes commented Apr 11, 2025

Uh oh!

davemgreen commented Apr 14, 2025

Uh oh!

david-arm commented Apr 15, 2025

Uh oh!

c-rhodes commented Apr 15, 2025

Uh oh!

Uh oh!

[AArch64][LoopVectorize] Enable tail-folding on neoverse-v2 #135357

[AArch64][LoopVectorize] Enable tail-folding on neoverse-v2 #135357

Uh oh!

Conversation

c-rhodes commented Apr 11, 2025

Uh oh!

llvmbot commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davemgreen commented Apr 11, 2025

Uh oh!

c-rhodes commented Apr 11, 2025

Uh oh!

davemgreen commented Apr 14, 2025

Uh oh!

david-arm commented Apr 15, 2025

Uh oh!

c-rhodes commented Apr 15, 2025

Uh oh!

Uh oh!

llvmbot commented Apr 11, 2025 •

edited

Loading