[AArch64] Add an AArch64 pass for loop idiom transformations #72273

david-arm · 2023-11-14T15:48:44Z

We have added a new pass that looks for loops such as the following:

  while (i != max_len)
      if (a[i] != b[i])
          break;

  ... use index i ...

Although similar to a memcmp, this is slightly different because instead of returning the difference between the values of the first non-matching pair of bytes, it returns the index of the first mismatch. As such, we are not able to lower this to a memcmp call.

The new pass can now spot such idioms and transform them into a specialised predicated loop that gives a significant performance improvement for AArch64. It is intended as a stop-gap solution until this can be handled by the vectoriser, which doesn't currently deal with early exits.

This specialised loop makes use of a generic intrinsic that counts the trailing zero elements in a predicate vector. This was added in https://reviews.llvm.org/D159283 and for SVE we end up with brkb & incp instructions.

Although we have added this pass only for AArch64, it was written in a generic way so that in theory it could be used by other targets. Currently the pass requires scalable vector support and needs to know the minimum page size for the target, however it's possible to make it work for fixed-width vectors too. Also, the llvm.experimental.cttz.elts intrinsic used by the pass has generic lowering, but can be made efficient for targets with instructions similar to SVE's brkb, cntp and incp.

Original version of patch was posted on Phabricator:

https://reviews.llvm.org/D158291

Patch co-authored by Kerry McLaughlin (@kmclaughlin-arm) and David Sherwood (@david-arm)

See the original discussion on Discourse:
https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383

@kmclaughlin-arm

We have added a new pass that looks for loops such as the following: while (i != max_len) if (a[i] != b[i]) break; ... use index i ... Although similar to a memcmp, this is slightly different because instead of returning the difference between the values of the first non-matching pair of bytes, it returns the index of the first mismatch. As such, we are not able to lower this to a memcmp call. The new pass can now spot such idioms and transform them into a specialised predicated loop that gives a significant performance improvement for AArch64. It is intended as a stop-gap solution until this can be handled by the vectoriser, which doesn't currently deal with early exits. This specialised loop makes use of a generic intrinsic that counts the trailing zero elements in a predicate vector. This was added in https://reviews.llvm.org/D159283 and for SVE we end up with brkb & incp instructions. Although we have added this pass only for AArch64, it was written in a generic way so that in theory it could be used by other targets. Currently the pass requires scalable vector support and needs to know the minimum page size for the target, however it's possible to make it work for fixed-width vectors too. Also, the llvm.experimental.cttz.elts intrinsic used by the pass has generic lowering, but can be made efficient for targets with instructions similar to SVE's brkb, cntp and incp. Original version of patch was posted on Phabricator: https://reviews.llvm.org/D158291 Patch co-authored by Kerry McLaughlin (@kmclaughlin-arm) and David Sherwood (@david-arm) See the original discussion on Discourse: https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383

llvmbot · 2023-11-14T15:49:13Z

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-transforms

Author: David Sherwood (david-arm)

Changes

We have added a new pass that looks for loops such as the following:

while (i != max_len)
if (a[i] != b[i])
break;

... use index i ...

Although similar to a memcmp, this is slightly different because instead of returning the difference between the values of the first non-matching pair of bytes, it returns the index of the first mismatch. As such, we are not able to lower this to a memcmp call.

The new pass can now spot such idioms and transform them into a specialised predicated loop that gives a significant performance improvement for AArch64. It is intended as a stop-gap solution until this can be handled by the vectoriser, which doesn't currently deal with early exits.

This specialised loop makes use of a generic intrinsic that counts the trailing zero elements in a predicate vector. This was added in https://reviews.llvm.org/D159283 and for SVE we end up with brkb & incp instructions.

Although we have added this pass only for AArch64, it was written in a generic way so that in theory it could be used by other targets. Currently the pass requires scalable vector support and needs to know the minimum page size for the target, however it's possible to make it work for fixed-width vectors too. Also, the llvm.experimental.cttz.elts intrinsic used by the pass has generic lowering, but can be made efficient for targets with instructions similar to SVE's brkb, cntp and incp.

Original version of patch was posted on Phabricator:

https://reviews.llvm.org/D158291

Patch co-authored by Kerry McLaughlin (@kmclaughlin-arm) and David Sherwood (@david-arm)

See the original discussion on Discourse:
https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383

Patch is 138.04 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/72273.diff

12 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+8)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
(modified) llvm/lib/Target/AArch64/AArch64.h (+1)
(added) llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp (+726)
(added) llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.h (+25)
(modified) llvm/lib/Target/AArch64/AArch64TargetMachine.cpp (+10)
(modified) llvm/lib/Target/AArch64/AArch64TargetMachine.h (+3)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+2)
(modified) llvm/lib/Target/AArch64/CMakeLists.txt (+1)
(added) llvm/test/Transforms/LoopIdiom/AArch64/byte-compare-index.ll (+1640)
(modified) llvm/utils/gn/secondary/llvm/lib/Target/AArch64/BUILD.gn (+1)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index c18e0acdb759a8d..10d178a73b0fcdd 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1155,6 +1155,9 @@ class TargetTransformInfo {
   /// \return The associativity of the cache level, if available.
   std::optional<unsigned> getCacheAssociativity(CacheLevel Level) const;
 
+  /// \return The minimum architectural page size for the target.
+  std::optional<unsigned> getMinPageSize() const;
+
   /// \return How much before a load we should place the prefetch
   /// instruction.  This is currently measured in number of
   /// instructions.
@@ -1889,6 +1892,7 @@ class TargetTransformInfo::Concept {
   virtual std::optional<unsigned> getCacheSize(CacheLevel Level) const = 0;
   virtual std::optional<unsigned> getCacheAssociativity(CacheLevel Level)
       const = 0;
+  virtual std::optional<unsigned> getMinPageSize() const = 0;
 
   /// \return How much before a load we should place the prefetch
   /// instruction.  This is currently measured in number of
@@ -2475,6 +2479,10 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
     return Impl.getCacheAssociativity(Level);
   }
 
+  std::optional<unsigned> getMinPageSize() const override {
+    return Impl.getMinPageSize();
+  }
+
   /// Return the preferred prefetch distance in terms of instructions.
   ///
   unsigned getPrefetchDistance() const override {
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 2ccf57c22234f9a..13030cb9fe46825 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -494,6 +494,8 @@ class TargetTransformInfoImplBase {
     llvm_unreachable("Unknown TargetTransformInfo::CacheLevel");
   }
 
+  std::optional<unsigned> getMinPageSize() const { return {}; }
+
   unsigned getPrefetchDistance() const { return 0; }
   unsigned getMinPrefetchStride(unsigned NumMemAccesses,
                                 unsigned NumStridedMemAccesses,
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index caa9b17ae695e49..dfe8d004bb97899 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -753,6 +753,10 @@ TargetTransformInfo::getCacheAssociativity(CacheLevel Level) const {
   return TTIImpl->getCacheAssociativity(Level);
 }
 
+std::optional<unsigned> TargetTransformInfo::getMinPageSize() const {
+  return TTIImpl->getMinPageSize();
+}
+
 unsigned TargetTransformInfo::getPrefetchDistance() const {
   return TTIImpl->getPrefetchDistance();
 }
diff --git a/llvm/lib/Target/AArch64/AArch64.h b/llvm/lib/Target/AArch64/AArch64.h
index 901769c54b6ef59..d20ef63a72e8f62 100644
--- a/llvm/lib/Target/AArch64/AArch64.h
+++ b/llvm/lib/Target/AArch64/AArch64.h
@@ -88,6 +88,7 @@ void initializeAArch64DeadRegisterDefinitionsPass(PassRegistry&);
 void initializeAArch64ExpandPseudoPass(PassRegistry &);
 void initializeAArch64GlobalsTaggingPass(PassRegistry &);
 void initializeAArch64LoadStoreOptPass(PassRegistry&);
+void initializeAArch64LoopIdiomTransformLegacyPassPass(PassRegistry &);
 void initializeAArch64LowerHomogeneousPrologEpilogPass(PassRegistry &);
 void initializeAArch64MIPeepholeOptPass(PassRegistry &);
 void initializeAArch64O0PreLegalizerCombinerPass(PassRegistry &);
diff --git a/llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp b/llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp
new file mode 100644
index 000000000000000..c4d2529a3e53e7d
--- /dev/null
+++ b/llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp
@@ -0,0 +1,726 @@
+
+//===- AArch64LoopIdiomTransform.cpp - Loop idiom recognition -------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "AArch64LoopIdiomTransform.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
+#include "llvm/Analysis/LoopPass.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/Dominators.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/MDBuilder.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Transforms/Utils/BasicBlockUtils.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "aarch64-lit"
+
+static cl::opt<bool>
+    DisableAll("disable-aarch64-lit-all", cl::Hidden, cl::init(false),
+               cl::desc("Disable AArch64 Loop Idiom Transform Pass."));
+
+static cl::opt<bool> DisableByteCmp(
+    "disable-aarch64-lit-bytecmp", cl::Hidden, cl::init(false),
+    cl::desc("Proceed with AArch64 Loop Idiom Transform Pass, but do "
+             "not convert byte-compare loop(s)."));
+
+namespace llvm {
+
+void initializeAArch64LoopIdiomTransformLegacyPassPass(PassRegistry &);
+Pass *createAArch64LoopIdiomTransformPass();
+
+} // end namespace llvm
+
+namespace {
+
+class AArch64LoopIdiomTransform {
+  Loop *CurLoop = nullptr;
+  DominatorTree *DT;
+  LoopInfo *LI;
+  const TargetTransformInfo *TTI;
+  const DataLayout *DL;
+
+public:
+  explicit AArch64LoopIdiomTransform(DominatorTree *DT, LoopInfo *LI,
+                                     const TargetTransformInfo *TTI,
+                                     const DataLayout *DL)
+      : DT(DT), LI(LI), TTI(TTI), DL(DL) {}
+
+  bool run(Loop *L);
+
+private:
+  /// \name Countable Loop Idiom Handling
+  /// @{
+
+  bool runOnCountableLoop();
+  bool runOnLoopBlock(BasicBlock *BB, const SCEV *BECount,
+                      SmallVectorImpl<BasicBlock *> &ExitBlocks);
+
+  bool recognizeByteCompare();
+  Value *expandFindMismatch(IRBuilder<> &Builder, GetElementPtrInst *GEPA,
+                            GetElementPtrInst *GEPB, Value *Start,
+                            Value *MaxLen);
+  void transformByteCompare(GetElementPtrInst *GEPA, GetElementPtrInst *GEPB,
+                            Value *MaxLen, Value *Index, Value *Start,
+                            bool IncIdx, BasicBlock *FoundBB,
+                            BasicBlock *EndBB);
+  /// @}
+};
+
+class AArch64LoopIdiomTransformLegacyPass : public LoopPass {
+public:
+  static char ID;
+
+  explicit AArch64LoopIdiomTransformLegacyPass() : LoopPass(ID) {
+    initializeAArch64LoopIdiomTransformLegacyPassPass(
+        *PassRegistry::getPassRegistry());
+  }
+
+  StringRef getPassName() const override {
+    return "Recognize AArch64-specific loop idioms";
+  }
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<LoopInfoWrapperPass>();
+    AU.addRequired<DominatorTreeWrapperPass>();
+    AU.addRequired<TargetTransformInfoWrapperPass>();
+  }
+
+  bool runOnLoop(Loop *L, LPPassManager &LPM) override;
+};
+
+bool AArch64LoopIdiomTransformLegacyPass::runOnLoop(Loop *L,
+                                                    LPPassManager &LPM) {
+
+  if (skipLoop(L))
+    return false;
+
+  auto *DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
+  auto *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
+  auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(
+      *L->getHeader()->getParent());
+  return AArch64LoopIdiomTransform(
+             DT, LI, &TTI, &L->getHeader()->getModule()->getDataLayout())
+      .run(L);
+}
+
+} // end anonymous namespace
+
+char AArch64LoopIdiomTransformLegacyPass::ID = 0;
+
+INITIALIZE_PASS_BEGIN(
+    AArch64LoopIdiomTransformLegacyPass, "aarch64-lit",
+    "Transform specific loop idioms into optimised vector forms", false, false)
+INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(LoopSimplify)
+INITIALIZE_PASS_DEPENDENCY(LCSSAWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
+INITIALIZE_PASS_END(
+    AArch64LoopIdiomTransformLegacyPass, "aarch64-lit",
+    "Transform specific loop idioms into optimised vector forms", false, false)
+
+Pass *llvm::createAArch64LoopIdiomTransformPass() {
+  return new AArch64LoopIdiomTransformLegacyPass();
+}
+
+PreservedAnalyses
+AArch64LoopIdiomTransformPass::run(Loop &L, LoopAnalysisManager &AM,
+                                   LoopStandardAnalysisResults &AR,
+                                   LPMUpdater &) {
+  if (DisableAll)
+    return PreservedAnalyses::all();
+
+  const auto *DL = &L.getHeader()->getModule()->getDataLayout();
+
+  AArch64LoopIdiomTransform LIT(&AR.DT, &AR.LI, &AR.TTI, DL);
+  if (!LIT.run(&L))
+    return PreservedAnalyses::all();
+
+  return PreservedAnalyses::none();
+}
+
+//===----------------------------------------------------------------------===//
+//
+//          Implementation of AArch64LoopIdiomTransform
+//
+//===----------------------------------------------------------------------===//
+
+bool AArch64LoopIdiomTransform::run(Loop *L) {
+  CurLoop = L;
+
+  if (DisableAll)
+    return false;
+
+  // If the loop could not be converted to canonical form, it must have an
+  // indirectbr in it, just give up.
+  if (!L->getLoopPreheader())
+    return false;
+
+  LLVM_DEBUG(dbgs() << DEBUG_TYPE " Scanning: F["
+                    << CurLoop->getHeader()->getParent()->getName()
+                    << "] Loop %" << CurLoop->getHeader()->getName() << "\n");
+
+  return recognizeByteCompare();
+}
+
+/// Match loop-invariant value.
+template <typename SubPattern_t> struct match_LoopInvariant {
+  SubPattern_t SubPattern;
+  const Loop *L;
+
+  match_LoopInvariant(const SubPattern_t &SP, const Loop *L)
+      : SubPattern(SP), L(L) {}
+
+  template <typename ITy> bool match(ITy *V) {
+    return L->isLoopInvariant(V) && SubPattern.match(V);
+  }
+};
+
+/// Matches if the value is loop-invariant.
+template <typename Ty>
+inline match_LoopInvariant<Ty> m_LoopInvariant(const Ty &M, const Loop *L) {
+  return match_LoopInvariant<Ty>(M, L);
+}
+
+bool AArch64LoopIdiomTransform::recognizeByteCompare() {
+  if (!TTI->supportsScalableVectors() || !TTI->getMinPageSize().has_value() ||
+      DisableByteCmp)
+    return false;
+
+  BasicBlock *Header = CurLoop->getHeader();
+  BasicBlock *PH = CurLoop->getLoopPreheader();
+
+  // In AArch64LoopIdiomTransform::run we have already checked that the loop
+  // has a preheader so we can assume it's in a canonical form.
+  auto *EntryBI = cast<BranchInst>(PH->getTerminator());
+
+  if (CurLoop->getNumBackEdges() != 1 || CurLoop->getNumBlocks() != 2)
+    return false;
+
+  PHINode *PN = dyn_cast<PHINode>(&Header->front());
+  if (!PN || PN->getNumIncomingValues() != 2)
+    return false;
+
+  auto LoopBlocks = CurLoop->getBlocks();
+  // The first block in the loop should contain only 4 instructions, e.g.
+  //
+  //  while.cond:
+  //   %res.phi = phi i32 [ %start, %ph ], [ %inc, %while.body ]
+  //   %inc = add i32 %res.phi, 1
+  //   %cmp.not = icmp eq i32 %inc, %n
+  //   br i1 %cmp.not, label %while.end, label %while.body
+  //
+  auto CondBBInsts = LoopBlocks[0]->instructionsWithoutDebug();
+  if (std::distance(CondBBInsts.begin(), CondBBInsts.end()) > 4)
+    return false;
+
+  // The second block should contain 7 instructions, e.g.
+  //
+  // while.body:
+  //   %idx = zext i32 %inc to i64
+  //   %idx.a = getelementptr inbounds i8, ptr %a, i64 %idx
+  //   %load.a = load i8, ptr %idx.a
+  //   %idx.b = getelementptr inbounds i8, ptr %b, i64 %idx
+  //   %load.b = load i8, ptr %idx.b
+  //   %cmp.not.ld = icmp eq i8 %load.a, %load.b
+  //   br i1 %cmp.not.ld, label %while.cond, label %while.end
+  //
+  auto LoopBBInsts = LoopBlocks[1]->instructionsWithoutDebug();
+  if (std::distance(LoopBBInsts.begin(), LoopBBInsts.end()) > 7)
+    return false;
+
+  using namespace PatternMatch;
+
+  // The incoming value to the PHI node from the loop should be an add of 1.
+  Instruction *Index = nullptr;
+  Value *StartIdx = nullptr;
+  for (BasicBlock *BB : PN->blocks()) {
+    if (!CurLoop->contains(BB)) {
+      StartIdx = PN->getIncomingValueForBlock(BB);
+      continue;
+    }
+    Index = dyn_cast<Instruction>(PN->getIncomingValueForBlock(BB));
+    // Limit to 32-bit types for now
+    if (!Index || !Index->getType()->isIntegerTy(32) ||
+        !match(Index, m_c_Add(m_Specific(PN), m_One())))
+      return false;
+  }
+
+  // If we match the pattern, PN and Index will be replaced with the result of
+  // the cttz.elts intrinsic. If any other instructions are used outside of
+  // the loop, we cannot replace it.
+  for (BasicBlock *BB : LoopBlocks)
+    for (Instruction &I : *BB)
+      if (&I != PN && &I != Index)
+        for (User *U : I.users()) {
+          auto UI = cast<Instruction>(U);
+          if (!CurLoop->contains(UI))
+            return false;
+        }
+
+  // Don't replace the loop if the add has a wrap flag.
+  if (Index->hasNoSignedWrap() || Index->hasNoUnsignedWrap())
+    return false;
+
+  // Match the branch instruction for the header
+  ICmpInst::Predicate Pred;
+  Value *MaxLen;
+  BasicBlock *EndBB, *WhileBB;
+  if (!match(Header->getTerminator(),
+             m_Br(m_ICmp(Pred, m_Specific(Index), m_Value(MaxLen)),
+                  m_BasicBlock(EndBB), m_BasicBlock(WhileBB))) ||
+      Pred != ICmpInst::Predicate::ICMP_EQ || !CurLoop->contains(WhileBB))
+    return false;
+
+  // WhileBB should contain the pattern of load & compare instructions. Match
+  // the pattern and find the GEP instructions used by the loads.
+  ICmpInst::Predicate WhilePred;
+  BasicBlock *FoundBB;
+  BasicBlock *TrueBB;
+  Value *LoadA, *LoadB;
+  if (!match(WhileBB->getTerminator(),
+             m_Br(m_ICmp(WhilePred, m_Value(LoadA), m_Value(LoadB)),
+                  m_BasicBlock(TrueBB), m_BasicBlock(FoundBB))) ||
+      WhilePred != ICmpInst::Predicate::ICMP_EQ || !CurLoop->contains(TrueBB))
+    return false;
+
+  Value *A, *B;
+  if (!match(LoadA, m_Load(m_Value(A))) || !match(LoadB, m_Load(m_Value(B))))
+    return false;
+
+  GetElementPtrInst *GEPA = dyn_cast<GetElementPtrInst>(A);
+  GetElementPtrInst *GEPB = dyn_cast<GetElementPtrInst>(B);
+
+  if (!GEPA || !GEPB)
+    return false;
+
+  Value *PtrA = GEPA->getPointerOperand();
+  Value *PtrB = GEPB->getPointerOperand();
+
+  // Check we are loading i8 values from two loop invariant pointers
+  if (!CurLoop->isLoopInvariant(PtrA) || !CurLoop->isLoopInvariant(PtrB) ||
+      !GEPA->getResultElementType()->isIntegerTy(8) ||
+      !GEPB->getResultElementType()->isIntegerTy(8) ||
+      !cast<LoadInst>(LoadA)->getType()->isIntegerTy(8) ||
+      !cast<LoadInst>(LoadB)->getType()->isIntegerTy(8) || PtrA == PtrB)
+    return false;
+
+  // Check that the index to the GEPs is the index we found earlier
+  if (GEPA->getNumIndices() > 1 || GEPB->getNumIndices() > 1)
+    return false;
+
+  Value *IdxA = GEPA->getOperand(GEPA->getNumIndices());
+  Value *IdxB = GEPB->getOperand(GEPB->getNumIndices());
+  if (IdxA != IdxB || !match(IdxA, m_ZExt(m_Specific(Index))))
+    return false;
+
+  LLVM_DEBUG(dbgs() << "FOUND IDIOM IN LOOP: \n"
+                    << *(EndBB->getParent()) << "\n\n");
+
+  // The index is incremented before the GEP/Load pair so we need to
+  // add 1 to the start value.
+  transformByteCompare(GEPA, GEPB, MaxLen, Index, StartIdx, /*IncIdx=*/true, FoundBB,
+                       EndBB);
+  return true;
+}
+
+Value *AArch64LoopIdiomTransform::expandFindMismatch(IRBuilder<> &Builder,
+                                                     GetElementPtrInst *GEPA,
+                                                     GetElementPtrInst *GEPB,
+                                                     Value *Start,
+                                                     Value *MaxLen) {
+  Value *PtrA = GEPA->getPointerOperand();
+  Value *PtrB = GEPB->getPointerOperand();
+
+  // Get the arguments and types for the intrinsic.
+  BasicBlock *Preheader = CurLoop->getLoopPreheader();
+  BranchInst *PHBranch = cast<BranchInst>(Preheader->getTerminator());
+  LLVMContext &Ctx = PHBranch->getContext();
+  Type *LoadType = Type::getInt8Ty(Ctx);
+  Type *ResType = Builder.getInt32Ty();
+
+  // Split block in the original loop preheader.
+  DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Lazy);
+  BasicBlock *EndBlock =
+      SplitBlock(Preheader, PHBranch, DT, LI, nullptr, "mismatch_end");
+
+  // Create the blocks that we're going to need:
+  //  1. A block for checking the zero-extended length exceeds 0
+  //  2. A block to check that the start and end addresses of a given array
+  //     lie on the same page.
+  //  3. The SVE loop preheader.
+  //  4. The first SVE loop block.
+  //  5. The SVE loop increment block.
+  //  6. A block we can jump to from the SVE loop when a mismatch is found.
+  //  7. The first block of the scalar loop itself, containing PHIs , loads
+  //  and cmp.
+  //  8. A scalar loop increment block to increment the PHIs and go back
+  //  around the loop.
+
+  BasicBlock *MinItCheckBlock = BasicBlock::Create(
+      Ctx, "mismatch_min_it_check", EndBlock->getParent(), EndBlock);
+
+  // Update the terminator added by SplitBlock to branch to the first block
+  Preheader->getTerminator()->setSuccessor(0, MinItCheckBlock);
+
+  BasicBlock *MemCheckBlock = BasicBlock::Create(
+      Ctx, "mismatch_mem_check", EndBlock->getParent(), EndBlock);
+
+  BasicBlock *SVELoopPreheaderBlock = BasicBlock::Create(
+      Ctx, "mismatch_sve_loop_preheader", EndBlock->getParent(), EndBlock);
+
+  BasicBlock *SVELoopStartBlock = BasicBlock::Create(
+      Ctx, "mismatch_sve_loop", EndBlock->getParent(), EndBlock);
+
+  BasicBlock *SVELoopIncBlock = BasicBlock::Create(
+      Ctx, "mismatch_sve_loop_inc", EndBlock->getParent(), EndBlock);
+
+  BasicBlock *SVELoopMismatchBlock = BasicBlock::Create(
+      Ctx, "mismatch_sve_loop_found", EndBlock->getParent(), EndBlock);
+
+  BasicBlock *LoopPreHeaderBlock = BasicBlock::Create(
+      Ctx, "mismatch_loop_pre", EndBlock->getParent(), EndBlock);
+
+  BasicBlock *LoopStartBlock =
+      BasicBlock::Create(Ctx, "mismatch_loop", EndBlock->getParent(), EndBlock);
+
+  BasicBlock *LoopIncBlock = BasicBlock::Create(
+      Ctx, "mismatch_loop_inc", EndBlock->getParent(), EndBlock);
+
+  DTU.applyUpdates({{DominatorTree::Insert, Preheader, MinItCheckBlock},
+                    {DominatorTree::Delete, Preheader, EndBlock}});
+
+  // Update LoopInfo with the new SVE & scalar loops.
+  auto SVELoop = LI->AllocateLoop();
+  auto ScalarLoop = LI->AllocateLoop();
+  if (CurLoop->getParentLoop()) {
+    CurLoop->getParentLoop()->addChildLoop(SVELoop);
+    CurLoop->getParentLoop()->addChildLoop(ScalarLoop);
+  } else {
+    LI->addTopLevelLoop(SVELoop);
+    LI->addTopLevelLoop(ScalarLoop);
+  }
+
+  // Add the new basic blocks to their associated loops.
+  SVELoop->addBasicBlockToLoop(MinItCheckBlock, *LI);
+  SVELoop->addBasicBlockToLoop(MemCheckBlock, *LI);
+  SVELoop->addBasicBlockToLoop(SVELoopPreheaderBlock, *LI);
+  SVELoop->addBasicBlockToLoop(SVELoopStartBlock, *LI);
+  SVELoop->addBasicBlockToLoop(SVELoopIncBlock, *LI);
+  SVELoop->addBasicBlockToLoop(SVELoopMismatchBlock, *LI);
+
+  ScalarLoop->addBasicBlockToLoop(LoopPreHeaderBlock, *LI);
+  ScalarLoop->addBasicBlockToLoop(LoopStartBlock, *LI);
+  ScalarLoop->addBasicBlockToLoop(LoopIncBlock, *LI);
+
+  // Set up some types and constants that we intend to reuse.
+  Type *I64Type = Builder.getInt64Ty();
+
+  // Check the zero-extended iteration count > 0
+  Builder.SetInsertPoint(MinItCheckBlock);
+  Value *ExtStart = Builder.CreateZExt(Start, I64Type);
+  Value *ExtEnd = Builder.CreateZExt(MaxLen, I64Type);
+  // This check doesn't really cost us very much.
+
+  Value *LimitCheck = Builder.CreateICmpULE(Start, MaxLen);
+  BranchInst *MinItCheckBr =
+      BranchInst::Create(MemCheckBlock, LoopPreHeaderBlock, LimitCheck);
+  MinItCheckBr->setMetadata(
+      LLVMContext::MD_prof,
+      MDBuilder(MinItCheckBr->getContext()).createBranchWeights(99, 1));
+  Builder.Insert(MinItCheckBr);
+
+  DTU.applyUpdates(
+      {{DominatorTree::Insert, MinItCheckBlock, MemCheckBlock},
+       {DominatorTree::Insert, MinItCheckBlock, LoopPreHeaderBlock}});
+
+  // For each of the arrays, check the start/end addresses are...
[truncated]

github-actions · 2023-11-14T15:51:23Z

✅ With the latest revision this PR passed the C/C++ code formatter.

sjoerdmeijer · 2023-11-16T12:21:18Z

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

+
+  using namespace PatternMatch;
+
+  // The incoming value to the PHI node from the loop should be an add of 1.


Can we use one of the Loop APIs for this? For example:

PHINode * getCanonicalInductionVariable () const Check to see if the loop has a canonical induction variable: an integer recurrence that starts at 0 and increments by one each time through the loop.

Sadly not! That's because our induction variable does not start at 0.

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

sjoerdmeijer · 2023-11-16T12:24:42Z

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

+
+using namespace llvm;
+
+#define DEBUG_TYPE "aarch64-lit"


Bikeshedding: lit as name could perhaps be confusing. Something more verbose but descriptive with "loopidiom" in it would probably have my preference.

I've updated the DEBUG_TYPE to use something more descriptive. I wasn't sure if we should also update the flags, since I was worried the names would be too long!

Sorry, I just realised I have lost this change somehow. I'll fix it.

sjoerdmeijer · 2023-11-16T13:17:27Z

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

+  //   br i1 %cmp.not, label %while.end, label %while.body
+  //
+  auto CondBBInsts = LoopBlocks[0]->instructionsWithoutDebug();
+  if (std::distance(CondBBInsts.begin(), CondBBInsts.end()) > 4)


Counting the number of instructions might be a bit too "specific". Do we not need to inspect the supported instructions instead?

This is more like a quick initial filter of loops we don't care about. The code does inspect uses of the index, the GEPs, loads and comparisons.

Is there a Godbolt link for the C leading to these instructions (I couldn't reproduce the same sequence for compare_bytes_simple)? I know this is to discriminate specific loop structures so chances for pattern matching later to succeed are higher, but this looks very specific to me too, and may be subject to quick changes in the future. Would dropping this lead to bottlenecks?

Sure. https://godbolt.org/z/e5MnnGT84

As you can see, the first block corresponds to this - while (++start != end), so the block is likely to be tiny. It's possible the compiler could increment start before the loop, in which case it will look more like while (start++ != end). We've seen the former idiom in xz and 7zip. In either case the first block will have 4 or less instructions. Even if we permit more instructions we still have to match the right pattern - it would likely fail anyway because we have more than the expected uses of a value. We also couldn't, for example, permit random stores in the loop to account for the extra instructions.

I'm a bit reluctant to expand the algorithm to cover hypothetical cases that we can't prove exist. I agree it's a good idea to make it robust based on real-world examples. Do you have any examples of other specific variations of this idiom that you are worried about?

It's worth noting that the LoopIdiomRecognition pass also uses std::distance as a way to filter out loops.

For what it's worth, I do have downstream code that increases the number of idioms we recognise. This led to me finding a number of bugs, so the exercise was very useful! However, I'd prefer to land this patch first and build upon this incrementally. I can create a patch later on that adds these extra idioms once I've done sufficient testing. I think the idioms in this patch have the biggest impact in terms of performance.

The extra idioms involve: 1) supporting 64-bit induction variable types, 2) permitting add increments without nuw and nsw flags, 3) supporting loops that increment the counter after loading the values (as opposed to currently incrementing before loading).

Sorry for replying only now, thanks for the explanation. If I got this correctly, the snippet you provided is not meant to be transformed as the indexes of the two GEPs do not match, thus we would be bailing out. I'm confused though, why the following snippet shouldn't be supported (as per comment here)? Doesn't it match the specifics of the pass (the index is simply incremented before loading the values), or am I missing something?

int test(unsigned char *p1, unsigned char *p2, unsigned start, unsigned end) { while (++start != end) { if (p1[start] != p2[start]) break; } return start; }

Also, if we were to support 1), 2), 3) we would still not be catching samples like https://godbolt.org/z/vPGqde4zj, due to the and on the induction variable (not sure why it has been converted to an i64 here, but I guess it's an orthogonal problem – provided we want to catch this).

Oh my apologies. By snippet you provided are you referring to the godbolt link I posted? If so, I made a mistake in there and it should be

int test(unsigned char *p1, unsigned char *p2, unsigned start, unsigned end) { while (++start != end) { if (p1[start] != p2[start]) break; } return start; }

For the new godbolt link you provided, you're right the IR contains an add of 1 and a and instruction that isn't currently recognised. Again, the purpose of this pass is not to find as many loops as possible that we can recognise, but to focus on those that are of genuine value to users. Adding support for loops like the one you posted does add a lot more complexity and expandFindMismatch will need to generate code for all flavours of loops, such as potentially adding the add and and instructions in your example.

I have identified some more loops that we can recognise in the xz benchmark and have a WIP patch downstream, but these only give us tiny gains and the codegen is also poor due to LoopStrengthReduce making some poor choices for the SVE loops. Once I can be sure that recognising these extra loops is going to be a definite win I can follow up with another patch that adds them.

Thank you. I'm still slightly unsure by the logic here (between 336-368) which seems to prevent transformation of simple loops as the aforementioned one, as the incoming values of the phi are not the same (which should be valid), but it's fine if this is going to be added/reviewed in a second moment.

This logic is designed to prevent recognising loops such as @compare_bytes_simple2, i.e.

while.cond: ... br i1 %cmp.not, label %while.end, label %while.body while.body: ... br i1 %cmp.not2, label %while.cond, label %while.end while.end: ... %final_ptr = phi ptr [ %c, %while.body ], [ %d, %while.cond ]

where the found and end blocks are the same, and in that block there is a PHI node that has a unique incoming value for each block in the loop. However, each incoming value is not defined in the loop itself. Currently we don't handle this case in transformByteCompare because it would require extra work in the byte.compare block to select between these two incoming values. Rather than let transformByteCompare generate incorrect code and lead to bugs, I've explicitly disabled these cases for now. I haven't encountered any examples of such loops in the benchmarks I have looked at, so these are not critical to us. I only discovered this accidentally while writing more test cases.

sjoerdmeijer · 2023-11-16T13:36:37Z

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

+  // At this point we know two things must be true:
+  //  1. Start <= End
+  //  2. ExtMaxLen <= 4096 due to the page checks.
+  // Therefore, we know that we can use a 64-bit induction variable that


I am missing some higher level descriptions of the transformation that we are exactly trying to achieve, and under which conditions this is a valid thing to do. For example, I don't understand yet the purpose of the page size check.

I've added comments at the top of the file, as well as detailed comments further up explaining how the page size is used to generate runtime memory checks. These ensure we don't fault midway through a vector load due to reading ahead.

xgupta · 2023-11-21T08:16:05Z

Sorry, I was supposed to push to my repo, but changes got into this pull request, I will revert it immediately.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

* Added option to override the target's minimum page size. * Added more comments/documentation explaining what the pass is doing. * Rewrote the code for detecting the indices.

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

antoniofrighetto · 2023-11-30T15:57:48Z

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

+  //   br i1 %cmp.not, label %while.end, label %while.body
+  //
+  auto CondBBInsts = LoopBlocks[0]->instructionsWithoutDebug();
+  if (std::distance(CondBBInsts.begin(), CondBBInsts.end()) > 4)


Is there a Godbolt link for the C leading to these instructions (I couldn't reproduce the same sequence for compare_bytes_simple)? I know this is to discriminate specific loop structures so chances for pattern matching later to succeed are higher, but this looks very specific to me too, and may be subject to quick changes in the future. Would dropping this lead to bottlenecks?

As an experiment I tried adding support for more loop idioms in. As a result of greater test coverage I also discovered issues with how we create the loops, which led to me fixing these bugs: * The loops were not in the expected LCSSA format, which I have now fixed. I've added an option for testing that allows us to verify the loops after we create them. * The scalar fallback loop could potentially read one byte past the end of the array. Curiously it would always have given a correct result, but had the potential to seg fault. * I also added code to reject an unsupported case that would have led to incorrect code generation. I have added a test for this. I also amended some comments and cleaned up some unused variables. I haven't added support for more idioms in this patch yet, since I would like to do that in a follow-on patch after I've had chance to do more testing. It would be good to build on this incrementally.

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

davemgreen · 2023-12-13T11:22:58Z

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

+  Value *RhsPageCmp = Builder.CreateICmpNE(RhsStartPage, RhsEndPage);
+
+  Value *CombinedPageCmp = Builder.CreateOr(LhsPageCmp, RhsPageCmp);
+  BranchInst *CombinedPageCmpCmpBr = BranchInst::Create(


Did you consider this setting the maximum number of iteration for the new loop, as opposed to just skipping it? Or does that lead to worse performance?

So a while ago I did try writing a simple SVE ACLE implementation that processed the data in chunks, constantly walking up to the nearest page boundary in the SVE loop and then restarting again. The performance wasn't any better (presumably due to the extra calculations of the end index) and I figured it would add even more complexity to expandFindMismatch, so I stuck with this approach for now. But I'm happy to be proved wrong and this is certainly something we could revisit.

* Bail out of transformation if optimising for size * Simple code tidy-ups * Add support for incrementing the index with wrap flags

davemgreen

Thanks. I think it is worth trying to get this in. I already see it triggering in a number of places, it might be worth working on making it a little more generic in followup patches if we can, but there is already quite a bit going on.

davemgreen · 2023-12-19T11:18:20Z

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

+
+  // In AArch64LoopIdiomTransform::run we have already checked that the loop
+  // has a preheader so we can assume it's in a canonical form.
+  if (CurLoop->getNumBackEdges() != 1 || CurLoop->getNumBlocks() != 2)


I suspect getNumBackEdges == 1 is probably redundant with the other checks, but it sounds OK to keep to be careful.

Yeah, this is typically what LoopIdiomRecognize.cpp does as well for many of the loop idioms it analyses.

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

davemgreen · 2023-12-19T11:33:30Z

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

+
+  Value *IdxA = GEPA->getOperand(GEPA->getNumIndices());
+  Value *IdxB = GEPB->getOperand(GEPB->getNumIndices());
+  if (IdxA != IdxB || !match(IdxA, m_ZExt(m_Specific(Index))))


It might be nice to generalize this to values that are not zext, but that is likely done in a separate patch. There is already quite a bit going on in this one.

davemgreen · 2023-12-19T11:45:09Z

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

+
+  // The index is incremented before the GEP/Load pair so we need to
+  // add 1 to the start value.
+  transformByteCompare(GEPA, GEPB, PN, MaxLen, Index, StartIdx, /*IncIdx=*/true,


Is IncIdx always true for later use?

Yeah, the original plan was to support both forms of the loop:

First loop block: increment the index, second block: compare bytes.

First loop block: compare bytes, second block: increment the index.

IncIdx=true refers to 1), but in a follow-on patch I intend to add support for 2) as well. I did consider deleting the function argument in this first patch because I know it looks odd, but wasn't sure if there is much point given I plan to use this later.

davemgreen · 2023-12-19T11:53:35Z

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

+    CurLoop->getParentLoop()->addBasicBlockToLoop(CmpBB, *LI);
+
+  // Update the dominator tree with the new block.
+  DT->addNewBlock(CmpBB, Preheader);


Do CmpBB->EndBB / FoundBB edges need to be added? Its hard to tell with so many edges being added/removed! Is it worth adding DT verification to the checks below?

I've tried to tighten up the DominatorTree updates and I've added a flag to one of the test file RUN lines to verify the tree.

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

davemgreen · 2023-12-19T12:26:02Z

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

+  DTU.applyUpdates({{DominatorTree::Insert, SVELoopMismatchBlock, EndBlock}});
+
+  // Generate code for scalar loop.
+  Builder.SetInsertPoint(LoopPreHeaderBlock);


Why is the scalar loop recreated as opposed to reused? Is it the same but rotated?

Yeah, it's the same but we've already incremented the value by 1 before entering the loop. I guess that's what you mean by rotated? You're right in theory that we could have reused the existing loop, but given we wanted to increment the index before the vector loop I thought it was good to be consistent with the scalar loop as well.

davemgreen · 2023-12-19T12:40:39Z

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

+
+  // If no mismatch was found, we can jump to the end block. Create a
+  // new basic block for the compare instruction.
+  auto *CmpBB = BasicBlock::Create(Preheader->getContext(), "byte.compare",


Is CmpBB needed, or is this really part of the end of PreHeader, after the split?

Given the way that expandFindMismatch funnels all paths from both the scalar and vector loops into a single mismatch.end block, we then need a final byte.compare block to check whether the result is the end or not. That decides which successor block to jump to - FoundBB or EndBB. I admit for the case when FoundBB == EndBB we don't need to do this, but I think it's simpler just to be consistent with the FoundBB != EndBB case.

I think my point was that (old) preheader branches unconditionally to CmpBB, and CmpBB branches to EndBB/FoundBB. The preheader could be branch to EndBB/FoundBB. The condition for the preheader is Builder.CreateCondBr(Builder.getTrue(), CmpBB, Header); though, so it might make sense to keep it as is and let simplify-cfg clean up the branches.

* Do more work to ensure DominatorTree is up to date. * Add more checks to ensure only one use of the index PHI - transformByteCompare doesn't currently handle this case as it would require subtracting one from the result. It's not a typical use case for loops in this form. * General clean-up. * Add the hasOptSize check back in as it was somehow lost in an intermediate commit.

davemgreen

Thanks for the updates. From what I can tell this LGTM, but it will need a rebase.

You might want to commit it with the option disabled, and then flip the switch in a followup to avoid the commit-revert cycles in case there are any issues.

davemgreen · 2024-01-08T12:24:18Z

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp

+
+  // If no mismatch was found, we can jump to the end block. Create a
+  // new basic block for the compare instruction.
+  auto *CmpBB = BasicBlock::Create(Preheader->getContext(), "byte.compare",


I think my point was that (old) preheader branches unconditionally to CmpBB, and CmpBB branches to EndBB/FoundBB. The preheader could be branch to EndBB/FoundBB. The condition for the preheader is Builder.CreateCondBr(Builder.getTrue(), CmpBB, Header); though, so it might make sense to keep it as is and let simplify-cfg clean up the branches.

* Fixed an issue after rebase with change in registerPassBuilderCallbacks definition. * Disabled the new pass by default - we can enable this in a later patch.

dyung · 2024-01-09T12:50:56Z

Hi @david-arm, the test you added, byte-compare-index.ll seems to be failing on several bots, can you take a look?

For example on https://lab.llvm.org/buildbot/#/builders/139/builds/56740:

******************** TEST 'LLVM :: Transforms/LoopIdiom/AArch64/byte-compare-index.ll' FAILED ********************
Exit Code: 2
Command Output (stderr):
--
RUN: at line 2: /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/opt -aarch64-lit -disable-aarch64-lit-all=false -aarch64-lit-verify -verify-dom-info -mtriple aarch64-unknown-linux-gnu -mattr=+sve -S < /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/LoopIdiom/AArch64/byte-compare-index.ll | /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/FileCheck /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/LoopIdiom/AArch64/byte-compare-index.ll
+ /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/FileCheck /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/LoopIdiom/AArch64/byte-compare-index.ll
+ /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/opt -aarch64-lit -disable-aarch64-lit-all=false -aarch64-lit-verify -verify-dom-info -mtriple aarch64-unknown-linux-gnu -mattr=+sve -S
opt: Unknown command line argument '-aarch64-lit'.  Try: '/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/opt --help'
opt: Did you mean '--march'?
opt: Unknown command line argument '-disable-aarch64-lit-all=false'.  Try: '/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/opt --help'
opt: Did you mean '--disable-sched-stalls=false'?
opt: Unknown command line argument '-aarch64-lit-verify'.  Try: '/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/opt --help'
opt: Did you mean '--tail-dup-verify'?
FileCheck error: '<stdin>' is empty.
FileCheck command line:  /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/FileCheck /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/LoopIdiom/AArch64/byte-compare-index.ll
--

david-arm · 2024-01-09T13:06:31Z

Hi @dyung, sorry about this! It passed for me locally. It sounds like it needs a REQUIRED aarch64-target somewhere then.

I'll try to fix it asap.

david-arm · 2024-01-09T13:18:49Z

@dyung - fix pending here #77467

david-arm · 2024-01-09T13:18:54Z

@dyung - fix pending here #77467

Following on from llvm#72273 which added the new AArch64 loop idiom transformation pass, this patch enables the pass by default for AArch64.

Following on from #72273 which added the new AArch64 loop idiom transformation pass, this patch enables the pass by default for AArch64.

@kmclaughlin-arm

) We have added a new pass that looks for loops such as the following: ``` while (i != max_len) if (a[i] != b[i]) break; ... use index i ... ``` Although similar to a memcmp, this is slightly different because instead of returning the difference between the values of the first non-matching pair of bytes, it returns the index of the first mismatch. As such, we are not able to lower this to a memcmp call. The new pass can now spot such idioms and transform them into a specialised predicated loop that gives a significant performance improvement for AArch64. It is intended as a stop-gap solution until this can be handled by the vectoriser, which doesn't currently deal with early exits. This specialised loop makes use of a generic intrinsic that counts the trailing zero elements in a predicate vector. This was added in https://reviews.llvm.org/D159283 and for SVE we end up with brkb & incp instructions. Although we have added this pass only for AArch64, it was written in a generic way so that in theory it could be used by other targets. Currently the pass requires scalable vector support and needs to know the minimum page size for the target, however it's possible to make it work for fixed-width vectors too. Also, the llvm.experimental.cttz.elts intrinsic used by the pass has generic lowering, but can be made efficient for targets with instructions similar to SVE's brkb, cntp and incp. Original version of patch was posted on Phabricator: https://reviews.llvm.org/D158291 Patch co-authored by Kerry McLaughlin (@kmclaughlin-arm) and David Sherwood (@david-arm) See the original discussion on Discourse: https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383

Following on from llvm#72273 which added the new AArch64 loop idiom transformation pass, this patch enables the pass by default for AArch64.

shaojingzhi · 2024-02-06T06:28:07Z

Hi! I wonder that have you conducted any tests to determine the potential performance increase of this pass in the SPEC2017 557xz benchmark? I attempted to apply it to the xz benchmark, but only one copy(--copies=1) demonstrated a significant increase(about 3%), but there was no increase when I set --copies=128 or higher. Do you have any suggestions or test results that you could share?

david-arm · 2024-02-06T08:56:47Z

Hi! I wonder that have you conducted any tests to determine the potential performance increase of this pass in the SPEC2017 557xz benchmark? I attempted to apply it to the xz benchmark, but only one copy(--copies=1) demonstrated a significant increase(about 3%), but there was no increase when I set --copies=128 or higher. Do you have any suggestions or test results that you could share?

The most significant gains with xz have already been achieved when #77480 and #77480 landed, which improved performance by 6-7% for neoverse-v1. This PR is a NFC refactoring patch so it won't improve performance further. My follow-on patch (not yet posted) will trigger more cases in xz, but I don't expect any substantial performance gains for xz. The main goal of extending this pass further is to improve code coverage and testing, and hopefully there will be other applications besides xz that will benefit too.

llvmbot added backend:AArch64 llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Nov 14, 2023

david-arm requested review from topperc, paulwalker-arm and kmclaughlin-arm November 14, 2023 15:49

david-arm requested review from davemgreen and sjoerdmeijer November 14, 2023 15:54

sjoerdmeijer reviewed Nov 16, 2023

View reviewed changes

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp Show resolved Hide resolved

sjoerdmeijer reviewed Nov 16, 2023

View reviewed changes

xgupta force-pushed the aarch64_idiom branch from cd54783 to c160c3c Compare November 21, 2023 08:18

davemgreen reviewed Nov 21, 2023

View reviewed changes

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp Outdated Show resolved Hide resolved

xgupta mentioned this pull request Nov 22, 2023

[Code Coverage] Add a tool to check test coverage of a patch #71841

Closed

Address review comments

89b2c4c

* Added option to override the target's minimum page size. * Added more comments/documentation explaining what the pass is doing. * Rewrote the code for detecting the indices.

antoniofrighetto reviewed Nov 30, 2023

View reviewed changes

david-arm and others added 2 commits December 12, 2023 09:53

Merge branch 'llvm:main' into aarch64_idiom

d3c32cc

davemgreen reviewed Dec 13, 2023

View reviewed changes

Address recent review comments

42017b8

* Bail out of transformation if optimising for size * Simple code tidy-ups * Add support for incrementing the index with wrap flags

davemgreen reviewed Dec 19, 2023

View reviewed changes

davemgreen approved these changes Jan 8, 2024

View reviewed changes

Merge branch 'llvm:main' into aarch64_idiom

cd5144c

Rebase and disable by default

869c1f5

* Fixed an issue after rebase with change in registerPassBuilderCallbacks definition. * Disabled the new pass by default - we can enable this in a later patch.

david-arm merged commit c714846 into llvm:main Jan 9, 2024

david-arm mentioned this pull request Jan 9, 2024

[AArch64] Enable AArch64 loop idiom transform pass #77480

Merged

david-arm added a commit that referenced this pull request Jan 10, 2024

[AArch64] Enable AArch64 loop idiom transform pass (#77480)

ccaf9e0

Following on from #72273 which added the new AArch64 loop idiom transformation pass, this patch enables the pass by default for AArch64.

david-arm deleted the aarch64_idiom branch January 10, 2024 10:07


		using namespace PatternMatch;

		// The incoming value to the PHI node from the loop should be an add of 1.

[AArch64] Add an AArch64 pass for loop idiom transformations #72273

[AArch64] Add an AArch64 pass for loop idiom transformations #72273

Uh oh!

Conversation

david-arm commented Nov 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sjoerdmeijer Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xgupta commented Nov 21, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

david-arm commented Nov 14, 2023 •

edited

Loading

llvmbot commented Nov 14, 2023 •

edited

Loading

github-actions bot commented Nov 14, 2023 •

edited

Loading

sjoerdmeijer Nov 16, 2023 •

edited

Loading