Skip to content

[AArch64] Add an AArch64 pass for loop idiom transformations #72273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jan 9, 2024

Conversation

david-arm
Copy link
Contributor

@david-arm david-arm commented Nov 14, 2023

We have added a new pass that looks for loops such as the following:

  while (i != max_len)
      if (a[i] != b[i])
          break;

  ... use index i ...

Although similar to a memcmp, this is slightly different because instead of returning the difference between the values of the first non-matching pair of bytes, it returns the index of the first mismatch. As such, we are not able to lower this to a memcmp call.

The new pass can now spot such idioms and transform them into a specialised predicated loop that gives a significant performance improvement for AArch64. It is intended as a stop-gap solution until this can be handled by the vectoriser, which doesn't currently deal with early exits.

This specialised loop makes use of a generic intrinsic that counts the trailing zero elements in a predicate vector. This was added in https://reviews.llvm.org/D159283 and for SVE we end up with brkb & incp instructions.

Although we have added this pass only for AArch64, it was written in a generic way so that in theory it could be used by other targets. Currently the pass requires scalable vector support and needs to know the minimum page size for the target, however it's possible to make it work for fixed-width vectors too. Also, the llvm.experimental.cttz.elts intrinsic used by the pass has generic lowering, but can be made efficient for targets with instructions similar to SVE's brkb, cntp and incp.

Original version of patch was posted on Phabricator:

https://reviews.llvm.org/D158291

Patch co-authored by Kerry McLaughlin (@kmclaughlin-arm) and David Sherwood (@david-arm)

See the original discussion on Discourse:
https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383

We have added a new pass that looks for loops such as the following:

  while (i != max_len)
      if (a[i] != b[i])
          break;

  ... use index i ...

Although similar to a memcmp, this is slightly different because
instead of returning the difference between the values of the first
non-matching pair of bytes, it returns the index of the first mismatch.
As such, we are not able to lower this to a memcmp call.

The new pass can now spot such idioms and transform them into
a specialised predicated loop that gives a significant performance
improvement for AArch64. It is intended as a stop-gap solution until
this can be handled by the vectoriser, which doesn't currently deal
with early exits.

This specialised loop makes use of a generic intrinsic that counts
the trailing zero elements in a predicate vector. This was added in
https://reviews.llvm.org/D159283 and for SVE we end up with brkb &
incp instructions.

Although we have added this pass only for AArch64, it was written in
a generic way so that in theory it could be used by other targets.
Currently the pass requires scalable vector support and needs to know
the minimum page size for the target, however it's possible to make
it work for fixed-width vectors too. Also, the llvm.experimental.cttz.elts
intrinsic used by the pass has generic lowering, but can be made
efficient for targets with instructions similar to SVE's
brkb, cntp and incp.

Original version of patch was posted on Phabricator:

 https://reviews.llvm.org/D158291

Patch co-authored by Kerry McLaughlin (@kmclaughlin-arm) and David Sherwood (@david-arm)

See the original discussion on Discourse:
https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383
@llvmbot llvmbot added backend:AArch64 llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Nov 14, 2023
@llvmbot
Copy link
Member

llvmbot commented Nov 14, 2023

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-transforms

Author: David Sherwood (david-arm)

Changes

We have added a new pass that looks for loops such as the following:

while (i != max_len)
if (a[i] != b[i])
break;

... use index i ...

Although similar to a memcmp, this is slightly different because instead of returning the difference between the values of the first non-matching pair of bytes, it returns the index of the first mismatch. As such, we are not able to lower this to a memcmp call.

The new pass can now spot such idioms and transform them into a specialised predicated loop that gives a significant performance improvement for AArch64. It is intended as a stop-gap solution until this can be handled by the vectoriser, which doesn't currently deal with early exits.

This specialised loop makes use of a generic intrinsic that counts the trailing zero elements in a predicate vector. This was added in https://reviews.llvm.org/D159283 and for SVE we end up with brkb & incp instructions.

Although we have added this pass only for AArch64, it was written in a generic way so that in theory it could be used by other targets. Currently the pass requires scalable vector support and needs to know the minimum page size for the target, however it's possible to make it work for fixed-width vectors too. Also, the llvm.experimental.cttz.elts intrinsic used by the pass has generic lowering, but can be made efficient for targets with instructions similar to SVE's brkb, cntp and incp.

Original version of patch was posted on Phabricator:

https://reviews.llvm.org/D158291

Patch co-authored by Kerry McLaughlin (@kmclaughlin-arm) and David Sherwood (@david-arm)

See the original discussion on Discourse:
https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383


Patch is 138.04 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/72273.diff

12 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+8)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
  • (modified) llvm/lib/Target/AArch64/AArch64.h (+1)
  • (added) llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp (+726)
  • (added) llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.h (+25)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetMachine.cpp (+10)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetMachine.h (+3)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+2)
  • (modified) llvm/lib/Target/AArch64/CMakeLists.txt (+1)
  • (added) llvm/test/Transforms/LoopIdiom/AArch64/byte-compare-index.ll (+1640)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Target/AArch64/BUILD.gn (+1)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index c18e0acdb759a8d..10d178a73b0fcdd 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1155,6 +1155,9 @@ class TargetTransformInfo {
   /// \return The associativity of the cache level, if available.
   std::optional<unsigned> getCacheAssociativity(CacheLevel Level) const;
 
+  /// \return The minimum architectural page size for the target.
+  std::optional<unsigned> getMinPageSize() const;
+
   /// \return How much before a load we should place the prefetch
   /// instruction.  This is currently measured in number of
   /// instructions.
@@ -1889,6 +1892,7 @@ class TargetTransformInfo::Concept {
   virtual std::optional<unsigned> getCacheSize(CacheLevel Level) const = 0;
   virtual std::optional<unsigned> getCacheAssociativity(CacheLevel Level)
       const = 0;
+  virtual std::optional<unsigned> getMinPageSize() const = 0;
 
   /// \return How much before a load we should place the prefetch
   /// instruction.  This is currently measured in number of
@@ -2475,6 +2479,10 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
     return Impl.getCacheAssociativity(Level);
   }
 
+  std::optional<unsigned> getMinPageSize() const override {
+    return Impl.getMinPageSize();
+  }
+
   /// Return the preferred prefetch distance in terms of instructions.
   ///
   unsigned getPrefetchDistance() const override {
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 2ccf57c22234f9a..13030cb9fe46825 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -494,6 +494,8 @@ class TargetTransformInfoImplBase {
     llvm_unreachable("Unknown TargetTransformInfo::CacheLevel");
   }
 
+  std::optional<unsigned> getMinPageSize() const { return {}; }
+
   unsigned getPrefetchDistance() const { return 0; }
   unsigned getMinPrefetchStride(unsigned NumMemAccesses,
                                 unsigned NumStridedMemAccesses,
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index caa9b17ae695e49..dfe8d004bb97899 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -753,6 +753,10 @@ TargetTransformInfo::getCacheAssociativity(CacheLevel Level) const {
   return TTIImpl->getCacheAssociativity(Level);
 }
 
+std::optional<unsigned> TargetTransformInfo::getMinPageSize() const {
+  return TTIImpl->getMinPageSize();
+}
+
 unsigned TargetTransformInfo::getPrefetchDistance() const {
   return TTIImpl->getPrefetchDistance();
 }
diff --git a/llvm/lib/Target/AArch64/AArch64.h b/llvm/lib/Target/AArch64/AArch64.h
index 901769c54b6ef59..d20ef63a72e8f62 100644
--- a/llvm/lib/Target/AArch64/AArch64.h
+++ b/llvm/lib/Target/AArch64/AArch64.h
@@ -88,6 +88,7 @@ void initializeAArch64DeadRegisterDefinitionsPass(PassRegistry&);
 void initializeAArch64ExpandPseudoPass(PassRegistry &);
 void initializeAArch64GlobalsTaggingPass(PassRegistry &);
 void initializeAArch64LoadStoreOptPass(PassRegistry&);
+void initializeAArch64LoopIdiomTransformLegacyPassPass(PassRegistry &);
 void initializeAArch64LowerHomogeneousPrologEpilogPass(PassRegistry &);
 void initializeAArch64MIPeepholeOptPass(PassRegistry &);
 void initializeAArch64O0PreLegalizerCombinerPass(PassRegistry &);
diff --git a/llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp b/llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp
new file mode 100644
index 000000000000000..c4d2529a3e53e7d
--- /dev/null
+++ b/llvm/lib/Target/AArch64/AArch64LoopIdiomTransform.cpp
@@ -0,0 +1,726 @@
+
+//===- AArch64LoopIdiomTransform.cpp - Loop idiom recognition -------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "AArch64LoopIdiomTransform.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
+#include "llvm/Analysis/LoopPass.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/Dominators.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/MDBuilder.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Transforms/Utils/BasicBlockUtils.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "aarch64-lit"
+
+static cl::opt<bool>
+    DisableAll("disable-aarch64-lit-all", cl::Hidden, cl::init(false),
+               cl::desc("Disable AArch64 Loop Idiom Transform Pass."));
+
+static cl::opt<bool> DisableByteCmp(
+    "disable-aarch64-lit-bytecmp", cl::Hidden, cl::init(false),
+    cl::desc("Proceed with AArch64 Loop Idiom Transform Pass, but do "
+             "not convert byte-compare loop(s)."));
+
+namespace llvm {
+
+void initializeAArch64LoopIdiomTransformLegacyPassPass(PassRegistry &);
+Pass *createAArch64LoopIdiomTransformPass();
+
+} // end namespace llvm
+
+namespace {
+
+class AArch64LoopIdiomTransform {
+  Loop *CurLoop = nullptr;
+  DominatorTree *DT;
+  LoopInfo *LI;
+  const TargetTransformInfo *TTI;
+  const DataLayout *DL;
+
+public:
+  explicit AArch64LoopIdiomTransform(DominatorTree *DT, LoopInfo *LI,
+                                     const TargetTransformInfo *TTI,
+                                     const DataLayout *DL)
+      : DT(DT), LI(LI), TTI(TTI), DL(DL) {}
+
+  bool run(Loop *L);
+
+private:
+  /// \name Countable Loop Idiom Handling
+  /// @{
+
+  bool runOnCountableLoop();
+  bool runOnLoopBlock(BasicBlock *BB, const SCEV *BECount,
+                      SmallVectorImpl<BasicBlock *> &ExitBlocks);
+
+  bool recognizeByteCompare();
+  Value *expandFindMismatch(IRBuilder<> &Builder, GetElementPtrInst *GEPA,
+                            GetElementPtrInst *GEPB, Value *Start,
+                            Value *MaxLen);
+  void transformByteCompare(GetElementPtrInst *GEPA, GetElementPtrInst *GEPB,
+                            Value *MaxLen, Value *Index, Value *Start,
+                            bool IncIdx, BasicBlock *FoundBB,
+                            BasicBlock *EndBB);
+  /// @}
+};
+
+class AArch64LoopIdiomTransformLegacyPass : public LoopPass {
+public:
+  static char ID;
+
+  explicit AArch64LoopIdiomTransformLegacyPass() : LoopPass(ID) {
+    initializeAArch64LoopIdiomTransformLegacyPassPass(
+        *PassRegistry::getPassRegistry());
+  }
+
+  StringRef getPassName() const override {
+    return "Recognize AArch64-specific loop idioms";
+  }
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<LoopInfoWrapperPass>();
+    AU.addRequired<DominatorTreeWrapperPass>();
+    AU.addRequired<TargetTransformInfoWrapperPass>();
+  }
+
+  bool runOnLoop(Loop *L, LPPassManager &LPM) override;
+};
+
+bool AArch64LoopIdiomTransformLegacyPass::runOnLoop(Loop *L,
+                                                    LPPassManager &LPM) {
+
+  if (skipLoop(L))
+    return false;
+
+  auto *DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
+  auto *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
+  auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(
+      *L->getHeader()->getParent());
+  return AArch64LoopIdiomTransform(
+             DT, LI, &TTI, &L->getHeader()->getModule()->getDataLayout())
+      .run(L);
+}
+
+} // end anonymous namespace
+
+char AArch64LoopIdiomTransformLegacyPass::ID = 0;
+
+INITIALIZE_PASS_BEGIN(
+    AArch64LoopIdiomTransformLegacyPass, "aarch64-lit",
+    "Transform specific loop idioms into optimised vector forms", false, false)
+INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(LoopSimplify)
+INITIALIZE_PASS_DEPENDENCY(LCSSAWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
+INITIALIZE_PASS_END(
+    AArch64LoopIdiomTransformLegacyPass, "aarch64-lit",
+    "Transform specific loop idioms into optimised vector forms", false, false)
+
+Pass *llvm::createAArch64LoopIdiomTransformPass() {
+  return new AArch64LoopIdiomTransformLegacyPass();
+}
+
+PreservedAnalyses
+AArch64LoopIdiomTransformPass::run(Loop &L, LoopAnalysisManager &AM,
+                                   LoopStandardAnalysisResults &AR,
+                                   LPMUpdater &) {
+  if (DisableAll)
+    return PreservedAnalyses::all();
+
+  const auto *DL = &L.getHeader()->getModule()->getDataLayout();
+
+  AArch64LoopIdiomTransform LIT(&AR.DT, &AR.LI, &AR.TTI, DL);
+  if (!LIT.run(&L))
+    return PreservedAnalyses::all();
+
+  return PreservedAnalyses::none();
+}
+
+//===----------------------------------------------------------------------===//
+//
+//          Implementation of AArch64LoopIdiomTransform
+//
+//===----------------------------------------------------------------------===//
+
+bool AArch64LoopIdiomTransform::run(Loop *L) {
+  CurLoop = L;
+
+  if (DisableAll)
+    return false;
+
+  // If the loop could not be converted to canonical form, it must have an
+  // indirectbr in it, just give up.
+  if (!L->getLoopPreheader())
+    return false;
+
+  LLVM_DEBUG(dbgs() << DEBUG_TYPE " Scanning: F["
+                    << CurLoop->getHeader()->getParent()->getName()
+                    << "] Loop %" << CurLoop->getHeader()->getName() << "\n");
+
+  return recognizeByteCompare();
+}
+
+/// Match loop-invariant value.
+template <typename SubPattern_t> struct match_LoopInvariant {
+  SubPattern_t SubPattern;
+  const Loop *L;
+
+  match_LoopInvariant(const SubPattern_t &SP, const Loop *L)
+      : SubPattern(SP), L(L) {}
+
+  template <typename ITy> bool match(ITy *V) {
+    return L->isLoopInvariant(V) && SubPattern.match(V);
+  }
+};
+
+/// Matches if the value is loop-invariant.
+template <typename Ty>
+inline match_LoopInvariant<Ty> m_LoopInvariant(const Ty &M, const Loop *L) {
+  return match_LoopInvariant<Ty>(M, L);
+}
+
+bool AArch64LoopIdiomTransform::recognizeByteCompare() {
+  if (!TTI->supportsScalableVectors() || !TTI->getMinPageSize().has_value() ||
+      DisableByteCmp)
+    return false;
+
+  BasicBlock *Header = CurLoop->getHeader();
+  BasicBlock *PH = CurLoop->getLoopPreheader();
+
+  // In AArch64LoopIdiomTransform::run we have already checked that the loop
+  // has a preheader so we can assume it's in a canonical form.
+  auto *EntryBI = cast<BranchInst>(PH->getTerminator());
+
+  if (CurLoop->getNumBackEdges() != 1 || CurLoop->getNumBlocks() != 2)
+    return false;
+
+  PHINode *PN = dyn_cast<PHINode>(&Header->front());
+  if (!PN || PN->getNumIncomingValues() != 2)
+    return false;
+
+  auto LoopBlocks = CurLoop->getBlocks();
+  // The first block in the loop should contain only 4 instructions, e.g.
+  //
+  //  while.cond:
+  //   %res.phi = phi i32 [ %start, %ph ], [ %inc, %while.body ]
+  //   %inc = add i32 %res.phi, 1
+  //   %cmp.not = icmp eq i32 %inc, %n
+  //   br i1 %cmp.not, label %while.end, label %while.body
+  //
+  auto CondBBInsts = LoopBlocks[0]->instructionsWithoutDebug();
+  if (std::distance(CondBBInsts.begin(), CondBBInsts.end()) > 4)
+    return false;
+
+  // The second block should contain 7 instructions, e.g.
+  //
+  // while.body:
+  //   %idx = zext i32 %inc to i64
+  //   %idx.a = getelementptr inbounds i8, ptr %a, i64 %idx
+  //   %load.a = load i8, ptr %idx.a
+  //   %idx.b = getelementptr inbounds i8, ptr %b, i64 %idx
+  //   %load.b = load i8, ptr %idx.b
+  //   %cmp.not.ld = icmp eq i8 %load.a, %load.b
+  //   br i1 %cmp.not.ld, label %while.cond, label %while.end
+  //
+  auto LoopBBInsts = LoopBlocks[1]->instructionsWithoutDebug();
+  if (std::distance(LoopBBInsts.begin(), LoopBBInsts.end()) > 7)
+    return false;
+
+  using namespace PatternMatch;
+
+  // The incoming value to the PHI node from the loop should be an add of 1.
+  Instruction *Index = nullptr;
+  Value *StartIdx = nullptr;
+  for (BasicBlock *BB : PN->blocks()) {
+    if (!CurLoop->contains(BB)) {
+      StartIdx = PN->getIncomingValueForBlock(BB);
+      continue;
+    }
+    Index = dyn_cast<Instruction>(PN->getIncomingValueForBlock(BB));
+    // Limit to 32-bit types for now
+    if (!Index || !Index->getType()->isIntegerTy(32) ||
+        !match(Index, m_c_Add(m_Specific(PN), m_One())))
+      return false;
+  }
+
+  // If we match the pattern, PN and Index will be replaced with the result of
+  // the cttz.elts intrinsic. If any other instructions are used outside of
+  // the loop, we cannot replace it.
+  for (BasicBlock *BB : LoopBlocks)
+    for (Instruction &I : *BB)
+      if (&I != PN && &I != Index)
+        for (User *U : I.users()) {
+          auto UI = cast<Instruction>(U);
+          if (!CurLoop->contains(UI))
+            return false;
+        }
+
+  // Don't replace the loop if the add has a wrap flag.
+  if (Index->hasNoSignedWrap() || Index->hasNoUnsignedWrap())
+    return false;
+
+  // Match the branch instruction for the header
+  ICmpInst::Predicate Pred;
+  Value *MaxLen;
+  BasicBlock *EndBB, *WhileBB;
+  if (!match(Header->getTerminator(),
+             m_Br(m_ICmp(Pred, m_Specific(Index), m_Value(MaxLen)),
+                  m_BasicBlock(EndBB), m_BasicBlock(WhileBB))) ||
+      Pred != ICmpInst::Predicate::ICMP_EQ || !CurLoop->contains(WhileBB))
+    return false;
+
+  // WhileBB should contain the pattern of load & compare instructions. Match
+  // the pattern and find the GEP instructions used by the loads.
+  ICmpInst::Predicate WhilePred;
+  BasicBlock *FoundBB;
+  BasicBlock *TrueBB;
+  Value *LoadA, *LoadB;
+  if (!match(WhileBB->getTerminator(),
+             m_Br(m_ICmp(WhilePred, m_Value(LoadA), m_Value(LoadB)),
+                  m_BasicBlock(TrueBB), m_BasicBlock(FoundBB))) ||
+      WhilePred != ICmpInst::Predicate::ICMP_EQ || !CurLoop->contains(TrueBB))
+    return false;
+
+  Value *A, *B;
+  if (!match(LoadA, m_Load(m_Value(A))) || !match(LoadB, m_Load(m_Value(B))))
+    return false;
+
+  GetElementPtrInst *GEPA = dyn_cast<GetElementPtrInst>(A);
+  GetElementPtrInst *GEPB = dyn_cast<GetElementPtrInst>(B);
+
+  if (!GEPA || !GEPB)
+    return false;
+
+  Value *PtrA = GEPA->getPointerOperand();
+  Value *PtrB = GEPB->getPointerOperand();
+
+  // Check we are loading i8 values from two loop invariant pointers
+  if (!CurLoop->isLoopInvariant(PtrA) || !CurLoop->isLoopInvariant(PtrB) ||
+      !GEPA->getResultElementType()->isIntegerTy(8) ||
+      !GEPB->getResultElementType()->isIntegerTy(8) ||
+      !cast<LoadInst>(LoadA)->getType()->isIntegerTy(8) ||
+      !cast<LoadInst>(LoadB)->getType()->isIntegerTy(8) || PtrA == PtrB)
+    return false;
+
+  // Check that the index to the GEPs is the index we found earlier
+  if (GEPA->getNumIndices() > 1 || GEPB->getNumIndices() > 1)
+    return false;
+
+  Value *IdxA = GEPA->getOperand(GEPA->getNumIndices());
+  Value *IdxB = GEPB->getOperand(GEPB->getNumIndices());
+  if (IdxA != IdxB || !match(IdxA, m_ZExt(m_Specific(Index))))
+    return false;
+
+  LLVM_DEBUG(dbgs() << "FOUND IDIOM IN LOOP: \n"
+                    << *(EndBB->getParent()) << "\n\n");
+
+  // The index is incremented before the GEP/Load pair so we need to
+  // add 1 to the start value.
+  transformByteCompare(GEPA, GEPB, MaxLen, Index, StartIdx, /*IncIdx=*/true, FoundBB,
+                       EndBB);
+  return true;
+}
+
+Value *AArch64LoopIdiomTransform::expandFindMismatch(IRBuilder<> &Builder,
+                                                     GetElementPtrInst *GEPA,
+                                                     GetElementPtrInst *GEPB,
+                                                     Value *Start,
+                                                     Value *MaxLen) {
+  Value *PtrA = GEPA->getPointerOperand();
+  Value *PtrB = GEPB->getPointerOperand();
+
+  // Get the arguments and types for the intrinsic.
+  BasicBlock *Preheader = CurLoop->getLoopPreheader();
+  BranchInst *PHBranch = cast<BranchInst>(Preheader->getTerminator());
+  LLVMContext &Ctx = PHBranch->getContext();
+  Type *LoadType = Type::getInt8Ty(Ctx);
+  Type *ResType = Builder.getInt32Ty();
+
+  // Split block in the original loop preheader.
+  DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Lazy);
+  BasicBlock *EndBlock =
+      SplitBlock(Preheader, PHBranch, DT, LI, nullptr, "mismatch_end");
+
+  // Create the blocks that we're going to need:
+  //  1. A block for checking the zero-extended length exceeds 0
+  //  2. A block to check that the start and end addresses of a given array
+  //     lie on the same page.
+  //  3. The SVE loop preheader.
+  //  4. The first SVE loop block.
+  //  5. The SVE loop increment block.
+  //  6. A block we can jump to from the SVE loop when a mismatch is found.
+  //  7. The first block of the scalar loop itself, containing PHIs , loads
+  //  and cmp.
+  //  8. A scalar loop increment block to increment the PHIs and go back
+  //  around the loop.
+
+  BasicBlock *MinItCheckBlock = BasicBlock::Create(
+      Ctx, "mismatch_min_it_check", EndBlock->getParent(), EndBlock);
+
+  // Update the terminator added by SplitBlock to branch to the first block
+  Preheader->getTerminator()->setSuccessor(0, MinItCheckBlock);
+
+  BasicBlock *MemCheckBlock = BasicBlock::Create(
+      Ctx, "mismatch_mem_check", EndBlock->getParent(), EndBlock);
+
+  BasicBlock *SVELoopPreheaderBlock = BasicBlock::Create(
+      Ctx, "mismatch_sve_loop_preheader", EndBlock->getParent(), EndBlock);
+
+  BasicBlock *SVELoopStartBlock = BasicBlock::Create(
+      Ctx, "mismatch_sve_loop", EndBlock->getParent(), EndBlock);
+
+  BasicBlock *SVELoopIncBlock = BasicBlock::Create(
+      Ctx, "mismatch_sve_loop_inc", EndBlock->getParent(), EndBlock);
+
+  BasicBlock *SVELoopMismatchBlock = BasicBlock::Create(
+      Ctx, "mismatch_sve_loop_found", EndBlock->getParent(), EndBlock);
+
+  BasicBlock *LoopPreHeaderBlock = BasicBlock::Create(
+      Ctx, "mismatch_loop_pre", EndBlock->getParent(), EndBlock);
+
+  BasicBlock *LoopStartBlock =
+      BasicBlock::Create(Ctx, "mismatch_loop", EndBlock->getParent(), EndBlock);
+
+  BasicBlock *LoopIncBlock = BasicBlock::Create(
+      Ctx, "mismatch_loop_inc", EndBlock->getParent(), EndBlock);
+
+  DTU.applyUpdates({{DominatorTree::Insert, Preheader, MinItCheckBlock},
+                    {DominatorTree::Delete, Preheader, EndBlock}});
+
+  // Update LoopInfo with the new SVE & scalar loops.
+  auto SVELoop = LI->AllocateLoop();
+  auto ScalarLoop = LI->AllocateLoop();
+  if (CurLoop->getParentLoop()) {
+    CurLoop->getParentLoop()->addChildLoop(SVELoop);
+    CurLoop->getParentLoop()->addChildLoop(ScalarLoop);
+  } else {
+    LI->addTopLevelLoop(SVELoop);
+    LI->addTopLevelLoop(ScalarLoop);
+  }
+
+  // Add the new basic blocks to their associated loops.
+  SVELoop->addBasicBlockToLoop(MinItCheckBlock, *LI);
+  SVELoop->addBasicBlockToLoop(MemCheckBlock, *LI);
+  SVELoop->addBasicBlockToLoop(SVELoopPreheaderBlock, *LI);
+  SVELoop->addBasicBlockToLoop(SVELoopStartBlock, *LI);
+  SVELoop->addBasicBlockToLoop(SVELoopIncBlock, *LI);
+  SVELoop->addBasicBlockToLoop(SVELoopMismatchBlock, *LI);
+
+  ScalarLoop->addBasicBlockToLoop(LoopPreHeaderBlock, *LI);
+  ScalarLoop->addBasicBlockToLoop(LoopStartBlock, *LI);
+  ScalarLoop->addBasicBlockToLoop(LoopIncBlock, *LI);
+
+  // Set up some types and constants that we intend to reuse.
+  Type *I64Type = Builder.getInt64Ty();
+
+  // Check the zero-extended iteration count > 0
+  Builder.SetInsertPoint(MinItCheckBlock);
+  Value *ExtStart = Builder.CreateZExt(Start, I64Type);
+  Value *ExtEnd = Builder.CreateZExt(MaxLen, I64Type);
+  // This check doesn't really cost us very much.
+
+  Value *LimitCheck = Builder.CreateICmpULE(Start, MaxLen);
+  BranchInst *MinItCheckBr =
+      BranchInst::Create(MemCheckBlock, LoopPreHeaderBlock, LimitCheck);
+  MinItCheckBr->setMetadata(
+      LLVMContext::MD_prof,
+      MDBuilder(MinItCheckBr->getContext()).createBranchWeights(99, 1));
+  Builder.Insert(MinItCheckBr);
+
+  DTU.applyUpdates(
+      {{DominatorTree::Insert, MinItCheckBlock, MemCheckBlock},
+       {DominatorTree::Insert, MinItCheckBlock, LoopPreHeaderBlock}});
+
+  // For each of the arrays, check the start/end addresses are...
[truncated]

Copy link

github-actions bot commented Nov 14, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.


using namespace PatternMatch;

// The incoming value to the PHI node from the loop should be an add of 1.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use one of the Loop APIs for this? For example:

PHINode * getCanonicalInductionVariable () const
Check to see if the loop has a canonical induction variable: an integer recurrence that starts at 0 and increments by one each time through the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly not! That's because our induction variable does not start at 0.


using namespace llvm;

#define DEBUG_TYPE "aarch64-lit"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bikeshedding: lit as name could perhaps be confusing. Something more verbose but descriptive with "loopidiom" in it would probably have my preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the DEBUG_TYPE to use something more descriptive. I wasn't sure if we should also update the flags, since I was worried the names would be too long!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just realised I have lost this change somehow. I'll fix it.

// br i1 %cmp.not, label %while.end, label %while.body
//
auto CondBBInsts = LoopBlocks[0]->instructionsWithoutDebug();
if (std::distance(CondBBInsts.begin(), CondBBInsts.end()) > 4)
Copy link
Collaborator

@sjoerdmeijer sjoerdmeijer Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counting the number of instructions might be a bit too "specific". Do we not need to inspect the supported instructions instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more like a quick initial filter of loops we don't care about. The code does inspect uses of the index, the GEPs, loads and comparisons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a Godbolt link for the C leading to these instructions (I couldn't reproduce the same sequence for compare_bytes_simple)? I know this is to discriminate specific loop structures so chances for pattern matching later to succeed are higher, but this looks very specific to me too, and may be subject to quick changes in the future. Would dropping this lead to bottlenecks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. https://godbolt.org/z/e5MnnGT84

As you can see, the first block corresponds to this - while (++start != end), so the block is likely to be tiny. It's possible the compiler could increment start before the loop, in which case it will look more like while (start++ != end). We've seen the former idiom in xz and 7zip. In either case the first block will have 4 or less instructions. Even if we permit more instructions we still have to match the right pattern - it would likely fail anyway because we have more than the expected uses of a value. We also couldn't, for example, permit random stores in the loop to account for the extra instructions.

I'm a bit reluctant to expand the algorithm to cover hypothetical cases that we can't prove exist. I agree it's a good idea to make it robust based on real-world examples. Do you have any examples of other specific variations of this idiom that you are worried about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that the LoopIdiomRecognition pass also uses std::distance as a way to filter out loops.

For what it's worth, I do have downstream code that increases the number of idioms we recognise. This led to me finding a number of bugs, so the exercise was very useful! However, I'd prefer to land this patch first and build upon this incrementally. I can create a patch later on that adds these extra idioms once I've done sufficient testing. I think the idioms in this patch have the biggest impact in terms of performance.

The extra idioms involve: 1) supporting 64-bit induction variable types, 2) permitting add increments without nuw and nsw flags, 3) supporting loops that increment the counter after loading the values (as opposed to currently incrementing before loading).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for replying only now, thanks for the explanation. If I got this correctly, the snippet you provided is not meant to be transformed as the indexes of the two GEPs do not match, thus we would be bailing out. I'm confused though, why the following snippet shouldn't be supported (as per comment here)? Doesn't it match the specifics of the pass (the index is simply incremented before loading the values), or am I missing something?

int test(unsigned char *p1, unsigned char *p2, unsigned start, unsigned end) {
  while (++start != end) {
    if (p1[start] != p2[start])
      break;
  }
  return start;
}

Also, if we were to support 1), 2), 3) we would still not be catching samples like https://godbolt.org/z/vPGqde4zj, due to the and on the induction variable (not sure why it has been converted to an i64 here, but I guess it's an orthogonal problem – provided we want to catch this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my apologies. By snippet you provided are you referring to the godbolt link I posted? If so, I made a mistake in there and it should be

int test(unsigned char *p1, unsigned char *p2, unsigned start, unsigned end) {
  while (++start != end) {
    if (p1[start] != p2[start])
      break;
  }
  return start;
}

For the new godbolt link you provided, you're right the IR contains an add of 1 and a and instruction that isn't currently recognised. Again, the purpose of this pass is not to find as many loops as possible that we can recognise, but to focus on those that are of genuine value to users. Adding support for loops like the one you posted does add a lot more complexity and expandFindMismatch will need to generate code for all flavours of loops, such as potentially adding the add and and instructions in your example.

I have identified some more loops that we can recognise in the xz benchmark and have a WIP patch downstream, but these only give us tiny gains and the codegen is also poor due to LoopStrengthReduce making some poor choices for the SVE loops. Once I can be sure that recognising these extra loops is going to be a definite win I can follow up with another patch that adds them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I'm still slightly unsure by the logic here (between 336-368) which seems to prevent transformation of simple loops as the aforementioned one, as the incoming values of the phi are not the same (which should be valid), but it's fine if this is going to be added/reviewed in a second moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is designed to prevent recognising loops such as @compare_bytes_simple2, i.e.

while.cond:
...
  br i1 %cmp.not, label %while.end, label %while.body

while.body:
...
  br i1 %cmp.not2, label %while.cond, label %while.end

while.end:
...
  %final_ptr = phi ptr [ %c, %while.body ], [ %d, %while.cond ]

where the found and end blocks are the same, and in that block there is a PHI node that has a unique incoming value for each block in the loop. However, each incoming value is not defined in the loop itself. Currently we don't handle this case in transformByteCompare because it would require extra work in the byte.compare block to select between these two incoming values. Rather than let transformByteCompare generate incorrect code and lead to bugs, I've explicitly disabled these cases for now. I haven't encountered any examples of such loops in the benchmarks I have looked at, so these are not critical to us. I only discovered this accidentally while writing more test cases.

// At this point we know two things must be true:
// 1. Start <= End
// 2. ExtMaxLen <= 4096 due to the page checks.
// Therefore, we know that we can use a 64-bit induction variable that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am missing some higher level descriptions of the transformation that we are exactly trying to achieve, and under which conditions this is a valid thing to do. For example, I don't understand yet the purpose of the page size check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added comments at the top of the file, as well as detailed comments further up explaining how the page size is used to generate runtime memory checks. These ensure we don't fault midway through a vector load due to reading ahead.

@xgupta
Copy link
Contributor

xgupta commented Nov 21, 2023

Sorry, I was supposed to push to my repo, but changes got into this pull request, I will revert it immediately.

* Added option to override the target's minimum page size.
* Added more comments/documentation explaining what the pass
is doing.
* Rewrote the code for detecting the indices.
// br i1 %cmp.not, label %while.end, label %while.body
//
auto CondBBInsts = LoopBlocks[0]->instructionsWithoutDebug();
if (std::distance(CondBBInsts.begin(), CondBBInsts.end()) > 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a Godbolt link for the C leading to these instructions (I couldn't reproduce the same sequence for compare_bytes_simple)? I know this is to discriminate specific loop structures so chances for pattern matching later to succeed are higher, but this looks very specific to me too, and may be subject to quick changes in the future. Would dropping this lead to bottlenecks?

david-arm and others added 2 commits December 12, 2023 09:53
As an experiment I tried adding support for more loop idioms in.
As a result of greater test coverage I also discovered issues with
how we create the loops, which led to me fixing these bugs:

* The loops were not in the expected LCSSA format, which I have
now fixed. I've added an option for testing that allows us to
verify the loops after we create them.
* The scalar fallback loop could potentially read one byte past the
end of the array. Curiously it would always have given a correct
result, but had the potential to seg fault.
* I also added code to reject an unsupported case that would have
led to incorrect code generation. I have added a test for this.

I also amended some comments and cleaned up some unused variables.

I haven't added support for more idioms in this patch yet, since
I would like to do that in a follow-on patch after I've had chance
to do more testing. It would be good to build on this incrementally.
Value *RhsPageCmp = Builder.CreateICmpNE(RhsStartPage, RhsEndPage);

Value *CombinedPageCmp = Builder.CreateOr(LhsPageCmp, RhsPageCmp);
BranchInst *CombinedPageCmpCmpBr = BranchInst::Create(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider this setting the maximum number of iteration for the new loop, as opposed to just skipping it? Or does that lead to worse performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a while ago I did try writing a simple SVE ACLE implementation that processed the data in chunks, constantly walking up to the nearest page boundary in the SVE loop and then restarting again. The performance wasn't any better (presumably due to the extra calculations of the end index) and I figured it would add even more complexity to expandFindMismatch, so I stuck with this approach for now. But I'm happy to be proved wrong and this is certainly something we could revisit.

* Bail out of transformation if optimising for size
* Simple code tidy-ups
* Add support for incrementing the index with wrap flags
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think it is worth trying to get this in. I already see it triggering in a number of places, it might be worth working on making it a little more generic in followup patches if we can, but there is already quite a bit going on.


// In AArch64LoopIdiomTransform::run we have already checked that the loop
// has a preheader so we can assume it's in a canonical form.
if (CurLoop->getNumBackEdges() != 1 || CurLoop->getNumBlocks() != 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect getNumBackEdges == 1 is probably redundant with the other checks, but it sounds OK to keep to be careful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is typically what LoopIdiomRecognize.cpp does as well for many of the loop idioms it analyses.


Value *IdxA = GEPA->getOperand(GEPA->getNumIndices());
Value *IdxB = GEPB->getOperand(GEPB->getNumIndices());
if (IdxA != IdxB || !match(IdxA, m_ZExt(m_Specific(Index))))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to generalize this to values that are not zext, but that is likely done in a separate patch. There is already quite a bit going on in this one.


// The index is incremented before the GEP/Load pair so we need to
// add 1 to the start value.
transformByteCompare(GEPA, GEPB, PN, MaxLen, Index, StartIdx, /*IncIdx=*/true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is IncIdx always true for later use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the original plan was to support both forms of the loop:

  1. First loop block: increment the index, second block: compare bytes.
  2. First loop block: compare bytes, second block: increment the index.

IncIdx=true refers to 1), but in a follow-on patch I intend to add support for 2) as well. I did consider deleting the function argument in this first patch because I know it looks odd, but wasn't sure if there is much point given I plan to use this later.

CurLoop->getParentLoop()->addBasicBlockToLoop(CmpBB, *LI);

// Update the dominator tree with the new block.
DT->addNewBlock(CmpBB, Preheader);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do CmpBB->EndBB / FoundBB edges need to be added? Its hard to tell with so many edges being added/removed! Is it worth adding DT verification to the checks below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to tighten up the DominatorTree updates and I've added a flag to one of the test file RUN lines to verify the tree.

DTU.applyUpdates({{DominatorTree::Insert, SVELoopMismatchBlock, EndBlock}});

// Generate code for scalar loop.
Builder.SetInsertPoint(LoopPreHeaderBlock);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the scalar loop recreated as opposed to reused? Is it the same but rotated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's the same but we've already incremented the value by 1 before entering the loop. I guess that's what you mean by rotated? You're right in theory that we could have reused the existing loop, but given we wanted to increment the index before the vector loop I thought it was good to be consistent with the scalar loop as well.


// If no mismatch was found, we can jump to the end block. Create a
// new basic block for the compare instruction.
auto *CmpBB = BasicBlock::Create(Preheader->getContext(), "byte.compare",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is CmpBB needed, or is this really part of the end of PreHeader, after the split?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the way that expandFindMismatch funnels all paths from both the scalar and vector loops into a single mismatch.end block, we then need a final byte.compare block to check whether the result is the end or not. That decides which successor block to jump to - FoundBB or EndBB. I admit for the case when FoundBB == EndBB we don't need to do this, but I think it's simpler just to be consistent with the FoundBB != EndBB case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my point was that (old) preheader branches unconditionally to CmpBB, and CmpBB branches to EndBB/FoundBB. The preheader could be branch to EndBB/FoundBB. The condition for the preheader is Builder.CreateCondBr(Builder.getTrue(), CmpBB, Header); though, so it might make sense to keep it as is and let simplify-cfg clean up the branches.

* Do more work to ensure DominatorTree is up to date.
* Add more checks to ensure only one use of the index PHI -
transformByteCompare doesn't currently handle this case as it
would require subtracting one from the result. It's not a
typical use case for loops in this form.
* General clean-up.
* Add the hasOptSize check back in as it was somehow lost in
an intermediate commit.
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. From what I can tell this LGTM, but it will need a rebase.

You might want to commit it with the option disabled, and then flip the switch in a followup to avoid the commit-revert cycles in case there are any issues.


// If no mismatch was found, we can jump to the end block. Create a
// new basic block for the compare instruction.
auto *CmpBB = BasicBlock::Create(Preheader->getContext(), "byte.compare",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my point was that (old) preheader branches unconditionally to CmpBB, and CmpBB branches to EndBB/FoundBB. The preheader could be branch to EndBB/FoundBB. The condition for the preheader is Builder.CreateCondBr(Builder.getTrue(), CmpBB, Header); though, so it might make sense to keep it as is and let simplify-cfg clean up the branches.

* Fixed an issue after rebase with change in
registerPassBuilderCallbacks definition.
* Disabled the new pass by default - we can enable this in
a later patch.
@david-arm david-arm merged commit c714846 into llvm:main Jan 9, 2024
@dyung
Copy link
Collaborator

dyung commented Jan 9, 2024

Hi @david-arm, the test you added, byte-compare-index.ll seems to be failing on several bots, can you take a look?

For example on https://lab.llvm.org/buildbot/#/builders/139/builds/56740:

******************** TEST 'LLVM :: Transforms/LoopIdiom/AArch64/byte-compare-index.ll' FAILED ********************
Exit Code: 2
Command Output (stderr):
--
RUN: at line 2: /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/opt -aarch64-lit -disable-aarch64-lit-all=false -aarch64-lit-verify -verify-dom-info -mtriple aarch64-unknown-linux-gnu -mattr=+sve -S < /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/LoopIdiom/AArch64/byte-compare-index.ll | /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/FileCheck /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/LoopIdiom/AArch64/byte-compare-index.ll
+ /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/FileCheck /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/LoopIdiom/AArch64/byte-compare-index.ll
+ /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/opt -aarch64-lit -disable-aarch64-lit-all=false -aarch64-lit-verify -verify-dom-info -mtriple aarch64-unknown-linux-gnu -mattr=+sve -S
opt: Unknown command line argument '-aarch64-lit'.  Try: '/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/opt --help'
opt: Did you mean '--march'?
opt: Unknown command line argument '-disable-aarch64-lit-all=false'.  Try: '/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/opt --help'
opt: Did you mean '--disable-sched-stalls=false'?
opt: Unknown command line argument '-aarch64-lit-verify'.  Try: '/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/opt --help'
opt: Did you mean '--tail-dup-verify'?
FileCheck error: '<stdin>' is empty.
FileCheck command line:  /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/FileCheck /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/LoopIdiom/AArch64/byte-compare-index.ll
--

@david-arm
Copy link
Contributor Author

Hi @dyung, sorry about this! It passed for me locally. It sounds like it needs a REQUIRED aarch64-target somewhere then.

I'll try to fix it asap.

@david-arm
Copy link
Contributor Author

@dyung - fix pending here #77467

1 similar comment
@david-arm
Copy link
Contributor Author

@dyung - fix pending here #77467

david-arm added a commit to david-arm/llvm-project that referenced this pull request Jan 9, 2024
Following on from

llvm#72273

which added the new AArch64 loop idiom transformation pass,
this patch enables the pass by default for AArch64.
david-arm added a commit that referenced this pull request Jan 10, 2024
Following on from

#72273

which added the new AArch64 loop idiom transformation pass, this patch
enables the pass by default for AArch64.
@david-arm david-arm deleted the aarch64_idiom branch January 10, 2024 10:07
justinfargnoli pushed a commit to justinfargnoli/llvm-project that referenced this pull request Jan 28, 2024
)

We have added a new pass that looks for loops such as the following:

```
  while (i != max_len)
      if (a[i] != b[i])
          break;

  ... use index i ...
```

Although similar to a memcmp, this is slightly different because instead
of returning the difference between the values of the first non-matching
pair of bytes, it returns the index of the first mismatch. As such, we
are not able to lower this to a memcmp call.

The new pass can now spot such idioms and transform them into a
specialised predicated loop that gives a significant performance
improvement for AArch64. It is intended as a stop-gap solution until
this can be handled by the vectoriser, which doesn't currently deal with
early exits.

This specialised loop makes use of a generic intrinsic that counts the
trailing zero elements in a predicate vector. This was added in
https://reviews.llvm.org/D159283 and for SVE we end up with brkb & incp
instructions.

Although we have added this pass only for AArch64, it was written in a
generic way so that in theory it could be used by other targets.
Currently the pass requires scalable vector support and needs to know
the minimum page size for the target, however it's possible to make it
work for fixed-width vectors too. Also, the llvm.experimental.cttz.elts
intrinsic used by the pass has generic lowering, but can be made
efficient for targets with instructions similar to SVE's brkb, cntp and
incp.

Original version of patch was posted on Phabricator:

 https://reviews.llvm.org/D158291

Patch co-authored by Kerry McLaughlin (@kmclaughlin-arm) and David
Sherwood (@david-arm)

See the original discussion on Discourse:

https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383
justinfargnoli pushed a commit to justinfargnoli/llvm-project that referenced this pull request Jan 28, 2024
Following on from

llvm#72273

which added the new AArch64 loop idiom transformation pass, this patch
enables the pass by default for AArch64.
@shaojingzhi
Copy link
Contributor

Hi! I wonder that have you conducted any tests to determine the potential performance increase of this pass in the SPEC2017 557xz benchmark? I attempted to apply it to the xz benchmark, but only one copy(--copies=1) demonstrated a significant increase(about 3%), but there was no increase when I set --copies=128 or higher. Do you have any suggestions or test results that you could share?

@david-arm
Copy link
Contributor Author

Hi! I wonder that have you conducted any tests to determine the potential performance increase of this pass in the SPEC2017 557xz benchmark? I attempted to apply it to the xz benchmark, but only one copy(--copies=1) demonstrated a significant increase(about 3%), but there was no increase when I set --copies=128 or higher. Do you have any suggestions or test results that you could share?

The most significant gains with xz have already been achieved when #77480 and #77480 landed, which improved performance by 6-7% for neoverse-v1. This PR is a NFC refactoring patch so it won't improve performance further. My follow-on patch (not yet posted) will trigger more cases in xz, but I don't expect any substantial performance gains for xz. The main goal of extending this pass further is to improve code coverage and testing, and hopefully there will be other applications besides xz that will benefit too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AArch64 llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants