Skip to content

[IR] Use block numbers in PredIteratorCache #101885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aengelke
Copy link
Contributor

@aengelke aengelke commented Aug 4, 2024

Avoid using a DenseMap of BasicBlocks when a vector is sufficient. This only has a few users, so gains are low, but so is the effort.

There's no longer a separate map which only holds sizes. All users that used size() also use get() afterwards, so it's always beneficial to compute both.

c-t-t -0.08% in stage2-O3

Avoid using DenseMap of BasicBlocks when a vector is sufficient. This
only has a few users, so gains are low, but so is the effort.

There's no longer a separate map which only holds sizes. All users that
used size() also use get() afterwards, so it's always beneficial to
compute both.
@aengelke aengelke requested a review from nikic August 4, 2024 12:31
@llvmbot llvmbot added the llvm:ir label Aug 4, 2024
@llvmbot
Copy link
Member

llvmbot commented Aug 4, 2024

@llvm/pr-subscribers-llvm-ir

Author: Alexis Engelke (aengelke)

Changes

Avoid using a DenseMap of BasicBlocks when a vector is sufficient. This only has a few users, so gains are low, but so is the effort.

There's no longer a separate map which only holds sizes. All users that used size() also use get() afterwards, so it's always beneficial to compute both.

c-t-t -0.08% in stage2-O3


Full diff: https://github.com/llvm/llvm-project/pull/101885.diff

1 Files Affected:

  • (modified) llvm/include/llvm/IR/PredIteratorCache.h (+25-35)
diff --git a/llvm/include/llvm/IR/PredIteratorCache.h b/llvm/include/llvm/IR/PredIteratorCache.h
index fc8cf20e9f759..c9603c870820b 100644
--- a/llvm/include/llvm/IR/PredIteratorCache.h
+++ b/llvm/include/llvm/IR/PredIteratorCache.h
@@ -25,52 +25,42 @@ namespace llvm {
 /// predecessor iterator queries.  This is useful for code that repeatedly
 /// wants the predecessor list for the same blocks.
 class PredIteratorCache {
-  /// BlockToPredsMap - Pointer to null-terminated list.
-  mutable DenseMap<BasicBlock *, BasicBlock **> BlockToPredsMap;
-  mutable DenseMap<BasicBlock *, unsigned> BlockToPredCountMap;
+  /// Storage, indexed by block number.
+  SmallVector<ArrayRef<BasicBlock *>> Storage;
+  /// Block number epoch to guard against renumberings.
+  unsigned BlockNumberEpoch;
 
   /// Memory - This is the space that holds cached preds.
   BumpPtrAllocator Memory;
 
-private:
-  /// GetPreds - Get a cached list for the null-terminated predecessor list of
-  /// the specified block.  This can be used in a loop like this:
-  ///   for (BasicBlock **PI = PredCache->GetPreds(BB); *PI; ++PI)
-  ///      use(*PI);
-  /// instead of:
-  /// for (pred_iterator PI = pred_begin(BB), E = pred_end(BB); PI != E; ++PI)
-  BasicBlock **GetPreds(BasicBlock *BB) {
-    BasicBlock **&Entry = BlockToPredsMap[BB];
-    if (Entry)
-      return Entry;
-
-    SmallVector<BasicBlock *, 32> PredCache(predecessors(BB));
-    PredCache.push_back(nullptr); // null terminator.
+public:
+  size_t size(BasicBlock *BB) { return get(BB).size(); }
+  ArrayRef<BasicBlock *> get(BasicBlock *BB) {
+#ifndef NDEBUG
+    // In debug builds, verify that no renumbering has occured.
+    if (Storage.empty())
+      BlockNumberEpoch = BB->getParent()->getBlockNumberEpoch();
+    else
+      assert(BlockNumberEpoch == BB->getParent()->getBlockNumberEpoch() &&
+             "Blocks renumbered during lifetime of PredIteratorCache");
+#endif
 
-    BlockToPredCountMap[BB] = PredCache.size() - 1;
+    if (LLVM_LIKELY(BB->getNumber() < Storage.size()))
+      if (auto Res = Storage[BB->getNumber()]; Res.data())
+        return Res;
 
-    Entry = Memory.Allocate<BasicBlock *>(PredCache.size());
-    std::copy(PredCache.begin(), PredCache.end(), Entry);
-    return Entry;
-  }
+    if (BB->getNumber() >= Storage.size())
+      Storage.resize(BB->getParent()->getMaxBlockNumber());
 
-  unsigned GetNumPreds(BasicBlock *BB) const {
-    auto Result = BlockToPredCountMap.find(BB);
-    if (Result != BlockToPredCountMap.end())
-      return Result->second;
-    return BlockToPredCountMap[BB] = pred_size(BB);
-  }
-
-public:
-  size_t size(BasicBlock *BB) const { return GetNumPreds(BB); }
-  ArrayRef<BasicBlock *> get(BasicBlock *BB) {
-    return ArrayRef(GetPreds(BB), GetNumPreds(BB));
+    SmallVector<BasicBlock *, 32> PredCache(predecessors(BB));
+    BasicBlock **Data = Memory.Allocate<BasicBlock *>(PredCache.size());
+    std::copy(PredCache.begin(), PredCache.end(), Data);
+    return Storage[BB->getNumber()] = ArrayRef(Data, PredCache.size());
   }
 
   /// clear - Remove all information.
   void clear() {
-    BlockToPredsMap.clear();
-    BlockToPredCountMap.clear();
+    Storage.clear();
     Memory.Reset();
   }
 };

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone skeptical about this one. It means that PredIteratorCache should not be used in cases where only a small fraction of blocks will be queried.

The change to store ArrayRef instead of having two separate maps looks good to me though.

@aengelke
Copy link
Contributor Author

aengelke commented Aug 4, 2024

where only a small fraction of blocks will be queried.

This boils down to the question: when is a vector faster than a DenseMap? This depends on:

  • number of inserted elements (I)
  • total element space (vector only) (N)
  • element size (16 bytes in this case)

I made a small benchmark (dmvsvec.txt -- clang++ -O3 -DNDEBUG -fno-exceptions -fno-rtti) that inserts I elements (with random number in 0..N-1) into a map/vector (vector preallocated to size N) and then probes all I elements. Not a real scientific benchmark, single machine only, and only to get an intuition where the boundaries are. Key results:

  • For I=1, the vector is better than the map for N<70 (2%, so the map is better if less than 2% of the elements are inserted+probed)
  • For I=5, the break even is at N~80 (6%)
  • For I=10, the break even is at N~110 (9%)
  • For I=20, the break even is at N~160 (12%)
  • For I=40, the break even is at N~270 (15%)
  • For I=150, the break even is at N~2000 (8%) -- for many insertions, when the DenseMap has to grow larger element sizes, the vector becomes more beneficial again.

For PredIteratorCache, there are likely multiple probes, so the vector is preferable in even more cases. (E.g., I=5, and 5 probes instead of one => N~100).

I have no statistics about the PIC users, but I think that they probe >5% of the basic blocks of a function is a somewhat reasonable assumption?

@nikic
Copy link
Contributor

nikic commented Aug 4, 2024

I have no statistics about the PIC users, but I think that they probe >5% of the basic blocks of a function is a somewhat reasonable assumption?

Not really. For example LCSSA formation uses PredIterCache for exit blocks, so it probably usually only has one entry when forming LCSSA for a single loop? It could be pushed to a higher level and be used for e.g. full recursive LCSSA formation, but that's how it is right now. The use in scalar promotion is for all loop blocks using a pointer, which may also only be tiny fraction of all blocks.

The data does suggest that it does work out fine on average, but I'm concerned about degenerate cases.

@aengelke
Copy link
Contributor Author

aengelke commented Aug 4, 2024

(Meh, didn't intend to push my experiment to this branch... just wanted to get performance results (perf/...) for a two-level scheme with a BitVector to reduce memset overhead.)

Edit: removed commit, memset is cheaper than BitVector lookup (c-t-t).

nikic added a commit to nikic/llvm-project that referenced this pull request Aug 5, 2024
Use a single map storing ArrayRef instead of two separate maps
for the base pointer and size. This saves one map lookup per
cache accesses.

This is an alternative to llvm#101885.
nikic added a commit that referenced this pull request Aug 5, 2024
Use a single map storing ArrayRef instead of two separate maps for the
base pointer and size. This saves one map lookup per cache accesses.

This is an alternative to
#101885 with a similar
compile-time improvement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants