Skip to content

TableGen: Optimize super-register class computation #134865

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 10, 2025

Conversation

nhaehnle
Copy link
Collaborator

@nhaehnle nhaehnle commented Apr 8, 2025

Inferring super-register classes naively requires checking every
register class against every other register class and sub-register index.
Each of those checks is itself a non-trivial operation on register sets.

Culling as many (RC, RC, SubIdx) triples as possible is important for
the running time of TableGen for architectures with complex sub-register
relations.

Use transitivity to cull many (RC, RC, SubIdx) triples. This
unfortunately requires us to complete the transitive closure of
super-register classes explicitly, but it still cuts down the running
time on AMDGPU substantially -- in some upcoming work in the
backend by more than half (in very rough measurements).

This changes the names of some of the inferred register classes, since
the order in which they are inferred changes. The names of the inferred
register classes become shorter, which reduces the size of the generated
files.

Replacing some uses of SmallPtrSet by DenseSet shaves off a few more
percent; there are hundreds of register classes in AMDGPU.

Tweaking the topological signature check to skip reigsters without
super-registers further helps skip register classes that have "pseudo"
registers in them whose sub- and super-register structure is trivial.

nhaehnle added 2 commits April 8, 2025 16:38
Inferring register classes for getMatchingSuperRegClass via
CodeGenRegBank::inferMatchingSuperRegClass is very expensive since it
potentially iterates over all triple of (super class, sub class, sub
class idx). Filtering out triples early is crucial.

The filter based on topological signatures generally does a good job,
but it fails in the presence of "pseudo registers" like FP_REG in AMDGPU:
These registers don't participate in the usual subregister / superregister
relationships and end up with a topological signature of 0 because of
that. However, they appear in register classes whose "normal" registers
all have a non-0 signature. We then fail to filter out those register
classes.

Since the TopoSigs are currently only used for inferring
getMatchingSuperRegClass, we can explicitly skip registers that don't
have super-registers.

On some future work in the AMDGPU backend, this simple change reduces
the running time of `llvm-tblgen -gen-register-info` by ~20% (in an
extremely rough measurement).
Inferring super-register classes naively requires checking every
register class against every other register class and sub-register index.
Each of those checks is itself a non-trivial operation on register sets.

Culling as many (RC, RC, SubIdx) triples as possible is important for
the running time of TableGen for architectures with complex sub-register
relations.

Use transitivity to cull many (RC, RC, SubIdx) triples. This
unfortunately requires us to complete the transitive closure of
super-register classes explicitly, but it still cuts down the running
time on AMDGPU substantially.

Replacing some uses of SmallPtrSet by DenseSet shaves off a few more
percent; there are hundreds of register classes in AMDGPU.

This changes the names of some of the inferred register classes, since
the order in which they are inferred changes. The names of the inferred
register classes become shorter, which reduces the size of the generated
files.
@nhaehnle nhaehnle requested review from jayfoad, arsenm and Sisyph April 8, 2025 14:51
@llvmbot
Copy link
Member

llvmbot commented Apr 8, 2025

@llvm/pr-subscribers-llvm-regalloc

@llvm/pr-subscribers-tablegen

Author: Nicolai Hähnle (nhaehnle)

Changes

Inferring super-register classes naively requires checking every
register class against every other register class and sub-register index.
Each of those checks is itself a non-trivial operation on register sets.

Culling as many (RC, RC, SubIdx) triples as possible is important for
the running time of TableGen for architectures with complex sub-register
relations.

Use transitivity to cull many (RC, RC, SubIdx) triples. This
unfortunately requires us to complete the transitive closure of
super-register classes explicitly, but it still cuts down the running
time on AMDGPU substantially -- in some upcoming work in the
backend by more than half (in very rough measurements).

This changes the names of some of the inferred register classes, since
the order in which they are inferred changes. The names of the inferred
register classes become shorter, which reduces the size of the generated
files.

Replacing some uses of SmallPtrSet by DenseSet shaves off a few more
percent; there are hundreds of register classes in AMDGPU.

Tweaking the topological signature check to skip reigsters without
super-registers further helps skip register classes that have "pseudo"
registers in them whose sub- and super-register structure is trivial.


Full diff: https://github.com/llvm/llvm-project/pull/134865.diff

3 Files Affected:

  • (modified) llvm/utils/TableGen/Common/CodeGenRegisters.cpp (+182-20)
  • (modified) llvm/utils/TableGen/Common/CodeGenRegisters.h (+41-11)
  • (modified) llvm/utils/TableGen/RegisterBankEmitter.cpp (+2-2)
diff --git a/llvm/utils/TableGen/Common/CodeGenRegisters.cpp b/llvm/utils/TableGen/Common/CodeGenRegisters.cpp
index 7105ced26be1c..3a6e828a99f2d 100644
--- a/llvm/utils/TableGen/Common/CodeGenRegisters.cpp
+++ b/llvm/utils/TableGen/Common/CodeGenRegisters.cpp
@@ -16,6 +16,8 @@
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/IntEqClasses.h"
+#include "llvm/ADT/PointerUnion.h"
+#include "llvm/ADT/PostOrderIterator.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
 #include "llvm/ADT/SmallPtrSet.h"
@@ -765,7 +767,8 @@ static void sortAndUniqueRegisters(CodeGenRegister::Vec &M) {
 CodeGenRegisterClass::CodeGenRegisterClass(CodeGenRegBank &RegBank,
                                            const Record *R)
     : TheDef(R), Name(std::string(R->getName())),
-      TopoSigs(RegBank.getNumTopoSigs()), EnumValue(-1), TSFlags(0) {
+      RegsWithSuperRegsTopoSigs(RegBank.getNumTopoSigs()), EnumValue(-1),
+      TSFlags(0) {
   GeneratePressureSet = R->getValueAsBit("GeneratePressureSet");
   std::vector<const Record *> TypeList = R->getValueAsListOfDefs("RegTypes");
   if (TypeList.empty())
@@ -791,7 +794,8 @@ CodeGenRegisterClass::CodeGenRegisterClass(CodeGenRegBank &RegBank,
     const CodeGenRegister *Reg = RegBank.getReg((*Elements)[i]);
     Members.push_back(Reg);
     Artificial &= Reg->Artificial;
-    TopoSigs.set(Reg->getTopoSig());
+    if (!Reg->getSuperRegs().empty())
+      RegsWithSuperRegsTopoSigs.set(Reg->getTopoSig());
   }
   sortAndUniqueRegisters(Members);
 
@@ -849,13 +853,14 @@ CodeGenRegisterClass::CodeGenRegisterClass(CodeGenRegBank &RegBank,
 CodeGenRegisterClass::CodeGenRegisterClass(CodeGenRegBank &RegBank,
                                            StringRef Name, Key Props)
     : Members(*Props.Members), TheDef(nullptr), Name(std::string(Name)),
-      TopoSigs(RegBank.getNumTopoSigs()), EnumValue(-1), RSI(Props.RSI),
-      CopyCost(0), Allocatable(true), AllocationPriority(0),
+      RegsWithSuperRegsTopoSigs(RegBank.getNumTopoSigs()), EnumValue(-1),
+      RSI(Props.RSI), CopyCost(0), Allocatable(true), AllocationPriority(0),
       GlobalPriority(false), TSFlags(0) {
   Artificial = true;
   GeneratePressureSet = false;
   for (const auto R : Members) {
-    TopoSigs.set(R->getTopoSig());
+    if (!R->getSuperRegs().empty())
+      RegsWithSuperRegsTopoSigs.set(R->getTopoSig());
     Artificial &= R->Artificial;
   }
 }
@@ -1173,6 +1178,28 @@ void CodeGenRegisterClass::buildRegUnitSet(
                    std::back_inserter(RegUnits));
 }
 
+// Combine our super classes of the given sub-register index with all of their
+// super classes in turn.
+void CodeGenRegisterClass::extendSuperRegClasses(CodeGenSubRegIndex *SubIdx) {
+  auto It = SuperRegClasses.find(SubIdx);
+  if (It == SuperRegClasses.end())
+    return;
+
+  SmallVector<CodeGenRegisterClass *> MidRCs;
+  MidRCs.insert(MidRCs.end(), It->second.begin(), It->second.end());
+
+  for (CodeGenRegisterClass *MidRC : MidRCs) {
+    for (auto &Pair : MidRC->SuperRegClasses) {
+      CodeGenSubRegIndex *ComposedSubIdx = Pair.first->compose(SubIdx);
+      if (!ComposedSubIdx)
+        continue;
+
+      for (CodeGenRegisterClass *SuperRC : Pair.second)
+        addSuperRegClass(ComposedSubIdx, SuperRC);
+    }
+  }
+}
+
 //===----------------------------------------------------------------------===//
 //                           CodeGenRegisterCategory
 //===----------------------------------------------------------------------===//
@@ -1290,6 +1317,8 @@ CodeGenRegBank::CodeGenRegBank(const RecordKeeper &Records,
     }
   }
 
+  computeSubRegIndicesRPOT();
+
   // Native register units are associated with a leaf register. They've all been
   // discovered now.
   NumNativeRegUnits = RegUnits.size();
@@ -1364,7 +1393,7 @@ void CodeGenRegBank::addToMaps(CodeGenRegisterClass *RC) {
 }
 
 // Create a synthetic sub-class if it is missing.
-CodeGenRegisterClass *
+std::pair<CodeGenRegisterClass *, bool>
 CodeGenRegBank::getOrCreateSubClass(const CodeGenRegisterClass *RC,
                                     const CodeGenRegister::Vec *Members,
                                     StringRef Name) {
@@ -1372,12 +1401,12 @@ CodeGenRegBank::getOrCreateSubClass(const CodeGenRegisterClass *RC,
   CodeGenRegisterClass::Key K(Members, RC->RSI);
   RCKeyMap::const_iterator FoundI = Key2RC.find(K);
   if (FoundI != Key2RC.end())
-    return FoundI->second;
+    return {FoundI->second, false};
 
   // Sub-class doesn't exist, create a new one.
   RegClasses.emplace_back(*this, Name, K);
   addToMaps(&RegClasses.back());
-  return &RegClasses.back();
+  return {&RegClasses.back(), true};
 }
 
 CodeGenRegisterClass *CodeGenRegBank::getRegClass(const Record *Def) const {
@@ -1694,6 +1723,81 @@ void CodeGenRegBank::computeSubRegLaneMasks() {
 
 namespace {
 
+// A directed graph on sub-register indices with a virtual source node that
+// has an arc to all other nodes, and an arc from A to B if sub-register index
+// B can be obtained by composing A with some other sub-register index.
+struct SubRegIndexCompositionGraph {
+  std::deque<CodeGenSubRegIndex> &SubRegIndices;
+  CodeGenSubRegIndex::CompMap EntryNode;
+
+  SubRegIndexCompositionGraph(std::deque<CodeGenSubRegIndex> &SubRegIndices)
+      : SubRegIndices(SubRegIndices) {
+    for (CodeGenSubRegIndex &Idx : SubRegIndices) {
+      EntryNode.try_emplace(&Idx, &Idx);
+    }
+  }
+};
+
+} // namespace
+
+template <> struct llvm::GraphTraits<SubRegIndexCompositionGraph> {
+  using NodeRef =
+      PointerUnion<CodeGenSubRegIndex *, const CodeGenSubRegIndex::CompMap *>;
+
+  // Using a reverse iterator causes sub-register indices to appear in their
+  // more natural order in RPOT.
+  using CompMapIt = CodeGenSubRegIndex::CompMap::const_reverse_iterator;
+  struct ChildIteratorType
+      : public iterator_adaptor_base<
+            ChildIteratorType, CompMapIt,
+            typename std::iterator_traits<CompMapIt>::iterator_category,
+            NodeRef> {
+    ChildIteratorType(CompMapIt I)
+        : ChildIteratorType::iterator_adaptor_base(I) {}
+
+    NodeRef operator*() const { return wrapped()->second; }
+  };
+
+  static NodeRef getEntryNode(const SubRegIndexCompositionGraph &G) {
+    return &G.EntryNode;
+  }
+
+  static const CodeGenSubRegIndex::CompMap *children(NodeRef N) {
+    if (auto *Idx = dyn_cast<CodeGenSubRegIndex *>(N))
+      return &Idx->getComposites();
+    return cast<const CodeGenSubRegIndex::CompMap *>(N);
+  }
+
+  static ChildIteratorType child_begin(NodeRef N) {
+    return ChildIteratorType(children(N)->rbegin());
+  }
+  static ChildIteratorType child_end(NodeRef N) {
+    return ChildIteratorType(children(N)->rend());
+  }
+
+  static auto nodes_begin(SubRegIndexCompositionGraph *G) {
+    return G->SubRegIndices.begin();
+  }
+  static auto nodes_end(SubRegIndexCompositionGraph *G) {
+    return G->SubRegIndices.end();
+  }
+
+  static unsigned size(SubRegIndexCompositionGraph *G) {
+    return G->SubRegIndices.size();
+  }
+};
+
+void CodeGenRegBank::computeSubRegIndicesRPOT() {
+  SubRegIndexCompositionGraph G(SubRegIndices);
+  ReversePostOrderTraversal<SubRegIndexCompositionGraph> RPOT(G);
+  for (const auto N : RPOT) {
+    if (auto *Idx = dyn_cast<CodeGenSubRegIndex *>(N))
+      SubRegIndicesRPOT.push_back(Idx);
+  }
+}
+
+namespace {
+
 // UberRegSet is a helper class for computeRegUnitWeights. Each UberRegSet is
 // the transitive closure of the union of overlapping register
 // classes. Together, the UberRegSets form a partition of the registers. If we
@@ -2323,8 +2427,10 @@ void CodeGenRegBank::inferSubClassWithSubReg(CodeGenRegisterClass *RC) {
     if (SubIdx.Artificial)
       continue;
     // This is a real subset.  See if we have a matching class.
-    CodeGenRegisterClass *SubRC = getOrCreateSubClass(
-        RC, &I->second, RC->getName() + "_with_" + I->first->getName());
+    CodeGenRegisterClass *SubRC =
+        getOrCreateSubClass(RC, &I->second,
+                            RC->getName() + "_with_" + I->first->getName())
+            .first;
     RC->setSubClassWithSubReg(&SubIdx, SubRC);
   }
 }
@@ -2339,16 +2445,22 @@ void CodeGenRegBank::inferSubClassWithSubReg(CodeGenRegisterClass *RC) {
 void CodeGenRegBank::inferMatchingSuperRegClass(
     CodeGenRegisterClass *RC,
     std::list<CodeGenRegisterClass>::iterator FirstSubRegRC) {
+  DenseSet<const CodeGenSubRegIndex *> ImpliedSubRegIndices;
   std::vector<std::pair<const CodeGenRegister *, const CodeGenRegister *>>
       SubToSuperRegs;
   BitVector TopoSigs(getNumTopoSigs());
 
-  // Iterate in SubRegIndex numerical order to visit synthetic indices last.
-  for (auto &SubIdx : SubRegIndices) {
+  // Iterate subregister indices in topological order to visit larger indices
+  // first. This allows us to skip the smaller indices in many cases because
+  // their inferred super-register classes are implied.
+  for (auto *SubIdx : SubRegIndicesRPOT) {
     // Skip indexes that aren't fully supported by RC's registers. This was
     // computed by inferSubClassWithSubReg() above which should have been
     // called first.
-    if (RC->getSubClassWithSubReg(&SubIdx) != RC)
+    if (RC->getSubClassWithSubReg(SubIdx) != RC)
+      continue;
+
+    if (ImpliedSubRegIndices.count(SubIdx))
       continue;
 
     // Build list of (Sub, Super) pairs for this SubIdx, sorted by Sub. Note
@@ -2356,7 +2468,7 @@ void CodeGenRegBank::inferMatchingSuperRegClass(
     SubToSuperRegs.clear();
     TopoSigs.reset();
     for (const auto Super : RC->getMembers()) {
-      const CodeGenRegister *Sub = Super->getSubRegs().find(&SubIdx)->second;
+      const CodeGenRegister *Sub = Super->getSubRegs().find(SubIdx)->second;
       assert(Sub && "Missing sub-register");
       SubToSuperRegs.emplace_back(Sub, Super);
       TopoSigs.set(Sub->getTopoSig());
@@ -2374,7 +2486,7 @@ void CodeGenRegBank::inferMatchingSuperRegClass(
       if (SubRC.Artificial)
         continue;
       // Topological shortcut: SubRC members have the wrong shape.
-      if (!TopoSigs.anyCommon(SubRC.getTopoSigs()))
+      if (!TopoSigs.anyCommon(SubRC.getRegsWithSuperRegsTopoSigs()))
         continue;
       // Compute the subset of RC that maps into SubRC with a single linear scan
       // through SubToSuperRegs and the members of SubRC.
@@ -2395,15 +2507,54 @@ void CodeGenRegBank::inferMatchingSuperRegClass(
       // RC injects completely into SubRC.
       sortAndUniqueRegisters(SubSetVec);
       if (SubSetVec.size() == RC->getMembers().size()) {
-        SubRC.addSuperRegClass(&SubIdx, RC);
+        SubRC.addSuperRegClass(SubIdx, RC);
+
+        // We can skip checking subregister indices that can be composed from
+        // the current SubIdx.
+        //
+        // Proof sketch: Let SubRC' be another register class and SubSubIdx
+        // a subregister index that can be composed from SubIdx.
+        //
+        // Calling this function with SubRC in place of RC ensures the existence
+        // of a subclass X of SubRC with the registers that have subregisters in
+        // SubRC'.
+        //
+        // The set of registers in RC with SubSubIdx in SubRC' is equal to the
+        // set of registers in RC with SubIdx in X (because every register in
+        // RC has a corresponding subregister in SubRC), and so checking the
+        // pair (SubSubIdx, SubRC') is redundant with checking (SubIdx, X).
+        for (const auto &SubSubIdx : SubIdx->getComposites())
+          ImpliedSubRegIndices.insert(SubSubIdx.second);
+
         continue;
       }
 
       // Only a subset of RC maps into SubRC. Make sure it is represented by a
       // class.
-      getOrCreateSubClass(RC, &SubSetVec,
-                          RC->getName() + "_with_" + SubIdx.getName() + "_in_" +
-                              SubRC.getName());
+      //
+      // The name of the inferred register class follows the template
+      // "<RC>_with_<SubIdx>_in_<SubRC>".
+      //
+      // When SubRC is already an inferred class, prefer a name of the form
+      // "<RC>_with_<CompositeSubIdx>_in_<SubSubRC>" over a chain of the form
+      // "<RC>_with_<SubIdx>_in_<OtherRc>_with_<SubSubIdx>_in_<SubSubRC>".
+      CodeGenSubRegIndex *CompositeSubIdx = SubIdx;
+      CodeGenRegisterClass *CompositeSubRC = &SubRC;
+      if (CodeGenSubRegIndex *SubSubIdx = SubRC.getInferredFromSubRegIdx()) {
+        auto It = SubIdx->getComposites().find(SubSubIdx);
+        if (It != SubIdx->getComposites().end()) {
+          CompositeSubIdx = It->second;
+          CompositeSubRC = SubRC.getInferredFromRC();
+        }
+      }
+
+      auto [SubSetRC, Inserted] = getOrCreateSubClass(
+          RC, &SubSetVec,
+          RC->getName() + "_with_" + CompositeSubIdx->getName() + "_in_" +
+              CompositeSubRC->getName());
+
+      if (Inserted)
+        SubSetRC->setInferredFrom(CompositeSubIdx, CompositeSubRC);
     }
   }
 }
@@ -2438,7 +2589,7 @@ void CodeGenRegBank::computeInferredRegisterClasses() {
     inferMatchingSuperRegClass(RC);
 
     // New register classes are created while this loop is running, and we need
-    // to visit all of them.  I  particular, inferMatchingSuperRegClass needs
+    // to visit all of them.  In particular, inferMatchingSuperRegClass needs
     // to match old super-register classes with sub-register classes created
     // after inferMatchingSuperRegClass was called.  At this point,
     // inferMatchingSuperRegClass has checked SuperRC = [0..rci] with SubRC =
@@ -2451,6 +2602,17 @@ void CodeGenRegBank::computeInferredRegisterClasses() {
       FirstNewRC = NextNewRC;
     }
   }
+
+  // Compute the transitive closure for super-register classes.
+  //
+  // By iterating over sub-register indices in topological order, we only ever
+  // add super-register classes for sub-register indices that have not already
+  // been visited. That allows computing the transitive closure in a single
+  // pass.
+  for (CodeGenSubRegIndex *SubIdx : SubRegIndicesRPOT) {
+    for (CodeGenRegisterClass &SubRC : RegClasses)
+      SubRC.extendSuperRegClasses(SubIdx);
+  }
 }
 
 /// getRegisterClassForRegister - Find the register class that contains the
diff --git a/llvm/utils/TableGen/Common/CodeGenRegisters.h b/llvm/utils/TableGen/Common/CodeGenRegisters.h
index 5e2d1977545c1..f9a7904709830 100644
--- a/llvm/utils/TableGen/Common/CodeGenRegisters.h
+++ b/llvm/utils/TableGen/Common/CodeGenRegisters.h
@@ -21,7 +21,6 @@
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
-#include "llvm/ADT/SmallPtrSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/SparseBitVector.h"
 #include "llvm/ADT/StringMap.h"
@@ -337,12 +336,18 @@ class CodeGenRegisterClass {
   //
   //   R:SubRegIndex in this RC for all R in SuperRC.
   //
-  DenseMap<const CodeGenSubRegIndex *, SmallPtrSet<CodeGenRegisterClass *, 8>>
+  DenseMap<CodeGenSubRegIndex *, DenseSet<CodeGenRegisterClass *>>
       SuperRegClasses;
 
-  // Bit vector of TopoSigs for the registers in this class. This will be
-  // very sparse on regular architectures.
-  BitVector TopoSigs;
+  // Bit vector of TopoSigs for the registers with super registers in this
+  // class. This will be very sparse on regular architectures.
+  BitVector RegsWithSuperRegsTopoSigs;
+
+  // If the register class was inferred for getMatchingSuperRegClass, this
+  // holds the subregister index and subregister class for which the register
+  // class was created.
+  CodeGenSubRegIndex *InferredFromSubRegIdx = nullptr;
+  CodeGenRegisterClass *InferredFromRC = nullptr;
 
 public:
   unsigned EnumValue;
@@ -438,6 +443,8 @@ class CodeGenRegisterClass {
     SuperRegClasses[SubIdx].insert(SuperRC);
   }
 
+  void extendSuperRegClasses(CodeGenSubRegIndex *SubIdx);
+
   // getSubClasses - Returns a constant BitVector of subclasses indexed by
   // EnumValue.
   // The SubClasses vector includes an entry for this class.
@@ -463,8 +470,11 @@ class CodeGenRegisterClass {
   // getOrder(0).
   const CodeGenRegister::Vec &getMembers() const { return Members; }
 
-  // Get a bit vector of TopoSigs present in this register class.
-  const BitVector &getTopoSigs() const { return TopoSigs; }
+  // Get a bit vector of TopoSigs of registers with super registers in this
+  // register class.
+  const BitVector &getRegsWithSuperRegsTopoSigs() const {
+    return RegsWithSuperRegsTopoSigs;
+  }
 
   // Get a weight of this register class.
   unsigned getWeight(const CodeGenRegBank &) const;
@@ -505,6 +515,20 @@ class CodeGenRegisterClass {
       return TheDef->getValueAsInt("BaseClassOrder");
     return {};
   }
+
+  void setInferredFrom(CodeGenSubRegIndex *Idx, CodeGenRegisterClass *RC) {
+    assert(Idx && RC);
+    assert(!InferredFromSubRegIdx);
+
+    InferredFromSubRegIdx = Idx;
+    InferredFromRC = RC;
+  }
+
+  CodeGenSubRegIndex *getInferredFromSubRegIdx() const {
+    return InferredFromSubRegIdx;
+  }
+
+  CodeGenRegisterClass *getInferredFromRC() const { return InferredFromRC; }
 };
 
 // Register categories are used when we need to deterine the category a
@@ -587,6 +611,9 @@ class CodeGenRegBank {
   std::deque<CodeGenSubRegIndex> SubRegIndices;
   DenseMap<const Record *, CodeGenSubRegIndex *> Def2SubRegIdx;
 
+  // Subregister indices sorted topologically by composition.
+  std::vector<CodeGenSubRegIndex *> SubRegIndicesRPOT;
+
   CodeGenSubRegIndex *createSubRegIndex(StringRef Name, StringRef NameSpace);
 
   typedef std::map<SmallVector<CodeGenSubRegIndex *, 8>, CodeGenSubRegIndex *>
@@ -638,10 +665,10 @@ class CodeGenRegBank {
   // Add RC to *2RC maps.
   void addToMaps(CodeGenRegisterClass *);
 
-  // Create a synthetic sub-class if it is missing.
-  CodeGenRegisterClass *getOrCreateSubClass(const CodeGenRegisterClass *RC,
-                                            const CodeGenRegister::Vec *Membs,
-                                            StringRef Name);
+  // Create a synthetic sub-class if it is missing. Returns (RC, inserted).
+  std::pair<CodeGenRegisterClass *, bool>
+  getOrCreateSubClass(const CodeGenRegisterClass *RC,
+                      const CodeGenRegister::Vec *Membs, StringRef Name);
 
   // Infer missing register classes.
   void computeInferredRegisterClasses();
@@ -671,6 +698,9 @@ class CodeGenRegBank {
   // Compute a lane mask for each sub-register index.
   void computeSubRegLaneMasks();
 
+  // Compute RPOT of subregister indices by composition.
+  void computeSubRegIndicesRPOT();
+
   /// Computes a lane mask for each register unit enumerated by a physical
   /// register.
   void computeRegUnitLaneMasks();
diff --git a/llvm/utils/TableGen/RegisterBankEmitter.cpp b/llvm/utils/TableGen/RegisterBankEmitter.cpp
index a2fcf55e85132..e931000bb9c71 100644
--- a/llvm/utils/TableGen/RegisterBankEmitter.cpp
+++ b/llvm/utils/TableGen/RegisterBankEmitter.cpp
@@ -179,7 +179,7 @@ static void visitRegisterBankClasses(
     const CodeGenRegBank &RegisterClassHierarchy,
     const CodeGenRegisterClass *RC, const Twine &Kind,
     std::function<void(const CodeGenRegisterClass *, StringRef)> VisitFn,
-    SmallPtrSetImpl<const CodeGenRegisterClass *> &VisitedRCs) {
+    DenseSet<const CodeGenRegisterClass *> &VisitedRCs) {
 
   // Make sure we only visit each class once to avoid infinite loops.
   if (!VisitedRCs.insert(RC).second)
@@ -390,7 +390,7 @@ void RegisterBankEmitter::run(raw_ostream &OS) {
   Timer.startTimer("Analyze records");
   std::vector<RegisterBank> Banks;
   for (const auto &V : Records.getAllDerivedDefinitions("RegisterBank")) {
-    SmallPtrSet<const CodeGenRegisterClass *, 8> VisitedRCs;
+    DenseSet<const CodeGenRegisterClass *> VisitedRCs;
     RegisterBank Bank(*V, CGH.getNumModeIds());
 
     for (const CodeGenRegisterClass *RC :

@nhaehnle nhaehnle merged commit 9c31155 into llvm:main Apr 10, 2025
14 checks passed
@nhaehnle nhaehnle deleted the pub-tblgen branch April 10, 2025 19:00
@llvm-ci
Copy link
Collaborator

llvm-ci commented Apr 11, 2025

LLVM Buildbot has detected a new failure on builder bolt-x86_64-ubuntu-nfc running on bolt-worker while building llvm at step 9 "test-build-bolt-check-large-bolt".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/92/builds/16960

Here is the relevant piece of the build log for the reference
Step 9 (test-build-bolt-check-large-bolt) failure: test (failure)
******************** TEST 'bolt-tests :: X86/clang-nolbr.test' FAILED ********************
Exit Code: 254

Command Output (stderr):
--
mkdir -p /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Output # RUN: at line 5
+ mkdir -p /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Output
test -f /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Output/clang || unzstd /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Inputs/clang.zst -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Output/clang # RUN: at line 6
+ test -f /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Output/clang
perf record -e cycles:u -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.data --    /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Output/clang /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Inputs/bf.cpp -O2 -std=c++11 -c -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.out # RUN: at line 9
+ perf record -e cycles:u -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.data -- /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Output/clang /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Inputs/bf.cpp -O2 -std=c++11 -c -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.out
Lowering default frequency rate from 4000 to 2000.
Please consider tweaking /proc/sys/kernel/perf_event_max_sample_rate.
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.433 MB /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.data (11117 samples) ]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/bin/perf2bolt /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Output/clang -p /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.data -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.fdata -nl    |& /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/bin/FileCheck /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/clang-nolbr.test -check-prefix=CHECK-P2B # RUN: at line 13
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/bin/perf2bolt /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Output/clang -p /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.data -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.fdata -nl
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/bin/FileCheck /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/clang-nolbr.test -check-prefix=CHECK-P2B
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/bin/llvm-bolt /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Output/clang -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp -data /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.fdata     -relocs -reorder-blocks=ext-tsp -split-functions=3 -split-all-cold     -split-eh -icf=1 -reorder-functions=hfsort+ -use-gnu-stack     -jump-tables=move -frame-opt=hot -peepholes=all -dyno-stats    |& /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/bin/FileCheck /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/clang-nolbr.test -check-prefix=CHECK-BOLT # RUN: at line 17
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/bin/llvm-bolt /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Output/clang -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp -data /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.fdata -relocs -reorder-blocks=ext-tsp -split-functions=3 -split-all-cold -split-eh -icf=1 -reorder-functions=hfsort+ -use-gnu-stack -jump-tables=move -frame-opt=hot -peepholes=all -dyno-stats
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/bin/FileCheck /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/clang-nolbr.test -check-prefix=CHECK-BOLT
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Inputs/bf.cpp -O2 -std=c++11 -c -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.out # RUN: at line 24
+ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Inputs/bf.cpp -O2 -std=c++11 -c -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.out
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamE+0x1d)[0xb91357]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp(_ZN4llvm3sys17RunSignalHandlersEv+0x2f)[0xb90173]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp[0xb903f1]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f7c61c42520]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp[0x37482df]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp[0x373f667]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp[0x199f49e]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp[0x373db1f]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp(_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE+0x365)[0x341c16f]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp(_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE+0x36)[0x341bdfe]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp(_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE+0x185)[0x2219375]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp(_ZN5clang17EmitBackendOutputERNS_17DiagnosticsEngineERKNS_19HeaderSearchOptionsERKNS_14CodeGenOptionsERKNS_13TargetOptionsERKNS_11LangOptionsERKN4llvm10DataLayoutEPNSE_6ModuleENS_13BackendActionESt10unique_ptrINSE_17raw_pwrite_streamESt14default_deleteISM_EE+0x825)[0x22a3455]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp[0x2385f1e]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp(_ZN5clang8ParseASTERNS_4SemaEbb+0x140)[0x1e933b0]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp(_ZN5clang13CodeGenAction13ExecuteActionEv+0x36)[0x2385986]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp(_ZN5clang14FrontendAction7ExecuteEv+0x24)[0x2337034]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp(_ZN5clang16CompilerInstance13ExecuteActionERNS_14FrontendActionE+0x120)[0x231c0e0]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp(_ZN5clang25ExecuteCompilerInvocationEPNS_16CompilerInstanceE+0x1a7)[0x23433a7]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp(_Z8cc1_mainN4llvm8ArrayRefIPKcEES2_Pv+0x415)[0x2173c15]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp(main+0x4da)[0x1768eea]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f7c61c29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f7c61c29e40]
/home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp[0x2172d29]
Stack dump:
0.	Program arguments: /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp -cc1 -triple x86_64-unknown-linux-gnu -emit-obj -disable-free -disable-llvm-verifier -discard-value-names -main-file-name bf.cpp -mrelocation-model static -mthread-model posix -fmath-errno -masm-verbose -mconstructor-aliases -munwind-tables -fuse-init-array -target-cpu x86-64 -momit-leaf-frame-pointer -dwarf-column-info -debugger-tuning=gdb -coverage-notes-file /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.gcno -resource-dir /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/lib/clang/6.0.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/backward -internal-isystem /usr/local/include -internal-isystem /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/lib/clang/6.0.0/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -O2 -std=c++11 -fdeprecated-macro -fdebug-compilation-dir /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86 -ferror-limit 19 -fmessage-length 0 -fobjc-runtime=gcc -fcxx-exceptions -fexceptions -fdiagnostics-show-option -vectorize-loops -vectorize-slp -o /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/build/tools/bolttests/X86/Output/clang-nolbr.test.tmp.out -x c++ /home/worker/bolt-worker2/bolt-x86_64-ubuntu-nfc/bolt-tests/test/X86/Inputs/bf.cpp 
1.	<eof> parser at end of file
2.	Code generation
...

@bjope
Copy link
Collaborator

bjope commented Apr 15, 2025

@nhaehnle , I'm trying to understand the impact of this for our downstream out-of-tree target. Some of our benchmarks has shown regressions, and one thing I've noticed is that the register pressure sets differ a lot. There is really no comments in the commit message about this really changing anything (except for register class names). And there aren't any test cases that are impacted.

(No idea yet if we get codegen diffs due to something special we have in our downstream fork.)

Just want to double check if this just is supposed to save compilation time, without any changes in codegen. Or am I really supposed to see impact in llc codegen with this patch?

@bjope
Copy link
Collaborator

bjope commented Apr 15, 2025

@nhaehnle , I'm trying to understand the impact of this for our downstream out-of-tree target. Some of our benchmarks has shown regressions, and one thing I've noticed is that the register pressure sets differ a lot. There is really no comments in the commit message about this really changing anything (except for register class names). And there aren't any test cases that are impacted.

(No idea yet if we get codegen diffs due to something special we have in our downstream fork.)

Just want to double check if this just is supposed to save compilation time, without any changes in codegen. Or am I really supposed to see impact in llc codegen with this patch?

I did check for some in-tree target and then the GenRegisterInfo.inc files isn't impacted. So it looks like we have some downstream added property on the register classes that isn't inferred the same way after this patch. Hmm. We have probably
been "lucky" with iteration order or something in the past. I'll keep digging on my end as I suspect this is a downstream issue.

var-const pushed a commit to ldionne/llvm-project that referenced this pull request Apr 17, 2025
Inferring super-register classes naively requires checking every
register class against every other register class and sub-register
index.
Each of those checks is itself a non-trivial operation on register sets.

Culling as many (RC, RC, SubIdx) triples as possible is important for
the running time of TableGen for architectures with complex sub-register
relations.

Use transitivity to cull many (RC, RC, SubIdx) triples. This
unfortunately requires us to complete the transitive closure of
super-register classes explicitly, but it still cuts down the running
time on AMDGPU substantially -- in some upcoming work in the
backend by more than half (in very rough measurements).

This changes the names of some of the inferred register classes, since
the order in which they are inferred changes. The names of the inferred
register classes become shorter, which reduces the size of the generated
files.

Replacing some uses of SmallPtrSet by DenseSet shaves off a few more
percent; there are hundreds of register classes in AMDGPU.

Tweaking the topological signature check to skip reigsters without
super-registers further helps skip register classes that have "pseudo"
registers in them whose sub- and super-register structure is trivial.
@nhaehnle
Copy link
Collaborator Author

Just to confirm, this was really only supposed to improve compile time. So yeah, if you depended on iteration order for something additional downstream, you may have gotten unlucky now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants