Skip to content

[RFC][MC] Cache MCRegAliasIterator #93510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 14, 2024
Merged

Conversation

Pierre-vh
Copy link
Contributor

AMDGPU has a lot of registers, almost 9000. Many of those registers have aliases. For instance, SGPR0 has a ton of aliases due to the presence of register tuples. It's even worse if you query the aliases of a register tuple itself. A large register tuple can have hundreds of aliases because it may include 16 registers, and each of those registers have their own tuples as well.

The current implementation of MCRegAliasIterator is not good at this. In some extreme cases it can iterate (IIRC), 7000 more times than necessary, just giving duplicates over and over again and using a lot of expensive iterators.

This patch implements a cache system for MCRegAliasIterator. It does the expensive part only once and then saves it for us so the next iterations on that register's aliases are just a map lookup.

Furthermore, the cached data is uniqued (and sorted). Thus, this speeds up code by both speeding up the iterator itself, but also by minimizing the number of loop iterations users of the iterator do.

I think this is beneficial for AMDGPU (and some quick profiling showed a few seconds gained on a +-3 minute test build), but may introduce some overhead on smaller targets where this iterator was alreeady cheap. To avoid that, we might consider disabling the cache on targets with a small amount of registers. We could pretty much restrict this cache to AMDGPU only by only enabling it if NumReg is above a certain threshold, like 1000.

I still need to do memory usage measurements, but I think this cache should take at most 10MB on AMDGPU if we were to query the aliases of all registers (which isn't common I think). I don't think this is a concern but if it is, we can always find ways to alleviate this, for instance by only caching after a certain number of hits or by clearing the cache at regular intervals in the backend.

AMDGPU has a lot of registers, almost 9000. Many of those registers have aliases. For instance, SGPR0 has a ton of aliases due to the presence of register tuples. It's even worse if you query the aliases of a register tuple itself. A large register tuple can have hundreds of aliases because it may include 16 registers, and each of those registers have their own tuples as well.

The current implementation of MCRegAliasIterator is not good at this. In some extreme cases it can iterate (IIRC), 7000 more times than necessary, just giving duplicates over and over again and using a lot of expensive iterators.

This patch implements a cache system for MCRegAliasIterator. It does the expensive part only once and then saves it for us so the next iterations on that register's aliases are just a map lookup.

Furthermore, the cached data is uniqued (and sorted). Thus, this speeds up code by both speeding up the iterator itself, but also by minimizing the number of loop iterations users of the iterator do.

I think this is beneficial for AMDGPU (and some quick profiling showed  a few seconds gained on a +-3 minute test build), but may introduce some overhead on smaller targets where this iterator was alreeady cheap.  To avoid that, we might consider disabling the cache on targets with a small amount of registers. We could pretty much restrict this cache to AMDGPU only by only enabling it if NumReg is above a certain threshold, like 1000.

I still need to do memory usage measurements, but I think this cache should take at most 10MB on AMDGPU if we were to query the aliases of all registers (which isn't common I think). I don't think this is a concern but if it is,  we can always find ways to alleviate this, for instance by only caching after a certain number of hits or by clearing the cache at regular intervals in the backend.
@llvmbot llvmbot added backend:ARM mc Machine (object) code labels May 28, 2024
@llvmbot
Copy link
Member

llvmbot commented May 28, 2024

@llvm/pr-subscribers-llvm-regalloc
@llvm/pr-subscribers-backend-arm

@llvm/pr-subscribers-mc

Author: Pierre van Houtryve (Pierre-vh)

Changes

AMDGPU has a lot of registers, almost 9000. Many of those registers have aliases. For instance, SGPR0 has a ton of aliases due to the presence of register tuples. It's even worse if you query the aliases of a register tuple itself. A large register tuple can have hundreds of aliases because it may include 16 registers, and each of those registers have their own tuples as well.

The current implementation of MCRegAliasIterator is not good at this. In some extreme cases it can iterate (IIRC), 7000 more times than necessary, just giving duplicates over and over again and using a lot of expensive iterators.

This patch implements a cache system for MCRegAliasIterator. It does the expensive part only once and then saves it for us so the next iterations on that register's aliases are just a map lookup.

Furthermore, the cached data is uniqued (and sorted). Thus, this speeds up code by both speeding up the iterator itself, but also by minimizing the number of loop iterations users of the iterator do.

I think this is beneficial for AMDGPU (and some quick profiling showed a few seconds gained on a +-3 minute test build), but may introduce some overhead on smaller targets where this iterator was alreeady cheap. To avoid that, we might consider disabling the cache on targets with a small amount of registers. We could pretty much restrict this cache to AMDGPU only by only enabling it if NumReg is above a certain threshold, like 1000.

I still need to do memory usage measurements, but I think this cache should take at most 10MB on AMDGPU if we were to query the aliases of all registers (which isn't common I think). I don't think this is a concern but if it is, we can always find ways to alleviate this, for instance by only caching after a certain number of hits or by clearing the cache at regular intervals in the backend.


Full diff: https://github.com/llvm/llvm-project/pull/93510.diff

3 Files Affected:

  • (modified) llvm/include/llvm/MC/MCRegisterInfo.h (+90-24)
  • (modified) llvm/lib/MC/MCRegisterInfo.cpp (+20)
  • (modified) llvm/test/CodeGen/ARM/constant-island-movwt.mir (+7-7)
diff --git a/llvm/include/llvm/MC/MCRegisterInfo.h b/llvm/include/llvm/MC/MCRegisterInfo.h
index af5be9186108a..79df17d33e420 100644
--- a/llvm/include/llvm/MC/MCRegisterInfo.h
+++ b/llvm/include/llvm/MC/MCRegisterInfo.h
@@ -187,6 +187,9 @@ class MCRegisterInfo {
   DenseMap<MCRegister, int> L2SEHRegs;        // LLVM to SEH regs mapping
   DenseMap<MCRegister, int> L2CVRegs;         // LLVM to CV regs mapping
 
+  mutable DenseMap<MCPhysReg, std::vector<MCPhysReg>> RegAliasesCache;
+  ArrayRef<MCPhysReg> getCachedAliasesOf(MCPhysReg R) const;
+
   /// Iterator class that can traverse the differentially encoded values in
   /// DiffLists. Don't use this class directly, use one of the adaptors below.
   class DiffListIterator
@@ -263,6 +266,7 @@ class MCRegisterInfo {
   friend class MCRegUnitIterator;
   friend class MCRegUnitMaskIterator;
   friend class MCRegUnitRootIterator;
+  friend class MCRegAliasIterator;
 
   /// Initialize MCRegisterInfo, called by TableGen
   /// auto-generated routines. *DO NOT USE*.
@@ -726,60 +730,122 @@ class MCRegUnitRootIterator {
 /// MCRegAliasIterator enumerates all registers aliasing Reg.  If IncludeSelf is
 /// set, Reg itself is included in the list.  This iterator does not guarantee
 /// any ordering or that entries are unique.
+///
+/// This iterator can work in two modes: cached and uncached.
+///
+/// In Uncached mode, this uses RegUnit/RegUnitRoot/SuperReg iterators to
+/// find all aliases. This is very expensive if the target (such as AMDGPU)
+/// has a lot of register and register aliases. It can also cause us
+/// to iterate the same register many times over.
+///
+/// In cached mode, the iterator requests MCRegisterInfo for a cache.
+/// MCRegisterInfo then returns a cached list of sorted, uniqued
+/// aliases for that register. This allows the iterator to be faster
+/// past the first request, but also to iterate much less in some
+/// cases, giving a slight speedup to any code that's using it.
 class MCRegAliasIterator {
 private:
   MCRegister Reg;
   const MCRegisterInfo *MCRI;
-  bool IncludeSelf;
+  bool IncludeSelf : 1;
+  bool IsCached : 1;
+
+  /// TODO: Should we dynamically allocate this? If we don't make caching
+  /// specific for each target, we probably should in order to keep the size of
+  /// this object under control. RegAliasIterator is currently 80 bytes.
+  struct UncachedData {
+    MCRegUnitIterator RI;
+    MCRegUnitRootIterator RRI;
+    MCSuperRegIterator SI;
+  };
 
-  MCRegUnitIterator RI;
-  MCRegUnitRootIterator RRI;
-  MCSuperRegIterator SI;
+  struct CachedData {
+    ArrayRef<MCPhysReg> Aliases;
+    // Index into Aliases. The actual index is (Idx - IncludeSelf) because, when
+    // IncludeSelf is used, we use zero to represent self.
+    unsigned Idx;
+  };
+
+  union {
+    UncachedData Iters;
+    CachedData Cache;
+  };
 
 public:
   MCRegAliasIterator(MCRegister Reg, const MCRegisterInfo *MCRI,
-                     bool IncludeSelf)
-    : Reg(Reg), MCRI(MCRI), IncludeSelf(IncludeSelf) {
+                     bool IncludeSelf, bool UseCache = true)
+      : Reg(Reg), MCRI(MCRI), IncludeSelf(IncludeSelf) {
+
+    IsCached = UseCache;
+    if (IsCached) {
+      Cache.Aliases = MCRI->getCachedAliasesOf(Reg);
+      Cache.Idx = 0;
+      return;
+    }
+
     // Initialize the iterators.
-    for (RI = MCRegUnitIterator(Reg, MCRI); RI.isValid(); ++RI) {
-      for (RRI = MCRegUnitRootIterator(*RI, MCRI); RRI.isValid(); ++RRI) {
-        for (SI = MCSuperRegIterator(*RRI, MCRI, true); SI.isValid(); ++SI) {
-          if (!(!IncludeSelf && Reg == *SI))
+    for (Iters.RI = MCRegUnitIterator(Reg, MCRI); Iters.RI.isValid();
+         ++Iters.RI) {
+      for (Iters.RRI = MCRegUnitRootIterator(*Iters.RI, MCRI);
+           Iters.RRI.isValid(); ++Iters.RRI) {
+        for (Iters.SI = MCSuperRegIterator(*Iters.RRI, MCRI, true);
+             Iters.SI.isValid(); ++Iters.SI) {
+          if (!(!IncludeSelf && Reg == *Iters.SI))
             return;
         }
       }
     }
   }
 
-  bool isValid() const { return RI.isValid(); }
+  bool isValid() const {
+    return IsCached ? (Cache.Idx < (Cache.Aliases.size() + IncludeSelf))
+                    : Iters.RI.isValid();
+  }
 
   MCRegister operator*() const {
-    assert(SI.isValid() && "Cannot dereference an invalid iterator.");
-    return *SI;
+    if (IsCached) {
+      if (IncludeSelf && (Cache.Idx == 0))
+        return Reg;
+      return Cache.Aliases[Cache.Idx - IncludeSelf];
+    }
+
+    assert(Iters.SI.isValid() && "Cannot dereference an invalid iterator.");
+    return *Iters.SI;
   }
 
   void advance() {
+    if (IsCached) {
+      ++Cache.Idx;
+      return;
+    }
+
     // Assuming SI is valid.
-    ++SI;
-    if (SI.isValid()) return;
+    ++Iters.SI;
+    if (Iters.SI.isValid())
+      return;
 
-    ++RRI;
-    if (RRI.isValid()) {
-      SI = MCSuperRegIterator(*RRI, MCRI, true);
+    ++Iters.RRI;
+    if (Iters.RRI.isValid()) {
+      Iters.SI = MCSuperRegIterator(*Iters.RRI, MCRI, true);
       return;
     }
 
-    ++RI;
-    if (RI.isValid()) {
-      RRI = MCRegUnitRootIterator(*RI, MCRI);
-      SI = MCSuperRegIterator(*RRI, MCRI, true);
+    ++Iters.RI;
+    if (Iters.RI.isValid()) {
+      Iters.RRI = MCRegUnitRootIterator(*Iters.RI, MCRI);
+      Iters.SI = MCSuperRegIterator(*Iters.RRI, MCRI, true);
     }
   }
 
   MCRegAliasIterator &operator++() {
     assert(isValid() && "Cannot move off the end of the list.");
-    do advance();
-    while (!IncludeSelf && isValid() && *SI == Reg);
+    if (IsCached)
+      advance();
+    else {
+      do
+        advance();
+      while (!IncludeSelf && isValid() && *Iters.SI == Reg);
+    }
     return *this;
   }
 };
diff --git a/llvm/lib/MC/MCRegisterInfo.cpp b/llvm/lib/MC/MCRegisterInfo.cpp
index 334655616d8db..4ce9eb40547be 100644
--- a/llvm/lib/MC/MCRegisterInfo.cpp
+++ b/llvm/lib/MC/MCRegisterInfo.cpp
@@ -20,6 +20,26 @@
 
 using namespace llvm;
 
+ArrayRef<MCPhysReg> MCRegisterInfo::getCachedAliasesOf(MCPhysReg R) const {
+  if (auto It = RegAliasesCache.find(R); It != RegAliasesCache.end())
+    return It->second;
+
+  // TODO: Should we have a DenseSet instead & then convert it
+  // to vector? Or even a BitVector that's then converted to a normal
+  // MCPhysReg vector?
+  auto &Aliases = RegAliasesCache[R];
+  for (MCRegAliasIterator It(R, this, /*IncludeSelf=*/false,
+                             /*UseCache=*/false);
+       It.isValid(); ++It) {
+    Aliases.push_back(*It);
+  }
+
+  llvm::sort(Aliases);
+  Aliases.erase(unique(Aliases), Aliases.end());
+  Aliases.shrink_to_fit();
+  return Aliases;
+}
+
 MCRegister
 MCRegisterInfo::getMatchingSuperReg(MCRegister Reg, unsigned SubIdx,
                                     const MCRegisterClass *RC) const {
diff --git a/llvm/test/CodeGen/ARM/constant-island-movwt.mir b/llvm/test/CodeGen/ARM/constant-island-movwt.mir
index 7d21a4e4875c3..16d2c478cf7ee 100644
--- a/llvm/test/CodeGen/ARM/constant-island-movwt.mir
+++ b/llvm/test/CodeGen/ARM/constant-island-movwt.mir
@@ -898,13 +898,13 @@ body:             |
 # CHECK-NEXT:    CONSTPOOL_ENTRY 1, %const.0, 4
 # CHECK-NEXT: {{^  $}}
 # CHECK-NEXT:   bb.2.entry (align 2):
-# CHECK-NEXT:   liveins: $d13, $s27, $r10, $r9, $r8, $s26, $d12, $s25, $s24,
-# CHECK-SAME:            $d15, $s30, $s31, $d14, $s28, $s29, $lr, $r0, $d21,
-# CHECK-SAME:            $r3, $q10, $d20, $d17, $r2, $d25, $q11, $d22, $d23,
-# CHECK-SAME:            $r1, $q8, $d16, $s3, $q14, $d28, $d29, $d19, $s17,
-# CHECK-SAME:            $d8, $s16, $r6, $r7, $r4, $q12, $q9, $d18, $s0, $q15,
-# CHECK-SAME:            $d30, $d31, $r12, $s1, $d0, $d24, $s2, $d1, $q0, $s6,
-# CHECK-SAME:            $d3, $d2, $s4, $q1, $s7, $s5, $d9, $s18, $s19, $q4
+# CHECK-NEXT:   liveins: $d13, $s26, $r10, $r9, $r8, $s27, $d12, $s24, $s25,
+# CHECK-SAME:             $d15, $s30, $s31, $d14, $s28, $s29, $lr, $d21, $r0,
+# CHECK-SAME:             $q10, $r7, $d20, $d17, $r2, $d25, $q11, $d22, $d23,
+# CHECK-SAME:             $s1, $q8, $d16, $s3, $q14, $d28, $d29, $d19, $s16,
+# CHECK-SAME:             $r4, $d8, $r6, $r3, $q12, $s17, $q9, $d18, $s0, $d31,
+# CHECK-SAME:             $q15, $d30, $r1, $d0, $r12, $d24, $s2, $d1, $q0, $s7,
+# CHECK-SAME:             $s4, $d2, $s6, $q1, $s5, $d3, $d9, $s18, $s19, $q4
 # CHECK-NEXT: {{^  $}}
 # CHECK-NEXT:     $r5 = t2MOVi16 target-flags(arm-lo16) @.str.84, 14 /* CC::al */, $noreg
 # CHECK-NEXT:     $r5 = t2MOVTi16 $r5, target-flags(arm-hi16) @.str.84, 14 /* CC::al */, $noreg

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real fix should be to switch all uses of MCRegAliasIterator to use MCRegUnitIterator instead. No liveness infrastructure should operate in terms of aliases

@Pierre-vh
Copy link
Contributor Author

The real fix should be to switch all uses of MCRegAliasIterator to use MCRegUnitIterator instead. No liveness infrastructure should operate in terms of aliases

I'm not too familiar with reg units but as far as I understand them, yes, that'd probably make more sense.

However I'm wondering how realistic it is to expect that itt'll happen anytime soon. I can fix one or two cases myself I think but we have about 80 uses in llvm/lib.

How do you feel about caching? Do you think it's a good addition nonetheless, or should we avoid it in favor of (hopefully, someday) fixing all the unnecessary RegAliasIterator uses?

/// MCRegisterInfo then returns a cached list of sorted, uniqued
/// aliases for that register. This allows the iterator to be faster
/// past the first request, but also to iterate much less in some
/// cases, giving a slight speedup to any code that's using it.
class MCRegAliasIterator {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if I should move the old implementation to a MCRegAliasIteratorUncached class, and then make MCRegAliasIterator the default cached implementation.

It makes things a bit cleaner (and makes the caching system easier to remove someday), but makes it very difficult to add a cache "killswitch" if we feel like one is necessary (like if we don't want caching on all targets).

@@ -187,6 +187,9 @@ class MCRegisterInfo {
DenseMap<MCRegister, int> L2SEHRegs; // LLVM to SEH regs mapping
DenseMap<MCRegister, int> L2CVRegs; // LLVM to CV regs mapping

mutable DenseMap<MCPhysReg, std::vector<MCPhysReg>> RegAliasesCache;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also just be a std::vector<std::unique_ptr<std::vector<MCPhysReg>>> where register IDs are the indexes into the vector.

It'd use more space upfront (for AMDGPU, about 9000*8 bytes) but makes accesses a lot faster, and would be more memory efficient if we end up caching tons of aliases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the total size of this map (worst case) for AMDGPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realistically I don't know if a program could use all registers available. I don't think so.
Theoretically, there are 9000 registers and I'm not sure how much space DenseMap needs, I think it's a lot more than what std::vector would use

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretical minimum size of worst case would be 897289722 -> ~154MiB?
The upfront cost of a 70KiB vector version without the risk of multiple megabyte map seems preferable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I counted correctly, there will be 1'864'820 registers in all vectors in the map if every register is asked for its aliases. The number doesn't look like horribly large, so I guess caching is a reasonable trade-off.

@jayfoad jayfoad requested a review from perlfu May 28, 2024 11:04
@jayfoad
Copy link
Contributor

jayfoad commented May 28, 2024

I think @perlfu did some work on speeding up MCRegAliasIterator.

I've tried to tackle the same problem by removing uses of MCRegAliasIterator whenever I can.

@arsenm
Copy link
Contributor

arsenm commented May 28, 2024

The real fix should be to switch all uses of MCRegAliasIterator to use MCRegUnitIterator instead. No liveness infrastructure should operate in terms of aliases

I'm not too familiar with reg units but as far as I understand them, yes, that'd probably make more sense.

RegUnits are the fundamental unit of liveness tracking in llvm. All of the generated register enums have to go through the awkward mapping to regunits, things would be better if the register representation looked more like regunit sets.

However I'm wondering how realistic it is to expect that itt'll happen anytime soon. I can fix one or two cases myself I think but we have about 80 uses in llvm/lib.

It's usually not too difficult to convert these uses. I've been suggesting this as an intern project for a while. There's some additional infrastructure changes that would make this easier, like moving block live ins and reserved registers to directly use regunits

How do you feel about caching? Do you think it's a good addition nonetheless, or should we avoid it in favor of (hopefully, someday) fixing all the unnecessary RegAliasIterator uses?

Maybe it's fine for a stopgap

@arsenm arsenm requested review from MatzeB and qcolombet May 28, 2024 11:06
@Pierre-vh
Copy link
Contributor Author

Maybe it's fine for a stopgap

It's what I think too, but as the saying goes, there's nothing as permanent as a temporary solution. I'm not comfortable landing this unless everyone is sure it'd beneficial even if we end up removing most uses of RegAlias iterator at some point.

I personally think it is beneficial anyway because:

  • If we remove RegAlias iterator as a whole at some point, then this just goes away.
  • If we can't remove it and there are still some uses left in the hot path, caching is always a good thing.

@s-barannikov
Copy link
Contributor

s-barannikov commented May 28, 2024

+1 for switching to regunit iterators

There's some additional infrastructure changes that would make this easier, like moving block live ins and reserved registers to directly use regunits

This is what I'd like to see, too. It is not obvious how to deal with e.g. X86 EAX high sub-register though.

@qcolombet
Copy link
Collaborator

Maybe it's fine for a stopgap

It's what I think too, but as the saying goes, there's nothing as permanent as a temporary solution. I'm not comfortable landing this unless everyone is sure it'd beneficial even if we end up removing most uses of RegAlias iterator at some point.

I personally think it is beneficial anyway because:

  • If we remove RegAlias iterator as a whole at some point, then this just goes away.
  • If we can't remove it and there are still some uses left in the hot path, caching is always a good thing.

Agree on everything that was discussed.

Just one comment on the caching (disclaimer I haven't looked at the implementation), how stable is it from run-to-run?

I want to make sure we do not create non-determinism.

@Pierre-vh
Copy link
Contributor Author

Maybe it's fine for a stopgap

It's what I think too, but as the saying goes, there's nothing as permanent as a temporary solution. I'm not comfortable landing this unless everyone is sure it'd beneficial even if we end up removing most uses of RegAlias iterator at some point.
I personally think it is beneficial anyway because:

  • If we remove RegAlias iterator as a whole at some point, then this just goes away.
  • If we can't remove it and there are still some uses left in the hot path, caching is always a good thing.

Agree on everything that was discussed.

Just one comment on the caching (disclaimer I haven't looked at the implementation), how stable is it from run-to-run?

I want to make sure we do not create non-determinism.

The current implementation should be perfectly stable, sort + unique is used and register IDs are tablegen-erated so they're consistent.
If we make caching more complex, like say only cache when N hits happen, or above certain thresholds or with frequent cache clears, then it might introduce some non-determinism because the cached/uncached version return registers in a different order (and the uncached version returns them multiple times).

Agree on everything that was discussed.

To clarify, do you agree that caching is valuable anyway (so this diff could land), but we should still make efforts to move away from using register aliases for liveness?

@Pierre-vh
Copy link
Contributor Author

ping

@arsenm
Copy link
Contributor

arsenm commented May 31, 2024

Probably should test this on compile-time-tracker, touching this kind of stuff has been bad for X86 before

@Pierre-vh Pierre-vh requested a review from arsenm June 3, 2024 13:46
@Pierre-vh
Copy link
Contributor Author

gentle ping

@Pierre-vh Pierre-vh requested a review from nikic June 12, 2024 05:38
@@ -187,6 +187,9 @@ class MCRegisterInfo {
DenseMap<MCRegister, int> L2SEHRegs; // LLVM to SEH regs mapping
DenseMap<MCRegister, int> L2CVRegs; // LLVM to CV regs mapping

mutable std::vector<std::vector<MCPhysReg>> RegAliasesCache;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I wanted to use a vector of unique_ptr but it makes it too awkward to use. I can't even use reserve in that case, so I didn't do that indirection in the end.

The vector is resized exactly once upon init, and each vector is 3 pointers, so for AMDGPU we have 9000*24 bytes so about 216KB for this map, which is worth it IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reduce the size by using SmallVector<, 0> instead of std::vector. (Using SmallVector is generally always preferred in LLVM, even if you don't use the inline storage.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use std::optional in place of unique_ptr.
Sentinel looks fine, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikic SmallVector is 32 bytes, it has one more pointer for some reason.
Though it means I can use SmallVector<MCPhysReg, 4> and store small alias sets inline, so I think it's worth the size increase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but SmallVector doesn't have a shrink_to_fit, so it may end up using quite a bit more memory. I don't think it's worth it then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikic SmallVector is 32 bytes, it has one more pointer for some reason.

That doesn't sound right. SmallVector should be 16 bytes large if the element type is 4 bytes... uh, I just saw that MCPhyReg is only 2 bytes, which increases SmallVector size to 24 bytes. And of course there's no way to disable that heuristic. So yeah, SmallVector won't make things smaller here (but it also shouldn't be 32 bytes either...)

Though it means I can use SmallVector<MCPhysReg, 4> and store small alias sets inline, so I think it's worth the size increase.

But this also means that you allocate those 4 inline elements for all of the registers that are not actually queried. It's probably not worthwhile in this context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SmallVector should be 16 bytes large if the element type is 4 bytes... uh, I just saw that MCPhyReg is only 2 bytes, which increases SmallVector size to 24 bytes.

Why does it get bigger when the element type gets smaller? Is it worth fixing that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because SmallVector<char> sometimes needs to hold more than 4GB of data. Use of 64-bit sizes should probably be limited to just that case and not also extend to SmallVector<uint16_t>...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pierre-vh Pierre-vh requested a review from s-barannikov June 14, 2024 06:41
@Pierre-vh Pierre-vh merged commit ab0d01a into llvm:main Jun 14, 2024
5 of 6 checks passed
@Pierre-vh Pierre-vh deleted the cached-regalias branch June 14, 2024 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants