-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[RFC][MC] Cache MCRegAliasIterator #93510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
AMDGPU has a lot of registers, almost 9000. Many of those registers have aliases. For instance, SGPR0 has a ton of aliases due to the presence of register tuples. It's even worse if you query the aliases of a register tuple itself. A large register tuple can have hundreds of aliases because it may include 16 registers, and each of those registers have their own tuples as well. The current implementation of MCRegAliasIterator is not good at this. In some extreme cases it can iterate (IIRC), 7000 more times than necessary, just giving duplicates over and over again and using a lot of expensive iterators. This patch implements a cache system for MCRegAliasIterator. It does the expensive part only once and then saves it for us so the next iterations on that register's aliases are just a map lookup. Furthermore, the cached data is uniqued (and sorted). Thus, this speeds up code by both speeding up the iterator itself, but also by minimizing the number of loop iterations users of the iterator do. I think this is beneficial for AMDGPU (and some quick profiling showed a few seconds gained on a +-3 minute test build), but may introduce some overhead on smaller targets where this iterator was alreeady cheap. To avoid that, we might consider disabling the cache on targets with a small amount of registers. We could pretty much restrict this cache to AMDGPU only by only enabling it if NumReg is above a certain threshold, like 1000. I still need to do memory usage measurements, but I think this cache should take at most 10MB on AMDGPU if we were to query the aliases of all registers (which isn't common I think). I don't think this is a concern but if it is, we can always find ways to alleviate this, for instance by only caching after a certain number of hits or by clearing the cache at regular intervals in the backend.
@llvm/pr-subscribers-llvm-regalloc @llvm/pr-subscribers-mc Author: Pierre van Houtryve (Pierre-vh) ChangesAMDGPU has a lot of registers, almost 9000. Many of those registers have aliases. For instance, SGPR0 has a ton of aliases due to the presence of register tuples. It's even worse if you query the aliases of a register tuple itself. A large register tuple can have hundreds of aliases because it may include 16 registers, and each of those registers have their own tuples as well. The current implementation of MCRegAliasIterator is not good at this. In some extreme cases it can iterate (IIRC), 7000 more times than necessary, just giving duplicates over and over again and using a lot of expensive iterators. This patch implements a cache system for MCRegAliasIterator. It does the expensive part only once and then saves it for us so the next iterations on that register's aliases are just a map lookup. Furthermore, the cached data is uniqued (and sorted). Thus, this speeds up code by both speeding up the iterator itself, but also by minimizing the number of loop iterations users of the iterator do. I think this is beneficial for AMDGPU (and some quick profiling showed a few seconds gained on a +-3 minute test build), but may introduce some overhead on smaller targets where this iterator was alreeady cheap. To avoid that, we might consider disabling the cache on targets with a small amount of registers. We could pretty much restrict this cache to AMDGPU only by only enabling it if NumReg is above a certain threshold, like 1000. I still need to do memory usage measurements, but I think this cache should take at most 10MB on AMDGPU if we were to query the aliases of all registers (which isn't common I think). I don't think this is a concern but if it is, we can always find ways to alleviate this, for instance by only caching after a certain number of hits or by clearing the cache at regular intervals in the backend. Full diff: https://github.com/llvm/llvm-project/pull/93510.diff 3 Files Affected:
diff --git a/llvm/include/llvm/MC/MCRegisterInfo.h b/llvm/include/llvm/MC/MCRegisterInfo.h
index af5be9186108a..79df17d33e420 100644
--- a/llvm/include/llvm/MC/MCRegisterInfo.h
+++ b/llvm/include/llvm/MC/MCRegisterInfo.h
@@ -187,6 +187,9 @@ class MCRegisterInfo {
DenseMap<MCRegister, int> L2SEHRegs; // LLVM to SEH regs mapping
DenseMap<MCRegister, int> L2CVRegs; // LLVM to CV regs mapping
+ mutable DenseMap<MCPhysReg, std::vector<MCPhysReg>> RegAliasesCache;
+ ArrayRef<MCPhysReg> getCachedAliasesOf(MCPhysReg R) const;
+
/// Iterator class that can traverse the differentially encoded values in
/// DiffLists. Don't use this class directly, use one of the adaptors below.
class DiffListIterator
@@ -263,6 +266,7 @@ class MCRegisterInfo {
friend class MCRegUnitIterator;
friend class MCRegUnitMaskIterator;
friend class MCRegUnitRootIterator;
+ friend class MCRegAliasIterator;
/// Initialize MCRegisterInfo, called by TableGen
/// auto-generated routines. *DO NOT USE*.
@@ -726,60 +730,122 @@ class MCRegUnitRootIterator {
/// MCRegAliasIterator enumerates all registers aliasing Reg. If IncludeSelf is
/// set, Reg itself is included in the list. This iterator does not guarantee
/// any ordering or that entries are unique.
+///
+/// This iterator can work in two modes: cached and uncached.
+///
+/// In Uncached mode, this uses RegUnit/RegUnitRoot/SuperReg iterators to
+/// find all aliases. This is very expensive if the target (such as AMDGPU)
+/// has a lot of register and register aliases. It can also cause us
+/// to iterate the same register many times over.
+///
+/// In cached mode, the iterator requests MCRegisterInfo for a cache.
+/// MCRegisterInfo then returns a cached list of sorted, uniqued
+/// aliases for that register. This allows the iterator to be faster
+/// past the first request, but also to iterate much less in some
+/// cases, giving a slight speedup to any code that's using it.
class MCRegAliasIterator {
private:
MCRegister Reg;
const MCRegisterInfo *MCRI;
- bool IncludeSelf;
+ bool IncludeSelf : 1;
+ bool IsCached : 1;
+
+ /// TODO: Should we dynamically allocate this? If we don't make caching
+ /// specific for each target, we probably should in order to keep the size of
+ /// this object under control. RegAliasIterator is currently 80 bytes.
+ struct UncachedData {
+ MCRegUnitIterator RI;
+ MCRegUnitRootIterator RRI;
+ MCSuperRegIterator SI;
+ };
- MCRegUnitIterator RI;
- MCRegUnitRootIterator RRI;
- MCSuperRegIterator SI;
+ struct CachedData {
+ ArrayRef<MCPhysReg> Aliases;
+ // Index into Aliases. The actual index is (Idx - IncludeSelf) because, when
+ // IncludeSelf is used, we use zero to represent self.
+ unsigned Idx;
+ };
+
+ union {
+ UncachedData Iters;
+ CachedData Cache;
+ };
public:
MCRegAliasIterator(MCRegister Reg, const MCRegisterInfo *MCRI,
- bool IncludeSelf)
- : Reg(Reg), MCRI(MCRI), IncludeSelf(IncludeSelf) {
+ bool IncludeSelf, bool UseCache = true)
+ : Reg(Reg), MCRI(MCRI), IncludeSelf(IncludeSelf) {
+
+ IsCached = UseCache;
+ if (IsCached) {
+ Cache.Aliases = MCRI->getCachedAliasesOf(Reg);
+ Cache.Idx = 0;
+ return;
+ }
+
// Initialize the iterators.
- for (RI = MCRegUnitIterator(Reg, MCRI); RI.isValid(); ++RI) {
- for (RRI = MCRegUnitRootIterator(*RI, MCRI); RRI.isValid(); ++RRI) {
- for (SI = MCSuperRegIterator(*RRI, MCRI, true); SI.isValid(); ++SI) {
- if (!(!IncludeSelf && Reg == *SI))
+ for (Iters.RI = MCRegUnitIterator(Reg, MCRI); Iters.RI.isValid();
+ ++Iters.RI) {
+ for (Iters.RRI = MCRegUnitRootIterator(*Iters.RI, MCRI);
+ Iters.RRI.isValid(); ++Iters.RRI) {
+ for (Iters.SI = MCSuperRegIterator(*Iters.RRI, MCRI, true);
+ Iters.SI.isValid(); ++Iters.SI) {
+ if (!(!IncludeSelf && Reg == *Iters.SI))
return;
}
}
}
}
- bool isValid() const { return RI.isValid(); }
+ bool isValid() const {
+ return IsCached ? (Cache.Idx < (Cache.Aliases.size() + IncludeSelf))
+ : Iters.RI.isValid();
+ }
MCRegister operator*() const {
- assert(SI.isValid() && "Cannot dereference an invalid iterator.");
- return *SI;
+ if (IsCached) {
+ if (IncludeSelf && (Cache.Idx == 0))
+ return Reg;
+ return Cache.Aliases[Cache.Idx - IncludeSelf];
+ }
+
+ assert(Iters.SI.isValid() && "Cannot dereference an invalid iterator.");
+ return *Iters.SI;
}
void advance() {
+ if (IsCached) {
+ ++Cache.Idx;
+ return;
+ }
+
// Assuming SI is valid.
- ++SI;
- if (SI.isValid()) return;
+ ++Iters.SI;
+ if (Iters.SI.isValid())
+ return;
- ++RRI;
- if (RRI.isValid()) {
- SI = MCSuperRegIterator(*RRI, MCRI, true);
+ ++Iters.RRI;
+ if (Iters.RRI.isValid()) {
+ Iters.SI = MCSuperRegIterator(*Iters.RRI, MCRI, true);
return;
}
- ++RI;
- if (RI.isValid()) {
- RRI = MCRegUnitRootIterator(*RI, MCRI);
- SI = MCSuperRegIterator(*RRI, MCRI, true);
+ ++Iters.RI;
+ if (Iters.RI.isValid()) {
+ Iters.RRI = MCRegUnitRootIterator(*Iters.RI, MCRI);
+ Iters.SI = MCSuperRegIterator(*Iters.RRI, MCRI, true);
}
}
MCRegAliasIterator &operator++() {
assert(isValid() && "Cannot move off the end of the list.");
- do advance();
- while (!IncludeSelf && isValid() && *SI == Reg);
+ if (IsCached)
+ advance();
+ else {
+ do
+ advance();
+ while (!IncludeSelf && isValid() && *Iters.SI == Reg);
+ }
return *this;
}
};
diff --git a/llvm/lib/MC/MCRegisterInfo.cpp b/llvm/lib/MC/MCRegisterInfo.cpp
index 334655616d8db..4ce9eb40547be 100644
--- a/llvm/lib/MC/MCRegisterInfo.cpp
+++ b/llvm/lib/MC/MCRegisterInfo.cpp
@@ -20,6 +20,26 @@
using namespace llvm;
+ArrayRef<MCPhysReg> MCRegisterInfo::getCachedAliasesOf(MCPhysReg R) const {
+ if (auto It = RegAliasesCache.find(R); It != RegAliasesCache.end())
+ return It->second;
+
+ // TODO: Should we have a DenseSet instead & then convert it
+ // to vector? Or even a BitVector that's then converted to a normal
+ // MCPhysReg vector?
+ auto &Aliases = RegAliasesCache[R];
+ for (MCRegAliasIterator It(R, this, /*IncludeSelf=*/false,
+ /*UseCache=*/false);
+ It.isValid(); ++It) {
+ Aliases.push_back(*It);
+ }
+
+ llvm::sort(Aliases);
+ Aliases.erase(unique(Aliases), Aliases.end());
+ Aliases.shrink_to_fit();
+ return Aliases;
+}
+
MCRegister
MCRegisterInfo::getMatchingSuperReg(MCRegister Reg, unsigned SubIdx,
const MCRegisterClass *RC) const {
diff --git a/llvm/test/CodeGen/ARM/constant-island-movwt.mir b/llvm/test/CodeGen/ARM/constant-island-movwt.mir
index 7d21a4e4875c3..16d2c478cf7ee 100644
--- a/llvm/test/CodeGen/ARM/constant-island-movwt.mir
+++ b/llvm/test/CodeGen/ARM/constant-island-movwt.mir
@@ -898,13 +898,13 @@ body: |
# CHECK-NEXT: CONSTPOOL_ENTRY 1, %const.0, 4
# CHECK-NEXT: {{^ $}}
# CHECK-NEXT: bb.2.entry (align 2):
-# CHECK-NEXT: liveins: $d13, $s27, $r10, $r9, $r8, $s26, $d12, $s25, $s24,
-# CHECK-SAME: $d15, $s30, $s31, $d14, $s28, $s29, $lr, $r0, $d21,
-# CHECK-SAME: $r3, $q10, $d20, $d17, $r2, $d25, $q11, $d22, $d23,
-# CHECK-SAME: $r1, $q8, $d16, $s3, $q14, $d28, $d29, $d19, $s17,
-# CHECK-SAME: $d8, $s16, $r6, $r7, $r4, $q12, $q9, $d18, $s0, $q15,
-# CHECK-SAME: $d30, $d31, $r12, $s1, $d0, $d24, $s2, $d1, $q0, $s6,
-# CHECK-SAME: $d3, $d2, $s4, $q1, $s7, $s5, $d9, $s18, $s19, $q4
+# CHECK-NEXT: liveins: $d13, $s26, $r10, $r9, $r8, $s27, $d12, $s24, $s25,
+# CHECK-SAME: $d15, $s30, $s31, $d14, $s28, $s29, $lr, $d21, $r0,
+# CHECK-SAME: $q10, $r7, $d20, $d17, $r2, $d25, $q11, $d22, $d23,
+# CHECK-SAME: $s1, $q8, $d16, $s3, $q14, $d28, $d29, $d19, $s16,
+# CHECK-SAME: $r4, $d8, $r6, $r3, $q12, $s17, $q9, $d18, $s0, $d31,
+# CHECK-SAME: $q15, $d30, $r1, $d0, $r12, $d24, $s2, $d1, $q0, $s7,
+# CHECK-SAME: $s4, $d2, $s6, $q1, $s5, $d3, $d9, $s18, $s19, $q4
# CHECK-NEXT: {{^ $}}
# CHECK-NEXT: $r5 = t2MOVi16 target-flags(arm-lo16) @.str.84, 14 /* CC::al */, $noreg
# CHECK-NEXT: $r5 = t2MOVTi16 $r5, target-flags(arm-hi16) @.str.84, 14 /* CC::al */, $noreg
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real fix should be to switch all uses of MCRegAliasIterator to use MCRegUnitIterator instead. No liveness infrastructure should operate in terms of aliases
I'm not too familiar with reg units but as far as I understand them, yes, that'd probably make more sense. However I'm wondering how realistic it is to expect that itt'll happen anytime soon. I can fix one or two cases myself I think but we have about 80 uses in llvm/lib. How do you feel about caching? Do you think it's a good addition nonetheless, or should we avoid it in favor of (hopefully, someday) fixing all the unnecessary RegAliasIterator uses? |
/// MCRegisterInfo then returns a cached list of sorted, uniqued | ||
/// aliases for that register. This allows the iterator to be faster | ||
/// past the first request, but also to iterate much less in some | ||
/// cases, giving a slight speedup to any code that's using it. | ||
class MCRegAliasIterator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if I should move the old implementation to a MCRegAliasIteratorUncached
class, and then make MCRegAliasIterator
the default cached implementation.
It makes things a bit cleaner (and makes the caching system easier to remove someday), but makes it very difficult to add a cache "killswitch" if we feel like one is necessary (like if we don't want caching on all targets).
@@ -187,6 +187,9 @@ class MCRegisterInfo { | |||
DenseMap<MCRegister, int> L2SEHRegs; // LLVM to SEH regs mapping | |||
DenseMap<MCRegister, int> L2CVRegs; // LLVM to CV regs mapping | |||
|
|||
mutable DenseMap<MCPhysReg, std::vector<MCPhysReg>> RegAliasesCache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also just be a std::vector<std::unique_ptr<std::vector<MCPhysReg>>>
where register IDs are the indexes into the vector.
It'd use more space upfront (for AMDGPU, about 9000*8 bytes) but makes accesses a lot faster, and would be more memory efficient if we end up caching tons of aliases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the total size of this map (worst case) for AMDGPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realistically I don't know if a program could use all registers available. I don't think so.
Theoretically, there are 9000 registers and I'm not sure how much space DenseMap needs, I think it's a lot more than what std::vector would use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretical minimum size of worst case would be 897289722 -> ~154MiB?
The upfront cost of a 70KiB vector version without the risk of multiple megabyte map seems preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I counted correctly, there will be 1'864'820 registers in all vectors in the map if every register is asked for its aliases. The number doesn't look like horribly large, so I guess caching is a reasonable trade-off.
I think @perlfu did some work on speeding up MCRegAliasIterator. I've tried to tackle the same problem by removing uses of MCRegAliasIterator whenever I can. |
RegUnits are the fundamental unit of liveness tracking in llvm. All of the generated register enums have to go through the awkward mapping to regunits, things would be better if the register representation looked more like regunit sets.
It's usually not too difficult to convert these uses. I've been suggesting this as an intern project for a while. There's some additional infrastructure changes that would make this easier, like moving block live ins and reserved registers to directly use regunits
Maybe it's fine for a stopgap |
It's what I think too, but as the saying goes, there's nothing as permanent as a temporary solution. I'm not comfortable landing this unless everyone is sure it'd beneficial even if we end up removing most uses of RegAlias iterator at some point. I personally think it is beneficial anyway because:
|
+1 for switching to regunit iterators
This is what I'd like to see, too. It is not obvious how to deal with e.g. X86 EAX high sub-register though. |
Agree on everything that was discussed. Just one comment on the caching (disclaimer I haven't looked at the implementation), how stable is it from run-to-run? I want to make sure we do not create non-determinism. |
The current implementation should be perfectly stable,
To clarify, do you agree that caching is valuable anyway (so this diff could land), but we should still make efforts to move away from using register aliases for liveness? |
ping |
Probably should test this on compile-time-tracker, touching this kind of stuff has been bad for X86 before |
gentle ping |
@@ -187,6 +187,9 @@ class MCRegisterInfo { | |||
DenseMap<MCRegister, int> L2SEHRegs; // LLVM to SEH regs mapping | |||
DenseMap<MCRegister, int> L2CVRegs; // LLVM to CV regs mapping | |||
|
|||
mutable std::vector<std::vector<MCPhysReg>> RegAliasesCache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I wanted to use a vector of unique_ptr
but it makes it too awkward to use. I can't even use reserve
in that case, so I didn't do that indirection in the end.
The vector is resized exactly once upon init, and each vector is 3 pointers, so for AMDGPU we have 9000*24 bytes so about 216KB for this map, which is worth it IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can reduce the size by using SmallVector<, 0>
instead of std::vector
. (Using SmallVector is generally always preferred in LLVM, even if you don't use the inline storage.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use std::optional in place of unique_ptr.
Sentinel looks fine, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikic SmallVector is 32 bytes, it has one more pointer for some reason.
Though it means I can use SmallVector<MCPhysReg, 4>
and store small alias sets inline, so I think it's worth the size increase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but SmallVector doesn't have a shrink_to_fit, so it may end up using quite a bit more memory. I don't think it's worth it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikic SmallVector is 32 bytes, it has one more pointer for some reason.
That doesn't sound right. SmallVector should be 16 bytes large if the element type is 4 bytes... uh, I just saw that MCPhyReg is only 2 bytes, which increases SmallVector size to 24 bytes. And of course there's no way to disable that heuristic. So yeah, SmallVector won't make things smaller here (but it also shouldn't be 32 bytes either...)
Though it means I can use
SmallVector<MCPhysReg, 4>
and store small alias sets inline, so I think it's worth the size increase.
But this also means that you allocate those 4 inline elements for all of the registers that are not actually queried. It's probably not worthwhile in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SmallVector should be 16 bytes large if the element type is 4 bytes... uh, I just saw that MCPhyReg is only 2 bytes, which increases SmallVector size to 24 bytes.
Why does it get bigger when the element type gets smaller? Is it worth fixing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because SmallVector<char>
sometimes needs to hold more than 4GB of data. Use of 64-bit sizes should probably be limited to just that case and not also extend to SmallVector<uint16_t>
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMDGPU has a lot of registers, almost 9000. Many of those registers have aliases. For instance, SGPR0 has a ton of aliases due to the presence of register tuples. It's even worse if you query the aliases of a register tuple itself. A large register tuple can have hundreds of aliases because it may include 16 registers, and each of those registers have their own tuples as well.
The current implementation of MCRegAliasIterator is not good at this. In some extreme cases it can iterate (IIRC), 7000 more times than necessary, just giving duplicates over and over again and using a lot of expensive iterators.
This patch implements a cache system for MCRegAliasIterator. It does the expensive part only once and then saves it for us so the next iterations on that register's aliases are just a map lookup.
Furthermore, the cached data is uniqued (and sorted). Thus, this speeds up code by both speeding up the iterator itself, but also by minimizing the number of loop iterations users of the iterator do.
I think this is beneficial for AMDGPU (and some quick profiling showed a few seconds gained on a +-3 minute test build), but may introduce some overhead on smaller targets where this iterator was alreeady cheap. To avoid that, we might consider disabling the cache on targets with a small amount of registers. We could pretty much restrict this cache to AMDGPU only by only enabling it if NumReg is above a certain threshold, like 1000.
I still need to do memory usage measurements, but I think this cache should take at most 10MB on AMDGPU if we were to query the aliases of all registers (which isn't common I think). I don't think this is a concern but if it is, we can always find ways to alleviate this, for instance by only caching after a certain number of hits or by clearing the cache at regular intervals in the backend.