Skip to content

[AMPGPU] Emit s_singleuse_vdst instructions when a register is used multiple times in the same instruction. #89601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ScottEgerton
Copy link
Member

Previously, multiple uses of a register within the same instruction were
being counted as multiple uses. This has been corrected to
only count as a single use as per the specification allowing for
more optimisation candidates.

…ultiple times in the same instruction.

Previously, multiple uses of a register within the same instruction were
being counted as multiple uses. This has been corrected to
only count as a single use as per the specification allowing for
more optimisation candidates.
@llvmbot
Copy link
Member

llvmbot commented Apr 22, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Scott Egerton (ScottEgerton)

Changes

Previously, multiple uses of a register within the same instruction were
being counted as multiple uses. This has been corrected to
only count as a single use as per the specification allowing for
more optimisation candidates.


Full diff: https://github.com/llvm/llvm-project/pull/89601.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInsertSingleUseVDST.cpp (+23-19)
  • (modified) llvm/test/CodeGen/AMDGPU/insert-singleuse-vdst.mir (+26-3)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInsertSingleUseVDST.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInsertSingleUseVDST.cpp
index f6a63502ec2cda..dbf41da3ceac63 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInsertSingleUseVDST.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInsertSingleUseVDST.cpp
@@ -19,6 +19,7 @@
 #include "MCTargetDesc/AMDGPUMCTargetDesc.h"
 #include "SIInstrInfo.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseSet.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/StringRef.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
@@ -83,26 +84,13 @@ class AMDGPUInsertSingleUseVDST : public MachineFunctionPass {
         // instruction to be marked as a single use producer.
         bool AllProducerOperandsAreSingleUse = true;
 
-        for (const auto &Operand : MI.operands()) {
-          if (!Operand.isReg())
-            continue;
-          const auto Reg = Operand.getReg();
-
-          // Count the number of times each register is read.
-          if (Operand.readsReg())
-            for (const MCRegUnit &Unit : TRI->regunits(Reg))
-              RegisterUseCount[Unit]++;
+        // Gather a list of Registers used before updating use counts to avoid
+        // double counting registers that appear multiple times in a single
+        // MachineInstr.
+        DenseSet<MCRegUnit> RegistersUsed;
 
-          // Do not attempt to optimise across exec mask changes.
-          if (MI.modifiesRegister(AMDGPU::EXEC, TRI)) {
-            for (auto &UsedReg : RegisterUseCount)
-              UsedReg.second = 2;
-          }
-
-          // If we are at the point where the register first became live,
-          // check if the operands are single use.
-          if (!MI.modifiesRegister(Reg, TRI))
-            continue;
+        for (const auto &Operand : MI.all_defs()) {
+          const auto Reg = Operand.getReg();
 
           const auto RegUnits = TRI->regunits(Reg);
           if (any_of(RegUnits, [&RegisterUseCount](const MCRegUnit &Unit) {
@@ -114,6 +102,22 @@ class AMDGPUInsertSingleUseVDST : public MachineFunctionPass {
           for (const MCRegUnit &Unit : RegUnits)
             RegisterUseCount.erase(Unit);
         }
+
+        for (const auto &Operand : MI.all_uses()) {
+          const auto Reg = Operand.getReg();
+
+          // Count the number of times each register is read.
+          for (const MCRegUnit &Unit : TRI->regunits(Reg))
+            RegistersUsed.insert(Unit);
+        }
+        for (const MCRegUnit &Unit : RegistersUsed)
+          RegisterUseCount[Unit]++;
+
+        // Do not attempt to optimise across exec mask changes.
+        if (MI.modifiesRegister(AMDGPU::EXEC, TRI)) {
+          for (auto &UsedReg : RegisterUseCount)
+            UsedReg.second = 2;
+        }
         if (AllProducerOperandsAreSingleUse && SIInstrInfo::isVALU(MI)) {
           // TODO: Replace with candidate logging for instruction grouping
           // later.
diff --git a/llvm/test/CodeGen/AMDGPU/insert-singleuse-vdst.mir b/llvm/test/CodeGen/AMDGPU/insert-singleuse-vdst.mir
index 135a101822bfa6..513734388eb65b 100644
--- a/llvm/test/CodeGen/AMDGPU/insert-singleuse-vdst.mir
+++ b/llvm/test/CodeGen/AMDGPU/insert-singleuse-vdst.mir
@@ -136,6 +136,7 @@ body: |
   ; CHECK-NEXT:   successors: %bb.1(0x80000000)
   ; CHECK-NEXT:   liveins: $vgpr0
   ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   S_SINGLEUSE_VDST 1
   ; CHECK-NEXT:   $vgpr1 = V_MOV_B32_e32 $vgpr0, implicit $exec
   ; CHECK-NEXT:   $vgpr2 = V_ADD_U32_e32 $vgpr1, $vgpr1, implicit $exec
   ; CHECK-NEXT: {{  $}}
@@ -278,6 +279,31 @@ body: |
     liveins: $vgpr2, $vgpr3
 ...
 
+# Second use is an instruction that reads and writes v1.
+---
+name: multiple_uses_4
+tracksRegLiveness: true
+body: |
+  ; CHECK-LABEL: name: multiple_uses_4
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT:   liveins: $vgpr0
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+  ; CHECK-NEXT:   $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
+  ; CHECK-NEXT:   $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2
+  bb.0:
+    liveins: $vgpr0
+    $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
+    $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
+  bb.1:
+    liveins: $vgpr0, $vgpr1, $vgpr2
+...
+
 # Results are live-in to another basic block.
 ---
 name: basic_block_1
@@ -398,7 +424,6 @@ body: |
   ; CHECK-NEXT:   successors: %bb.1(0x80000000)
   ; CHECK-NEXT:   liveins: $sgpr0_sgpr1
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT:   S_SINGLEUSE_VDST 1
   ; CHECK-NEXT:   $vgpr0 = V_MOV_B32_e32 0, implicit $exec
   ; CHECK-NEXT:   $exec = COPY $sgpr0_sgpr1
   ; CHECK-NEXT:   $vgpr0 = V_MOV_B32_e32 $vgpr0, implicit $exec
@@ -424,7 +449,6 @@ body: |
   ; CHECK-NEXT:   successors: %bb.1(0x80000000)
   ; CHECK-NEXT:   liveins: $sgpr0
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT:   S_SINGLEUSE_VDST 1
   ; CHECK-NEXT:   $vgpr0 = V_MOV_B32_e32 0, implicit $exec
   ; CHECK-NEXT:   $exec_lo = COPY $sgpr0
   ; CHECK-NEXT:   $vgpr0 = V_MOV_B32_e32 $vgpr0, implicit $exec
@@ -450,7 +474,6 @@ body: |
   ; CHECK-NEXT:   successors: %bb.1(0x80000000)
   ; CHECK-NEXT:   liveins: $sgpr0
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT:   S_SINGLEUSE_VDST 1
   ; CHECK-NEXT:   $vgpr0 = V_MOV_B32_e32 0, implicit $exec
   ; CHECK-NEXT:   $exec_hi = COPY $sgpr0
   ; CHECK-NEXT:   $vgpr0 = V_MOV_B32_e32 $vgpr0, implicit $exec

@@ -114,6 +102,22 @@ class AMDGPUInsertSingleUseVDST : public MachineFunctionPass {
for (const MCRegUnit &Unit : RegUnits)
RegisterUseCount.erase(Unit);
}

for (const auto &Operand : MI.all_uses()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with regmask operands? What happens with call uses?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the point of view of this pass, I'm not sure these need special handling. The pass will treat them as regular instructions, just with a large amount of registers used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you should disable all optimization across a call, like you do across exec mask changes, but that can be a separate patch.

if (Operand.readsReg())
for (const MCRegUnit &Unit : TRI->regunits(Reg))
RegisterUseCount[Unit]++;
// Gather a list of Registers used before updating use counts to avoid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything tracking per-instruction liveness should be split into a helper function. I feel like there's already something to do something similar for you somewhere, but I can't seem to find it. Is this just building a map from regunit to a 0-1-or-2 counter?

Copy link
Member Author

@ScottEgerton ScottEgerton Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.

Is this just building a map from regunit to a 0-1-or-2 counter?

Essentially, yes. We are counting the uses but the only values we care about are 0, 1 or >1.

This is a simple integer value and should not be passed by reference.
@jayfoad
Copy link
Contributor

jayfoad commented Apr 29, 2024

The test changes LGTM. I think this is ready to go if Matt is happy with DenseSet or a replacement (maybe just a SmallVector with a manual is_contained check before appending to it?).

This is purely for performance reasons.
Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants