Skip to content

[AMPGPU] Emit s_singleuse_vdst instructions when a register is used multiple times in the same instruction. #89601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 26 additions & 21 deletions llvm/lib/Target/AMDGPU/AMDGPUInsertSingleUseVDST.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -83,37 +83,42 @@ class AMDGPUInsertSingleUseVDST : public MachineFunctionPass {
// instruction to be marked as a single use producer.
bool AllProducerOperandsAreSingleUse = true;

for (const auto &Operand : MI.operands()) {
if (!Operand.isReg())
continue;
const auto Reg = Operand.getReg();

// Count the number of times each register is read.
if (Operand.readsReg())
for (const MCRegUnit &Unit : TRI->regunits(Reg))
RegisterUseCount[Unit]++;

// Do not attempt to optimise across exec mask changes.
if (MI.modifiesRegister(AMDGPU::EXEC, TRI)) {
for (auto &UsedReg : RegisterUseCount)
UsedReg.second = 2;
}
// Gather a list of Registers used before updating use counts to avoid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything tracking per-instruction liveness should be split into a helper function. I feel like there's already something to do something similar for you somewhere, but I can't seem to find it. Is this just building a map from regunit to a 0-1-or-2 counter?

Copy link
Member Author

@ScottEgerton ScottEgerton Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.

Is this just building a map from regunit to a 0-1-or-2 counter?

Essentially, yes. We are counting the uses but the only values we care about are 0, 1 or >1.

// double counting registers that appear multiple times in a single
// MachineInstr.
SmallVector<MCRegUnit> RegistersUsed;

// If we are at the point where the register first became live,
// check if the operands are single use.
if (!MI.modifiesRegister(Reg, TRI))
continue;
for (const auto &Operand : MI.all_defs()) {
const auto Reg = Operand.getReg();

const auto RegUnits = TRI->regunits(Reg);
if (any_of(RegUnits, [&RegisterUseCount](const MCRegUnit &Unit) {
if (any_of(RegUnits, [&RegisterUseCount](const MCRegUnit Unit) {
return RegisterUseCount[Unit] > 1;
}))
AllProducerOperandsAreSingleUse = false;

// Reset uses count when a register is no longer live.
for (const MCRegUnit &Unit : RegUnits)
for (const MCRegUnit Unit : RegUnits)
RegisterUseCount.erase(Unit);
}

for (const auto &Operand : MI.all_uses()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with regmask operands? What happens with call uses?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the point of view of this pass, I'm not sure these need special handling. The pass will treat them as regular instructions, just with a large amount of registers used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you should disable all optimization across a call, like you do across exec mask changes, but that can be a separate patch.

const auto Reg = Operand.getReg();

// Count the number of times each register is read.
for (const MCRegUnit Unit : TRI->regunits(Reg)) {
if (!is_contained(RegistersUsed, Unit))
RegistersUsed.push_back(Unit);
}
}
for (const MCRegUnit Unit : RegistersUsed)
RegisterUseCount[Unit]++;

// Do not attempt to optimise across exec mask changes.
if (MI.modifiesRegister(AMDGPU::EXEC, TRI)) {
for (auto &UsedReg : RegisterUseCount)
UsedReg.second = 2;
}
if (AllProducerOperandsAreSingleUse && SIInstrInfo::isVALU(MI)) {
// TODO: Replace with candidate logging for instruction grouping
// later.
Expand Down
29 changes: 26 additions & 3 deletions llvm/test/CodeGen/AMDGPU/insert-singleuse-vdst.mir
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@ body: |
; CHECK-NEXT: successors: %bb.1(0x80000000)
; CHECK-NEXT: liveins: $vgpr0
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: S_SINGLEUSE_VDST 1
; CHECK-NEXT: $vgpr1 = V_MOV_B32_e32 $vgpr0, implicit $exec
; CHECK-NEXT: $vgpr2 = V_ADD_U32_e32 $vgpr1, $vgpr1, implicit $exec
; CHECK-NEXT: {{ $}}
Expand Down Expand Up @@ -278,6 +279,31 @@ body: |
liveins: $vgpr2, $vgpr3
...

# Second use is an instruction that reads and writes v1.
---
name: multiple_uses_4
tracksRegLiveness: true
body: |
; CHECK-LABEL: name: multiple_uses_4
; CHECK: bb.0:
; CHECK-NEXT: successors: %bb.1(0x80000000)
; CHECK-NEXT: liveins: $vgpr0
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
; CHECK-NEXT: $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
; CHECK-NEXT: $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: bb.1:
; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2
bb.0:
liveins: $vgpr0
$vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
$vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
$vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
bb.1:
liveins: $vgpr0, $vgpr1, $vgpr2
...

# Results are live-in to another basic block.
---
name: basic_block_1
Expand Down Expand Up @@ -398,7 +424,6 @@ body: |
; CHECK-NEXT: successors: %bb.1(0x80000000)
; CHECK-NEXT: liveins: $sgpr0_sgpr1
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: S_SINGLEUSE_VDST 1
; CHECK-NEXT: $vgpr0 = V_MOV_B32_e32 0, implicit $exec
; CHECK-NEXT: $exec = COPY $sgpr0_sgpr1
; CHECK-NEXT: $vgpr0 = V_MOV_B32_e32 $vgpr0, implicit $exec
Expand All @@ -424,7 +449,6 @@ body: |
; CHECK-NEXT: successors: %bb.1(0x80000000)
; CHECK-NEXT: liveins: $sgpr0
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: S_SINGLEUSE_VDST 1
; CHECK-NEXT: $vgpr0 = V_MOV_B32_e32 0, implicit $exec
; CHECK-NEXT: $exec_lo = COPY $sgpr0
; CHECK-NEXT: $vgpr0 = V_MOV_B32_e32 $vgpr0, implicit $exec
Expand All @@ -450,7 +474,6 @@ body: |
; CHECK-NEXT: successors: %bb.1(0x80000000)
; CHECK-NEXT: liveins: $sgpr0
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: S_SINGLEUSE_VDST 1
; CHECK-NEXT: $vgpr0 = V_MOV_B32_e32 0, implicit $exec
; CHECK-NEXT: $exec_hi = COPY $sgpr0
; CHECK-NEXT: $vgpr0 = V_MOV_B32_e32 $vgpr0, implicit $exec
Expand Down