Skip to content

[AMDGPU][True16][CodeGen] update wwm reg sorting check condition #135053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1650,7 +1650,7 @@ void SIFrameLowering::determineCalleeSaves(MachineFunction &MF,
// are of 32-bit size. SIPreAllocateWWMRegs pass can add tuples into WWM
// reserved registers.
const TargetRegisterClass *RC = TRI->getPhysRegBaseClass(Reg);
if (TRI->getRegSizeInBits(*RC) > 32)
if (TRI->getRegSizeInBits(*RC) != 32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why this code needs to filter anything.

I don't understand this comment:

// SIPreAllocateWWMRegs pass can add tuples into WWM reserved registers.

It can, but why? This should only track the 32-bit element registers

Copy link
Contributor Author

@broxigarchen broxigarchen Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matt. I am not quite familiar with this. I think maybe @cdevadas can answer this question.

If we want to change something here I think we probably should do a seperate patch. Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIPreallocateWWMRegs pass handpicks VGPRs from the lower end. They might allocate VGPR tuples as well. However, the wwm-regalloc pass gets registers from the tail-end (we reserve them from the higher end to ensure the per-lane VGPR tuple allocation gets sufficient free contiguous registers from the initial scratch range.). This code inserted during PEI is trying to shift them down to the lowest range. Remember, the shifting is only required for those allocated during wwm-regalloc pass. Since we have the unified set WWMReservedRegs that holds all sort of wwm-regs, this loop here is trying to identify only 32-bit registers that are allocated during wwm-regalloc pass (the VGPRs used for SGPR spilling, at the moment). We may also pick the 32-bit regs custom allocated during the SIPreallcoate pass. But that's ok. This filter avoids any VGPR tuple allocated for wwm-operands during the custom allocate pass. The shift-down logic currently Inserted here considers only 32-bit regclasses.
The 16-bit registers weren't enabled earlier. So, it makes sense to change the condition to exactly match 32-bit classes.

continue;
SortedWWMVGPRs.push_back(Reg);
}
Expand Down
28 changes: 28 additions & 0 deletions llvm/test/CodeGen/AMDGPU/wwm-reg-shift-down-gfx11plus.mir
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
# RUN: llc -mtriple=amdgcn-- -mcpu=gfx1100 -mattr=+real-true16 -run-pass=prologepilog %s -o - | FileCheck -check-prefix=GCN %s

---
name: wwm_reg_skip_sort_16bit
tracksRegLiveness: true
machineFunctionInfo:
isEntryFunction: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There won't be any prolog epilog CSR spills inserted for entry functions. This test isn't relevant for this patch, especially the shifting won't happen for the entry functions.

Copy link
Contributor Author

@broxigarchen broxigarchen Apr 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Christ. Then this sounds like a bug. The shifting is happening before the entryfunction check

Should this entryFunction check being moved to early stage at line 1646? or even earlier to 1619?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad. The shift-back is needed to ensure the wwm-regs are finally in the lowest range. This is needed for both entry functions and device functions. But their CSR spills/restores will be ignored for entry functions.

body: |
bb.0:
; GCN-LABEL: name: wwm_reg_skip_sort_16bit
; GCN: renamable $sgpr0 = ENTER_STRICT_WWM -1, implicit-def $exec, implicit-def $scc, implicit $exec
; GCN-NEXT: $vgpr0 = IMPLICIT_DEF
; GCN-NEXT: renamable $sgpr1 = V_READLANE_B32 $vgpr0, 31
; GCN-NEXT: renamable $sgpr2 = S_MOV_B32 0
; GCN-NEXT: undef $vgpr0_lo16 = V_CNDMASK_B16_t16_e64 0, 0, 0, killed $sgpr1, killed $sgpr2, 0, implicit $exec
; GCN-NEXT: $exec_lo = EXIT_STRICT_WWM killed renamable $sgpr0
; GCN-NEXT: early-clobber renamable $vgpr1 = V_MOV_B32_e32 $vgpr0, implicit $exec
; GCN-NEXT: S_ENDPGM 0, implicit killed renamable $vgpr1
renamable $sgpr0 = ENTER_STRICT_WWM -1, implicit-def $exec, implicit-def $scc, implicit $exec
$vgpr0 = IMPLICIT_DEF
renamable $sgpr1 = V_READLANE_B32 $vgpr0, 31
renamable $sgpr2 = S_MOV_B32 0
undef $vgpr0_lo16 = V_CNDMASK_B16_t16_e64 0, 0, 0, killed $sgpr1, killed $sgpr2, 0, implicit $exec
$exec_lo = EXIT_STRICT_WWM killed renamable $sgpr0
early-clobber renamable $vgpr1 = V_MOV_B32_e32 $vgpr0, implicit $exec
S_ENDPGM 0, implicit killed renamable $vgpr1
...
Loading