Skip to content

[AMDGPU][True16][CodeGen] update waitcnt for true16 #128927

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 11, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 18 additions & 13 deletions llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -130,10 +130,10 @@ enum WaitEventType {
// We reserve a fixed number of VGPR slots in the scoring tables for
// special tokens like SCMEM_LDS (needed for buffer load to LDS).
enum RegisterMapping {
SQ_MAX_PGM_VGPRS = 512, // Maximum programmable VGPRs across all targets.
AGPR_OFFSET = 256, // Maximum programmable ArchVGPRs across all targets.
SQ_MAX_PGM_SGPRS = 128, // Maximum programmable SGPRs across all targets.
NUM_EXTRA_VGPRS = 9, // Reserved slots for DS.
SQ_MAX_PGM_VGPRS = 1024, // Maximum programmable VGPRs across all targets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any consequence to widening this on targets without true16?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tracking is not differentiated by the presence or absence of True16 instructions on the subtarget. So it would 2x the size of unsigned VgprScores, which also causes 2x the iterations in the for loop in determineWait and setScoreByInterval on all subtargets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid that?

Copy link
Contributor Author

@broxigarchen broxigarchen Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is mainly about compile speed right?

It seems determineWait and setScoreByInterval are doing for loops and checking for boundary. I assumed we mostly just update waitcnt of a few registers at one time right? If that's true, we could benefits through maintaing a sorted list of the score so that we can turn the for loop into a binary search. The benefits in search should overcome the loss in updating value.

Otherwise we might go with Ivan's suggestions to only increase this number for target with true16 enabled. Please let me know your comments on this. Thanks!

AGPR_OFFSET = 512, // Maximum programmable ArchVGPRs across all targets.
SQ_MAX_PGM_SGPRS = 128, // Maximum programmable SGPRs across all targets.
NUM_EXTRA_VGPRS = 9, // Reserved slots for DS.
// Artificial register slots to track LDS writes into specific LDS locations
// if a location is known. When slots are exhausted or location is
// unknown use the first slot. The first slot is also always updated in
Expand Down Expand Up @@ -748,27 +748,32 @@ RegInterval WaitcntBrackets::getRegInterval(const MachineInstr *MI,

RegInterval Result;

unsigned Reg = TRI->getEncodingValue(AMDGPU::getMCReg(Op.getReg(), *ST)) &
AMDGPU::HWEncoding::REG_IDX_MASK;
MCRegister MCReg = AMDGPU::getMCReg(Op.getReg(), *ST);
unsigned RegIdx = TRI->getHWRegIndex(MCReg);
assert(isUInt<8>(RegIdx));

const TargetRegisterClass *RC = TRI->getPhysRegBaseClass(Op.getReg());
unsigned Size = TRI->getRegSizeInBits(*RC);

// AGPRs/VGPRs are tracked every 16 bits, SGPRs by 32 bits
if (TRI->isVectorRegister(*MRI, Op.getReg())) {
assert(Reg <= SQ_MAX_PGM_VGPRS);
unsigned Reg = RegIdx << 1 | (AMDGPU::isHi16Reg(MCReg, *TRI) ? 1 : 0);
assert(Reg < AGPR_OFFSET);
Result.first = Reg;
if (TRI->isAGPR(*MRI, Op.getReg()))
Result.first += AGPR_OFFSET;
assert(Result.first >= 0 && Result.first < SQ_MAX_PGM_VGPRS);
} else if (TRI->isSGPRReg(*MRI, Op.getReg()) && Reg < SQ_MAX_PGM_SGPRS) {
assert(Size % 16 == 0);
Result.second = Result.first + (Size / 16);
} else if (TRI->isSGPRReg(*MRI, Op.getReg()) && RegIdx < SQ_MAX_PGM_SGPRS) {
// SGPRs including VCC, TTMPs and EXEC but excluding read-only scalar
// sources like SRC_PRIVATE_BASE.
Result.first = Reg + NUM_ALL_VGPRS;
Result.first = RegIdx + NUM_ALL_VGPRS;
Result.second = Result.first + divideCeil(Size, 32);
} else {
return {-1, -1};
}

const TargetRegisterClass *RC = TRI->getPhysRegBaseClass(Op.getReg());
unsigned Size = TRI->getRegSizeInBits(*RC);
Result.second = Result.first + ((Size + 16) / 32);

return Result;
}

Expand Down
2 changes: 1 addition & 1 deletion llvm/test/CodeGen/AMDGPU/spillv16.ll
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,8 @@ define void @spill_i16_alu_two_vals() {
; GCN-TRUE16-NEXT: scratch_load_d16_b16 v0, off, s32 offset:4 glc dlc
; GCN-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GCN-TRUE16-NEXT: scratch_load_d16_hi_b16 v0, off, s32 offset:6 ; 2-byte Folded Reload
; GCN-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GCN-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x7b, v0.l
; GCN-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GCN-TRUE16-NEXT: scratch_store_d16_hi_b16 off, v0, s32 dlc
; GCN-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GCN-TRUE16-NEXT: scratch_store_b16 off, v0, s32 offset:4 dlc
Expand Down
Loading