Skip to content

[AMDGPU][SIInsertWaitcnts] Do not add s_waitcnt when the counters are known to be 0 already #72830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,12 @@ void AMDGPUInstrPostProcess::postProcessInstruction(
std::unique_ptr<Instruction> &Inst, const MCInst &MCI) {
switch (MCI.getOpcode()) {
case AMDGPU::S_WAITCNT:
case AMDGPU::S_WAITCNT_soft:
case AMDGPU::S_WAITCNT_EXPCNT:
case AMDGPU::S_WAITCNT_LGKMCNT:
case AMDGPU::S_WAITCNT_VMCNT:
case AMDGPU::S_WAITCNT_VSCNT:
case AMDGPU::S_WAITCNT_VSCNT_soft:
case AMDGPU::S_WAITCNT_EXPCNT_gfx10:
case AMDGPU::S_WAITCNT_LGKMCNT_gfx10:
case AMDGPU::S_WAITCNT_VMCNT_gfx10:
Expand Down Expand Up @@ -77,10 +79,12 @@ unsigned AMDGPUCustomBehaviour::checkCustomHazard(ArrayRef<InstRef> IssuedInst,
default:
return 0;
case AMDGPU::S_WAITCNT: // This instruction
case AMDGPU::S_WAITCNT_soft:
case AMDGPU::S_WAITCNT_EXPCNT:
case AMDGPU::S_WAITCNT_LGKMCNT:
case AMDGPU::S_WAITCNT_VMCNT:
case AMDGPU::S_WAITCNT_VSCNT: // to this instruction are all pseudo.
case AMDGPU::S_WAITCNT_VSCNT:
case AMDGPU::S_WAITCNT_VSCNT_soft: // to this instruction are all pseudo.
case AMDGPU::S_WAITCNT_EXPCNT_gfx10:
case AMDGPU::S_WAITCNT_LGKMCNT_gfx10:
case AMDGPU::S_WAITCNT_VMCNT_gfx10:
Expand Down
135 changes: 67 additions & 68 deletions llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -369,7 +369,6 @@ class SIInsertWaitcnts : public MachineFunctionPass {
const MachineRegisterInfo *MRI = nullptr;
AMDGPU::IsaVersion IV;

DenseSet<MachineInstr *> TrackedWaitcntSet;
DenseMap<const Value *, MachineBasicBlock *> SLoadAddresses;
DenseMap<MachineBasicBlock *, bool> PreheadersToFlush;
MachineLoopInfo *MLI;
Expand Down Expand Up @@ -493,6 +492,9 @@ class SIInsertWaitcnts : public MachineFunctionPass {
MachineInstr &OldWaitcntInstr,
AMDGPU::Waitcnt &Wait,
MachineBasicBlock::instr_iterator It) const;

// Transform a soft waitcnt into a normal one.
bool promoteSoftWaitCnt(MachineInstr *Waitcnt) const;
};

} // end anonymous namespace
Expand Down Expand Up @@ -872,6 +874,15 @@ static bool updateOperandIfDifferent(MachineInstr &MI, uint16_t OpName,
return true;
}

bool SIInsertWaitcnts::promoteSoftWaitCnt(MachineInstr *Waitcnt) const {
unsigned Opcode = Waitcnt->getOpcode();
if (!SIInstrInfo::isSoftWaitcnt(Opcode))
return false;

Waitcnt->setDesc(TII->get(SIInstrInfo::getNonSoftWaitcntOpcode(Opcode)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this here as well as in pseudoToMCOpcode?

Note that SIInsertWaitcnts is iterative, so it may process the same block several times. If that happens then by the time the it runs the second time on a block, all remaining soft waitcnts will have been converted to hard ones by the first pass. Is that desirable for some reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this here as well as in pseudoToMCOpcode?

So soft waitcnts can be lowered into MCInsts
When I picked up this patch, IIRC, I tried to remove the one in pseudoToMCOpcode (and make soft waitcnts illegal to lower to MCInst) but it didn't work. I can try again if you want

Note that SIInsertWaitcnts is iterative, so it may process the same block several times. If that happens then by the time the it runs the second time on a block, all remaining soft waitcnts will have been converted to hard ones by the first pass. Is that desirable for some reason?

Not sure, I'll think about it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that SIInsertWaitcnts is iterative, so it may process the same block several times. If that happens then by the time the it runs the second time on a block, all remaining soft waitcnts will have been converted to hard ones by the first pass. Is that desirable for some reason?

So my assumption is that it's a normal process. The first iteration will either eliminate the unneeded waitcnts, or make them non-soft. Further iterations on the block should act like they did before when there were no soft waitcnts.

I've renamed this function to also make it clearer because I think its current name was confusing. It's only called when we update a waitcnt (so when we know the waitcnt is needed) in order to "promote" the soft waitcnt into a normal one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's OK, but it still seems odd to me that we promote soft waitcnts both here and in pseudoToMCOpcode.

SIInsertWaitcnts is required for correctness (it is not just an optimization) so why is the pseudoToMCOpcode part required?

Alternatively, could we stop promoting soft waitcnts here? Or would that somehow change the behaviour of the second (or subsequent) visit to a basic block?

return true;
}

/// Combine consecutive waitcnt instructions that precede \p It and follow
/// \p OldWaitcntInstr and apply any extra wait from waitcnt that were added
/// by previous passes. Currently this pass conservatively assumes that these
Expand All @@ -888,86 +899,77 @@ bool SIInsertWaitcnts::applyPreexistingWaitcnt(
if (II.isMetaInstruction())
continue;

if (II.getOpcode() == AMDGPU::S_WAITCNT) {
// Conservatively update required wait if this waitcnt was added in an
// earlier pass. In this case it will not exist in the tracked waitcnt
// set.
if (!TrackedWaitcntSet.count(&II)) {
unsigned IEnc = II.getOperand(0).getImm();
AMDGPU::Waitcnt OldWait = AMDGPU::decodeWaitcnt(IV, IEnc);
Wait = Wait.combined(OldWait);
}
unsigned Opcode = II.getOpcode();
bool IsSoft = SIInstrInfo::isSoftWaitcnt(Opcode);

if (SIInstrInfo::isWaitcnt(Opcode)) {
// Update required wait count. If this is a soft waitcnt (= it was added
// by an earlier pass), it may be entirely removed.
unsigned IEnc = II.getOperand(0).getImm();
AMDGPU::Waitcnt OldWait = AMDGPU::decodeWaitcnt(IV, IEnc);
if (IsSoft)
ScoreBrackets.simplifyWaitcnt(OldWait);
Wait = Wait.combined(OldWait);

// Merge consecutive waitcnt of the same type by erasing multiples.
if (!WaitcntInstr) {
WaitcntInstr = &II;
} else {
if (WaitcntInstr || (!Wait.hasWaitExceptVsCnt() && IsSoft)) {
II.eraseFromParent();
Modified = true;
}
} else
WaitcntInstr = &II;

} else {
assert(II.getOpcode() == AMDGPU::S_WAITCNT_VSCNT);
assert(SIInstrInfo::isWaitcntVsCnt(Opcode));
assert(II.getOperand(0).getReg() == AMDGPU::SGPR_NULL);
if (!TrackedWaitcntSet.count(&II)) {
unsigned OldVSCnt =
TII->getNamedOperand(II, AMDGPU::OpName::simm16)->getImm();
Wait.VsCnt = std::min(Wait.VsCnt, OldVSCnt);
}

if (!WaitcntVsCntInstr) {
WaitcntVsCntInstr = &II;
} else {
unsigned OldVSCnt =
TII->getNamedOperand(II, AMDGPU::OpName::simm16)->getImm();
if (IsSoft)
ScoreBrackets.simplifyWaitcnt(InstCounterType::VS_CNT, OldVSCnt);
Wait.VsCnt = std::min(Wait.VsCnt, OldVSCnt);

if (WaitcntVsCntInstr || (!Wait.hasWaitVsCnt() && IsSoft)) {
II.eraseFromParent();
Modified = true;
}
} else
WaitcntVsCntInstr = &II;
}
}

// Updated encoding of merged waitcnt with the required wait.
if (WaitcntInstr) {
if (Wait.hasWaitExceptVsCnt()) {
Modified |=
updateOperandIfDifferent(*WaitcntInstr, AMDGPU::OpName::simm16,
AMDGPU::encodeWaitcnt(IV, Wait));
ScoreBrackets.applyWaitcnt(Wait);
Wait.VmCnt = ~0u;
Wait.LgkmCnt = ~0u;
Wait.ExpCnt = ~0u;

LLVM_DEBUG(It == OldWaitcntInstr.getParent()->end()
? dbgs() << "applyPreexistingWaitcnt\n"
<< "New Instr at block end: " << *WaitcntInstr
<< '\n'
: dbgs() << "applyPreexistingWaitcnt\n"
<< "Old Instr: " << *It
<< "New Instr: " << *WaitcntInstr << '\n');
Modified |= updateOperandIfDifferent(*WaitcntInstr, AMDGPU::OpName::simm16,
AMDGPU::encodeWaitcnt(IV, Wait));
Modified |= promoteSoftWaitCnt(WaitcntInstr);

} else {
WaitcntInstr->eraseFromParent();
Modified = true;
}
ScoreBrackets.applyWaitcnt(Wait);
Wait.VmCnt = ~0u;
Wait.LgkmCnt = ~0u;
Wait.ExpCnt = ~0u;

LLVM_DEBUG(It == OldWaitcntInstr.getParent()->end()
? dbgs()
<< "applyPreexistingWaitcnt\n"
<< "New Instr at block end: " << *WaitcntInstr << '\n'
: dbgs() << "applyPreexistingWaitcnt\n"
<< "Old Instr: " << *It
<< "New Instr: " << *WaitcntInstr << '\n');
}

if (WaitcntVsCntInstr) {
if (Wait.hasWaitVsCnt()) {
assert(ST->hasVscnt());
Modified |= updateOperandIfDifferent(*WaitcntVsCntInstr,
AMDGPU::OpName::simm16, Wait.VsCnt);
ScoreBrackets.applyWaitcnt(Wait);
Wait.VsCnt = ~0u;

LLVM_DEBUG(It == OldWaitcntInstr.getParent()->end()
? dbgs() << "applyPreexistingWaitcnt\n"
<< "New Instr at block end: "
<< *WaitcntVsCntInstr << '\n'
: dbgs() << "applyPreexistingWaitcnt\n"
<< "Old Instr: " << *It
<< "New Instr: " << *WaitcntVsCntInstr << '\n');
} else {
WaitcntVsCntInstr->eraseFromParent();
Modified = true;
}
Modified |= updateOperandIfDifferent(*WaitcntVsCntInstr,
AMDGPU::OpName::simm16, Wait.VsCnt);
Modified |= promoteSoftWaitCnt(WaitcntVsCntInstr);
ScoreBrackets.applyWaitcnt(Wait);
Wait.VsCnt = ~0u;

LLVM_DEBUG(It == OldWaitcntInstr.getParent()->end()
? dbgs() << "applyPreexistingWaitcnt\n"
<< "New Instr at block end: " << *WaitcntVsCntInstr
<< '\n'
: dbgs() << "applyPreexistingWaitcnt\n"
<< "Old Instr: " << *It
<< "New Instr: " << *WaitcntVsCntInstr << '\n');
}

return Modified;
Expand Down Expand Up @@ -1319,7 +1321,6 @@ bool SIInsertWaitcnts::generateWaitcnt(AMDGPU::Waitcnt Wait,
unsigned Enc = AMDGPU::encodeWaitcnt(IV, Wait);
auto SWaitInst =
BuildMI(Block, It, DL, TII->get(AMDGPU::S_WAITCNT)).addImm(Enc);
TrackedWaitcntSet.insert(SWaitInst);
Modified = true;

LLVM_DEBUG(dbgs() << "generateWaitcnt\n";
Expand All @@ -1333,7 +1334,6 @@ bool SIInsertWaitcnts::generateWaitcnt(AMDGPU::Waitcnt Wait,
auto SWaitInst = BuildMI(Block, It, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT))
.addReg(AMDGPU::SGPR_NULL, RegState::Undef)
.addImm(Wait.VsCnt);
TrackedWaitcntSet.insert(SWaitInst);
Modified = true;

LLVM_DEBUG(dbgs() << "generateWaitcnt\n";
Expand Down Expand Up @@ -1581,9 +1581,9 @@ bool WaitcntBrackets::merge(const WaitcntBrackets &Other) {
}

static bool isWaitInstr(MachineInstr &Inst) {
return Inst.getOpcode() == AMDGPU::S_WAITCNT ||
(Inst.getOpcode() == AMDGPU::S_WAITCNT_VSCNT &&
Inst.getOperand(0).isReg() &&
auto Opcode = Inst.getOpcode();
return SIInstrInfo::isWaitcnt(Opcode) ||
(SIInstrInfo::isWaitcntVsCnt(Opcode) && Inst.getOperand(0).isReg() &&
Inst.getOperand(0).getReg() == AMDGPU::SGPR_NULL);
}

Expand Down Expand Up @@ -1852,7 +1852,6 @@ bool SIInsertWaitcnts::runOnMachineFunction(MachineFunction &MF) {
TRI->getEncodingValue(AMDGPU::SGPR0) & AMDGPU::HWEncoding::REG_IDX_MASK;
Encoding.SGPRL = Encoding.SGPR0 + NumSGPRsMax - 1;

TrackedWaitcntSet.clear();
BlockInfos.clear();
bool Modified = false;

Expand Down
3 changes: 3 additions & 0 deletions llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8747,6 +8747,9 @@ bool SIInstrInfo::isAsmOnlyOpcode(int MCOp) const {
}

int SIInstrInfo::pseudoToMCOpcode(int Opcode) const {
if (SIInstrInfo::isSoftWaitcnt(Opcode))
Opcode = SIInstrInfo::getNonSoftWaitcntOpcode(Opcode);

unsigned Gen = subtargetEncodingFamily(ST);

if ((get(Opcode).TSFlags & SIInstrFlags::renamedInGFX9) != 0 &&
Expand Down
26 changes: 26 additions & 0 deletions llvm/lib/Target/AMDGPU/SIInstrInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -895,6 +895,32 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
return get(Opcode).TSFlags & SIInstrFlags::TiedSourceNotRead;
}

static unsigned getNonSoftWaitcntOpcode(unsigned Opcode) {
if (isWaitcnt(Opcode))
return AMDGPU::S_WAITCNT;

if (isWaitcntVsCnt(Opcode))
return AMDGPU::S_WAITCNT_VSCNT;

llvm_unreachable("Expected opcode S_WAITCNT/S_WAITCNT_VSCNT");
}

static bool isWaitcnt(unsigned Opcode) {
return Opcode == AMDGPU::S_WAITCNT || Opcode == AMDGPU::S_WAITCNT_soft;
}

static bool isWaitcntVsCnt(unsigned Opcode) {
return Opcode == AMDGPU::S_WAITCNT_VSCNT ||
Opcode == AMDGPU::S_WAITCNT_VSCNT_soft;
}

// "Soft" waitcnt instructions can be relaxed/optimized out by
// SIInsertWaitcnts.
static bool isSoftWaitcnt(unsigned Opcode) {
return Opcode == AMDGPU::S_WAITCNT_soft ||
Opcode == AMDGPU::S_WAITCNT_VSCNT_soft;
}

bool isVGPRCopy(const MachineInstr &MI) const {
assert(isCopyInstr(MI));
Register Dest = MI.getOperand(0).getReg();
Expand Down
12 changes: 7 additions & 5 deletions llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1055,7 +1055,8 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
VMCnt ? 0 : getVmcntBitMask(IV),
getExpcntBitMask(IV),
LGKMCnt ? 0 : getLgkmcntBitMask(IV));
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT)).addImm(WaitCntImmediate);
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
.addImm(WaitCntImmediate);
Changed = true;
}

Expand Down Expand Up @@ -1963,14 +1964,15 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
VMCnt ? 0 : getVmcntBitMask(IV),
getExpcntBitMask(IV),
LGKMCnt ? 0 : getLgkmcntBitMask(IV));
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT)).addImm(WaitCntImmediate);
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
.addImm(WaitCntImmediate);
Changed = true;
}

if (VSCnt) {
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT))
.addReg(AMDGPU::SGPR_NULL, RegState::Undef)
.addImm(0);
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT_soft))
.addReg(AMDGPU::SGPR_NULL, RegState::Undef)
.addImm(0);
Changed = true;
}

Expand Down
11 changes: 11 additions & 0 deletions llvm/lib/Target/AMDGPU/SOPInstructions.td
Original file line number Diff line number Diff line change
Expand Up @@ -1590,6 +1590,17 @@ def S_WAKEUP : SOPP_Pseudo <"s_wakeup", (ins) > {

def S_WAITCNT : SOPP_Pseudo <"s_waitcnt" , (ins SWaitCnt:$simm16), "$simm16",
[(int_amdgcn_s_waitcnt timm:$simm16)]>;

// "_soft" waitcnts are waitcnts that are either relaxed into their non-soft
// counterpart, or completely removed.
//
// These are inserted by SIMemoryLegalizer to resolve memory dependencies
// and later optimized by SIInsertWaitcnts
// For example, a S_WAITCNT_soft 0 can be completely removed in a function
// that doesn't access memory.
def S_WAITCNT_soft : SOPP_Pseudo <"s_soft_waitcnt" , (ins SWaitCnt:$simm16), "$simm16">;
def S_WAITCNT_VSCNT_soft : SOPK_WAITCNT<"s_soft_waitcnt_vscnt">;

def S_SETHALT : SOPP_Pseudo <"s_sethalt" , (ins i32imm:$simm16), "$simm16",
[(int_amdgcn_s_sethalt timm:$simm16)]>;
def S_SETKILL : SOPP_Pseudo <"s_setkill" , (ins i16imm:$simm16), "$simm16">;
Expand Down
Loading