Skip to content

[AMDGPU][Scheduler] Refactor VGPR rematerialization during scheduling #118722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed
4 changes: 4 additions & 0 deletions llvm/include/llvm/CodeGen/MachineRegisterInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
#include "llvm/ADT/iterator_range.h"
#include "llvm/CodeGen/MachineBasicBlock.h"
#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineInstr.h"
#include "llvm/CodeGen/MachineInstrBundle.h"
#include "llvm/CodeGen/MachineOperand.h"
#include "llvm/CodeGen/RegisterBank.h"
Expand Down Expand Up @@ -592,6 +593,9 @@ class MachineRegisterInfo {
/// multiple uses.
bool hasOneNonDBGUser(Register RegNo) const;

/// If the register has a single non-Debug instruction using the specified
/// register, returns it; otherwise returns nullptr.
MachineInstr *getOneNonDBGUser(Register RegNo) const;

/// hasAtMostUses - Return true if the given register has at most \p MaxUsers
/// non-debug user instructions.
Expand Down
5 changes: 5 additions & 0 deletions llvm/lib/CodeGen/MachineRegisterInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -431,6 +431,11 @@ bool MachineRegisterInfo::hasOneNonDBGUser(Register RegNo) const {
return hasSingleElement(use_nodbg_instructions(RegNo));
}

MachineInstr *MachineRegisterInfo::getOneNonDBGUser(Register RegNo) const {
auto RegNoDbgUsers = use_nodbg_instructions(RegNo);
return hasSingleElement(RegNoDbgUsers) ? &*RegNoDbgUsers.begin() : nullptr;
}

bool MachineRegisterInfo::hasAtMostUserInstrs(Register Reg,
unsigned MaxUsers) const {
return hasNItemsOrLess(use_instr_nodbg_begin(Reg), use_instr_nodbg_end(),
Expand Down
449 changes: 244 additions & 205 deletions llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

Large diffs are not rendered by default.

62 changes: 41 additions & 21 deletions llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@
#define LLVM_LIB_TARGET_AMDGPU_GCNSCHEDSTRATEGY_H

#include "GCNRegPressure.h"
#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/DenseMap.h"
#include "llvm/CodeGen/MachineInstr.h"
#include "llvm/CodeGen/MachineScheduler.h"

namespace llvm {
Expand Down Expand Up @@ -419,30 +420,49 @@ class ClusteredLowOccStage : public GCNSchedStage {
: GCNSchedStage(StageID, DAG) {}
};

/// Attempts to increase function occupancy with respect to VGPR usage by one by
/// sinking trivially rematerializable instructions to their use. When the stage
/// estimates increasing occupancy is possible, as few instructions as possible
/// are rematerialized to reduce potential negative effects on function latency.
///
/// TODO: We should extend this to work on SGPRs and AGPRs as well.
class PreRARematStage : public GCNSchedStage {
private:
// Each region at MinOccupancy will have their own list of trivially
// rematerializable instructions we can remat to reduce RP. The list maps an
// instruction to the position we should remat before, usually the MI using
// the rematerializable instruction.
MapVector<unsigned, MapVector<MachineInstr *, MachineInstr *>>
RematerializableInsts;

// Map a trivially rematerializable def to a list of regions at MinOccupancy
// that has the defined reg as a live-in.
DenseMap<MachineInstr *, SmallVector<unsigned, 4>> RematDefToLiveInRegions;

// Collect all trivially rematerializable VGPR instructions with a single def
// and single use outside the defining block into RematerializableInsts.
void collectRematerializableInstructions();

/// A trivially rematerializable VGPR-defining instruction along with
/// pre-computed information to help update the scheduler's status when we
/// rematerialize it.
struct RematInstruction {
/// Trivially rematerializable instruction.
MachineInstr *RematMI;
/// Single use of the rematerializable instruction's defined register,
/// located in a different block.
MachineInstr *UseMI;
Comment on lines +444 to +446
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the scheduling was per block only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU it is per region, and in the case of a rematerialization both the region where the instruction was rematerialized from and the one it is rematerialized to will be rescheduled after this scheduling stage takes place.

In the future we could improve rematerialization to work even if the def and use are in the same block but not the same region, which from what I understand can happen and make sense.

/// Set of regions in which the rematerializable instruction's defined
/// register is a live-in.
SmallDenseSet<unsigned, 4> LiveInRegions;
/// Region containing the rematerializable instruction.
unsigned DefRegion;

RematInstruction(MachineInstr *RematMI, unsigned DefRegion,
MachineInstr *UseMI)
: RematMI(RematMI), UseMI(UseMI), DefRegion(DefRegion) {}
};

/// Determines whether we can increase function occupancy by 1 through
/// rematerialization. If we can, returns true and fill \p RematInstructions
/// with a list of rematerializable instructions whose sinking would result in
/// increased occupancy; returns false otherwise.
bool
canIncreaseOccupancy(SmallVectorImpl<RematInstruction> &RematInstructions);

/// Whether the MI is trivially rematerializable and does not have any virtual
/// register use.
bool isTriviallyReMaterializable(const MachineInstr &MI);

// TODO: Should also attempt to reduce RP of SGPRs and AGPRs
// Attempt to reduce RP of VGPR by sinking trivially rematerializable
// instructions. Returns true if we were able to sink instruction(s).
bool sinkTriviallyRematInsts(const GCNSubtarget &ST,
const TargetInstrInfo *TII);
/// Sinks all instructions in \p RematInstructions to increase function
/// occupancy. Modified regions are tagged for rescheduling.
void sinkTriviallyRematInsts(ArrayRef<RematInstruction> RematInstructions,
const GCNSubtarget &ST, const SIInstrInfo *TII);

public:
bool initGCNSchedStage() override;
Expand Down
4 changes: 4 additions & 0 deletions llvm/lib/Target/AMDGPU/GCNSubtarget.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -367,6 +367,10 @@ unsigned GCNSubtarget::getOccupancyWithNumVGPRs(unsigned NumVGPRs) const {
return AMDGPU::IsaInfo::getNumWavesPerEUWithNumVGPRs(this, NumVGPRs);
}

unsigned GCNSubtarget::getNumVGPRsToIncreaseOccupancy(unsigned NumVGPRs) const {
return AMDGPU::IsaInfo::getVGPRReductionToIncreaseWavesPerEU(this, NumVGPRs);
}

unsigned
GCNSubtarget::getBaseReservedNumSGPRs(const bool HasFlatScratch) const {
if (getGeneration() >= AMDGPUSubtarget::GFX10)
Expand Down
5 changes: 5 additions & 0 deletions llvm/lib/Target/AMDGPU/GCNSubtarget.h
Original file line number Diff line number Diff line change
Expand Up @@ -1368,6 +1368,11 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
/// VGPRs
unsigned getOccupancyWithNumVGPRs(unsigned VGPRs) const;

/// Returns the necessary reduction in number of VGPRs from using \p VGPRs
/// VGPRs to increase occupancy by 1. Returns 0 when using \p VGPRs VGPRs
/// already results in maximum occupancy.
unsigned getNumVGPRsToIncreaseOccupancy(unsigned VGPRs) const;

/// Return occupancy for the given function. Used LDS and a number of
/// registers if provided.
/// Note, occupancy can be affected by the scratch allocation as well, but
Expand Down
13 changes: 13 additions & 0 deletions llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1185,6 +1185,19 @@ unsigned getNumWavesPerEUWithNumVGPRs(unsigned NumVGPRs, unsigned Granule,
return std::min(std::max(TotalNumVGPRs / RoundedRegs, 1u), MaxWaves);
}

unsigned getVGPRReductionToIncreaseWavesPerEU(const MCSubtargetInfo *STI,
unsigned NumVGPRs) {
unsigned Granule = getVGPRAllocGranule(STI);
unsigned MaxWaves = getMaxWavesPerEU(STI);
unsigned TotalNumVGPRs = getTotalNumVGPRs(STI);

unsigned NumWaves =
getNumWavesPerEUWithNumVGPRs(NumVGPRs, Granule, MaxWaves, TotalNumVGPRs);
if (NumWaves == MaxWaves)
return 0;
return NumVGPRs - alignDown(TotalNumVGPRs / (NumWaves + 1), Granule);
}

unsigned getOccupancyWithNumSGPRs(unsigned SGPRs, unsigned MaxWaves,
AMDGPUSubtarget::Generation Gen) {
if (Gen >= AMDGPUSubtarget::GFX10)
Expand Down
8 changes: 8 additions & 0 deletions llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,14 @@ unsigned getMaxNumVGPRs(const MCSubtargetInfo *STI, unsigned WavesPerEU);
unsigned getNumWavesPerEUWithNumVGPRs(const MCSubtargetInfo *STI,
unsigned NumVGPRs);

/// Returns the necessary reduction in number of VGPRs from using \p VGPRs VGPRs
/// to increase the achievable number of waves per EU for this subtarget by 1.
/// Returns 0 when using \p VGPRs VGPRs already results in maximum number of
/// waves per EU.

unsigned getVGPRReductionToIncreaseWavesPerEU(const MCSubtargetInfo *STI,
unsigned NumVGPRs);

/// \returns Number of waves reachable for a given \p NumVGPRs usage, \p Granule
/// size, \p MaxWaves possible, and \p TotalNumVGPRs available.
unsigned getNumWavesPerEUWithNumVGPRs(unsigned NumVGPRs, unsigned Granule,
Expand Down
Loading
Loading