Skip to content

Commit 72c3c30

Browse files
authored
[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055)
The CWSR trap handler needs to save and restore the VGPRs. When dynamic VGPRs are in use, the fixed function hardware will only allocate enough space for one VGPR block. The rest will have to be stored in scratch, at offset 0. This patch allocates the necessary space by: - generating a prologue that checks at runtime if we're on a compute queue (since CWSR only works on compute queues); for this we will have to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero, we can assume we're on a compute queue and initialize the SP and FP with enough room for the dynamic VGPRs - forcing all compute entry functions to use a FP so they can access their locals/spills correctly (this isn't ideal but it's the quickest to implement) Note that at the moment we allocate enough space for the theoretical maximum number of VGPRs that can be allocated dynamically (for blocks of 16 registers, this will be 128, of which we subtract the first 16, which are already allocated by the fixed function hardware). Future patches may decide to allocate less if they can prove the shader never allocates that many blocks. Also note that this should not affect any reported stack sizes (e.g. PAL backend_stack_size etc).
1 parent 861efd4 commit 72c3c30

17 files changed

+474
-44
lines changed

llvm/docs/AMDGPUUsage.rst

Lines changed: 36 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -6027,8 +6027,13 @@ Frame Pointer
60276027

60286028
If the kernel needs a frame pointer for the reasons defined in
60296029
``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
6030-
kernel prolog. If a frame pointer is not required then all uses of the frame
6031-
pointer are replaced with immediate ``0`` offsets.
6030+
kernel prolog. On GFX12+, when dynamic VGPRs are enabled, the prologue will
6031+
check if the kernel is running on a compute queue, and if so it will reserve
6032+
some scratch space for any dynamic VGPRs that might need to be saved by the
6033+
CWSR trap handler. In this case, the frame pointer will be initialized to
6034+
a suitably aligned offset above this reserved area. If a frame pointer is not
6035+
required then all uses of the frame pointer are replaced with immediate ``0``
6036+
offsets.
60326037

60336038
.. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
60346039

@@ -17140,33 +17145,35 @@ within a map that has been added by the same *vendor-name*.
1714017145
.. table:: AMDPAL Code Object Hardware Stage Metadata Map
1714117146
:name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
1714217147

17143-
========================== ============== ========= ===============================================================
17144-
String Key Value Type Required? Description
17145-
========================== ============== ========= ===============================================================
17146-
".entry_point" string The ELF symbol pointing to this pipeline's stage entry point.
17147-
".scratch_memory_size" integer Scratch memory size in bytes.
17148-
".lds_size" integer Local Data Share size in bytes.
17149-
".perf_data_buffer_size" integer Performance data buffer size in bytes.
17150-
".vgpr_count" integer Number of VGPRs used.
17151-
".agpr_count" integer Number of AGPRs used.
17152-
".sgpr_count" integer Number of SGPRs used.
17153-
".vgpr_limit" integer If non-zero, indicates the shader was compiled with a
17154-
directive to instruct the compiler to limit the VGPR usage to
17155-
be less than or equal to the specified value (only set if
17156-
different from HW default).
17157-
".sgpr_limit" integer SGPR count upper limit (only set if different from HW
17158-
default).
17159-
".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only).
17160-
3 integers
17161-
".wavefront_size" integer Wavefront size (only set if different from HW default).
17162-
".uses_uavs" boolean The shader reads or writes UAVs.
17163-
".uses_rovs" boolean The shader reads or writes ROVs.
17164-
".writes_uavs" boolean The shader writes to one or more UAVs.
17165-
".writes_depth" boolean The shader writes out a depth value.
17166-
".uses_append_consume" boolean The shader uses append and/or consume operations, either
17167-
memory or GDS.
17168-
".uses_prim_id" boolean The shader uses PrimID.
17169-
========================== ============== ========= ===============================================================
17148+
=========================== ============== ========= ===============================================================
17149+
String Key Value Type Required? Description
17150+
=========================== ============== ========= ===============================================================
17151+
".entry_point" string The ELF symbol pointing to this pipeline's stage entry point.
17152+
".scratch_memory_size" integer Scratch memory size in bytes.
17153+
".lds_size" integer Local Data Share size in bytes.
17154+
".perf_data_buffer_size" integer Performance data buffer size in bytes.
17155+
".vgpr_count" integer Number of VGPRs used.
17156+
".agpr_count" integer Number of AGPRs used.
17157+
".sgpr_count" integer Number of SGPRs used.
17158+
".dynamic_vgpr_saved_count" integer No Number of dynamic VGPRs that can be stored in scratch by the
17159+
CWSR trap handler. Only used on GFX12+.
17160+
".vgpr_limit" integer If non-zero, indicates the shader was compiled with a
17161+
directive to instruct the compiler to limit the VGPR usage to
17162+
be less than or equal to the specified value (only set if
17163+
different from HW default).
17164+
".sgpr_limit" integer SGPR count upper limit (only set if different from HW
17165+
default).
17166+
".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only).
17167+
3 integers
17168+
".wavefront_size" integer Wavefront size (only set if different from HW default).
17169+
".uses_uavs" boolean The shader reads or writes UAVs.
17170+
".uses_rovs" boolean The shader reads or writes ROVs.
17171+
".writes_uavs" boolean The shader writes to one or more UAVs.
17172+
".writes_depth" boolean The shader writes out a depth value.
17173+
".uses_append_consume" boolean The shader uses append and/or consume operations, either
17174+
memory or GDS.
17175+
".uses_prim_id" boolean The shader uses PrimID.
17176+
=========================== ============== ========= ===============================================================
1717017177

1717117178
..
1717217179

llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1440,8 +1440,15 @@ void AMDGPUAsmPrinter::EmitPALMetadata(const MachineFunction &MF,
14401440
MD->setEntryPoint(CC, MF.getFunction().getName());
14411441
MD->setNumUsedVgprs(CC, CurrentProgramInfo.NumVGPRsForWavesPerEU, Ctx);
14421442

1443-
// Only set AGPRs for supported devices
1443+
// For targets that support dynamic VGPRs, set the number of saved dynamic
1444+
// VGPRs (if any) in the PAL metadata.
14441445
const GCNSubtarget &STM = MF.getSubtarget<GCNSubtarget>();
1446+
if (STM.isDynamicVGPREnabled() &&
1447+
MFI->getScratchReservedForDynamicVGPRs() > 0)
1448+
MD->setHwStage(CC, ".dynamic_vgpr_saved_count",
1449+
MFI->getScratchReservedForDynamicVGPRs() / 4);
1450+
1451+
// Only set AGPRs for supported devices
14451452
if (STM.hasMAIInsts()) {
14461453
MD->setNumUsedAgprs(CC, CurrentProgramInfo.NumAccVGPR);
14471454
}

llvm/lib/Target/AMDGPU/SIDefines.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -552,6 +552,7 @@ enum Id { // HwRegCode, (6) [5:0]
552552

553553
enum Offset : unsigned { // Offset, (5) [10:6]
554554
OFFSET_MEM_VIOL = 8,
555+
OFFSET_ME_ID = 8, // in HW_ID2
555556
};
556557

557558
enum ModeRegisterMasks : uint32_t {

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp

Lines changed: 62 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -691,17 +691,62 @@ void SIFrameLowering::emitEntryFunctionPrologue(MachineFunction &MF,
691691
}
692692
assert(ScratchWaveOffsetReg || !PreloadedScratchWaveOffsetReg);
693693

694-
if (hasFP(MF)) {
694+
unsigned Offset = FrameInfo.getStackSize() * getScratchScaleFactor(ST);
695+
if (!mayReserveScratchForCWSR(MF)) {
696+
if (hasFP(MF)) {
697+
Register FPReg = MFI->getFrameOffsetReg();
698+
assert(FPReg != AMDGPU::FP_REG);
699+
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_MOV_B32), FPReg).addImm(0);
700+
}
701+
702+
if (requiresStackPointerReference(MF)) {
703+
Register SPReg = MFI->getStackPtrOffsetReg();
704+
assert(SPReg != AMDGPU::SP_REG);
705+
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_MOV_B32), SPReg).addImm(Offset);
706+
}
707+
} else {
708+
// We need to check if we're on a compute queue - if we are, then the CWSR
709+
// trap handler may need to store some VGPRs on the stack. The first VGPR
710+
// block is saved separately, so we only need to allocate space for any
711+
// additional VGPR blocks used. For now, we will make sure there's enough
712+
// room for the theoretical maximum number of VGPRs that can be allocated.
713+
// FIXME: Figure out if the shader uses fewer VGPRs in practice.
714+
assert(hasFP(MF));
695715
Register FPReg = MFI->getFrameOffsetReg();
696716
assert(FPReg != AMDGPU::FP_REG);
697-
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_MOV_B32), FPReg).addImm(0);
698-
}
699-
700-
if (requiresStackPointerReference(MF)) {
701-
Register SPReg = MFI->getStackPtrOffsetReg();
702-
assert(SPReg != AMDGPU::SP_REG);
703-
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_MOV_B32), SPReg)
704-
.addImm(FrameInfo.getStackSize() * getScratchScaleFactor(ST));
717+
unsigned VGPRSize =
718+
llvm::alignTo((ST.getAddressableNumVGPRs() -
719+
AMDGPU::IsaInfo::getVGPRAllocGranule(&ST)) *
720+
4,
721+
FrameInfo.getMaxAlign());
722+
MFI->setScratchReservedForDynamicVGPRs(VGPRSize);
723+
724+
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_GETREG_B32), FPReg)
725+
.addImm(AMDGPU::Hwreg::HwregEncoding::encode(
726+
AMDGPU::Hwreg::ID_HW_ID2, AMDGPU::Hwreg::OFFSET_ME_ID, 2));
727+
// The MicroEngine ID is 0 for the graphics queue, and 1 or 2 for compute
728+
// (3 is unused, so we ignore it). Unfortunately, S_GETREG doesn't set
729+
// SCC, so we need to check for 0 manually.
730+
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CMP_LG_U32)).addImm(0).addReg(FPReg);
731+
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CMOVK_I32), FPReg).addImm(VGPRSize);
732+
if (requiresStackPointerReference(MF)) {
733+
Register SPReg = MFI->getStackPtrOffsetReg();
734+
assert(SPReg != AMDGPU::SP_REG);
735+
736+
// If at least one of the constants can be inlined, then we can use
737+
// s_cselect. Otherwise, use a mov and cmovk.
738+
if (AMDGPU::isInlinableLiteral32(Offset, ST.hasInv2PiInlineImm()) ||
739+
AMDGPU::isInlinableLiteral32(Offset + VGPRSize,
740+
ST.hasInv2PiInlineImm())) {
741+
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CSELECT_B32), SPReg)
742+
.addImm(Offset + VGPRSize)
743+
.addImm(Offset);
744+
} else {
745+
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_MOV_B32), SPReg).addImm(Offset);
746+
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CMOVK_I32), SPReg)
747+
.addImm(Offset + VGPRSize);
748+
}
749+
}
705750
}
706751

707752
bool NeedsFlatScratchInit =
@@ -1831,9 +1876,17 @@ bool SIFrameLowering::hasFPImpl(const MachineFunction &MF) const {
18311876
return frameTriviallyRequiresSP(MFI) || MFI.isFrameAddressTaken() ||
18321877
MF.getSubtarget<GCNSubtarget>().getRegisterInfo()->hasStackRealignment(
18331878
MF) ||
1879+
mayReserveScratchForCWSR(MF) ||
18341880
MF.getTarget().Options.DisableFramePointerElim(MF);
18351881
}
18361882

1883+
bool SIFrameLowering::mayReserveScratchForCWSR(
1884+
const MachineFunction &MF) const {
1885+
return MF.getSubtarget<GCNSubtarget>().isDynamicVGPREnabled() &&
1886+
AMDGPU::isEntryFunctionCC(MF.getFunction().getCallingConv()) &&
1887+
AMDGPU::isCompute(MF.getFunction().getCallingConv());
1888+
}
1889+
18371890
// This is essentially a reduced version of hasFP for entry functions. Since the
18381891
// stack pointer is known 0 on entry to kernels, we never really need an FP
18391892
// register. We may need to initialize the stack pointer depending on the frame

llvm/lib/Target/AMDGPU/SIFrameLowering.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,10 @@ class SIFrameLowering final : public AMDGPUFrameLowering {
8686

8787
public:
8888
bool requiresStackPointerReference(const MachineFunction &MF) const;
89+
90+
// Returns true if the function may need to reserve space on the stack for the
91+
// CWSR trap handler.
92+
bool mayReserveScratchForCWSR(const MachineFunction &MF) const;
8993
};
9094

9195
} // end namespace llvm

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -715,7 +715,8 @@ yaml::SIMachineFunctionInfo::SIMachineFunctionInfo(
715715
ArgInfo(convertArgumentInfo(MFI.getArgInfo(), TRI)),
716716
PSInputAddr(MFI.getPSInputAddr()), PSInputEnable(MFI.getPSInputEnable()),
717717
MaxMemoryClusterDWords(MFI.getMaxMemoryClusterDWords()),
718-
Mode(MFI.getMode()), HasInitWholeWave(MFI.hasInitWholeWave()) {
718+
Mode(MFI.getMode()), HasInitWholeWave(MFI.hasInitWholeWave()),
719+
ScratchReservedForDynamicVGPRs(MFI.getScratchReservedForDynamicVGPRs()) {
719720
for (Register Reg : MFI.getSGPRSpillPhysVGPRs())
720721
SpillPhysVGPRS.push_back(regToString(Reg, TRI));
721722

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,8 @@ struct SIMachineFunctionInfo final : public yaml::MachineFunctionInfo {
299299

300300
bool HasInitWholeWave = false;
301301

302+
unsigned ScratchReservedForDynamicVGPRs = 0;
303+
302304
SIMachineFunctionInfo() = default;
303305
SIMachineFunctionInfo(const llvm::SIMachineFunctionInfo &,
304306
const TargetRegisterInfo &TRI,
@@ -350,6 +352,8 @@ template <> struct MappingTraits<SIMachineFunctionInfo> {
350352
YamlIO.mapOptional("longBranchReservedReg", MFI.LongBranchReservedReg,
351353
StringValue());
352354
YamlIO.mapOptional("hasInitWholeWave", MFI.HasInitWholeWave, false);
355+
YamlIO.mapOptional("scratchReservedForDynamicVGPRs",
356+
MFI.ScratchReservedForDynamicVGPRs, 0);
353357
}
354358
};
355359

@@ -455,6 +459,10 @@ class SIMachineFunctionInfo final : public AMDGPUMachineFunction,
455459
unsigned NumSpilledSGPRs = 0;
456460
unsigned NumSpilledVGPRs = 0;
457461

462+
// The size in bytes of the scratch space reserved for the CWSR trap handler
463+
// to spill some of the dynamic VGPRs.
464+
unsigned ScratchReservedForDynamicVGPRs = 0;
465+
458466
// Tracks information about user SGPRs that will be setup by hardware which
459467
// will apply to all wavefronts of the grid.
460468
GCNUserSGPRUsageInfo UserSGPRInfo;
@@ -780,6 +788,15 @@ class SIMachineFunctionInfo final : public AMDGPUMachineFunction,
780788
BytesInStackArgArea = Bytes;
781789
}
782790

791+
// This is only used if we need to save any dynamic VGPRs in scratch.
792+
unsigned getScratchReservedForDynamicVGPRs() const {
793+
return ScratchReservedForDynamicVGPRs;
794+
}
795+
796+
void setScratchReservedForDynamicVGPRs(unsigned SizeInBytes) {
797+
ScratchReservedForDynamicVGPRs = SizeInBytes;
798+
}
799+
783800
// Add user SGPRs.
784801
Register addPrivateSegmentBuffer(const SIRegisterInfo &TRI);
785802
Register addDispatchPtr(const SIRegisterInfo &TRI);

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -511,6 +511,7 @@ SIRegisterInfo::getLargestLegalSuperClass(const TargetRegisterClass *RC,
511511
Register SIRegisterInfo::getFrameRegister(const MachineFunction &MF) const {
512512
const SIFrameLowering *TFI = ST.getFrameLowering();
513513
const SIMachineFunctionInfo *FuncInfo = MF.getInfo<SIMachineFunctionInfo>();
514+
514515
// During ISel lowering we always reserve the stack pointer in entry and chain
515516
// functions, but never actually want to reference it when accessing our own
516517
// frame. If we need a frame pointer we use it, but otherwise we can just use

0 commit comments

Comments
 (0)