-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AMDGPU] Include unused preload kernarg in KD total SGPR count #104743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Include unused preload kernarg in KD total SGPR count #104743
Conversation
@llvm/pr-subscribers-mc @llvm/pr-subscribers-backend-amdgpu Author: Austin Kerbow (kerbowa) ChangesUnlike with implicitly preloaded data UserSGPRs firmware is unable to handle cases where SGPRs for kernel arguments contain prelaoded data but not are not explicitly referenced in the kernel. We need to include these preloaded SGPRs in the GRANULATED_WAVEFRONT_SGPR_COUNT calculation to not clobber SGPRs in adjacent waves. Full diff: https://github.com/llvm/llvm-project/pull/104743.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
index b90d245b7bd394..cfa5216c8c54b1 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
@@ -970,8 +970,15 @@ void AMDGPUAsmPrinter::getSIProgramInfo(SIProgramInfo &ProgInfo,
return SubGPR;
};
- ProgInfo.SGPRBlocks = GetNumGPRBlocks(ProgInfo.NumSGPRsForWavesPerEU,
- IsaInfo::getSGPREncodingGranule(&STM));
+ // Consider cases where the total number of UserSGPRs plus extra SGPRs is
+ // greater than the number of explicitly referenced SGPRs.
+ const MCExpr *MaxUserSGPRs = MCBinaryExpr::createAdd(
+ CreateExpr(MFI->getNumUserSGPRs()), ExtraSGPRs, Ctx);
+
+ ProgInfo.SGPRBlocks =
+ GetNumGPRBlocks(AMDGPUMCExpr::createMax(
+ {ProgInfo.NumSGPRsForWavesPerEU, MaxUserSGPRs}, Ctx),
+ IsaInfo::getSGPREncodingGranule(&STM));
ProgInfo.VGPRBlocks = GetNumGPRBlocks(ProgInfo.NumVGPRsForWavesPerEU,
IsaInfo::getVGPREncodingGranule(&STM));
diff --git a/llvm/test/CodeGen/AMDGPU/amdhsa-kernarg-preload-num-sgprs.ll b/llvm/test/CodeGen/AMDGPU/amdhsa-kernarg-preload-num-sgprs.ll
new file mode 100644
index 00000000000000..34bef81171e812
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/amdhsa-kernarg-preload-num-sgprs.ll
@@ -0,0 +1,14 @@
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -filetype=obj < %s > %t
+; RUN: llvm-objdump -s -j .rodata %t | FileCheck --check-prefix=OBJDUMP %s
+
+; OBJDUMP: Contents of section .rodata:
+; OBJDUMP-NEXT: 0000 00000000 00000000 10010000 00000000
+; OBJDUMP-NEXT: 0010 00000000 00000000 00000000 00000000
+; OBJDUMP-NEXT: 0020 00000000 00000000 00000000 00000000
+; OBJDUMP-NEXT: 0030 4000af00 94130000 1a000400 00000000
+; OBJDUMP-NOT: 0030 0000af00 94130000 1a000400 00000000
+
+; Include preloaded SGPRs that are not explicitly used in the kernel in
+; GRANULATED_WAVEFRONT_SGPR_COUNT.
+
+define amdgpu_kernel void @amdhsa_kernarg_preload_num_sgprs(i128 inreg) { ret void }
|
@@ -0,0 +1,14 @@ | |||
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -filetype=obj < %s > %t | |||
; RUN: llvm-objdump -s -j .rodata %t | FileCheck --check-prefix=OBJDUMP %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need this temporary file
; OBJDUMP-NEXT: 0030 4000af00 94130000 1a000400 00000000 | ||
; OBJDUMP-NOT: 0030 0000af00 94130000 1a000400 00000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a human readable asm output for reference, and should check the different kind of SGPR usage numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a human readable asm output for reference, and should check the different kind of SGPR usage numbers
The annoying part about this bug is we don't directly output a directive for this field anywhere, it's totally derivative. To find the problem I was modifying the KD in the binary directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we cannot go from .ll
-> .s
-> .o
? (i.e., will AMDGPUAsmParser
's calculation of SGPRBlocks still be correct?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be changing the reported used register count, not just manipulating the encoding fields
; Include preloaded SGPRs that are not explicitly used in the kernel in | ||
; GRANULATED_WAVEFRONT_SGPR_COUNT. | ||
|
||
define amdgpu_kernel void @amdhsa_kernarg_preload_num_sgprs(i128 inreg) { ret void } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test cases with more interaction with ordinary user SGPRs?
CreateExpr(MFI->getNumUserSGPRs()), ExtraSGPRs, Ctx); | ||
|
||
ProgInfo.SGPRBlocks = | ||
GetNumGPRBlocks(AMDGPUMCExpr::createMax( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fixing this up here is too late. We should have bumped up the SGPR count in the MFI tracked value to begin with. We have a similar round up for the unused inreg shader arguments, and this is essentially the same thing
af950b6
to
3f5b993
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
const MCExpr *MaxUserSGPRs = MCBinaryExpr::createAdd( | ||
CreateExpr(MFI->getNumUserSGPRs()), ExtraSGPRs, Ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this to be already added to MFI->getNumUserSGPRs. I.e. handle this during the calling convention lowering, not code emission
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExtraSGPRs here doesn't refer to anything from my change. It is the extra SGPRs the HW reserves for architected flat scratch, XNACK, and debugger. I think the name MaxUserSGPRs is maybe a misnomer and confusing though. It's really calculating UserSGPRs including preloads+extra reserved by the HW.
These "extra" SGPRs are not UserSGPRs in the way the HW uses or sets them up, or in the way they are encoded in the KD. So they should not be included in our backends' tracking of them. Preload kernarg SGPRs are already included in MFI->getNumUserSGPRs.
The reason we add the ExtraSGPRs is because they are also added to ProgInfo.NumSGPR above so we need to consider them in the Max expr on the next line here. ExtraSGPRs do need to be included in the KD tally for total SGPR granules.
3f5b993
to
2e85e2a
Compare
Unlike with implicitly preloaded data UserSGPRs firmware is unable to handle cases where SGPRs for kernel arguments contain prelaoded data but not are not explicitly referenced in the kernel. We need to include these preloaded SGPRs in the GRANULATED_WAVEFRONT_SGPR_COUNT calculation to not clobber SGPRs in adjacent waves.
2e85e2a
to
1bd17ee
Compare
This updates LLVM to pull in two fixes we need for AMD: * llvm/llvm-project#110553 * llvm/llvm-project#104743 Fixed `LLVM::CallOp` and `LLVM::CallIntrinsicOp` builder API after * llvm/llvm-project#108933
This reverts ad9afc8 since the issue was fixed by llvm/llvm-project#104743
This reverts ad9afc8 since the issue was fixed by llvm/llvm-project#104743
This updates LLVM to pull in two fixes we need for AMD: * llvm/llvm-project#110553 * llvm/llvm-project#104743 Fixed `LLVM::CallOp` and `LLVM::CallIntrinsicOp` builder API after * llvm/llvm-project#108933
This reverts ad9afc8 since the issue was fixed by llvm/llvm-project#104743
This updates LLVM to pull in two fixes we need for AMD: * llvm/llvm-project#110553 * llvm/llvm-project#104743 Fixed `LLVM::CallOp` and `LLVM::CallIntrinsicOp` builder API after * llvm/llvm-project#108933
This reverts ad9afc8 since the issue was fixed by llvm/llvm-project#104743
Unlike with implicitly preloaded data UserSGPRs firmware is unable to handle cases where SGPRs for kernel arguments contain preloaded data but not are not explicitly referenced in the kernel. We need to include these preloaded SGPRs in the GRANULATED_WAVEFRONT_SGPR_COUNT calculation to not clobber SGPRs in adjacent waves.