Skip to content

[AMDGPU] Include unused preload kernarg in KD total SGPR count #104743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 23, 2024

Conversation

kerbowa
Copy link
Member

@kerbowa kerbowa commented Aug 19, 2024

Unlike with implicitly preloaded data UserSGPRs firmware is unable to handle cases where SGPRs for kernel arguments contain preloaded data but not are not explicitly referenced in the kernel. We need to include these preloaded SGPRs in the GRANULATED_WAVEFRONT_SGPR_COUNT calculation to not clobber SGPRs in adjacent waves.

@llvmbot
Copy link
Member

llvmbot commented Aug 19, 2024

@llvm/pr-subscribers-mc

@llvm/pr-subscribers-backend-amdgpu

Author: Austin Kerbow (kerbowa)

Changes

Unlike with implicitly preloaded data UserSGPRs firmware is unable to handle cases where SGPRs for kernel arguments contain prelaoded data but not are not explicitly referenced in the kernel. We need to include these preloaded SGPRs in the GRANULATED_WAVEFRONT_SGPR_COUNT calculation to not clobber SGPRs in adjacent waves.


Full diff: https://github.com/llvm/llvm-project/pull/104743.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp (+9-2)
  • (added) llvm/test/CodeGen/AMDGPU/amdhsa-kernarg-preload-num-sgprs.ll (+14)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
index b90d245b7bd394..cfa5216c8c54b1 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
@@ -970,8 +970,15 @@ void AMDGPUAsmPrinter::getSIProgramInfo(SIProgramInfo &ProgInfo,
     return SubGPR;
   };
 
-  ProgInfo.SGPRBlocks = GetNumGPRBlocks(ProgInfo.NumSGPRsForWavesPerEU,
-                                        IsaInfo::getSGPREncodingGranule(&STM));
+  // Consider cases where the total number of UserSGPRs plus extra SGPRs is
+  // greater than the number of explicitly referenced SGPRs.
+  const MCExpr *MaxUserSGPRs = MCBinaryExpr::createAdd(
+      CreateExpr(MFI->getNumUserSGPRs()), ExtraSGPRs, Ctx);
+
+  ProgInfo.SGPRBlocks =
+      GetNumGPRBlocks(AMDGPUMCExpr::createMax(
+                          {ProgInfo.NumSGPRsForWavesPerEU, MaxUserSGPRs}, Ctx),
+                      IsaInfo::getSGPREncodingGranule(&STM));
   ProgInfo.VGPRBlocks = GetNumGPRBlocks(ProgInfo.NumVGPRsForWavesPerEU,
                                         IsaInfo::getVGPREncodingGranule(&STM));
 
diff --git a/llvm/test/CodeGen/AMDGPU/amdhsa-kernarg-preload-num-sgprs.ll b/llvm/test/CodeGen/AMDGPU/amdhsa-kernarg-preload-num-sgprs.ll
new file mode 100644
index 00000000000000..34bef81171e812
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/amdhsa-kernarg-preload-num-sgprs.ll
@@ -0,0 +1,14 @@
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -filetype=obj < %s > %t
+; RUN: llvm-objdump -s -j .rodata %t | FileCheck --check-prefix=OBJDUMP %s
+
+; OBJDUMP: Contents of section .rodata:
+; OBJDUMP-NEXT: 0000 00000000 00000000 10010000 00000000
+; OBJDUMP-NEXT: 0010 00000000 00000000 00000000 00000000
+; OBJDUMP-NEXT: 0020 00000000 00000000 00000000 00000000
+; OBJDUMP-NEXT: 0030 4000af00 94130000 1a000400 00000000
+; OBJDUMP-NOT: 0030 0000af00 94130000 1a000400 00000000
+
+; Include preloaded SGPRs that are not explicitly used in the kernel in
+; GRANULATED_WAVEFRONT_SGPR_COUNT.
+
+define amdgpu_kernel void @amdhsa_kernarg_preload_num_sgprs(i128 inreg) { ret void }

@@ -0,0 +1,14 @@
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -filetype=obj < %s > %t
; RUN: llvm-objdump -s -j .rodata %t | FileCheck --check-prefix=OBJDUMP %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this temporary file

Comment on lines 8 to 9
; OBJDUMP-NEXT: 0030 4000af00 94130000 1a000400 00000000
; OBJDUMP-NOT: 0030 0000af00 94130000 1a000400 00000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a human readable asm output for reference, and should check the different kind of SGPR usage numbers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a human readable asm output for reference, and should check the different kind of SGPR usage numbers

The annoying part about this bug is we don't directly output a directive for this field anywhere, it's totally derivative. To find the problem I was modifying the KD in the binary directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we cannot go from .ll -> .s -> .o? (i.e., will AMDGPUAsmParser's calculation of SGPRBlocks still be correct?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be changing the reported used register count, not just manipulating the encoding fields

; Include preloaded SGPRs that are not explicitly used in the kernel in
; GRANULATED_WAVEFRONT_SGPR_COUNT.

define amdgpu_kernel void @amdhsa_kernarg_preload_num_sgprs(i128 inreg) { ret void }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test cases with more interaction with ordinary user SGPRs?

CreateExpr(MFI->getNumUserSGPRs()), ExtraSGPRs, Ctx);

ProgInfo.SGPRBlocks =
GetNumGPRBlocks(AMDGPUMCExpr::createMax(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fixing this up here is too late. We should have bumped up the SGPR count in the MFI tracked value to begin with. We have a similar round up for the unused inreg shader arguments, and this is essentially the same thing

@kerbowa kerbowa force-pushed the include-unused-preload-kernarg-in-KD branch from af950b6 to 3f5b993 Compare September 16, 2024 00:10
@llvmbot llvmbot added the mc Machine (object) code label Sep 16, 2024
Copy link

github-actions bot commented Sep 16, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Comment on lines 907 to 909
const MCExpr *MaxUserSGPRs = MCBinaryExpr::createAdd(
CreateExpr(MFI->getNumUserSGPRs()), ExtraSGPRs, Ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this to be already added to MFI->getNumUserSGPRs. I.e. handle this during the calling convention lowering, not code emission

Copy link
Member Author

@kerbowa kerbowa Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtraSGPRs here doesn't refer to anything from my change. It is the extra SGPRs the HW reserves for architected flat scratch, XNACK, and debugger. I think the name MaxUserSGPRs is maybe a misnomer and confusing though. It's really calculating UserSGPRs including preloads+extra reserved by the HW.

These "extra" SGPRs are not UserSGPRs in the way the HW uses or sets them up, or in the way they are encoded in the KD. So they should not be included in our backends' tracking of them. Preload kernarg SGPRs are already included in MFI->getNumUserSGPRs.

The reason we add the ExtraSGPRs is because they are also added to ProgInfo.NumSGPR above so we need to consider them in the Max expr on the next line here. ExtraSGPRs do need to be included in the KD tally for total SGPR granules.

@kerbowa kerbowa force-pushed the include-unused-preload-kernarg-in-KD branch from 3f5b993 to 2e85e2a Compare September 23, 2024 17:05
Unlike with implicitly preloaded data UserSGPRs firmware is unable to
handle cases where SGPRs for kernel arguments contain prelaoded data but
not are not explicitly referenced in the kernel. We need to include
these preloaded SGPRs in the GRANULATED_WAVEFRONT_SGPR_COUNT calculation
to not clobber SGPRs in adjacent waves.
@kerbowa kerbowa force-pushed the include-unused-preload-kernarg-in-KD branch from 2e85e2a to 1bd17ee Compare September 23, 2024 20:46
@kerbowa kerbowa merged commit 954ab83 into llvm:main Sep 23, 2024
5 of 6 checks passed
@kerbowa kerbowa deleted the include-unused-preload-kernarg-in-KD branch September 24, 2024 15:32
antiagainst added a commit to triton-lang/triton that referenced this pull request Oct 4, 2024
This updates LLVM to pull in two fixes we need for AMD:

* llvm/llvm-project#110553
* llvm/llvm-project#104743

Fixed `LLVM::CallOp` and `LLVM::CallIntrinsicOp` builder API after
* llvm/llvm-project#108933
zhanglx13 added a commit to triton-lang/triton that referenced this pull request Oct 5, 2024
sfzhu93 pushed a commit to sfzhu93/triton that referenced this pull request Oct 11, 2024
Luosuu pushed a commit to Luosuu/triton that referenced this pull request Nov 13, 2024
This updates LLVM to pull in two fixes we need for AMD:

* llvm/llvm-project#110553
* llvm/llvm-project#104743

Fixed `LLVM::CallOp` and `LLVM::CallIntrinsicOp` builder API after
* llvm/llvm-project#108933
Luosuu pushed a commit to Luosuu/triton that referenced this pull request Nov 13, 2024
bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
This updates LLVM to pull in two fixes we need for AMD:

* llvm/llvm-project#110553
* llvm/llvm-project#104743

Fixed `LLVM::CallOp` and `LLVM::CallIntrinsicOp` builder API after
* llvm/llvm-project#108933
bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU mc Machine (object) code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants