-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AMDGPU] Extend promotion of alloca to vectors #127973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
perlfu
commented
Feb 20, 2025
- Add multi dimensional array support
- Make maximum vector size tunable
- Make ratio of VGPRs used for vector promotion tunable
@llvm/pr-subscribers-backend-amdgpu Author: Carl Ritson (perlfu) Changes
Patch is 93.46 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127973.diff 9 Files Affected:
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index d580be1eb8cfc..734434641b4bd 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -1546,180 +1546,184 @@ The AMDGPU backend supports the following LLVM IR attributes.
.. table:: AMDGPU LLVM IR Attributes
:name: amdgpu-llvm-ir-attributes-table
- ============================================ ==========================================================
- LLVM Attribute Description
- ============================================ ==========================================================
- "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
- will be specified when the kernel is dispatched. Generated
- by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
- The IR implied default value is 1,1024. Clang may emit this attribute
- with more restrictive bounds depending on language defaults.
- If the actual block or workgroup size exceeds the limit at any point during
- the execution, the behavior is undefined. For example, even if there is
- only one active thread but the thread local id exceeds the limit, the
- behavior is undefined.
-
- "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
- argument block size for the implicit arguments. This
- varies by OS and language (for OpenCL see
- :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
- "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
- the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
- "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
- ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
- "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
- execution unit. Generated by the ``amdgpu_waves_per_eu``
- CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
- and the backend may not be able to satisfy the request. If
- the specified range is incompatible with the function's
- "amdgpu-flat-work-group-size" value, the implied occupancy
- bounds by the workgroup size takes precedence.
-
- "amdgpu-ieee" true/false. GFX6-GFX11 Only
- Specify whether the function expects the IEEE field of the
- mode register to be set on entry. Overrides the default for
- the calling convention.
- "amdgpu-dx10-clamp" true/false. GFX6-GFX11 Only
- Specify whether the function expects the DX10_CLAMP field of
- the mode register to be set on entry. Overrides the default
- for the calling convention.
-
- "amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the
- llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
- attribute, or reached through a call site marked with this attribute, and
- that intrinsic is called, the behavior of the program is undefined. (Whole-program
- undefined behavior is used here because, for example, the absence of a required workitem
- ID in the preloaded register set can mean that all other preloaded registers
- are earlier than the compilation assumed they would be.) The backend can
- generally infer this during code generation, so typically there is no
- benefit to frontends marking functions with this.
-
- "amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the
- llvm.amdgcn.workitem.id.y intrinsic.
-
- "amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the
- llvm.amdgcn.workitem.id.z intrinsic.
-
- "amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the
- llvm.amdgcn.workgroup.id.x intrinsic.
-
- "amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the
- llvm.amdgcn.workgroup.id.y intrinsic.
-
- "amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the
- llvm.amdgcn.workgroup.id.z intrinsic.
-
- "amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the
- llvm.amdgcn.dispatch.ptr intrinsic.
-
- "amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the
- llvm.amdgcn.implicitarg.ptr intrinsic.
-
- "amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the
- llvm.amdgcn.dispatch.id intrinsic.
-
- "amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the
- llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
- attributes, the queue pointer may be required in situations where the
- intrinsic call does not directly appear in the program. Some subtargets
- require the queue pointer for to handle some addrspacecasts, as well
- as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
- llvm.debug intrinsics.
-
- "amdgpu-no-hostcall-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
- kernel argument that holds the pointer to the hostcall buffer. If this
- attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
-
- "amdgpu-no-heap-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
- kernel argument that holds the pointer to an initialized memory buffer
- that conforms to the requirements of the malloc/free device library V1
- version implementation. If this attribute is absent, then the
- amdgpu-no-implicitarg-ptr is also removed.
-
- "amdgpu-no-multigrid-sync-arg" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
- kernel argument that holds the multigrid synchronization pointer. If this
- attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
-
- "amdgpu-no-default-queue" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
- kernel argument that holds the default queue pointer. If this
- attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
-
- "amdgpu-no-completion-action" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
- kernel argument that holds the completion action pointer. If this
- attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
-
- "amdgpu-lds-size"="min[,max]" Min is the minimum number of bytes that will be allocated in the Local
- Data Store at address zero. Variables are allocated within this frame
- using absolute symbol metadata, primarily by the AMDGPULowerModuleLDS
- pass. Optional max is the maximum number of bytes that will be allocated.
- Note that min==max indicates that no further variables can be added to
- the frame. This is an internal detail of how LDS variables are lowered,
- language front ends should not set this attribute.
-
- "amdgpu-gds-size" Bytes expected to be allocated at the start of GDS memory at entry.
-
- "amdgpu-git-ptr-high" The hard-wired high half of the address of the global information table
- for AMDPAL OS type. 0xffffffff represents no hard-wired high half, since
- current hardware only allows a 16 bit value.
-
- "amdgpu-32bit-address-high-bits" Assumed high 32-bits for 32-bit address spaces which are really truncated
- 64-bit addresses (i.e., addrspace(6))
-
- "amdgpu-color-export" Indicates shader exports color information if set to 1.
- Defaults to 1 for :ref:`amdgpu_ps <amdgpu-cc>`, and 0 for other calling
- conventions. Determines the necessity and type of null exports when a shader
- terminates early by killing lanes.
-
- "amdgpu-depth-export" Indicates shader exports depth information if set to 1. Determines the
- necessity and type of null exports when a shader terminates early by killing
- lanes. A depth-only shader will export to depth channel when no null export
- target is available (GFX11+).
-
- "InitialPSInputAddr" Set the initial value of the `spi_ps_input_addr` register for
- :ref:`amdgpu_ps <amdgpu-cc>` shaders. Any bits enabled by this value will
- be enabled in the final register value.
-
- "amdgpu-wave-priority-threshold" VALU instruction count threshold for adjusting wave priority. If exceeded,
- temporarily raise the wave priority at the start of the shader function
- until its last VMEM instructions to allow younger waves to issue their VMEM
- instructions as well.
+ =============================================== ==========================================================
+ LLVM Attribute Description
+ =============================================== ==========================================================
+ "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
+ will be specified when the kernel is dispatched. Generated
+ by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
+ The IR implied default value is 1,1024. Clang may emit this attribute
+ with more restrictive bounds depending on language defaults.
+ If the actual block or workgroup size exceeds the limit at any point during
+ the execution, the behavior is undefined. For example, even if there is
+ only one active thread but the thread local id exceeds the limit, the
+ behavior is undefined.
+
+ "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
+ argument block size for the implicit arguments. This
+ varies by OS and language (for OpenCL see
+ :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
+ "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
+ the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
+ "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
+ ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
+ "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
+ execution unit. Generated by the ``amdgpu_waves_per_eu``
+ CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
+ and the backend may not be able to satisfy the request. If
+ the specified range is incompatible with the function's
+ "amdgpu-flat-work-group-size" value, the implied occupancy
+ bounds by the workgroup size takes precedence.
+
+ "amdgpu-ieee" true/false. GFX6-GFX11 Only
+ Specify whether the function expects the IEEE field of the
+ mode register to be set on entry. Overrides the default for
+ the calling convention.
+ "amdgpu-dx10-clamp" true/false. GFX6-GFX11 Only
+ Specify whether the function expects the DX10_CLAMP field of
+ the mode register to be set on entry. Overrides the default
+ for the calling convention.
+
+ "amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the
+ llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
+ attribute, or reached through a call site marked with this attribute, and
+ that intrinsic is called, the behavior of the program is undefined. (Whole-program
+ undefined behavior is used here because, for example, the absence of a required workitem
+ ID in the preloaded register set can mean that all other preloaded registers
+ are earlier than the compilation assumed they would be.) The backend can
+ generally infer this during code generation, so typically there is no
+ benefit to frontends marking functions with this.
+
+ "amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the
+ llvm.amdgcn.workitem.id.y intrinsic.
+
+ "amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the
+ llvm.amdgcn.workitem.id.z intrinsic.
+
+ "amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the
+ llvm.amdgcn.workgroup.id.x intrinsic.
+
+ "amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the
+ llvm.amdgcn.workgroup.id.y intrinsic.
+
+ "amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the
+ llvm.amdgcn.workgroup.id.z intrinsic.
+
+ "amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the
+ llvm.amdgcn.dispatch.ptr intrinsic.
+
+ "amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the
+ llvm.amdgcn.implicitarg.ptr intrinsic.
+
+ "amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the
+ llvm.amdgcn.dispatch.id intrinsic.
+
+ "amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the
+ llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
+ attributes, the queue pointer may be required in situations where the
+ intrinsic call does not directly appear in the program. Some subtargets
+ require the queue pointer for to handle some addrspacecasts, as well
+ as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
+ llvm.debug intrinsics.
+
+ "amdgpu-no-hostcall-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
+ kernel argument that holds the pointer to the hostcall buffer. If this
+ attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
+
+ "amdgpu-no-heap-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
+ kernel...
[truncated]
|
✅ With the latest revision this PR passed the undef deprecator. |
static cl::opt<unsigned> PromoteAllocaToVectorMaxElements( | ||
"amdgpu-promote-alloca-to-vector-max-elements", | ||
cl::desc("Maximum vector size (in elements) to use when promoting alloca"), | ||
cl::init(16)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should turn these into pass parameters instead of opts.
Elements seems like a strange way to express this. Ideally we would pack the sub-32-bit element vectors into access of 32-bit vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect end users to use these options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elements seems like a strange way to express this.
Element count is how the limit is currently defined in the code. I agree in terms of 32-bit words (registers) would make more sense. I'll change to this model, but it does mean this patch will not preserve the existing limit so some edge case promotion will change.
Do we expect end users to use these options?
Graphics front end will use these for shader tuning, which is why they are accessible via function attributes as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. The reason I was asking is, if we expect any uses from users, we need to expose them as an option instead of a pass option, but based on your description, we don't expect from end users, and compiler front end can tune it via function attributes.
* Add multi dimensional array support * Make maximum vector size tunable * Make ratio of VGPRs used for vector promotion tunable
- Move options into pass attributes - Change vector size limit from max elements to max 32b registers - Add tests for i16, float and ptr in multi-dimensional arrays
2afc3ac
to
77d30c3
Compare
Note: I have changed the max vector size to be based on number of 32b registers. With the current value of 16, this mean some alloca which were previously promoted are no longer promoted. |
Ping |
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/10/builds/1133 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/73/builds/14431 Here is the relevant piece of the build log for the reference
|
Fix type error when GEP uses i64 offset introduced in #127973.
Fix type error when GEP uses i64 index introduced in llvm#127973.
Fix type error when GEP uses i64 index introduced in llvm#127973.
Fix type error when GEP uses i64 index introduced in llvm#127973.
Fix type error when GEP uses i64 index introduced in #127973.