-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[libc] Update the AMDGPU implementation to use code object 5 #72580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-amdgpu @llvm/pr-subscribers-libc Author: Joseph Huber (jhuber6) ChangesSummary: Fixes: #72517 Full diff: https://github.com/llvm/llvm-project/pull/72580.diff 2 Files Affected:
diff --git a/libc/cmake/modules/prepare_libc_gpu_build.cmake b/libc/cmake/modules/prepare_libc_gpu_build.cmake
index 0b6067f69775c45..377bc4eeae8fc9b 100644
--- a/libc/cmake/modules/prepare_libc_gpu_build.cmake
+++ b/libc/cmake/modules/prepare_libc_gpu_build.cmake
@@ -120,8 +120,5 @@ if(LIBC_GPU_TARGET_ARCHITECTURE_IS_AMDGPU)
# The AMDGPU environment uses different code objects to encode the ABI for
# kernel calls and intrinsic functions. We want to specify this manually to
# conform to whatever the test suite was built to handle.
- # FIXME: The test suite currently hangs when compiled targeting version five.
- # This occurrs during traversal of the callback array in the startup code. We
- # deliberately use version four until this can be addressed.
- set(LIBC_GPU_CODE_OBJECT_VERSION 4)
+ set(LIBC_GPU_CODE_OBJECT_VERSION 5)
endif()
diff --git a/libc/utils/gpu/loader/amdgpu/Loader.cpp b/libc/utils/gpu/loader/amdgpu/Loader.cpp
index b1b3aa6ce028ca5..b9a5a5aad002d29 100644
--- a/libc/utils/gpu/loader/amdgpu/Loader.cpp
+++ b/libc/utils/gpu/loader/amdgpu/Loader.cpp
@@ -34,6 +34,19 @@
#include <tuple>
#include <utility>
+// The implicit arguments of COV5 AMDGPU kernels.
+struct implicit_args_t {
+ uint32_t grid_size_x;
+ uint32_t grid_size_y;
+ uint32_t grid_size_z;
+ uint16_t workgroup_size_x;
+ uint16_t workgroup_size_y;
+ uint16_t workgroup_size_z;
+ uint8_t Unused0[46];
+ uint16_t gird_dims;
+ uint8_t Unused1[190];
+};
+
/// Print the error code and exit if \p code indicates an error.
static void handle_error(hsa_status_t code) {
if (code == HSA_STATUS_SUCCESS || code == HSA_STATUS_INFO_BREAK)
@@ -185,11 +198,13 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
uint32_t args_size;
uint32_t group_size;
uint32_t private_size;
+ bool dynamic_stack;
std::pair<hsa_executable_symbol_info_t, void *> symbol_infos[] = {
{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_OBJECT, &kernel},
{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_KERNARG_SEGMENT_SIZE, &args_size},
{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_GROUP_SEGMENT_SIZE, &group_size},
+ {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_DYNAMIC_CALLSTACK, &dynamic_stack},
{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_PRIVATE_SEGMENT_SIZE, &private_size}};
for (auto &[info, value] : symbol_infos)
@@ -209,6 +224,19 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
std::memset(args, 0, args_size);
std::memcpy(args, &kernel_args, sizeof(args_t));
+ // Initialize the necessarry implicit arguments to the proper values.
+ bool dims = 1 + (params.num_blocks_y * params.num_threads_y != 1) +
+ (params.num_blocks_z * params.num_threads_z != 1);
+ implicit_args_t *implicit_args = reinterpret_cast<implicit_args_t *>(
+ reinterpret_cast<uint8_t *>(args) + sizeof(args_t));
+ implicit_args->gird_dims = dims;
+ implicit_args->grid_size_x = params.num_threads_x;
+ implicit_args->grid_size_y = params.num_threads_y;
+ implicit_args->grid_size_z = params.num_threads_z;
+ implicit_args->workgroup_size_x = params.num_blocks_x;
+ implicit_args->workgroup_size_y = params.num_blocks_y;
+ implicit_args->workgroup_size_z = params.num_blocks_z;
+
// Obtain a packet from the queue.
uint64_t packet_id = hsa_queue_add_write_index_relaxed(queue, 1);
while (packet_id - hsa_queue_load_read_index_scacquire(queue) >= queue->size)
@@ -222,9 +250,7 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
// Set up the packet for exeuction on the device. We currently only launch
// with one thread on the device, forcing the rest of the wavefront to be
// masked off.
- uint16_t setup = (1 + (params.num_blocks_y * params.num_threads_y != 1) +
- (params.num_blocks_z * params.num_threads_z != 1))
- << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
+ uint16_t setup = (dims) << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
packet->workgroup_size_x = params.num_threads_x;
packet->workgroup_size_y = params.num_threads_y;
packet->workgroup_size_z = params.num_threads_z;
@@ -232,7 +258,7 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
packet->grid_size_x = params.num_blocks_x * params.num_threads_x;
packet->grid_size_y = params.num_blocks_y * params.num_threads_y;
packet->grid_size_z = params.num_blocks_z * params.num_threads_z;
- packet->private_segment_size = private_size;
+ packet->private_segment_size = !dynamic_stack ? private_size : 16834;
packet->group_segment_size = group_size;
packet->kernel_object = kernel;
packet->kernarg_address = args;
|
e529f9f
to
1a05220
Compare
packet->workgroup_size_x = params.num_threads_x; | ||
packet->workgroup_size_y = params.num_threads_y; | ||
packet->workgroup_size_z = params.num_threads_z; | ||
packet->reserved0 = 0; | ||
packet->grid_size_x = params.num_blocks_x * params.num_threads_x; | ||
packet->grid_size_y = params.num_blocks_y * params.num_threads_y; | ||
packet->grid_size_z = params.num_blocks_z * params.num_threads_z; | ||
packet->private_segment_size = private_size; | ||
packet->private_segment_size = !dynamic_stack ? private_size : 3 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
packet->private_segment_size = !dynamic_stack ? private_size : 3 * 1024; | |
packet->private_segment_size = !dynamic_stack ? private_size : 16 * 1024; |
Same as #72606
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is a sufficiently small amount to run all the libc
tests and mimics the size used for the NVIDIA tests. It's easier on resources when running a bunch of these in parallel to keep the stack usage small.
@@ -34,6 +34,19 @@ | |||
#include <tuple> | |||
#include <utility> | |||
|
|||
// The implicit arguments of COV5 AMDGPU kernels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to share this structure between here and openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
?
Otherwise, someone will have to keep updating this whenever there are changes in the AMDGPU plugin (like #65325 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a future endeavor. Probably something that could be solved by having the offloading directory for a lot of these HSA wrappers. For now it's easier to keep them separate.
Summary: This patch includes the necessary changes to make the `libc` tests running on AMD GPUs run using the newer code object version. The 'code object version' is AMD's internal ABI for making kernel calls. The move from 4 to 5 changed how we handle arguments for builtins such as obtaining the grid size or setting up the size of the private stack. Fixes: llvm#72517
1a05220
to
3e0bbbf
Compare
packet->workgroup_size_x = params.num_threads_x; | ||
packet->workgroup_size_y = params.num_threads_y; | ||
packet->workgroup_size_z = params.num_threads_z; | ||
packet->reserved0 = 0; | ||
packet->grid_size_x = params.num_blocks_x * params.num_threads_x; | ||
packet->grid_size_y = params.num_blocks_y * params.num_threads_y; | ||
packet->grid_size_z = params.num_blocks_z * params.num_threads_z; | ||
packet->private_segment_size = private_size; | ||
packet->private_segment_size = | ||
dynamic_stack ? 16 * 1024 /* 16 KB */ : private_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe write `enum {stack_size_default = 16*1024} or similar instead of the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably get a fixme to add some knob for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…lvm#72580)" This reverts commit 8341a40.
Summary:
This patch includes the necessary changes to make the
libc
testsrunning on AMD GPUs run using the newer code object version. The 'code
object version' is AMD's internal ABI for making kernel calls. The move
from 4 to 5 changed how we handle arguments for builtins such as
obtaining the grid size or setting up the size of the private stack.
Fixes: #72517