Skip to content

[libc] Update the AMDGPU implementation to use code object 5 #72580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 21, 2023

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Nov 16, 2023

Summary:
This patch includes the necessary changes to make the libc tests
running on AMD GPUs run using the newer code object version. The 'code
object version' is AMD's internal ABI for making kernel calls. The move
from 4 to 5 changed how we handle arguments for builtins such as
obtaining the grid size or setting up the size of the private stack.

Fixes: #72517

@llvmbot
Copy link
Member

llvmbot commented Nov 16, 2023

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-libc

Author: Joseph Huber (jhuber6)

Changes

Summary:
This patch includes the necessary changes to make the libc tests
running on AMD GPUs run using the newer code object version. The 'code
object version' is AMD's internal ABI for making kernel calls. The move
from 4 to 5 changed how we handle arguments for builtins such as
obtaining the grid size or setting up the size of the private stack.

Fixes: #72517


Full diff: https://github.com/llvm/llvm-project/pull/72580.diff

2 Files Affected:

  • (modified) libc/cmake/modules/prepare_libc_gpu_build.cmake (+1-4)
  • (modified) libc/utils/gpu/loader/amdgpu/Loader.cpp (+30-4)
diff --git a/libc/cmake/modules/prepare_libc_gpu_build.cmake b/libc/cmake/modules/prepare_libc_gpu_build.cmake
index 0b6067f69775c45..377bc4eeae8fc9b 100644
--- a/libc/cmake/modules/prepare_libc_gpu_build.cmake
+++ b/libc/cmake/modules/prepare_libc_gpu_build.cmake
@@ -120,8 +120,5 @@ if(LIBC_GPU_TARGET_ARCHITECTURE_IS_AMDGPU)
   # The AMDGPU environment uses different code objects to encode the ABI for
   # kernel calls and intrinsic functions. We want to specify this manually to
   # conform to whatever the test suite was built to handle.
-  # FIXME: The test suite currently hangs when compiled targeting version five.
-  # This occurrs during traversal of the callback array in the startup code. We
-  # deliberately use version four until this can be addressed.
-  set(LIBC_GPU_CODE_OBJECT_VERSION 4)
+  set(LIBC_GPU_CODE_OBJECT_VERSION 5)
 endif()
diff --git a/libc/utils/gpu/loader/amdgpu/Loader.cpp b/libc/utils/gpu/loader/amdgpu/Loader.cpp
index b1b3aa6ce028ca5..b9a5a5aad002d29 100644
--- a/libc/utils/gpu/loader/amdgpu/Loader.cpp
+++ b/libc/utils/gpu/loader/amdgpu/Loader.cpp
@@ -34,6 +34,19 @@
 #include <tuple>
 #include <utility>
 
+// The implicit arguments of COV5 AMDGPU kernels.
+struct implicit_args_t {
+  uint32_t grid_size_x;
+  uint32_t grid_size_y;
+  uint32_t grid_size_z;
+  uint16_t workgroup_size_x;
+  uint16_t workgroup_size_y;
+  uint16_t workgroup_size_z;
+  uint8_t Unused0[46];
+  uint16_t gird_dims;
+  uint8_t Unused1[190];
+};
+
 /// Print the error code and exit if \p code indicates an error.
 static void handle_error(hsa_status_t code) {
   if (code == HSA_STATUS_SUCCESS || code == HSA_STATUS_INFO_BREAK)
@@ -185,11 +198,13 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
   uint32_t args_size;
   uint32_t group_size;
   uint32_t private_size;
+  bool dynamic_stack;
 
   std::pair<hsa_executable_symbol_info_t, void *> symbol_infos[] = {
       {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_OBJECT, &kernel},
       {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_KERNARG_SEGMENT_SIZE, &args_size},
       {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_GROUP_SEGMENT_SIZE, &group_size},
+      {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_DYNAMIC_CALLSTACK, &dynamic_stack},
       {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_PRIVATE_SEGMENT_SIZE, &private_size}};
 
   for (auto &[info, value] : symbol_infos)
@@ -209,6 +224,19 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
   std::memset(args, 0, args_size);
   std::memcpy(args, &kernel_args, sizeof(args_t));
 
+  // Initialize the necessarry implicit arguments to the proper values.
+  bool dims = 1 + (params.num_blocks_y * params.num_threads_y != 1) +
+              (params.num_blocks_z * params.num_threads_z != 1);
+  implicit_args_t *implicit_args = reinterpret_cast<implicit_args_t *>(
+      reinterpret_cast<uint8_t *>(args) + sizeof(args_t));
+  implicit_args->gird_dims = dims;
+  implicit_args->grid_size_x = params.num_threads_x;
+  implicit_args->grid_size_y = params.num_threads_y;
+  implicit_args->grid_size_z = params.num_threads_z;
+  implicit_args->workgroup_size_x = params.num_blocks_x;
+  implicit_args->workgroup_size_y = params.num_blocks_y;
+  implicit_args->workgroup_size_z = params.num_blocks_z;
+
   // Obtain a packet from the queue.
   uint64_t packet_id = hsa_queue_add_write_index_relaxed(queue, 1);
   while (packet_id - hsa_queue_load_read_index_scacquire(queue) >= queue->size)
@@ -222,9 +250,7 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
   // Set up the packet for exeuction on the device. We currently only launch
   // with one thread on the device, forcing the rest of the wavefront to be
   // masked off.
-  uint16_t setup = (1 + (params.num_blocks_y * params.num_threads_y != 1) +
-                    (params.num_blocks_z * params.num_threads_z != 1))
-                   << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
+  uint16_t setup = (dims) << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
   packet->workgroup_size_x = params.num_threads_x;
   packet->workgroup_size_y = params.num_threads_y;
   packet->workgroup_size_z = params.num_threads_z;
@@ -232,7 +258,7 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
   packet->grid_size_x = params.num_blocks_x * params.num_threads_x;
   packet->grid_size_y = params.num_blocks_y * params.num_threads_y;
   packet->grid_size_z = params.num_blocks_z * params.num_threads_z;
-  packet->private_segment_size = private_size;
+  packet->private_segment_size = !dynamic_stack ? private_size : 16834;
   packet->group_segment_size = group_size;
   packet->kernel_object = kernel;
   packet->kernarg_address = args;

packet->workgroup_size_x = params.num_threads_x;
packet->workgroup_size_y = params.num_threads_y;
packet->workgroup_size_z = params.num_threads_z;
packet->reserved0 = 0;
packet->grid_size_x = params.num_blocks_x * params.num_threads_x;
packet->grid_size_y = params.num_blocks_y * params.num_threads_y;
packet->grid_size_z = params.num_blocks_z * params.num_threads_z;
packet->private_segment_size = private_size;
packet->private_segment_size = !dynamic_stack ? private_size : 3 * 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
packet->private_segment_size = !dynamic_stack ? private_size : 3 * 1024;
packet->private_segment_size = !dynamic_stack ? private_size : 16 * 1024;

Same as #72606

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is a sufficiently small amount to run all the libc tests and mimics the size used for the NVIDIA tests. It's easier on resources when running a bunch of these in parallel to keep the stack usage small.

@@ -34,6 +34,19 @@
#include <tuple>
#include <utility>

// The implicit arguments of COV5 AMDGPU kernels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to share this structure between here and openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h?

Otherwise, someone will have to keep updating this whenever there are changes in the AMDGPU plugin (like #65325 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a future endeavor. Probably something that could be solved by having the offloading directory for a lot of these HSA wrappers. For now it's easier to keep them separate.

Summary:
This patch includes the necessary changes to make the `libc` tests
running on AMD GPUs run using the newer code object version. The 'code
object version' is AMD's internal ABI for making kernel calls. The move
from 4 to 5 changed how we handle arguments for builtins such as
obtaining the grid size or setting up the size of the private stack.

Fixes: llvm#72517
packet->workgroup_size_x = params.num_threads_x;
packet->workgroup_size_y = params.num_threads_y;
packet->workgroup_size_z = params.num_threads_z;
packet->reserved0 = 0;
packet->grid_size_x = params.num_blocks_x * params.num_threads_x;
packet->grid_size_y = params.num_blocks_y * params.num_threads_y;
packet->grid_size_z = params.num_blocks_z * params.num_threads_z;
packet->private_segment_size = private_size;
packet->private_segment_size =
dynamic_stack ? 16 * 1024 /* 16 KB */ : private_size;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe write `enum {stack_size_default = 16*1024} or similar instead of the comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably get a fixme to add some knob for this

Copy link
Contributor

@saiislam saiislam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jhuber6 jhuber6 merged commit 8341a40 into llvm:main Nov 21, 2023
JonChesterfield added a commit to JonChesterfield/llvm-project that referenced this pull request Dec 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AMDGPU] Kernel hangs when compiled with code-object version 5 due to insufficient stack
5 participants