[libc] Update the AMDGPU implementation to use code object 5 #72580

jhuber6 · 2023-11-16T22:25:54Z

Summary:
This patch includes the necessary changes to make the libc tests
running on AMD GPUs run using the newer code object version. The 'code
object version' is AMD's internal ABI for making kernel calls. The move
from 4 to 5 changed how we handle arguments for builtins such as
obtaining the grid size or setting up the size of the private stack.

Fixes: #72517

llvmbot · 2023-11-16T22:26:20Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-libc

Author: Joseph Huber (jhuber6)

Changes

Summary:
This patch includes the necessary changes to make the libc tests
running on AMD GPUs run using the newer code object version. The 'code
object version' is AMD's internal ABI for making kernel calls. The move
from 4 to 5 changed how we handle arguments for builtins such as
obtaining the grid size or setting up the size of the private stack.

Fixes: #72517

Full diff: https://github.com/llvm/llvm-project/pull/72580.diff

2 Files Affected:

(modified) libc/cmake/modules/prepare_libc_gpu_build.cmake (+1-4)
(modified) libc/utils/gpu/loader/amdgpu/Loader.cpp (+30-4)

diff --git a/libc/cmake/modules/prepare_libc_gpu_build.cmake b/libc/cmake/modules/prepare_libc_gpu_build.cmake
index 0b6067f69775c45..377bc4eeae8fc9b 100644
--- a/libc/cmake/modules/prepare_libc_gpu_build.cmake
+++ b/libc/cmake/modules/prepare_libc_gpu_build.cmake
@@ -120,8 +120,5 @@ if(LIBC_GPU_TARGET_ARCHITECTURE_IS_AMDGPU)
   # The AMDGPU environment uses different code objects to encode the ABI for
   # kernel calls and intrinsic functions. We want to specify this manually to
   # conform to whatever the test suite was built to handle.
-  # FIXME: The test suite currently hangs when compiled targeting version five.
-  # This occurrs during traversal of the callback array in the startup code. We
-  # deliberately use version four until this can be addressed.
-  set(LIBC_GPU_CODE_OBJECT_VERSION 4)
+  set(LIBC_GPU_CODE_OBJECT_VERSION 5)
 endif()
diff --git a/libc/utils/gpu/loader/amdgpu/Loader.cpp b/libc/utils/gpu/loader/amdgpu/Loader.cpp
index b1b3aa6ce028ca5..b9a5a5aad002d29 100644
--- a/libc/utils/gpu/loader/amdgpu/Loader.cpp
+++ b/libc/utils/gpu/loader/amdgpu/Loader.cpp
@@ -34,6 +34,19 @@
 #include <tuple>
 #include <utility>
 
+// The implicit arguments of COV5 AMDGPU kernels.
+struct implicit_args_t {
+  uint32_t grid_size_x;
+  uint32_t grid_size_y;
+  uint32_t grid_size_z;
+  uint16_t workgroup_size_x;
+  uint16_t workgroup_size_y;
+  uint16_t workgroup_size_z;
+  uint8_t Unused0[46];
+  uint16_t gird_dims;
+  uint8_t Unused1[190];
+};
+
 /// Print the error code and exit if \p code indicates an error.
 static void handle_error(hsa_status_t code) {
   if (code == HSA_STATUS_SUCCESS || code == HSA_STATUS_INFO_BREAK)
@@ -185,11 +198,13 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
   uint32_t args_size;
   uint32_t group_size;
   uint32_t private_size;
+  bool dynamic_stack;
 
   std::pair<hsa_executable_symbol_info_t, void *> symbol_infos[] = {
       {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_OBJECT, &kernel},
       {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_KERNARG_SEGMENT_SIZE, &args_size},
       {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_GROUP_SEGMENT_SIZE, &group_size},
+      {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_DYNAMIC_CALLSTACK, &dynamic_stack},
       {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_PRIVATE_SEGMENT_SIZE, &private_size}};
 
   for (auto &[info, value] : symbol_infos)
@@ -209,6 +224,19 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
   std::memset(args, 0, args_size);
   std::memcpy(args, &kernel_args, sizeof(args_t));
 
+  // Initialize the necessarry implicit arguments to the proper values.
+  bool dims = 1 + (params.num_blocks_y * params.num_threads_y != 1) +
+              (params.num_blocks_z * params.num_threads_z != 1);
+  implicit_args_t *implicit_args = reinterpret_cast<implicit_args_t *>(
+      reinterpret_cast<uint8_t *>(args) + sizeof(args_t));
+  implicit_args->gird_dims = dims;
+  implicit_args->grid_size_x = params.num_threads_x;
+  implicit_args->grid_size_y = params.num_threads_y;
+  implicit_args->grid_size_z = params.num_threads_z;
+  implicit_args->workgroup_size_x = params.num_blocks_x;
+  implicit_args->workgroup_size_y = params.num_blocks_y;
+  implicit_args->workgroup_size_z = params.num_blocks_z;
+
   // Obtain a packet from the queue.
   uint64_t packet_id = hsa_queue_add_write_index_relaxed(queue, 1);
   while (packet_id - hsa_queue_load_read_index_scacquire(queue) >= queue->size)
@@ -222,9 +250,7 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
   // Set up the packet for exeuction on the device. We currently only launch
   // with one thread on the device, forcing the rest of the wavefront to be
   // masked off.
-  uint16_t setup = (1 + (params.num_blocks_y * params.num_threads_y != 1) +
-                    (params.num_blocks_z * params.num_threads_z != 1))
-                   << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
+  uint16_t setup = (dims) << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
   packet->workgroup_size_x = params.num_threads_x;
   packet->workgroup_size_y = params.num_threads_y;
   packet->workgroup_size_z = params.num_threads_z;
@@ -232,7 +258,7 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
   packet->grid_size_x = params.num_blocks_x * params.num_threads_x;
   packet->grid_size_y = params.num_blocks_y * params.num_threads_y;
   packet->grid_size_z = params.num_blocks_z * params.num_threads_z;
-  packet->private_segment_size = private_size;
+  packet->private_segment_size = !dynamic_stack ? private_size : 16834;
   packet->group_segment_size = group_size;
   packet->kernel_object = kernel;
   packet->kernarg_address = args;

libc/utils/gpu/loader/amdgpu/Loader.cpp

saiislam · 2023-11-20T17:42:23Z

libc/utils/gpu/loader/amdgpu/Loader.cpp

  packet->workgroup_size_x = params.num_threads_x;
  packet->workgroup_size_y = params.num_threads_y;
  packet->workgroup_size_z = params.num_threads_z;
  packet->reserved0 = 0;
  packet->grid_size_x = params.num_blocks_x * params.num_threads_x;
  packet->grid_size_y = params.num_blocks_y * params.num_threads_y;
  packet->grid_size_z = params.num_blocks_z * params.num_threads_z;
-  packet->private_segment_size = private_size;
+  packet->private_segment_size = !dynamic_stack ? private_size : 3 * 1024;


Suggested change

packet->private_segment_size = !dynamic_stack ? private_size : 3 * 1024;

packet->private_segment_size = !dynamic_stack ? private_size : 16 * 1024;

Same as #72606

So, this is a sufficiently small amount to run all the libc tests and mimics the size used for the NVIDIA tests. It's easier on resources when running a bunch of these in parallel to keep the stack usage small.

saiislam · 2023-11-20T17:45:00Z

libc/utils/gpu/loader/amdgpu/Loader.cpp

@@ -34,6 +34,19 @@
 #include <tuple>
 #include <utility>

+// The implicit arguments of COV5 AMDGPU kernels.


Is there a way to share this structure between here and openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h?

Otherwise, someone will have to keep updating this whenever there are changes in the AMDGPU plugin (like #65325 )

Yeah that's a future endeavor. Probably something that could be solved by having the offloading directory for a lot of these HSA wrappers. For now it's easier to keep them separate.

Summary: This patch includes the necessary changes to make the `libc` tests running on AMD GPUs run using the newer code object version. The 'code object version' is AMD's internal ABI for making kernel calls. The move from 4 to 5 changed how we handle arguments for builtins such as obtaining the grid size or setting up the size of the private stack. Fixes: llvm#72517

JonChesterfield · 2023-11-21T13:10:59Z

libc/utils/gpu/loader/amdgpu/Loader.cpp

  packet->workgroup_size_x = params.num_threads_x;
  packet->workgroup_size_y = params.num_threads_y;
  packet->workgroup_size_z = params.num_threads_z;
  packet->reserved0 = 0;
  packet->grid_size_x = params.num_blocks_x * params.num_threads_x;
  packet->grid_size_y = params.num_blocks_y * params.num_threads_y;
  packet->grid_size_z = params.num_blocks_z * params.num_threads_z;
-  packet->private_segment_size = private_size;
+  packet->private_segment_size =
+      dynamic_stack ? 16 * 1024 /* 16 KB */ : private_size;


maybe write `enum {stack_size_default = 16*1024} or similar instead of the comment

Should probably get a fixme to add some knob for this

saiislam

LGTM!

…lvm#72580)" This reverts commit 8341a40.

jhuber6 requested review from arsenm, bcahoon, JonChesterfield, saiislam and yxsamliu November 16, 2023 22:25

llvmbot added backend:AMDGPU libc labels Nov 16, 2023

arsenm reviewed Nov 17, 2023

View reviewed changes

libc/utils/gpu/loader/amdgpu/Loader.cpp Outdated Show resolved Hide resolved

libc/utils/gpu/loader/amdgpu/Loader.cpp Outdated Show resolved Hide resolved

libc/utils/gpu/loader/amdgpu/Loader.cpp Outdated Show resolved Hide resolved

jhuber6 force-pushed the MoveLibCToCOV5 branch from e529f9f to 1a05220 Compare November 17, 2023 02:25

saiislam reviewed Nov 20, 2023

View reviewed changes

jhuber6 force-pushed the MoveLibCToCOV5 branch from 1a05220 to 3e0bbbf Compare November 21, 2023 12:47

JonChesterfield reviewed Nov 21, 2023

View reviewed changes

JonChesterfield approved these changes Nov 21, 2023

View reviewed changes

saiislam approved these changes Nov 21, 2023

View reviewed changes

jhuber6 merged commit 8341a40 into llvm:main Nov 21, 2023

JonChesterfield added a commit to JonChesterfield/llvm-project that referenced this pull request Dec 9, 2023

Revert "[libc] Update the AMDGPU implementation to use code object 5 (l…

5498e69

…lvm#72580)" This reverts commit 8341a40.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[libc] Update the AMDGPU implementation to use code object 5 #72580

[libc] Update the AMDGPU implementation to use code object 5 #72580

Uh oh!

jhuber6 commented Nov 16, 2023

Uh oh!

llvmbot commented Nov 16, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saiislam Nov 20, 2023

Uh oh!

jhuber6 Nov 20, 2023

Uh oh!

saiislam Nov 20, 2023

Uh oh!

jhuber6 Nov 20, 2023

Uh oh!

JonChesterfield Nov 21, 2023

Uh oh!

arsenm Nov 28, 2023

Uh oh!

saiislam left a comment

Uh oh!

Uh oh!

	packet->private_segment_size = !dynamic_stack ? private_size : 3 * 1024;
	packet->private_segment_size = !dynamic_stack ? private_size : 16 * 1024;

[libc] Update the AMDGPU implementation to use code object 5 #72580

[libc] Update the AMDGPU implementation to use code object 5 #72580

Uh oh!

Conversation

jhuber6 commented Nov 16, 2023

Uh oh!

llvmbot commented Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saiislam Nov 20, 2023

Choose a reason for hiding this comment

Uh oh!

jhuber6 Nov 20, 2023

Choose a reason for hiding this comment

Uh oh!

saiislam Nov 20, 2023

Choose a reason for hiding this comment

Uh oh!

jhuber6 Nov 20, 2023

Choose a reason for hiding this comment

Uh oh!

JonChesterfield Nov 21, 2023

Choose a reason for hiding this comment

Uh oh!

arsenm Nov 28, 2023

Choose a reason for hiding this comment

Uh oh!

saiislam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvmbot commented Nov 16, 2023 •

edited

Loading