WIP: [Offload] Add testing for Offload program and kernel related entry points #127803

callumfare · 2025-02-19T14:27:56Z

No description provided.

The offload unit tests will no longer work on host Kernel execution will no longer work on CUDA

github-actions · 2025-02-19T14:36:44Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 71fd5288d28169cc4a6ae0bcf6c19a8130368936 7afee0f7d484a2ee93353cf734dc1373b7829bff --extensions c,inc,hpp,h,cpp -- offload/unittests/OffloadAPI/device_code/bar.c offload/unittests/OffloadAPI/device_code/foo.c offload/unittests/OffloadAPI/enqueue/olEnqueueKernelLaunch.cpp offload/unittests/OffloadAPI/enqueue/olEnqueueMemcpy.cpp offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp offload/unittests/OffloadAPI/kernel/olReleaseKernel.cpp offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp offload/unittests/OffloadAPI/memory/olMemAlloc.cpp offload/unittests/OffloadAPI/memory/olMemFree.cpp offload/unittests/OffloadAPI/program/olCreateProgram.cpp offload/unittests/OffloadAPI/program/olReleaseProgram.cpp offload/unittests/OffloadAPI/program/olRetainProgram.cpp offload/unittests/OffloadAPI/queue/olCreateQueue.cpp offload/unittests/OffloadAPI/queue/olReleaseQueue.cpp offload/unittests/OffloadAPI/queue/olRetainQueue.cpp offload/unittests/OffloadAPI/queue/olWaitQueue.cpp offload/liboffload/include/generated/OffloadAPI.h offload/liboffload/include/generated/OffloadEntryPoints.inc offload/liboffload/include/generated/OffloadFuncs.inc offload/liboffload/include/generated/OffloadImplFuncDecls.inc offload/liboffload/include/generated/OffloadPrint.hpp offload/liboffload/src/OffloadImpl.cpp offload/plugins-nextgen/host/src/rtl.cpp offload/tools/offload-tblgen/APIGen.cpp offload/tools/offload-tblgen/EntryPointGen.cpp offload/tools/offload-tblgen/PrintGen.cpp offload/tools/offload-tblgen/RecordTypes.hpp offload/unittests/OffloadAPI/common/Environment.cpp offload/unittests/OffloadAPI/common/Environment.hpp offload/unittests/OffloadAPI/common/Fixtures.hpp offload/unittests/OffloadAPI/platform/olPlatformInfo.hpp

View the diff from clang-format here.

diff --git a/offload/unittests/OffloadAPI/device_code/bar.c b/offload/unittests/OffloadAPI/device_code/bar.c
index f415339bc81..786aa2f5d61 100644
--- a/offload/unittests/OffloadAPI/device_code/bar.c
+++ b/offload/unittests/OffloadAPI/device_code/bar.c
@@ -1,5 +1,5 @@
 #include <gpuintrin.h>
 
 __gpu_kernel void foo(int *out) {
-    out[__gpu_thread_id(0)] = __gpu_thread_id(0) + 1;
+  out[__gpu_thread_id(0)] = __gpu_thread_id(0) + 1;
 }
diff --git a/offload/unittests/OffloadAPI/device_code/foo.c b/offload/unittests/OffloadAPI/device_code/foo.c
index e9f091f36bc..5bc893961d4 100644
--- a/offload/unittests/OffloadAPI/device_code/foo.c
+++ b/offload/unittests/OffloadAPI/device_code/foo.c
@@ -1,5 +1,5 @@
 #include <gpuintrin.h>
 
 __gpu_kernel void foo(int *out) {
-    out[__gpu_thread_id(0)] = __gpu_thread_id(0);
+  out[__gpu_thread_id(0)] = __gpu_thread_id(0);
 }
diff --git a/offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp b/offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp
index 1aaa87c8ae9..ff7fd3bc077 100644
--- a/offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp
+++ b/offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp
@@ -13,8 +13,9 @@
 using olCreateKernelTest = offloadProgramTest;
 
 TEST_F(olCreateKernelTest, Success) {
-//   std::shared_ptr<std::vector<char>> DeviceBin2;
-//   ASSERT_TRUE(TestEnvironment::loadDeviceBinary("foo", Platform, DeviceBin2));
+  //   std::shared_ptr<std::vector<char>> DeviceBin2;
+  //   ASSERT_TRUE(TestEnvironment::loadDeviceBinary("foo", Platform,
+  //   DeviceBin2));
 
   ol_kernel_handle_t Kernel = nullptr;
   ASSERT_SUCCESS(olCreateKernel(Program, "foo", &Kernel));
diff --git a/offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp b/offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp
index 8da12f8446d..20799072eb2 100644
--- a/offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp
+++ b/offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp
@@ -12,9 +12,7 @@
 
 using olRetainKernelTest = offloadKernelTest;
 
-TEST_F(olRetainKernelTest, Success) {
-  ASSERT_SUCCESS(olRetainKernel(Kernel));
-}
+TEST_F(olRetainKernelTest, Success) { ASSERT_SUCCESS(olRetainKernel(Kernel)); }
 
 TEST_F(olRetainKernelTest, InvalidNullHandle) {
   ASSERT_ERROR(OL_ERRC_INVALID_NULL_HANDLE, olRetainKernel(nullptr));

jhuber6

Some comments, mostly concerns about the proposed API.

jhuber6 · 2025-02-19T15:44:57Z

offload/plugins-nextgen/cuda/src/rtl.cpp

@@ -1327,6 +1327,34 @@ class CUDAGlobalHandlerTy final : public GenericGlobalHandlerTy {
    DeviceGlobal.setPtr(reinterpret_cast<void *>(CUPtr));
    return Plugin::success();
  }
+
+  Error getGlobalMetadataFromImage(GenericDeviceTy &Device,


This fromImage variant is supposed to be common, as it only looks through the ELF. You're looking for getGlobalMetadataFromDevice.

jhuber6 · 2025-02-19T15:46:13Z

offload/liboffload/API/Device.td

@@ -104,3 +104,15 @@ def : Function {
    Return<"OL_ERRC_INVALID_DEVICE">
  ];
 }
+
+def : Function {
+  let name = "olGetHostDevice";


The way it works in HSA is that you iterate through all of the devices, and one of them has the special 'type' of host. So this should use the same interface as the GPU devices but have a different 'platform' as you call it.

I'm a little wary of having a device discovered the regular way that a user can't actually enqueue work on (hopefully it will be usable that way, but as you've suggested in other comments the host plugin needs a bit of work).

But having the user check the device type to find the host device isn't too onerous so I can make this change.

jhuber6 · 2025-02-19T15:49:09Z

offload/liboffload/API/Enqueue.td

+//===----------------------------------------------------------------------===//
+
+def : Function {
+    let name = "olEnqueueMemcpy";


Do we need to be specific about enqueue? The way things work in the plugins at least is that everything takes a queue pointer, and if it's null we do it synchronously. We could also just make olMemcpy and olMemcpyAsync if we want to omit the argument since we do only give out handles, not pointers.

Off the top of my head I can't think of an reason why we couldn't make the queue handles optional.

In UR we have optional handles, and don't hide the fact that they're pointers so they can be set to null.

But if we want to avoid that then I can make the change to olMemcpy and olMemcpyAsync

I prefer olMemcpy if we're not making the distinction. Forcing the user to create a queue is fine since this is supposed to be lower level.

jhuber6 · 2025-02-19T15:49:47Z

offload/liboffload/API/Enqueue.td

+        "At least one device must be a non-host device"
+    ];
+    let params = [
+        Param<"ol_queue_handle_t", "Queue", "handle of the queue", PARAM_IN>,


Can't decide if I like the stream at the beginning or the end, but whatever we do it should be a consistent convention.

The queue is the first param in every function except olCreateQueue where it's the last, because it's an output parameter. Generally I'd prefer to keep all output pointers as the final argument in every function, but that's just personal preference so could be changed.

jhuber6 · 2025-02-19T15:51:35Z

offload/liboffload/API/Enqueue.td

+}
+
+def : Function {
+    let name = "olEnqueueMemcpyHtoD";


This is redundant if we go with the API I outlined, it's just a memcpy between two 'devices' where the host is one of them.

Yeah, since there seemed to be no objects to the new memcpy API you proposed I can remove these redundant functions

jhuber6 · 2025-02-19T16:01:38Z

offload/liboffload/API/Queue.td

+}
+
+def : Function {
+    let name = "olFinishQueue";


Finish is a weird name, it should be more about waiting or synchronizing since it's not like the queue gets deallocated once it's done, does it?

jhuber6 · 2025-02-19T16:03:52Z

offload/unittests/OffloadAPI/device_code/util.h

+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+


With <gpuintrin.h> this looks like

#include <gpuintrin.h> __gpu_kernel void kernel(uint32_t *out) { *out = __gpu_thread_id(0); }

Another thing to keep in mind: we should be able to detect if the user is building with the libc support, meaning we can have unit tests that run entirely on the GPU, or do printing or assertions on the GPU if needed. Likely we'll want that when we start fixing up the Device runtime, but not necessarily relevant here.

jhuber6 · 2025-02-19T16:06:27Z

offload/unittests/OffloadAPI/device_code/CMakeLists.txt

+    # TODO: Build for host CPU
+endmacro()
+
+


Refer to my libc code for a lot of similar handling https://github.com/llvm/llvm-project/blob/main/libc/cmake/modules/prepare_libc_gpu_build.cmake#L20. I need to hack in CMAKE_REQUIRED_FLAGS to get the standard helpers to work, likely can just push / pop those once we're out of this function, or do the check_source_compiles manually. I just found that checking if -march= or -mcpu succeeded was the easiest way to check.

jhuber6 · 2025-02-19T16:07:03Z

offload/unittests/OffloadAPI/device_code/CMakeLists.txt

+    add_custom_command(OUTPUT ${BIN_PATH}
+        COMMAND
+        ${CMAKE_C_COMPILER} --target=nvptx64-nvidia-cuda -march=native
+        --cuda-path=/usr/local/cuda


There is a dedicated CMake variable for this, it should be forwarded to the runtime build, but this should only be specified if present, otherwise let the autodetection do its job.

jhuber6 · 2025-02-19T16:07:26Z

offload/unittests/OffloadAPI/device_code/CMakeLists.txt

+    add_custom_command(OUTPUT ${BIN_PATH}
+        COMMAND
+        ${CMAKE_C_COMPILER} --target=amdgcn-amd-amdhsa -nogpulib
+        ${SRC_PATH} -o ${BIN_PATH}


Needs -mcpu for AMDGPU.

callumfare · 2025-02-21T13:05:52Z

@jhuber6 I've addressed some of your comments on here on the original PR (#122106). After fixing an issue with the __tgt_device_image lifetime I've realised I no longer need the plugin changes, so I'd like to roll these two PRs into one, since having reviews split across both is confusing. I'll close this PR soon and move the changes across.

jhuber6 · 2025-02-21T13:07:47Z

@jhuber6 I've addressed some of your comments on here on the original PR (#122106). After fixing an issue with the __tgt_device_image lifetime I've realised I no longer need the plugin changes, so I'd like to roll these two PRs into one, since having reviews split across both is confusing. I'll close this PR soon and move the changes across.

That really needs to be fixed regardless, registering an image with the runtime should do a copy of all the data.

jhuber6 · 2025-02-21T21:28:14Z

offload/unittests/OffloadAPI/common/Environment.cpp

+  std::ifstream SourceFile;
+  SourceFile.open(SourcePath, std::ios::binary | std::ios::in | std::ios::ate);
+
+  if (!SourceFile.is_open()) {


llvm::MemoryBuffer::getFileOrSTDIN is much easier to use and is backed by mmap() i believe.

callumfare added 16 commits December 11, 2024 12:08

WIP: Implement olMemAlloc, olMemFree

7cbe788

Add size check

73ed36a

Implement minimum Offload API needed to launch a SYCL kernel

be5c36b

Make a copy of the program binary in olCreateProgram

f6430fe

Rework kernel arguments

fb8a1cc

Update Offload unit tests

df9eb3e

Kernel launch size arguments

71326ae

Remove currently unused alignment param

81bd646

Fix formatting

8539184

Fix leak in olReleaseQueue

20acc17

General tidy up; improve documentation and formatting

3423f70

Revert plugin changes

44122e1

The offload unit tests will no longer work on host Kernel execution will no longer work on CUDA

Rename ol_*_handle_t_ -> ol_*_impl_t

2aea022

Various fixes to address review feedback

5c121fa

Formatting

3fbdf61

Alternative memcpy implementation

0ca7527

jdoerfert requested review from kevinsala and jhuber6 and removed request for kevinsala February 19, 2025 15:42

jhuber6 reviewed Feb 19, 2025

View reviewed changes

callumfare added 3 commits February 21, 2025 11:46

Rework program and kernel implementation

df52c01

WIP program and kernel launch testing

edc79cb

Various fixes

7443873

Rework test device code

9e56ad3

callumfare force-pushed the offload_new_api_plugin_changes branch from c43356c to 9e56ad3 Compare February 21, 2025 17:42

jhuber6 reviewed Feb 21, 2025

View reviewed changes

callumfare added 2 commits February 26, 2025 12:37

Change olFinishQueue -> olWaitQueue

d3effbd

Remove redundant olEnqueueMemcpy functions

7afee0f

callumfare closed this Mar 4, 2025

WIP: [Offload] Add testing for Offload program and kernel related entry points #127803

WIP: [Offload] Add testing for Offload program and kernel related entry points #127803

Uh oh!

Conversation

callumfare commented Feb 19, 2025

Uh oh!

github-actions bot commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

callumfare commented Feb 21, 2025

Uh oh!

jhuber6 commented Feb 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 19, 2025 •

edited

Loading