Skip to content

WIP: [Offload] Add testing for Offload program and kernel related entry points #127803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from

Conversation

callumfare
Copy link
Contributor

No description provided.

Copy link

github-actions bot commented Feb 19, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 71fd5288d28169cc4a6ae0bcf6c19a8130368936 7afee0f7d484a2ee93353cf734dc1373b7829bff --extensions c,inc,hpp,h,cpp -- offload/unittests/OffloadAPI/device_code/bar.c offload/unittests/OffloadAPI/device_code/foo.c offload/unittests/OffloadAPI/enqueue/olEnqueueKernelLaunch.cpp offload/unittests/OffloadAPI/enqueue/olEnqueueMemcpy.cpp offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp offload/unittests/OffloadAPI/kernel/olReleaseKernel.cpp offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp offload/unittests/OffloadAPI/memory/olMemAlloc.cpp offload/unittests/OffloadAPI/memory/olMemFree.cpp offload/unittests/OffloadAPI/program/olCreateProgram.cpp offload/unittests/OffloadAPI/program/olReleaseProgram.cpp offload/unittests/OffloadAPI/program/olRetainProgram.cpp offload/unittests/OffloadAPI/queue/olCreateQueue.cpp offload/unittests/OffloadAPI/queue/olReleaseQueue.cpp offload/unittests/OffloadAPI/queue/olRetainQueue.cpp offload/unittests/OffloadAPI/queue/olWaitQueue.cpp offload/liboffload/include/generated/OffloadAPI.h offload/liboffload/include/generated/OffloadEntryPoints.inc offload/liboffload/include/generated/OffloadFuncs.inc offload/liboffload/include/generated/OffloadImplFuncDecls.inc offload/liboffload/include/generated/OffloadPrint.hpp offload/liboffload/src/OffloadImpl.cpp offload/plugins-nextgen/host/src/rtl.cpp offload/tools/offload-tblgen/APIGen.cpp offload/tools/offload-tblgen/EntryPointGen.cpp offload/tools/offload-tblgen/PrintGen.cpp offload/tools/offload-tblgen/RecordTypes.hpp offload/unittests/OffloadAPI/common/Environment.cpp offload/unittests/OffloadAPI/common/Environment.hpp offload/unittests/OffloadAPI/common/Fixtures.hpp offload/unittests/OffloadAPI/platform/olPlatformInfo.hpp
View the diff from clang-format here.
diff --git a/offload/unittests/OffloadAPI/device_code/bar.c b/offload/unittests/OffloadAPI/device_code/bar.c
index f415339bc81..786aa2f5d61 100644
--- a/offload/unittests/OffloadAPI/device_code/bar.c
+++ b/offload/unittests/OffloadAPI/device_code/bar.c
@@ -1,5 +1,5 @@
 #include <gpuintrin.h>
 
 __gpu_kernel void foo(int *out) {
-    out[__gpu_thread_id(0)] = __gpu_thread_id(0) + 1;
+  out[__gpu_thread_id(0)] = __gpu_thread_id(0) + 1;
 }
diff --git a/offload/unittests/OffloadAPI/device_code/foo.c b/offload/unittests/OffloadAPI/device_code/foo.c
index e9f091f36bc..5bc893961d4 100644
--- a/offload/unittests/OffloadAPI/device_code/foo.c
+++ b/offload/unittests/OffloadAPI/device_code/foo.c
@@ -1,5 +1,5 @@
 #include <gpuintrin.h>
 
 __gpu_kernel void foo(int *out) {
-    out[__gpu_thread_id(0)] = __gpu_thread_id(0);
+  out[__gpu_thread_id(0)] = __gpu_thread_id(0);
 }
diff --git a/offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp b/offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp
index 1aaa87c8ae9..ff7fd3bc077 100644
--- a/offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp
+++ b/offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp
@@ -13,8 +13,9 @@
 using olCreateKernelTest = offloadProgramTest;
 
 TEST_F(olCreateKernelTest, Success) {
-//   std::shared_ptr<std::vector<char>> DeviceBin2;
-//   ASSERT_TRUE(TestEnvironment::loadDeviceBinary("foo", Platform, DeviceBin2));
+  //   std::shared_ptr<std::vector<char>> DeviceBin2;
+  //   ASSERT_TRUE(TestEnvironment::loadDeviceBinary("foo", Platform,
+  //   DeviceBin2));
 
   ol_kernel_handle_t Kernel = nullptr;
   ASSERT_SUCCESS(olCreateKernel(Program, "foo", &Kernel));
diff --git a/offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp b/offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp
index 8da12f8446d..20799072eb2 100644
--- a/offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp
+++ b/offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp
@@ -12,9 +12,7 @@
 
 using olRetainKernelTest = offloadKernelTest;
 
-TEST_F(olRetainKernelTest, Success) {
-  ASSERT_SUCCESS(olRetainKernel(Kernel));
-}
+TEST_F(olRetainKernelTest, Success) { ASSERT_SUCCESS(olRetainKernel(Kernel)); }
 
 TEST_F(olRetainKernelTest, InvalidNullHandle) {
   ASSERT_ERROR(OL_ERRC_INVALID_NULL_HANDLE, olRetainKernel(nullptr));

@jdoerfert jdoerfert requested review from kevinsala and jhuber6 and removed request for kevinsala February 19, 2025 15:42
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, mostly concerns about the proposed API.

@@ -1327,6 +1327,34 @@ class CUDAGlobalHandlerTy final : public GenericGlobalHandlerTy {
DeviceGlobal.setPtr(reinterpret_cast<void *>(CUPtr));
return Plugin::success();
}

Error getGlobalMetadataFromImage(GenericDeviceTy &Device,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fromImage variant is supposed to be common, as it only looks through the ELF. You're looking for getGlobalMetadataFromDevice.

@@ -104,3 +104,15 @@ def : Function {
Return<"OL_ERRC_INVALID_DEVICE">
];
}

def : Function {
let name = "olGetHostDevice";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it works in HSA is that you iterate through all of the devices, and one of them has the special 'type' of host. So this should use the same interface as the GPU devices but have a different 'platform' as you call it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little wary of having a device discovered the regular way that a user can't actually enqueue work on (hopefully it will be usable that way, but as you've suggested in other comments the host plugin needs a bit of work).

But having the user check the device type to find the host device isn't too onerous so I can make this change.

//===----------------------------------------------------------------------===//

def : Function {
let name = "olEnqueueMemcpy";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be specific about enqueue? The way things work in the plugins at least is that everything takes a queue pointer, and if it's null we do it synchronously. We could also just make olMemcpy and olMemcpyAsync if we want to omit the argument since we do only give out handles, not pointers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off the top of my head I can't think of an reason why we couldn't make the queue handles optional.

In UR we have optional handles, and don't hide the fact that they're pointers so they can be set to null.

But if we want to avoid that then I can make the change to olMemcpy and olMemcpyAsync

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer olMemcpy if we're not making the distinction. Forcing the user to create a queue is fine since this is supposed to be lower level.

"At least one device must be a non-host device"
];
let params = [
Param<"ol_queue_handle_t", "Queue", "handle of the queue", PARAM_IN>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't decide if I like the stream at the beginning or the end, but whatever we do it should be a consistent convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queue is the first param in every function except olCreateQueue where it's the last, because it's an output parameter. Generally I'd prefer to keep all output pointers as the final argument in every function, but that's just personal preference so could be changed.

}

def : Function {
let name = "olEnqueueMemcpyHtoD";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant if we go with the API I outlined, it's just a memcpy between two 'devices' where the host is one of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, since there seemed to be no objects to the new memcpy API you proposed I can remove these redundant functions

}

def : Function {
let name = "olFinishQueue";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finish is a weird name, it should be more about waiting or synchronizing since it's not like the queue gets deallocated once it's done, does it?

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With <gpuintrin.h> this looks like

#include <gpuintrin.h>

__gpu_kernel void kernel(uint32_t *out) {
  *out = __gpu_thread_id(0);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to keep in mind: we should be able to detect if the user is building with the libc support, meaning we can have unit tests that run entirely on the GPU, or do printing or assertions on the GPU if needed. Likely we'll want that when we start fixing up the Device runtime, but not necessarily relevant here.

# TODO: Build for host CPU
endmacro()


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to my libc code for a lot of similar handling https://github.com/llvm/llvm-project/blob/main/libc/cmake/modules/prepare_libc_gpu_build.cmake#L20. I need to hack in CMAKE_REQUIRED_FLAGS to get the standard helpers to work, likely can just push / pop those once we're out of this function, or do the check_source_compiles manually. I just found that checking if -march= or -mcpu succeeded was the easiest way to check.

add_custom_command(OUTPUT ${BIN_PATH}
COMMAND
${CMAKE_C_COMPILER} --target=nvptx64-nvidia-cuda -march=native
--cuda-path=/usr/local/cuda
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a dedicated CMake variable for this, it should be forwarded to the runtime build, but this should only be specified if present, otherwise let the autodetection do its job.

add_custom_command(OUTPUT ${BIN_PATH}
COMMAND
${CMAKE_C_COMPILER} --target=amdgcn-amd-amdhsa -nogpulib
${SRC_PATH} -o ${BIN_PATH}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs -mcpu for AMDGPU.

@callumfare
Copy link
Contributor Author

@jhuber6 I've addressed some of your comments on here on the original PR (#122106). After fixing an issue with the __tgt_device_image lifetime I've realised I no longer need the plugin changes, so I'd like to roll these two PRs into one, since having reviews split across both is confusing. I'll close this PR soon and move the changes across.

@jhuber6
Copy link
Contributor

jhuber6 commented Feb 21, 2025

@jhuber6 I've addressed some of your comments on here on the original PR (#122106). After fixing an issue with the __tgt_device_image lifetime I've realised I no longer need the plugin changes, so I'd like to roll these two PRs into one, since having reviews split across both is confusing. I'll close this PR soon and move the changes across.

That really needs to be fixed regardless, registering an image with the runtime should do a copy of all the data.

@callumfare callumfare force-pushed the offload_new_api_plugin_changes branch from c43356c to 9e56ad3 Compare February 21, 2025 17:42
std::ifstream SourceFile;
SourceFile.open(SourcePath, std::ios::binary | std::ios::in | std::ios::ate);

if (!SourceFile.is_open()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llvm::MemoryBuffer::getFileOrSTDIN is much easier to use and is backed by mmap() i believe.

@callumfare callumfare closed this Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants