-
Notifications
You must be signed in to change notification settings - Fork 14.3k
WIP: [Offload] Add testing for Offload program and kernel related entry points #127803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The offload unit tests will no longer work on host Kernel execution will no longer work on CUDA
You can test this locally with the following command:git-clang-format --diff 71fd5288d28169cc4a6ae0bcf6c19a8130368936 7afee0f7d484a2ee93353cf734dc1373b7829bff --extensions c,inc,hpp,h,cpp -- offload/unittests/OffloadAPI/device_code/bar.c offload/unittests/OffloadAPI/device_code/foo.c offload/unittests/OffloadAPI/enqueue/olEnqueueKernelLaunch.cpp offload/unittests/OffloadAPI/enqueue/olEnqueueMemcpy.cpp offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp offload/unittests/OffloadAPI/kernel/olReleaseKernel.cpp offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp offload/unittests/OffloadAPI/memory/olMemAlloc.cpp offload/unittests/OffloadAPI/memory/olMemFree.cpp offload/unittests/OffloadAPI/program/olCreateProgram.cpp offload/unittests/OffloadAPI/program/olReleaseProgram.cpp offload/unittests/OffloadAPI/program/olRetainProgram.cpp offload/unittests/OffloadAPI/queue/olCreateQueue.cpp offload/unittests/OffloadAPI/queue/olReleaseQueue.cpp offload/unittests/OffloadAPI/queue/olRetainQueue.cpp offload/unittests/OffloadAPI/queue/olWaitQueue.cpp offload/liboffload/include/generated/OffloadAPI.h offload/liboffload/include/generated/OffloadEntryPoints.inc offload/liboffload/include/generated/OffloadFuncs.inc offload/liboffload/include/generated/OffloadImplFuncDecls.inc offload/liboffload/include/generated/OffloadPrint.hpp offload/liboffload/src/OffloadImpl.cpp offload/plugins-nextgen/host/src/rtl.cpp offload/tools/offload-tblgen/APIGen.cpp offload/tools/offload-tblgen/EntryPointGen.cpp offload/tools/offload-tblgen/PrintGen.cpp offload/tools/offload-tblgen/RecordTypes.hpp offload/unittests/OffloadAPI/common/Environment.cpp offload/unittests/OffloadAPI/common/Environment.hpp offload/unittests/OffloadAPI/common/Fixtures.hpp offload/unittests/OffloadAPI/platform/olPlatformInfo.hpp View the diff from clang-format here.diff --git a/offload/unittests/OffloadAPI/device_code/bar.c b/offload/unittests/OffloadAPI/device_code/bar.c
index f415339bc81..786aa2f5d61 100644
--- a/offload/unittests/OffloadAPI/device_code/bar.c
+++ b/offload/unittests/OffloadAPI/device_code/bar.c
@@ -1,5 +1,5 @@
#include <gpuintrin.h>
__gpu_kernel void foo(int *out) {
- out[__gpu_thread_id(0)] = __gpu_thread_id(0) + 1;
+ out[__gpu_thread_id(0)] = __gpu_thread_id(0) + 1;
}
diff --git a/offload/unittests/OffloadAPI/device_code/foo.c b/offload/unittests/OffloadAPI/device_code/foo.c
index e9f091f36bc..5bc893961d4 100644
--- a/offload/unittests/OffloadAPI/device_code/foo.c
+++ b/offload/unittests/OffloadAPI/device_code/foo.c
@@ -1,5 +1,5 @@
#include <gpuintrin.h>
__gpu_kernel void foo(int *out) {
- out[__gpu_thread_id(0)] = __gpu_thread_id(0);
+ out[__gpu_thread_id(0)] = __gpu_thread_id(0);
}
diff --git a/offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp b/offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp
index 1aaa87c8ae9..ff7fd3bc077 100644
--- a/offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp
+++ b/offload/unittests/OffloadAPI/kernel/olCreateKernel.cpp
@@ -13,8 +13,9 @@
using olCreateKernelTest = offloadProgramTest;
TEST_F(olCreateKernelTest, Success) {
-// std::shared_ptr<std::vector<char>> DeviceBin2;
-// ASSERT_TRUE(TestEnvironment::loadDeviceBinary("foo", Platform, DeviceBin2));
+ // std::shared_ptr<std::vector<char>> DeviceBin2;
+ // ASSERT_TRUE(TestEnvironment::loadDeviceBinary("foo", Platform,
+ // DeviceBin2));
ol_kernel_handle_t Kernel = nullptr;
ASSERT_SUCCESS(olCreateKernel(Program, "foo", &Kernel));
diff --git a/offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp b/offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp
index 8da12f8446d..20799072eb2 100644
--- a/offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp
+++ b/offload/unittests/OffloadAPI/kernel/olRetainKernel.cpp
@@ -12,9 +12,7 @@
using olRetainKernelTest = offloadKernelTest;
-TEST_F(olRetainKernelTest, Success) {
- ASSERT_SUCCESS(olRetainKernel(Kernel));
-}
+TEST_F(olRetainKernelTest, Success) { ASSERT_SUCCESS(olRetainKernel(Kernel)); }
TEST_F(olRetainKernelTest, InvalidNullHandle) {
ASSERT_ERROR(OL_ERRC_INVALID_NULL_HANDLE, olRetainKernel(nullptr));
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, mostly concerns about the proposed API.
@@ -1327,6 +1327,34 @@ class CUDAGlobalHandlerTy final : public GenericGlobalHandlerTy { | |||
DeviceGlobal.setPtr(reinterpret_cast<void *>(CUPtr)); | |||
return Plugin::success(); | |||
} | |||
|
|||
Error getGlobalMetadataFromImage(GenericDeviceTy &Device, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fromImage
variant is supposed to be common, as it only looks through the ELF. You're looking for getGlobalMetadataFromDevice
.
@@ -104,3 +104,15 @@ def : Function { | |||
Return<"OL_ERRC_INVALID_DEVICE"> | |||
]; | |||
} | |||
|
|||
def : Function { | |||
let name = "olGetHostDevice"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it works in HSA is that you iterate through all of the devices, and one of them has the special 'type' of host. So this should use the same interface as the GPU devices but have a different 'platform' as you call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little wary of having a device discovered the regular way that a user can't actually enqueue work on (hopefully it will be usable that way, but as you've suggested in other comments the host plugin needs a bit of work).
But having the user check the device type to find the host device isn't too onerous so I can make this change.
//===----------------------------------------------------------------------===// | ||
|
||
def : Function { | ||
let name = "olEnqueueMemcpy"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to be specific about enqueue
? The way things work in the plugins at least is that everything takes a queue pointer, and if it's null we do it synchronously. We could also just make olMemcpy
and olMemcpyAsync
if we want to omit the argument since we do only give out handles, not pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Off the top of my head I can't think of an reason why we couldn't make the queue handles optional.
In UR we have optional handles, and don't hide the fact that they're pointers so they can be set to null.
But if we want to avoid that then I can make the change to olMemcpy
and olMemcpyAsync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer olMemcpy
if we're not making the distinction. Forcing the user to create a queue is fine since this is supposed to be lower level.
"At least one device must be a non-host device" | ||
]; | ||
let params = [ | ||
Param<"ol_queue_handle_t", "Queue", "handle of the queue", PARAM_IN>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't decide if I like the stream at the beginning or the end, but whatever we do it should be a consistent convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The queue is the first param in every function except olCreateQueue
where it's the last, because it's an output parameter. Generally I'd prefer to keep all output pointers as the final argument in every function, but that's just personal preference so could be changed.
offload/liboffload/API/Enqueue.td
Outdated
} | ||
|
||
def : Function { | ||
let name = "olEnqueueMemcpyHtoD"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant if we go with the API I outlined, it's just a memcpy between two 'devices' where the host is one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, since there seemed to be no objects to the new memcpy API you proposed I can remove these redundant functions
offload/liboffload/API/Queue.td
Outdated
} | ||
|
||
def : Function { | ||
let name = "olFinishQueue"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finish is a weird name, it should be more about waiting or synchronizing since it's not like the queue gets deallocated once it's done, does it?
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
// | ||
//===----------------------------------------------------------------------===// | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With <gpuintrin.h>
this looks like
#include <gpuintrin.h>
__gpu_kernel void kernel(uint32_t *out) {
*out = __gpu_thread_id(0);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to keep in mind: we should be able to detect if the user is building with the libc
support, meaning we can have unit tests that run entirely on the GPU, or do printing or assertions on the GPU if needed. Likely we'll want that when we start fixing up the Device runtime, but not necessarily relevant here.
# TODO: Build for host CPU | ||
endmacro() | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to my libc
code for a lot of similar handling https://github.com/llvm/llvm-project/blob/main/libc/cmake/modules/prepare_libc_gpu_build.cmake#L20. I need to hack in CMAKE_REQUIRED_FLAGS
to get the standard helpers to work, likely can just push / pop those once we're out of this function, or do the check_source_compiles
manually. I just found that checking if -march=
or -mcpu
succeeded was the easiest way to check.
add_custom_command(OUTPUT ${BIN_PATH} | ||
COMMAND | ||
${CMAKE_C_COMPILER} --target=nvptx64-nvidia-cuda -march=native | ||
--cuda-path=/usr/local/cuda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a dedicated CMake variable for this, it should be forwarded to the runtime build, but this should only be specified if present, otherwise let the autodetection do its job.
add_custom_command(OUTPUT ${BIN_PATH} | ||
COMMAND | ||
${CMAKE_C_COMPILER} --target=amdgcn-amd-amdhsa -nogpulib | ||
${SRC_PATH} -o ${BIN_PATH} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs -mcpu
for AMDGPU.
@jhuber6 I've addressed some of your comments on here on the original PR (#122106). After fixing an issue with the __tgt_device_image lifetime I've realised I no longer need the plugin changes, so I'd like to roll these two PRs into one, since having reviews split across both is confusing. I'll close this PR soon and move the changes across. |
That really needs to be fixed regardless, registering an image with the runtime should do a copy of all the data. |
c43356c
to
9e56ad3
Compare
std::ifstream SourceFile; | ||
SourceFile.open(SourcePath, std::ios::binary | std::ios::in | std::ios::ate); | ||
|
||
if (!SourceFile.is_open()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llvm::MemoryBuffer::getFileOrSTDIN
is much easier to use and is backed by mmap()
i believe.
No description provided.