[OpenMP] Allow GPUs to be targeted directly via `-fopenmp`. #122149

jhuber6 · 2025-01-08T18:23:44Z

Summary:
Currently we prevent the following from working. However, it is
completely reasonable to be able to target the files individually.

$ clang --target=amdgcn-amd-amdhsa -fopenmp

This patch lifts this restriction, allowing individual files to be
compiled as standalone OpenMP without the extra offloading overhead. The
main motivation behind this is to update the build of the OpenMP
DeviceRTL. Currently, we do --offload-device-only -S -emit-llvm which
is just a hackier version of -fopenmp -flto -c.

This patch allows the following to work.

$ clang omp.c -fopenmp --target=amdgcn-amd-amdhsa -flto -c
$ clang offload.c -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xoffload-linker omp.o
$ ./a.out

llvmbot · 2025-01-08T18:24:17Z

@llvm/pr-subscribers-clang-codegen

@llvm/pr-subscribers-clang

Author: Joseph Huber (jhuber6)

Changes

Summary:
Currently we prevent the following from working. However, it is
completely reasonable to be able to target the files individually.

$ clang --target=amdgcn-amd-amdhsa -fopenmp

This patch lifts this restriction, allowing individual files to be
compiled as standalone OpenMP without the extra offloading overhead. The
main motivation behind this is to update the build of the OpenMP
DeviceRTL. Currently, we do --offload-device-only -S -emit-llvm which
is just a hackier version of -fopenmp -flto -c.

This patch allows the following to work.

$ clang omp.c -fopenmp --target=amdgcn-amd-amdhsa -flto -c
$ clang offload.c -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xoffload-linker omp.o
$ ./a.out

Full diff: https://github.com/llvm/llvm-project/pull/122149.diff

6 Files Affected:

(modified) clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp (-3)
(modified) clang/lib/CodeGen/CGStmtOpenMP.cpp (+2-1)
(modified) clang/lib/CodeGen/CodeGenModule.cpp (-2)
(modified) clang/lib/Frontend/CompilerInvocation.cpp (-13)
(added) clang/test/OpenMP/gpu_target.cpp (+220)
(modified) clang/test/OpenMP/target_messages.cpp (-1)

diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
index 756f0482b8ea72..1ad4b4b0e8a7fc 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
+++ b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
@@ -870,9 +870,6 @@ CGOpenMPRuntimeGPU::CGOpenMPRuntimeGPU(CodeGenModule &CGM)
       hasRequiresUnifiedSharedMemory(), /*HasRequiresDynamicAllocators*/ false);
   OMPBuilder.setConfig(Config);
 
-  if (!CGM.getLangOpts().OpenMPIsTargetDevice)
-    llvm_unreachable("OpenMP can only handle device code.");
-
   if (CGM.getLangOpts().OpenMPCUDAMode)
     CurrentDataSharingMode = CGOpenMPRuntimeGPU::DS_CUDA;
 
diff --git a/clang/lib/CodeGen/CGStmtOpenMP.cpp b/clang/lib/CodeGen/CGStmtOpenMP.cpp
index 6cb37b20b7aeee..950ed173aecf3a 100644
--- a/clang/lib/CodeGen/CGStmtOpenMP.cpp
+++ b/clang/lib/CodeGen/CGStmtOpenMP.cpp
@@ -6801,7 +6801,8 @@ static void emitCommonOMPTargetDirective(CodeGenFunction &CGF,
   CodeGenModule &CGM = CGF.CGM;
 
   // On device emit this construct as inlined code.
-  if (CGM.getLangOpts().OpenMPIsTargetDevice) {
+  if (CGM.getLangOpts().OpenMPIsTargetDevice ||
+      CGM.getOpenMPRuntime().isGPU()) {
     OMPLexicalScope Scope(CGF, S, OMPD_target);
     CGM.getOpenMPRuntime().emitInlinedDirective(
         CGF, OMPD_target, [&S](CodeGenFunction &CGF, PrePostActionTy &) {
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index 5f15f0f48c54e4..26abd9a60632ae 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -483,8 +483,6 @@ void CodeGenModule::createOpenMPRuntime() {
   case llvm::Triple::nvptx:
   case llvm::Triple::nvptx64:
   case llvm::Triple::amdgcn:
-    assert(getLangOpts().OpenMPIsTargetDevice &&
-           "OpenMP AMDGPU/NVPTX is only prepared to deal with device code.");
     OpenMPRuntime.reset(new CGOpenMPRuntimeGPU(*this));
     break;
   default:
diff --git a/clang/lib/Frontend/CompilerInvocation.cpp b/clang/lib/Frontend/CompilerInvocation.cpp
index d711df02ce9503..d2df51593ff62b 100644
--- a/clang/lib/Frontend/CompilerInvocation.cpp
+++ b/clang/lib/Frontend/CompilerInvocation.cpp
@@ -4210,19 +4210,6 @@ bool CompilerInvocation::ParseLangArgs(LangOptions &Opts, ArgList &Args,
             Args, OPT_fopenmp_version_EQ,
             (IsSimdSpecified || IsTargetSpecified) ? 51 : Opts.OpenMP, Diags))
       Opts.OpenMP = Version;
-    // Provide diagnostic when a given target is not expected to be an OpenMP
-    // device or host.
-    if (!Opts.OpenMPIsTargetDevice) {
-      switch (T.getArch()) {
-      default:
-        break;
-      // Add unsupported host targets here:
-      case llvm::Triple::nvptx:
-      case llvm::Triple::nvptx64:
-        Diags.Report(diag::err_drv_omp_host_target_not_supported) << T.str();
-        break;
-      }
-    }
   }
 
   // Set the flag to prevent the implementation from emitting device exception
diff --git a/clang/test/OpenMP/gpu_target.cpp b/clang/test/OpenMP/gpu_target.cpp
new file mode 100644
index 00000000000000..3d5a47d7050436
--- /dev/null
+++ b/clang/test/OpenMP/gpu_target.cpp
@@ -0,0 +1,220 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals all --include-generated-funcs --replace-value-regex "__omp_offloading_[0-9a-z]+_[0-9a-z]+" "reduction_size[.].+[.]" "pl_cond[.].+[.|,]" --version 5
+// expected-no-diagnostics
+
+// RUN: %clang_cc1 -verify -fopenmp -fopenmp-version=45 -x c++ -triple amdgcn-amd-amdhsa -emit-llvm %s -o - | FileCheck %s --check-prefix=AMDGCN
+// RUN: %clang_cc1 -verify -fopenmp -fopenmp-version=45 -x c++ -triple nvptx64-nvidia-cuda -emit-llvm %s -o - | FileCheck %s --check-prefix=NVPTX
+
+typedef enum omp_allocator_handle_t {
+  omp_null_allocator = 0,
+  omp_default_mem_alloc = 1,
+  omp_large_cap_mem_alloc = 2,
+  omp_const_mem_alloc = 3,
+  omp_high_bw_mem_alloc = 4,
+  omp_low_lat_mem_alloc = 5,
+  omp_cgroup_mem_alloc = 6,
+  omp_pteam_mem_alloc = 7,
+  omp_thread_mem_alloc = 8,
+  KMP_ALLOCATOR_MAX_HANDLE = ~(0LU)
+} omp_allocator_handle_t;
+
+int d = 0;
+#pragma omp allocate(d) allocator(omp_default_mem_alloc)
+
+int g = 0;
+#pragma omp allocate(g) allocator(omp_cgroup_mem_alloc)
+
+extern const int c = 0;
+#pragma omp allocate(c) allocator(omp_const_mem_alloc)
+
+
+int foo() {
+  int t = 0;
+#pragma omp allocate(t) allocator(omp_thread_mem_alloc)
+  return t;
+}
+
+void bar() {
+#pragma omp target
+  ;
+#pragma omp parallel
+  ;
+}
+
+void baz(int *p) {
+#pragma omp atomic
+  *p += 1;
+}
+
+int qux() {
+#if defined(__NVPTX__)
+  return 1;
+#elif defined(__AMDGPU__)
+  return 2;
+#endif
+}
+//.
+// AMDGCN: @c = addrspace(4) constant i32 0, align 4
+// AMDGCN: @[[GLOB0:[0-9]+]] = private unnamed_addr constant [23 x i8] c"
+// AMDGCN: @[[GLOB1:[0-9]+]] = private unnamed_addr addrspace(1) constant %struct.ident_t { i32 0, i32 2, i32 0, i32 22, ptr @[[GLOB0]] }, align 8
+// AMDGCN: @d = global i32 0, align 4
+// AMDGCN: @g = global i32 0, align 4
+// AMDGCN: @__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 500
+//.
+// NVPTX: @d = global i32 0, align 4
+// NVPTX: @g = global i32 0, align 4
+// NVPTX: @c = addrspace(4) constant i32 0, align 4
+// NVPTX: @[[GLOB0:[0-9]+]] = private unnamed_addr constant [23 x i8] c"
+// NVPTX: @[[GLOB1:[0-9]+]] = private unnamed_addr constant %struct.ident_t { i32 0, i32 2, i32 0, i32 22, ptr @[[GLOB0]] }, align 8
+//.
+// AMDGCN-LABEL: define dso_local noundef i32 @_Z3foov(
+// AMDGCN-SAME: ) #[[ATTR0:[0-9]+]] {
+// AMDGCN-NEXT:  [[ENTRY:.*:]]
+// AMDGCN-NEXT:    [[RETVAL:%.*]] = alloca i32, align 4, addrspace(5)
+// AMDGCN-NEXT:    [[T:%.*]] = alloca i32, align 4, addrspace(5)
+// AMDGCN-NEXT:    [[RETVAL_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RETVAL]] to ptr
+// AMDGCN-NEXT:    [[T_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[T]] to ptr
+// AMDGCN-NEXT:    store i32 0, ptr [[T_ASCAST]], align 4
+// AMDGCN-NEXT:    [[TMP0:%.*]] = load i32, ptr [[T_ASCAST]], align 4
+// AMDGCN-NEXT:    ret i32 [[TMP0]]
+//
+//
+// AMDGCN-LABEL: define dso_local void @_Z3barv(
+// AMDGCN-SAME: ) #[[ATTR0]] {
+// AMDGCN-NEXT:  [[ENTRY:.*:]]
+// AMDGCN-NEXT:    [[CAPTURED_VARS_ADDRS:%.*]] = alloca [0 x ptr], align 8, addrspace(5)
+// AMDGCN-NEXT:    [[TMP0:%.*]] = call i32 @__kmpc_global_thread_num(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr))
+// AMDGCN-NEXT:    [[CAPTURED_VARS_ADDRS_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[CAPTURED_VARS_ADDRS]] to ptr
+// AMDGCN-NEXT:    call void @__kmpc_parallel_51(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i32 [[TMP0]], i32 1, i32 -1, i32 -1, ptr @_Z3barv_omp_outlined, ptr @_Z3barv_omp_outlined_wrapper, ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0)
+// AMDGCN-NEXT:    ret void
+//
+//
+// AMDGCN-LABEL: define internal void @_Z3barv_omp_outlined(
+// AMDGCN-SAME: ptr noalias noundef [[DOTGLOBAL_TID_:%.*]], ptr noalias noundef [[DOTBOUND_TID_:%.*]]) #[[ATTR1:[0-9]+]] {
+// AMDGCN-NEXT:  [[ENTRY:.*:]]
+// AMDGCN-NEXT:    [[DOTGLOBAL_TID__ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
+// AMDGCN-NEXT:    [[DOTBOUND_TID__ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
+// AMDGCN-NEXT:    [[DOTGLOBAL_TID__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTGLOBAL_TID__ADDR]] to ptr
+// AMDGCN-NEXT:    [[DOTBOUND_TID__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTBOUND_TID__ADDR]] to ptr
+// AMDGCN-NEXT:    store ptr [[DOTGLOBAL_TID_]], ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
+// AMDGCN-NEXT:    store ptr [[DOTBOUND_TID_]], ptr [[DOTBOUND_TID__ADDR_ASCAST]], align 8
+// AMDGCN-NEXT:    ret void
+//
+//
+// AMDGCN-LABEL: define internal void @_Z3barv_omp_outlined_wrapper(
+// AMDGCN-SAME: i16 noundef zeroext [[TMP0:%.*]], i32 noundef [[TMP1:%.*]]) #[[ATTR2:[0-9]+]] {
+// AMDGCN-NEXT:  [[ENTRY:.*:]]
+// AMDGCN-NEXT:    [[DOTADDR:%.*]] = alloca i16, align 2, addrspace(5)
+// AMDGCN-NEXT:    [[DOTADDR1:%.*]] = alloca i32, align 4, addrspace(5)
+// AMDGCN-NEXT:    [[DOTZERO_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
+// AMDGCN-NEXT:    [[GLOBAL_ARGS:%.*]] = alloca ptr, align 8, addrspace(5)
+// AMDGCN-NEXT:    [[DOTADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTADDR]] to ptr
+// AMDGCN-NEXT:    [[DOTADDR1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTADDR1]] to ptr
+// AMDGCN-NEXT:    [[DOTZERO_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTZERO_ADDR]] to ptr
+// AMDGCN-NEXT:    [[GLOBAL_ARGS_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[GLOBAL_ARGS]] to ptr
+// AMDGCN-NEXT:    store i16 [[TMP0]], ptr [[DOTADDR_ASCAST]], align 2
+// AMDGCN-NEXT:    store i32 [[TMP1]], ptr [[DOTADDR1_ASCAST]], align 4
+// AMDGCN-NEXT:    store i32 0, ptr [[DOTZERO_ADDR_ASCAST]], align 4
+// AMDGCN-NEXT:    call void @__kmpc_get_shared_variables(ptr [[GLOBAL_ARGS_ASCAST]])
+// AMDGCN-NEXT:    call void @_Z3barv_omp_outlined(ptr [[DOTADDR1_ASCAST]], ptr [[DOTZERO_ADDR_ASCAST]]) #[[ATTR3:[0-9]+]]
+// AMDGCN-NEXT:    ret void
+//
+//
+// AMDGCN-LABEL: define dso_local void @_Z3bazPi(
+// AMDGCN-SAME: ptr noundef [[P:%.*]]) #[[ATTR0]] {
+// AMDGCN-NEXT:  [[ENTRY:.*:]]
+// AMDGCN-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
+// AMDGCN-NEXT:    [[P_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[P_ADDR]] to ptr
+// AMDGCN-NEXT:    store ptr [[P]], ptr [[P_ADDR_ASCAST]], align 8
+// AMDGCN-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[P_ADDR_ASCAST]], align 8
+// AMDGCN-NEXT:    [[TMP1:%.*]] = atomicrmw add ptr [[TMP0]], i32 1 monotonic, align 4
+// AMDGCN-NEXT:    ret void
+//
+//
+// AMDGCN-LABEL: define dso_local noundef i32 @_Z3quxv(
+// AMDGCN-SAME: ) #[[ATTR0]] {
+// AMDGCN-NEXT:  [[ENTRY:.*:]]
+// AMDGCN-NEXT:    [[RETVAL:%.*]] = alloca i32, align 4, addrspace(5)
+// AMDGCN-NEXT:    [[RETVAL_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RETVAL]] to ptr
+// AMDGCN-NEXT:    ret i32 2
+//
+//
+// NVPTX-LABEL: define dso_local noundef i32 @_Z3foov(
+// NVPTX-SAME: ) #[[ATTR0:[0-9]+]] {
+// NVPTX-NEXT:  [[ENTRY:.*:]]
+// NVPTX-NEXT:    [[T:%.*]] = alloca i32, align 4
+// NVPTX-NEXT:    store i32 0, ptr [[T]], align 4
+// NVPTX-NEXT:    [[TMP0:%.*]] = load i32, ptr [[T]], align 4
+// NVPTX-NEXT:    ret i32 [[TMP0]]
+//
+//
+// NVPTX-LABEL: define dso_local void @_Z3barv(
+// NVPTX-SAME: ) #[[ATTR0]] {
+// NVPTX-NEXT:  [[ENTRY:.*:]]
+// NVPTX-NEXT:    [[CAPTURED_VARS_ADDRS:%.*]] = alloca [0 x ptr], align 8
+// NVPTX-NEXT:    [[TMP0:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB1]])
+// NVPTX-NEXT:    call void @__kmpc_parallel_51(ptr @[[GLOB1]], i32 [[TMP0]], i32 1, i32 -1, i32 -1, ptr @_Z3barv_omp_outlined, ptr @_Z3barv_omp_outlined_wrapper, ptr [[CAPTURED_VARS_ADDRS]], i64 0)
+// NVPTX-NEXT:    ret void
+//
+//
+// NVPTX-LABEL: define internal void @_Z3barv_omp_outlined(
+// NVPTX-SAME: ptr noalias noundef [[DOTGLOBAL_TID_:%.*]], ptr noalias noundef [[DOTBOUND_TID_:%.*]]) #[[ATTR1:[0-9]+]] {
+// NVPTX-NEXT:  [[ENTRY:.*:]]
+// NVPTX-NEXT:    [[DOTGLOBAL_TID__ADDR:%.*]] = alloca ptr, align 8
+// NVPTX-NEXT:    [[DOTBOUND_TID__ADDR:%.*]] = alloca ptr, align 8
+// NVPTX-NEXT:    store ptr [[DOTGLOBAL_TID_]], ptr [[DOTGLOBAL_TID__ADDR]], align 8
+// NVPTX-NEXT:    store ptr [[DOTBOUND_TID_]], ptr [[DOTBOUND_TID__ADDR]], align 8
+// NVPTX-NEXT:    ret void
+//
+//
+// NVPTX-LABEL: define internal void @_Z3barv_omp_outlined_wrapper(
+// NVPTX-SAME: i16 noundef zeroext [[TMP0:%.*]], i32 noundef [[TMP1:%.*]]) #[[ATTR2:[0-9]+]] {
+// NVPTX-NEXT:  [[ENTRY:.*:]]
+// NVPTX-NEXT:    [[DOTADDR:%.*]] = alloca i16, align 2
+// NVPTX-NEXT:    [[DOTADDR1:%.*]] = alloca i32, align 4
+// NVPTX-NEXT:    [[DOTZERO_ADDR:%.*]] = alloca i32, align 4
+// NVPTX-NEXT:    [[GLOBAL_ARGS:%.*]] = alloca ptr, align 8
+// NVPTX-NEXT:    store i16 [[TMP0]], ptr [[DOTADDR]], align 2
+// NVPTX-NEXT:    store i32 [[TMP1]], ptr [[DOTADDR1]], align 4
+// NVPTX-NEXT:    store i32 0, ptr [[DOTZERO_ADDR]], align 4
+// NVPTX-NEXT:    call void @__kmpc_get_shared_variables(ptr [[GLOBAL_ARGS]])
+// NVPTX-NEXT:    call void @_Z3barv_omp_outlined(ptr [[DOTADDR1]], ptr [[DOTZERO_ADDR]]) #[[ATTR3:[0-9]+]]
+// NVPTX-NEXT:    ret void
+//
+//
+// NVPTX-LABEL: define dso_local void @_Z3bazPi(
+// NVPTX-SAME: ptr noundef [[P:%.*]]) #[[ATTR0]] {
+// NVPTX-NEXT:  [[ENTRY:.*:]]
+// NVPTX-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// NVPTX-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// NVPTX-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// NVPTX-NEXT:    [[TMP1:%.*]] = atomicrmw add ptr [[TMP0]], i32 1 monotonic, align 4
+// NVPTX-NEXT:    ret void
+//
+//
+// NVPTX-LABEL: define dso_local noundef i32 @_Z3quxv(
+// NVPTX-SAME: ) #[[ATTR0]] {
+// NVPTX-NEXT:  [[ENTRY:.*:]]
+// NVPTX-NEXT:    ret i32 1
+//
+//.
+// AMDGCN: attributes #[[ATTR0]] = { convergent mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
+// AMDGCN: attributes #[[ATTR1]] = { convergent noinline norecurse nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
+// AMDGCN: attributes #[[ATTR2]] = { convergent noinline norecurse nounwind "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
+// AMDGCN: attributes #[[ATTR3]] = { nounwind }
+// AMDGCN: attributes #[[ATTR4:[0-9]+]] = { alwaysinline }
+//.
+// NVPTX: attributes #[[ATTR0]] = { convergent mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-features"="+ptx32" }
+// NVPTX: attributes #[[ATTR1]] = { convergent noinline norecurse nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-features"="+ptx32" }
+// NVPTX: attributes #[[ATTR2]] = { convergent noinline norecurse nounwind "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-features"="+ptx32" }
+// NVPTX: attributes #[[ATTR3]] = { nounwind }
+// NVPTX: attributes #[[ATTR4:[0-9]+]] = { alwaysinline }
+//.
+// AMDGCN: [[META0:![0-9]+]] = !{i32 1, !"amdhsa_code_object_version", i32 500}
+// AMDGCN: [[META1:![0-9]+]] = !{i32 1, !"wchar_size", i32 4}
+// AMDGCN: [[META2:![0-9]+]] = !{i32 7, !"openmp", i32 45}
+// AMDGCN: [[META3:![0-9]+]] = !{!"{{.*}}clang version {{.*}}"}
+//.
+// NVPTX: [[META0:![0-9]+]] = !{i32 1, !"wchar_size", i32 4}
+// NVPTX: [[META1:![0-9]+]] = !{i32 7, !"openmp", i32 45}
+// NVPTX: [[META2:![0-9]+]] = !{!"{{.*}}clang version {{.*}}"}
+//.
diff --git a/clang/test/OpenMP/target_messages.cpp b/clang/test/OpenMP/target_messages.cpp
index 62ab5817ba28da..7e68adddc949e0 100644
--- a/clang/test/OpenMP/target_messages.cpp
+++ b/clang/test/OpenMP/target_messages.cpp
@@ -8,7 +8,6 @@
 // CHECK: error: OpenMP target is invalid: 'aaa-bbb-ccc-ddd'
 // RUN: not %clang_cc1 -fopenmp -std=c++11 -triple nvptx64-nvidia-cuda -o - %s 2>&1 | FileCheck --check-prefix CHECK-UNSUPPORTED-HOST-TARGET %s
 // RUN: not %clang_cc1 -fopenmp -std=c++11 -triple nvptx-nvidia-cuda -o - %s 2>&1 | FileCheck --check-prefix CHECK-UNSUPPORTED-HOST-TARGET %s
-// CHECK-UNSUPPORTED-HOST-TARGET: error: target '{{nvptx64-nvidia-cuda|nvptx-nvidia-cuda}}' is not a supported OpenMP host target
 // RUN: not %clang_cc1 -fopenmp -std=c++11 -fopenmp-targets=hexagon-linux-gnu -o - %s 2>&1 | FileCheck --check-prefix CHECK-UNSUPPORTED-DEVICE-TARGET %s
 // CHECK-UNSUPPORTED-DEVICE-TARGET: OpenMP target is invalid: 'hexagon-linux-gnu'

alexey-bataev · 2025-01-08T18:40:00Z

Maybe just turn on OpenMPIsTargetDevice if gpu target + -fopenmp is specified?

jhuber6 · 2025-01-08T18:41:48Z

Maybe just turn on OpenMPIsTargetDevice if gpu target + -fopenmp is specified?

I'll give it a try.

shiltian · 2025-01-08T18:42:56Z

What code generation path would be used in this case? The GPU code generation or regular host OpenMP?

jhuber6 · 2025-01-08T18:44:14Z

What code generation path would be used in this case? The GPU code generation or regular host OpenMP?

The GPU path, I'm treating that as the code generation path that created correct runtime code for the GPU. I.e. you can link it with your OpenMP offloading program and it'll work.

alexey-bataev · 2025-01-08T18:44:23Z

What code generation path would be used in this case? The GPU code generation or regular host OpenMP?

gpu device code

jhuber6 · 2025-01-08T18:49:48Z

Maybe just turn on OpenMPIsTargetDevice if gpu target + -fopenmp is specified?

Doesn't work, it causes all definitions to be stripped as they are not declared on the device, which is not what we want.

shiltian · 2025-01-08T18:52:48Z

~~I don't think it should be GPU code generation path as there is no explicit target region used.~~
Probably I missed something here. Do you expect regular OpenMP stuff such as parallel region to be emitted in the same way as offloading code?

jhuber6 · 2025-01-08T18:54:36Z

I don't think it should be GPU code generation path as there is no explicit target region used.

it needs to be, otherwise the code generation for things like #pragma omp parallel will be wrong. The way I see it, the DeviceRTL is libomp.a for the GPU target, so we need to emit runtime calls for that runtime.

Summary: Currently we prevent the following from working. However, it is completely reasonable to be able to target the files individually. ``` $ clang --target=amdgcn-amd-amdhsa -fopenmp ``` This patch lifts this restriction, allowing individual files to be compiled as standalone OpenMP without the extra offloading overhead. The main motivation behind this is to update the build of the OpenMP DeviceRTL. Currently, we do `--offload-device-only -S -emit-llvm` which is just a hackier version of `-fopenmp -flto -c`. This patch allows the following to work. ``` $ clang omp.c -fopenmp --target=amdgcn-amd-amdhsa -flto -c $ clang offload.c -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xoffload-linker omp.o $ ./a.out ```

jhuber6 · 2025-01-08T18:58:22Z

~~I don't think it should be GPU code generation path as there is no explicit target region used.~~ Probably I missed something here. Do you expect regular OpenMP stuff such as parallel region to be emitted in the same way as offloading code?

Yes, the example in the description is how I can see this being used. It basically lets us emit GPU code without the build system complications introduced by offloading languages (i.e. we no longer need variants or declare target.)

shiltian · 2025-01-08T19:04:46Z

I think that is a misuse of OpenMP semantics. We can't expect to have regular OpenMP code working in the same way as OpenMP offloading code when targeting a GPU meanwhile the code is not wrapped into target region or declare target. I understand to have variants and declare target is not convenient but that conforms with OpenMP standard. I don't think this change conforms with the standard.

jhuber6 · 2025-01-08T19:09:33Z

I think that is a misuse of OpenMP semantics. We can't expect to have regular OpenMP code working in the same way as OpenMP offloading code when targeting a GPU meanwhile the code is not wrapped into target region or declare target. I understand to have variants and declare target is not convenient but that conforms with OpenMP standard. I don't think this change conforms with the standard.

It should maintain the normal semantics you'd get with -fopenmp except it codegens certain things differently. Alternatively I could just remove OpenMP entirely from the DeviceRTL so I might just do that instead.

shiltian · 2025-01-08T19:15:51Z

It should maintain the normal semantics you'd get with -fopenmp except it codegens certain things differently.

That is the key difference.

Alternatively I could just remove OpenMP entirely from the DeviceRTL so I might just do that instead.

+1

jdoerfert

We can't expect to have regular OpenMP code working in the same way as OpenMP offloading code when targeting a GPU meanwhile the code is not wrapped into target region or declare target

The way I see this is:
If the target is a GPU, and OpenMP is enabled (-fopenmp), what we should do is codegen it targeting the GPU OpenMP runtime. This is what this patch does and I don't think it violates anything. What we define as the internal OpenMP runtime could reasonably be different on every target, or even depending on the day of the week ;). No runtime would also be an option if we codegen all the semantics, though we really don't want that.

That said, this makes sense to me. LG.

P.S. I'd suggest discussing the device runtime in the actual follow up.

shiltian · 2025-01-08T19:41:11Z

I see it in a different way. #pragma omp target parallel (let's just assume this is valid code) is different from #pragma omp parallel, no matter what target is. However, this patch is to say, when targeting a GPU, #pragma omp parallel is #pragma omp target parallel. We have to talk about internal details here because we are implementing the standard.

Apparently the compiler skips non-target OpenMP regions when we compile for a GPU in the regular offloading compilation pipeline.

jhuber6 · 2025-01-08T19:51:33Z

I guess I'll see how much I favor this approach depending on how much more difficult it is to build the DeviceRTL without OpenMP. I think the only thing we'd miss is the #pragma omp assumes(...) business, which might have another way to be emitted in LLVM/Clang?

kparzysz · 2025-01-08T19:59:45Z

To me this looks like compilation for a host, except the GPU is the host. The only functions that could be called from such a CU would be the top-level ones, not any of the auto-generated one.

Additionally, the host wouldn't support offload, so we'd need to do something about TARGET (or any other construct that cannot be inside of TARGET). We should probably ignore those with a diagnostic.

jdoerfert · 2025-01-08T20:46:17Z

I see it in a different way. #pragma omp target parallel (let's just assume this is valid code) is different from #pragma omp parallel, no matter what target is. However, this patch is to say, when targeting a GPU, #pragma omp parallel is #pragma omp target parallel. We have to talk about internal details here because we are implementing the standard.

This is not the case. This patch does not make omp parallel the same as omp target parallel.
This patch does however implement omp parallel with _parallel_51, if the target is a GPU, and otherwise with fork_threads. From the standard perspective, what we provide is the semantics of omp parallel, the implementation just happens to be different on the targets. target parallel means "try to offload, then parallel", which for the host still does that. If your triple now is a GPU, we would not implement it as a parallel, not with this patch. We should either emit an error, or emit a task containing a parallel, effectively emitting the fallback only.

shiltian · 2025-01-08T22:17:47Z

To me this looks like compilation for a host, except the GPU is the host. The only functions that could be called from such a CU would be the top-level ones, not any of the auto-generated one.

Additionally, the host wouldn't support offload, so we'd need to do something about TARGET (or any other construct that cannot be inside of TARGET). We should probably ignore those with a diagnostic.

I get it, but that doesn't look like the case. If you look at the test case, the target region in bar is simply ignored. To me this looks like treating the entire TU being wrapped into a giant target region instead of compiling for host.

This is not the case. This patch does not make omp parallel the same as omp target parallel. This patch does however implement omp parallel with _parallel_51, if the target is a GPU, and otherwise with fork_threads. From the standard perspective, what we provide is the semantics of omp parallel, the implementation just happens to be different on the targets. target parallel means "try to offload, then parallel", which for the host still does that. If your triple now is a GPU, we would not implement it as a parallel, not with this patch. We should either emit an error, or emit a task containing a parallel, effectively emitting the fallback only.

The patch does make omp parallel same as omp target parallel from code generation's perspective. That is exactly the part that I'm not 100% sure if it is right because even the same construct might have different behavior when it is nested into a target region. I can't give a concrete example here, and I also might be wrong here. If this is not an issue (from the standard perspective), then I'm fine with the change, as long as we have a defined behavior of dealing with target regions, instead of simply ignoring it, which does seem like a bug to me.

jdoerfert · 2025-01-08T22:32:46Z

I get it, but that doesn't look like the case. If you look at the test case, the target region in bar is simply ignored. To me this looks like treating the entire TU being wrapped into a giant target region instead of compiling for host.

That is a good point. I think we can fix it in a follow up though.
For bar, not doing anything is technically legal, since the target region is empty.
However, if we would do

#pragma omp target 
{
  *G = 1;
  omp_set_thread_limit(3);
}

we should see the store, and the ICV change inside the target task.
As I mentioned, we could, for now, reasonably error out for omp target if the triple is a GPU.
Otherwise, we need to codegen it as if it is a task.

All that said, there are two cases to consider wrt. the standard:

The initial device is the CPU and the code compiled here is just part of a GPU library, or
the initial device is the GPU and the code compiled here is just part of the "host code".

For 1), omp target, w/o ancestor, is disallowed, IIRC.
For 2), it should work as if it is a task, basically we do not implement "offloading" from this host, which is totally fine.

shiltian · 2025-01-09T17:02:19Z

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp

+        SizeEmitter) {
+  SmallString<256> Buffer;
+  llvm::raw_svector_ostream Out(Buffer);
+  Out << "Cannot emit a '#pragma omp target' on the GPU";


Should this be a Sema or CodeGen error?

shiltian

Do we have a clear idea on if a construct can behave in a different manner if it is nested in a target region?

jhuber6 · 2025-01-09T17:05:52Z

Do we have a clear idea on if a construct can behave in a different manner if it is nested in a target region?

Unsure exactly, the target regions are just outlined, so it shouldn't affect anything on a codegen level.

kparzysz · 2025-01-09T17:08:14Z

All that said, there are two cases to consider wrt. the standard:

The initial device is the CPU and the code compiled here is just part of a GPU library, or

the initial device is the GPU and the code compiled here is just part of the "host code".

For 1), omp target, w/o ancestor, is disallowed, IIRC. For 2), it should work as if it is a task, basically we do not implement "offloading" from this host, which is totally fine.

Option (2) summarizes my point of view.

The use case that I have in mind for this is

cc -fopenmp -target=cpu -offload-target=gpu -c foo.c
cc -fopenmp -target=gpu -c bar.c
link device-part-of-foo.o bar.o -o works-on-gpu

jhuber6 · 2025-01-09T17:10:44Z

All that said, there are two cases to consider wrt. the standard:

The initial device is the CPU and the code compiled here is just part of a GPU library, or

the initial device is the GPU and the code compiled here is just part of the "host code".

For 1), omp target, w/o ancestor, is disallowed, IIRC. For 2), it should work as if it is a task, basically we do not implement "offloading" from this host, which is totally fine.

Option (2) summarizes my point of view.

The use case that I have in mind for this is
cc -fopenmp -target=cpu -offload-target=gpu -c foo.c
cc -fopenmp -target=gpu -c bar.c
link device-part-of-foo.o bar.o -o works-on-gpu

That's what I had in mind, just another way to generate the GPU code that OpenMP wants.

shiltian · 2025-01-09T17:33:18Z

the target regions are just outlined, so it shouldn't affect anything on a codegen level.

No, they are not. The standard defines the execution behavior and codegen has to conform with it. The current GPU CodeGen in this discussion assumes it is generating for constructs inside a target region (because almost all the other OpenMP constructs are skipped), which is the most essential question here. I have no problem with treating a GPU target as a host, but it has to conform with host execution, aka running outside of a target region. If the standard says all constructs have to behave the same if they are wrapped into declare target (as well as its friends), I'm totally fine with it.

// host
#pragma omp some_construct
{ /* some code */ }

// "offload"
#pragma omp declare target
#pragma omp same_construct_as_above
{ /* same code as above */ }

This patch basically wraps the entire TU into a giant declare target implicitly when it compiles (or specifically code gens) for a GPU, as shown in the 2nd code block above. I guess this might be a question to the language committee.

Meinersbur · 2025-01-21T18:19:50Z

It should maintain the normal semantics you'd get with -fopenmp except it codegens certain things differently.

Cannot do things differently. We also have host-offloading targets, --target=x86_64-unknown-linux -fopenmp would be ambigous.

There are differences between target and non-target code, even if it is only for #pragma omp metadirective when(construct={target}: ....

jhuber6 · 2025-01-21T18:25:35Z

This was mostly a hack around wanting to do #123673, but I just took the extra time to fix the bugs and do the C++ approach. I can close this.

jhuber6 requested review from alexey-bataev, carlobertolli, jdoerfert, jplehr, Meinersbur, ronlieb, saiislam, sarnex and shiltian January 8, 2025 18:23

llvmbot added clang Clang issues not falling into any other category clang:codegen IR generation bugs: mangling, exceptions, etc. clang:openmp OpenMP related changes to Clang labels Jan 8, 2025

jhuber6 force-pushed the GPUOpenMP branch from 378ee82 to 3329b7a Compare January 8, 2025 18:55

jdoerfert approved these changes Jan 8, 2025

View reviewed changes

update to make omp target illegal

6a7b5fe

shiltian reviewed Jan 9, 2025

View reviewed changes

Better error

e3b86ec

jhuber6 closed this Jan 21, 2025

[OpenMP] Allow GPUs to be targeted directly via -fopenmp. #122149

[OpenMP] Allow GPUs to be targeted directly via -fopenmp. #122149

Uh oh!

Conversation

jhuber6 commented Jan 8, 2025

Uh oh!

llvmbot commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-bataev commented Jan 8, 2025

Uh oh!

jhuber6 commented Jan 8, 2025

Uh oh!

shiltian commented Jan 8, 2025

Uh oh!

jhuber6 commented Jan 8, 2025

Uh oh!

alexey-bataev commented Jan 8, 2025

Uh oh!

jhuber6 commented Jan 8, 2025

Uh oh!

shiltian commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 commented Jan 8, 2025

Uh oh!

jhuber6 commented Jan 8, 2025

Uh oh!

shiltian commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 commented Jan 8, 2025

Uh oh!

shiltian commented Jan 8, 2025

Uh oh!

jdoerfert left a comment

Choose a reason for hiding this comment

Uh oh!

shiltian commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 commented Jan 8, 2025

Uh oh!

kparzysz commented Jan 8, 2025

Uh oh!

jdoerfert commented Jan 8, 2025

Uh oh!

shiltian commented Jan 8, 2025

Uh oh!

jdoerfert commented Jan 8, 2025

Uh oh!

shiltian Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

shiltian left a comment

Choose a reason for hiding this comment

Uh oh!

jhuber6 commented Jan 9, 2025

Uh oh!

kparzysz commented Jan 9, 2025

Uh oh!

jhuber6 commented Jan 9, 2025

Uh oh!

shiltian commented Jan 9, 2025

Uh oh!

Meinersbur commented Jan 21, 2025

Uh oh!

jhuber6 commented Jan 21, 2025

Uh oh!

Uh oh!

[OpenMP] Allow GPUs to be targeted directly via `-fopenmp`. #122149

[OpenMP] Allow GPUs to be targeted directly via `-fopenmp`. #122149

llvmbot commented Jan 8, 2025 •

edited

Loading

shiltian commented Jan 8, 2025 •

edited

Loading

shiltian commented Jan 8, 2025 •

edited

Loading

shiltian commented Jan 8, 2025 •

edited

Loading