Skip to content

[OpenMP][FIX] Allocate per launch memory for GPU team reductions #70752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 1, 2023

Conversation

jdoerfert
Copy link
Member

@jdoerfert jdoerfert commented Oct 31, 2023

We used to perform team reduction on global memory allocated in the
runtime and by clang. This was racy as multiple instances of a kernel,
or different kernels with team reductions, would use the same locations.
Since we now have the kernel launch environment, we can allocate dynamic
memory per-launch, allowing us to move all the state into a non-racy
place.

Fixes: #70249

@jdoerfert jdoerfert added openmp clang:openmp OpenMP related changes to Clang openmp:libomptarget OpenMP offload runtime labels Oct 31, 2023
@jdoerfert jdoerfert requested review from shiltian and jhuber6 October 31, 2023 00:15
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. flang:openmp llvm:transforms labels Oct 31, 2023
@llvmbot
Copy link
Member

llvmbot commented Oct 31, 2023

@llvm/pr-subscribers-mlir-llvm
@llvm/pr-subscribers-mlir
@llvm/pr-subscribers-clang
@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-flang-openmp
@llvm/pr-subscribers-clang-codegen

@llvm/pr-subscribers-openmp

Author: Johannes Doerfert (jdoerfert)

Changes

First commit is part of #70401


Patch is 4.72 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/70752.diff

186 Files Affected:

  • (modified) clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp (+28-47)
  • (modified) clang/lib/CodeGen/CGOpenMPRuntimeGPU.h (-2)
  • (modified) clang/lib/Sema/SemaOpenMP.cpp (+18-8)
  • (modified) clang/test/OpenMP/amdgcn_target_codegen.cpp (+10-4)
  • (modified) clang/test/OpenMP/amdgcn_target_device_vla.cpp (+20-8)
  • (modified) clang/test/OpenMP/amdgcn_target_init_temp_alloca.cpp (+2)
  • (modified) clang/test/OpenMP/amdgpu_target_with_aligned_attribute.c (+5-2)
  • (modified) clang/test/OpenMP/assumes_include_nvptx.cpp (+2-2)
  • (modified) clang/test/OpenMP/bug60602.cpp (+7-7)
  • (modified) clang/test/OpenMP/declare_target_codegen.cpp (+6-6)
  • (modified) clang/test/OpenMP/declare_target_codegen_globalization.cpp (+4-2)
  • (modified) clang/test/OpenMP/declare_target_link_codegen.cpp (+1-1)
  • (modified) clang/test/OpenMP/declare_variant_mixed_codegen.c (+1-1)
  • (modified) clang/test/OpenMP/distribute_codegen.cpp (+62-42)
  • (modified) clang/test/OpenMP/distribute_firstprivate_codegen.cpp (+36-36)
  • (modified) clang/test/OpenMP/distribute_lastprivate_codegen.cpp (+36-36)
  • (modified) clang/test/OpenMP/distribute_parallel_for_codegen.cpp (+118-118)
  • (modified) clang/test/OpenMP/distribute_parallel_for_firstprivate_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/distribute_parallel_for_if_codegen.cpp (+31-31)
  • (modified) clang/test/OpenMP/distribute_parallel_for_lastprivate_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/distribute_parallel_for_num_threads_codegen.cpp (+152-152)
  • (modified) clang/test/OpenMP/distribute_parallel_for_private_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/distribute_parallel_for_proc_bind_codegen.cpp (+11-11)
  • (modified) clang/test/OpenMP/distribute_parallel_for_simd_codegen.cpp (+118-118)
  • (modified) clang/test/OpenMP/distribute_parallel_for_simd_firstprivate_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/distribute_parallel_for_simd_if_codegen.cpp (+128-128)
  • (modified) clang/test/OpenMP/distribute_parallel_for_simd_lastprivate_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/distribute_parallel_for_simd_num_threads_codegen.cpp (+152-152)
  • (modified) clang/test/OpenMP/distribute_parallel_for_simd_private_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/distribute_parallel_for_simd_proc_bind_codegen.cpp (+11-11)
  • (modified) clang/test/OpenMP/distribute_private_codegen.cpp (+40-40)
  • (modified) clang/test/OpenMP/distribute_simd_codegen.cpp (+60-20)
  • (modified) clang/test/OpenMP/distribute_simd_firstprivate_codegen.cpp (+36-36)
  • (modified) clang/test/OpenMP/distribute_simd_lastprivate_codegen.cpp (+36-36)
  • (modified) clang/test/OpenMP/distribute_simd_private_codegen.cpp (+40-40)
  • (modified) clang/test/OpenMP/distribute_simd_reduction_codegen.cpp (+14-14)
  • (modified) clang/test/OpenMP/nvptx_SPMD_codegen.cpp (+2679-2301)
  • (modified) clang/test/OpenMP/nvptx_data_sharing.cpp (+4-2)
  • (modified) clang/test/OpenMP/nvptx_declare_target_var_ctor_dtor_codegen.cpp (+1-1)
  • (modified) clang/test/OpenMP/nvptx_distribute_parallel_generic_mode_codegen.cpp (+8-4)
  • (modified) clang/test/OpenMP/nvptx_lambda_capturing.cpp (+47-27)
  • (modified) clang/test/OpenMP/nvptx_multi_target_parallel_codegen.cpp (+16-8)
  • (modified) clang/test/OpenMP/nvptx_nested_parallel_codegen.cpp (+8-4)
  • (modified) clang/test/OpenMP/nvptx_parallel_codegen.cpp (+24-12)
  • (modified) clang/test/OpenMP/nvptx_parallel_for_codegen.cpp (+4-2)
  • (modified) clang/test/OpenMP/nvptx_target_codegen.cpp (+64-32)
  • (modified) clang/test/OpenMP/nvptx_target_firstprivate_codegen.cpp (+12-6)
  • (modified) clang/test/OpenMP/nvptx_target_parallel_codegen.cpp (+16-8)
  • (modified) clang/test/OpenMP/nvptx_target_parallel_num_threads_codegen.cpp (+16-8)
  • (modified) clang/test/OpenMP/nvptx_target_parallel_proc_bind_codegen.cpp (+72-36)
  • (modified) clang/test/OpenMP/nvptx_target_parallel_reduction_codegen.cpp (+36-18)
  • (modified) clang/test/OpenMP/nvptx_target_parallel_reduction_codegen_tbaa_PR46146.cpp (+272-268)
  • (modified) clang/test/OpenMP/nvptx_target_printf_codegen.c (+24-12)
  • (modified) clang/test/OpenMP/nvptx_target_simd_codegen.cpp (+318-270)
  • (modified) clang/test/OpenMP/nvptx_target_teams_codegen.cpp (+24-12)
  • (modified) clang/test/OpenMP/nvptx_target_teams_distribute_codegen.cpp (+8-4)
  • (modified) clang/test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp (+72-36)
  • (modified) clang/test/OpenMP/nvptx_target_teams_distribute_parallel_for_generic_mode_codegen.cpp (+8-4)
  • (modified) clang/test/OpenMP/nvptx_target_teams_distribute_parallel_for_simd_codegen.cpp (+364-348)
  • (modified) clang/test/OpenMP/nvptx_target_teams_distribute_simd_codegen.cpp (+390-342)
  • (modified) clang/test/OpenMP/nvptx_target_teams_generic_loop_codegen.cpp (+60-30)
  • (modified) clang/test/OpenMP/nvptx_target_teams_generic_loop_generic_mode_codegen.cpp (+8-4)
  • (modified) clang/test/OpenMP/nvptx_target_teams_ompx_bare_codegen.cpp (+3-1)
  • (modified) clang/test/OpenMP/nvptx_teams_codegen.cpp (+32-16)
  • (modified) clang/test/OpenMP/nvptx_teams_reduction_codegen.cpp (+156-138)
  • (modified) clang/test/OpenMP/ompx_attributes_codegen.cpp (+3-3)
  • (modified) clang/test/OpenMP/openmp_offload_codegen.cpp (+1-1)
  • (modified) clang/test/OpenMP/reduction_implicit_map.cpp (+35-33)
  • (modified) clang/test/OpenMP/remarks_parallel_in_multiple_target_state_machines.c (+2-1)
  • (modified) clang/test/OpenMP/remarks_parallel_in_target_state_machine.c (+2-1)
  • (modified) clang/test/OpenMP/target_codegen_global_capture.cpp (+30-30)
  • (modified) clang/test/OpenMP/target_firstprivate_codegen.cpp (+72-24)
  • (modified) clang/test/OpenMP/target_map_codegen_03.cpp (+6-6)
  • (modified) clang/test/OpenMP/target_map_member_expr_codegen.cpp (+2-2)
  • (modified) clang/test/OpenMP/target_ompx_dyn_cgroup_mem_codegen.cpp (+36-12)
  • (modified) clang/test/OpenMP/target_parallel_codegen.cpp (+42-14)
  • (modified) clang/test/OpenMP/target_parallel_debug_codegen.cpp (+441-420)
  • (modified) clang/test/OpenMP/target_parallel_for_codegen.cpp (+42-14)
  • (modified) clang/test/OpenMP/target_parallel_for_debug_codegen.cpp (+610-589)
  • (modified) clang/test/OpenMP/target_parallel_for_simd_codegen.cpp (+84-28)
  • (modified) clang/test/OpenMP/target_parallel_for_simd_tl_codegen.cpp (+79-3)
  • (modified) clang/test/OpenMP/target_parallel_for_tl_codegen.cpp (+72-3)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_codegen-1.cpp (+44-44)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_codegen-2.cpp (+24-16)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_codegen-3.cpp (+610-589)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_codegen.cpp (+5-2)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_depend_codegen.cpp (+4-6)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_tl_codegen.cpp (+72-3)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_uses_allocators_codegen.cpp (+2-2)
  • (modified) clang/test/OpenMP/target_parallel_if_codegen.cpp (+96-72)
  • (modified) clang/test/OpenMP/target_parallel_num_threads_codegen.cpp (+78-54)
  • (modified) clang/test/OpenMP/target_parallel_tl_codegen.cpp (+22-3)
  • (modified) clang/test/OpenMP/target_private_codegen.cpp (+14-7)
  • (modified) clang/test/OpenMP/target_reduction_codegen.cpp (+12-6)
  • (modified) clang/test/OpenMP/target_simd_tl_codegen.cpp (+35-3)
  • (modified) clang/test/OpenMP/target_task_affinity_codegen.cpp (+6-2)
  • (modified) clang/test/OpenMP/target_teams_codegen.cpp (+66-22)
  • (modified) clang/test/OpenMP/target_teams_distribute_codegen.cpp (+42-14)
  • (modified) clang/test/OpenMP/target_teams_distribute_collapse_codegen.cpp (+18-18)
  • (modified) clang/test/OpenMP/target_teams_distribute_dist_schedule_codegen.cpp (+42-42)
  • (modified) clang/test/OpenMP/target_teams_distribute_firstprivate_codegen.cpp (+7-7)
  • (modified) clang/test/OpenMP/target_teams_distribute_lastprivate_codegen.cpp (+36-36)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_codegen.cpp (+16-8)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_collapse_codegen.cpp (+24-24)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_dist_schedule_codegen.cpp (+60-60)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_firstprivate_codegen.cpp (+138-128)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_if_codegen.cpp (+34-34)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_lastprivate_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_order_codegen.cpp (+4-4)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_private_codegen.cpp (+94-84)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_proc_bind_codegen.cpp (+11-11)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_reduction_codegen.cpp (+29-29)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_schedule_codegen.cpp (+192-192)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_codegen.cpp (+24-12)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_collapse_codegen.cpp (+24-24)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_dist_schedule_codegen.cpp (+60-60)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_firstprivate_codegen.cpp (+138-128)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_lastprivate_codegen.cpp (+50-50)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_private_codegen.cpp (+94-84)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_proc_bind_codegen.cpp (+11-11)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_reduction_codegen.cpp (+29-29)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_schedule_codegen.cpp (+192-192)
  • (modified) clang/test/OpenMP/target_teams_distribute_private_codegen.cpp (+7-7)
  • (modified) clang/test/OpenMP/target_teams_distribute_reduction_codegen.cpp (+145-145)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_codegen.cpp (+84-28)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_collapse_codegen.cpp (+18-18)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_dist_schedule_codegen.cpp (+42-42)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_firstprivate_codegen.cpp (+7-7)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_lastprivate_codegen.cpp (+36-36)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_private_codegen.cpp (+7-7)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_reduction_codegen.cpp (+19-19)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_codegen-1.cpp (+16-8)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_codegen.cpp (+15-12)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_collapse_codegen.cpp (+24-24)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_depend_codegen.cpp (+4-6)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_if_codegen.cpp (+34-34)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_order_codegen.cpp (+4-4)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_private_codegen.cpp (+94-84)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_reduction_codegen.cpp (+29-29)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_uses_allocators_codegen.cpp (+3-3)
  • (modified) clang/test/OpenMP/target_teams_map_codegen.cpp (+130-94)
  • (modified) clang/test/OpenMP/target_teams_num_teams_codegen.cpp (+78-54)
  • (modified) clang/test/OpenMP/target_teams_thread_limit_codegen.cpp (+44-20)
  • (modified) clang/test/OpenMP/teams_codegen.cpp (+72-56)
  • (modified) llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h (+4-1)
  • (modified) llvm/include/llvm/Frontend/OpenMP/OMPKinds.def (+6-2)
  • (modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+30-2)
  • (modified) llvm/test/Transforms/OpenMP/add_attributes.ll (+4-4)
  • (modified) llvm/test/Transforms/OpenMP/always_inline_device.ll (+4-4)
  • (modified) llvm/test/Transforms/OpenMP/custom_state_machines.ll (+85-85)
  • (modified) llvm/test/Transforms/OpenMP/custom_state_machines_pre_lto.ll (+148-148)
  • (modified) llvm/test/Transforms/OpenMP/custom_state_machines_remarks.ll (+5-5)
  • (modified) llvm/test/Transforms/OpenMP/deduplication_target.ll (+5-5)
  • (modified) llvm/test/Transforms/OpenMP/get_hardware_num_threads_in_block_fold.ll (+13-13)
  • (modified) llvm/test/Transforms/OpenMP/get_hardware_num_threads_in_block_fold_optnone.ll (+7-7)
  • (modified) llvm/test/Transforms/OpenMP/global_constructor.ll (+5-5)
  • (modified) llvm/test/Transforms/OpenMP/globalization_remarks.ll (+2-2)
  • (modified) llvm/test/Transforms/OpenMP/gpu_state_machine_function_ptr_replacement.ll (+2-2)
  • (modified) llvm/test/Transforms/OpenMP/indirect_call_kernel_info_crash.ll (+3-3)
  • (modified) llvm/test/Transforms/OpenMP/is_spmd_exec_mode_fold.ll (+9-9)
  • (modified) llvm/test/Transforms/OpenMP/nested_parallelism.ll (+7-7)
  • (modified) llvm/test/Transforms/OpenMP/parallel_level_fold.ll (+7-7)
  • (modified) llvm/test/Transforms/OpenMP/remove_globalization.ll (+9-9)
  • (modified) llvm/test/Transforms/OpenMP/replace_globalization.ll (+14-14)
  • (modified) llvm/test/Transforms/OpenMP/single_threaded_execution.ll (+3-3)
  • (modified) llvm/test/Transforms/OpenMP/spmdization.ll (+49-49)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_assumes.ll (+5-5)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_constant_prop.ll (+3-3)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_guarding.ll (+9-9)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_guarding_two_reaching_kernels.ll (+15-15)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_indirect.ll (+15-15)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_kernel_env_dep.ll (+7-6)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_no_guarding_two_reaching_kernels.ll (+15-15)
  • (modified) llvm/test/Transforms/OpenMP/spmdization_remarks.ll (+5-5)
  • (modified) llvm/test/Transforms/OpenMP/value-simplify-openmp-opt.ll (+7-7)
  • (modified) llvm/unittests/Frontend/OpenMPIRBuilderTest.cpp (+12-5)
  • (modified) openmp/libomptarget/DeviceRTL/include/Interface.h (+5-1)
  • (modified) openmp/libomptarget/DeviceRTL/include/State.h (+8-2)
  • (modified) openmp/libomptarget/DeviceRTL/src/Kernel.cpp (+10-6)
  • (modified) openmp/libomptarget/DeviceRTL/src/Reduction.cpp (+7-3)
  • (modified) openmp/libomptarget/DeviceRTL/src/State.cpp (+11-1)
  • (modified) openmp/libomptarget/include/Environment.h (+7)
  • (modified) openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp (+61-11)
  • (modified) openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h (+21-2)
  • (modified) openmp/libomptarget/test/offloading/malloc_parallel.c (+2-2)
  • (added) openmp/libomptarget/test/offloading/parallel_target_teams_reduction.cpp (+36)
diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
index 9d00ebae702802a..de028b0209c171a 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
+++ b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
@@ -803,8 +803,30 @@ void CGOpenMPRuntimeGPU::emitKernelDeinit(CodeGenFunction &CGF,
   if (!IsSPMD)
     emitGenericVarsEpilog(CGF);
 
+  // This is temporary until we remove the fixed sized buffer.
+  ASTContext &C = CGM.getContext();
+  RecordDecl *StaticRD = C.buildImplicitRecord(
+      "_openmp_teams_reduction_type_$_", RecordDecl::TagKind::TTK_Union);
+  StaticRD->startDefinition();
+  for (const RecordDecl *TeamReductionRec : TeamsReductions) {
+    QualType RecTy = C.getRecordType(TeamReductionRec);
+    auto *Field = FieldDecl::Create(
+        C, StaticRD, SourceLocation(), SourceLocation(), nullptr, RecTy,
+        C.getTrivialTypeSourceInfo(RecTy, SourceLocation()),
+        /*BW=*/nullptr, /*Mutable=*/false,
+        /*InitStyle=*/ICIS_NoInit);
+    Field->setAccess(AS_public);
+    StaticRD->addDecl(Field);
+  }
+  StaticRD->completeDefinition();
+  QualType StaticTy = C.getRecordType(StaticRD);
+  llvm::Type *LLVMReductionsBufferTy =
+      CGM.getTypes().ConvertTypeForMem(StaticTy);
+  const auto &DL = CGM.getModule().getDataLayout();
+  uint64_t BufferSize =
+      DL.getTypeAllocSize(LLVMReductionsBufferTy).getFixedValue();
   CGBuilderTy &Bld = CGF.Builder;
-  OMPBuilder.createTargetDeinit(Bld);
+  OMPBuilder.createTargetDeinit(Bld, BufferSize);
 }
 
 void CGOpenMPRuntimeGPU::emitSPMDKernel(const OMPExecutableDirective &D,
@@ -2998,15 +3020,10 @@ void CGOpenMPRuntimeGPU::emitReduction(
         CGM.getContext(), PrivatesReductions, std::nullopt, VarFieldMap,
         C.getLangOpts().OpenMPCUDAReductionBufNum);
     TeamsReductions.push_back(TeamReductionRec);
-    if (!KernelTeamsReductionPtr) {
-      KernelTeamsReductionPtr = new llvm::GlobalVariable(
-          CGM.getModule(), CGM.VoidPtrTy, /*isConstant=*/true,
-          llvm::GlobalValue::InternalLinkage, nullptr,
-          "_openmp_teams_reductions_buffer_$_$ptr");
-    }
-    llvm::Value *GlobalBufferPtr = CGF.EmitLoadOfScalar(
-        Address(KernelTeamsReductionPtr, CGF.VoidPtrTy, CGM.getPointerAlign()),
-        /*Volatile=*/false, C.getPointerType(C.VoidPtrTy), Loc);
+    auto *KernelTeamsReductionPtr = CGF.EmitRuntimeCall(
+        OMPBuilder.getOrCreateRuntimeFunction(
+            CGM.getModule(), OMPRTL___kmpc_reduction_get_fixed_buffer),
+        {}, "_openmp_teams_reductions_buffer_$_$ptr");
     llvm::Value *GlobalToBufferCpyFn = ::emitListToGlobalCopyFunction(
         CGM, Privates, ReductionArrayTy, Loc, TeamReductionRec, VarFieldMap);
     llvm::Value *GlobalToBufferRedFn = ::emitListToGlobalReduceFunction(
@@ -3021,7 +3038,7 @@ void CGOpenMPRuntimeGPU::emitReduction(
     llvm::Value *Args[] = {
         RTLoc,
         ThreadId,
-        GlobalBufferPtr,
+        KernelTeamsReductionPtr,
         CGF.Builder.getInt32(C.getLangOpts().OpenMPCUDAReductionBufNum),
         RL,
         ShuffleAndReduceFn,
@@ -3654,42 +3671,6 @@ void CGOpenMPRuntimeGPU::processRequiresDirective(
   CGOpenMPRuntime::processRequiresDirective(D);
 }
 
-void CGOpenMPRuntimeGPU::clear() {
-
-  if (!TeamsReductions.empty()) {
-    ASTContext &C = CGM.getContext();
-    RecordDecl *StaticRD = C.buildImplicitRecord(
-        "_openmp_teams_reduction_type_$_", RecordDecl::TagKind::TTK_Union);
-    StaticRD->startDefinition();
-    for (const RecordDecl *TeamReductionRec : TeamsReductions) {
-      QualType RecTy = C.getRecordType(TeamReductionRec);
-      auto *Field = FieldDecl::Create(
-          C, StaticRD, SourceLocation(), SourceLocation(), nullptr, RecTy,
-          C.getTrivialTypeSourceInfo(RecTy, SourceLocation()),
-          /*BW=*/nullptr, /*Mutable=*/false,
-          /*InitStyle=*/ICIS_NoInit);
-      Field->setAccess(AS_public);
-      StaticRD->addDecl(Field);
-    }
-    StaticRD->completeDefinition();
-    QualType StaticTy = C.getRecordType(StaticRD);
-    llvm::Type *LLVMReductionsBufferTy =
-        CGM.getTypes().ConvertTypeForMem(StaticTy);
-    // FIXME: nvlink does not handle weak linkage correctly (object with the
-    // different size are reported as erroneous).
-    // Restore CommonLinkage as soon as nvlink is fixed.
-    auto *GV = new llvm::GlobalVariable(
-        CGM.getModule(), LLVMReductionsBufferTy,
-        /*isConstant=*/false, llvm::GlobalValue::InternalLinkage,
-        llvm::Constant::getNullValue(LLVMReductionsBufferTy),
-        "_openmp_teams_reductions_buffer_$_");
-    KernelTeamsReductionPtr->setInitializer(
-        llvm::ConstantExpr::getPointerBitCastOrAddrSpaceCast(GV,
-                                                             CGM.VoidPtrTy));
-  }
-  CGOpenMPRuntime::clear();
-}
-
 llvm::Value *CGOpenMPRuntimeGPU::getGPUNumThreads(CodeGenFunction &CGF) {
   CGBuilderTy &Bld = CGF.Builder;
   llvm::Module *M = &CGF.CGM.getModule();
diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
index 46e1361f2f895ba..141436f26230dde 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
+++ b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
@@ -130,7 +130,6 @@ class CGOpenMPRuntimeGPU : public CGOpenMPRuntime {
 
 public:
   explicit CGOpenMPRuntimeGPU(CodeGenModule &CGM);
-  void clear() override;
 
   bool isGPU() const override { return true; };
 
@@ -386,7 +385,6 @@ class CGOpenMPRuntimeGPU : public CGOpenMPRuntime {
   /// Maps the function to the list of the globalized variables with their
   /// addresses.
   llvm::SmallDenseMap<llvm::Function *, FunctionData> FunctionGlobalizedDecls;
-  llvm::GlobalVariable *KernelTeamsReductionPtr = nullptr;
   /// List of the records with the list of fields for the reductions across the
   /// teams. Used to build the intermediate buffer for the fast teams
   /// reductions.
diff --git a/clang/lib/Sema/SemaOpenMP.cpp b/clang/lib/Sema/SemaOpenMP.cpp
index 75f9e152dca9297..145f4dc4670081d 100644
--- a/clang/lib/Sema/SemaOpenMP.cpp
+++ b/clang/lib/Sema/SemaOpenMP.cpp
@@ -4249,12 +4249,15 @@ void Sema::ActOnOpenMPRegionStart(OpenMPDirectiveKind DKind, Scope *CurScope) {
     getCurCapturedRegion()->TheCapturedDecl->addAttr(
         AlwaysInlineAttr::CreateImplicit(
             Context, {}, AlwaysInlineAttr::Keyword_forceinline));
-    Sema::CapturedParamNameType ParamsTarget[] = {
-        std::make_pair(StringRef(), QualType()) // __context with shared vars
-    };
+    SmallVector<Sema::CapturedParamNameType, 2> ParamsTarget;
+    if (getLangOpts().OpenMPIsTargetDevice)
+      ParamsTarget.push_back(std::make_pair(StringRef("dyn_ptr"), VoidPtrTy));
+    ParamsTarget.push_back(
+        std::make_pair(StringRef(), QualType())); // __context with shared vars;
     // Start a captured region for 'target' with no implicit parameters.
     ActOnCapturedRegionStart(DSAStack->getConstructLoc(), CurScope, CR_OpenMP,
-                             ParamsTarget, /*OpenMPCaptureLevel=*/1);
+                             ParamsTarget,
+                             /*OpenMPCaptureLevel=*/1);
     Sema::CapturedParamNameType ParamsTeamsOrParallel[] = {
         std::make_pair(".global_tid.", KmpInt32PtrTy),
         std::make_pair(".bound_tid.", KmpInt32PtrTy),
@@ -4293,8 +4296,13 @@ void Sema::ActOnOpenMPRegionStart(OpenMPDirectiveKind DKind, Scope *CurScope) {
     getCurCapturedRegion()->TheCapturedDecl->addAttr(
         AlwaysInlineAttr::CreateImplicit(
             Context, {}, AlwaysInlineAttr::Keyword_forceinline));
+    SmallVector<Sema::CapturedParamNameType, 2> ParamsTarget;
+    if (getLangOpts().OpenMPIsTargetDevice)
+      ParamsTarget.push_back(std::make_pair(StringRef("dyn_ptr"), VoidPtrTy));
+    ParamsTarget.push_back(
+        std::make_pair(StringRef(), QualType())); // __context with shared vars;
     ActOnCapturedRegionStart(DSAStack->getConstructLoc(), CurScope, CR_OpenMP,
-                             std::make_pair(StringRef(), QualType()),
+                             ParamsTarget,
                              /*OpenMPCaptureLevel=*/1);
     break;
   }
@@ -4499,9 +4507,11 @@ void Sema::ActOnOpenMPRegionStart(OpenMPDirectiveKind DKind, Scope *CurScope) {
     getCurCapturedRegion()->TheCapturedDecl->addAttr(
         AlwaysInlineAttr::CreateImplicit(
             Context, {}, AlwaysInlineAttr::Keyword_forceinline));
-    Sema::CapturedParamNameType ParamsTarget[] = {
-        std::make_pair(StringRef(), QualType()) // __context with shared vars
-    };
+    SmallVector<Sema::CapturedParamNameType, 2> ParamsTarget;
+    if (getLangOpts().OpenMPIsTargetDevice)
+      ParamsTarget.push_back(std::make_pair(StringRef("dyn_ptr"), VoidPtrTy));
+    ParamsTarget.push_back(
+        std::make_pair(StringRef(), QualType())); // __context with shared vars;
     // Start a captured region for 'target' with no implicit parameters.
     ActOnCapturedRegionStart(DSAStack->getConstructLoc(), CurScope, CR_OpenMP,
                              ParamsTarget, /*OpenMPCaptureLevel=*/1);
diff --git a/clang/test/OpenMP/amdgcn_target_codegen.cpp b/clang/test/OpenMP/amdgcn_target_codegen.cpp
index 90d2ebdf26bd645..3ea2d107f072adb 100644
--- a/clang/test/OpenMP/amdgcn_target_codegen.cpp
+++ b/clang/test/OpenMP/amdgcn_target_codegen.cpp
@@ -29,15 +29,18 @@ int test_amdgcn_target_tid_threads_simd() {
 
 #endif
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z30test_amdgcn_target_tid_threadsv_l14
-// CHECK-SAME: (ptr noundef nonnull align 4 dereferenceable(4000) [[ARR:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef nonnull align 4 dereferenceable(4000) [[ARR:%.*]]) #[[ATTR0:[0-9]+]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[ARR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[I:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[ARR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[ARR_ADDR]] to ptr
 // CHECK-NEXT:    [[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[ARR]], ptr [[ARR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARR_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z30test_amdgcn_target_tid_threadsv_l14_kernel_environment to ptr))
+// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z30test_amdgcn_target_tid_threadsv_l14_kernel_environment to ptr), ptr [[DYN_PTR]])
 // CHECK-NEXT:    [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP1]], -1
 // CHECK-NEXT:    br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.*]], label [[WORKER_EXIT:%.*]]
 // CHECK:       user_code.entry:
@@ -66,19 +69,22 @@ int test_amdgcn_target_tid_threads_simd() {
 //
 //
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z35test_amdgcn_target_tid_threads_simdv_l23
-// CHECK-SAME: (ptr noundef nonnull align 4 dereferenceable(4000) [[ARR:%.*]]) #[[ATTR1:[0-9]+]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef nonnull align 4 dereferenceable(4000) [[ARR:%.*]]) #[[ATTR1:[0-9]+]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[ARR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[TMP:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[DOTOMP_IV:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[I:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[ARR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[ARR_ADDR]] to ptr
 // CHECK-NEXT:    [[TMP_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[TMP]] to ptr
 // CHECK-NEXT:    [[DOTOMP_IV_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IV]] to ptr
 // CHECK-NEXT:    [[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[ARR]], ptr [[ARR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARR_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z35test_amdgcn_target_tid_threads_simdv_l23_kernel_environment to ptr))
+// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z35test_amdgcn_target_tid_threads_simdv_l23_kernel_environment to ptr), ptr [[DYN_PTR]])
 // CHECK-NEXT:    [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP1]], -1
 // CHECK-NEXT:    br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.*]], label [[WORKER_EXIT:%.*]]
 // CHECK:       user_code.entry:
diff --git a/clang/test/OpenMP/amdgcn_target_device_vla.cpp b/clang/test/OpenMP/amdgcn_target_device_vla.cpp
index b2b630b546713dd..de150a0fcb4afd2 100644
--- a/clang/test/OpenMP/amdgcn_target_device_vla.cpp
+++ b/clang/test/OpenMP/amdgcn_target_device_vla.cpp
@@ -97,21 +97,24 @@ int main() {
 
 #endif
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo1v_l12
-// CHECK-SAME: (ptr noundef nonnull align 4 dereferenceable(4) [[SUM:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[SUM:%.*]]) #[[ATTR0:[0-9]+]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[SUM_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[N:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[__VLA_EXPR0:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[I:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[I1:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[SUM_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[SUM_ADDR]] to ptr
 // CHECK-NEXT:    [[N_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[N]] to ptr
 // CHECK-NEXT:    [[__VLA_EXPR0_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[__VLA_EXPR0]] to ptr
 // CHECK-NEXT:    [[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
 // CHECK-NEXT:    [[I1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I1]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[SUM]], ptr [[SUM_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[SUM_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo1v_l12_kernel_environment to ptr))
+// CHECK-NEXT:    [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo1v_l12_kernel_environment to ptr), ptr [[DYN_PTR]])
 // CHECK-NEXT:    [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP1]], -1
 // CHECK-NEXT:    br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.*]], label [[WORKER_EXIT:%.*]]
 // CHECK:       user_code.entry:
@@ -174,26 +177,29 @@ int main() {
 //
 //
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30
-// CHECK-SAME: (i64 noundef [[M:%.*]], i64 noundef [[VLA:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR0]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], i64 noundef [[M:%.*]], i64 noundef [[VLA:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR0]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[M_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[RESULT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[M_CASTED:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[DOTZERO_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[DOTTHREADID_TEMP_:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[M_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_ADDR]] to ptr
 // CHECK-NEXT:    [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
 // CHECK-NEXT:    [[RESULT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RESULT_ADDR]] to ptr
 // CHECK-NEXT:    [[M_CASTED_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_CASTED]] to ptr
 // CHECK-NEXT:    [[DOTZERO_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTZERO_ADDR]] to ptr
 // CHECK-NEXT:    [[DOTTHREADID_TEMP__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTTHREADID_TEMP_]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store i64 [[M]], ptr [[M_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[RESULT]], ptr [[RESULT_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr [[VLA_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[RESULT_ADDR_ASCAST]], align 8
-// CHECK-NEXT:    [[TMP2:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30_kernel_environment to ptr))
+// CHECK-NEXT:    [[TMP2:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30_kernel_environment to ptr), ptr [[DYN_PTR]])
 // CHECK-NEXT:    [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP2]], -1
 // CHECK-NEXT:    br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.*]], label [[WORKER_EXIT:%.*]]
 // CHECK:       user_code.entry:
@@ -540,26 +546,29 @@ int main() {
 //
 //
 // CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo3v_l52
-// CHECK-SAME: (i64 noundef [[M:%.*]], i64 noundef [[VLA:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR0]] {
+// CHECK-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], i64 noundef [[M:%.*]], i64 noundef [[VLA:%.*]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR0]] {
 // CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[M_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[RESULT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK-NEXT:    [[M_CASTED:%.*]] = alloca i64, align 8, addrspace(5)
 // CHECK-NEXT:    [[DOTZERO_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
 // CHECK-NEXT:    [[DOTTHREADID_TEMP_:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:    [[DYN_PTR_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DYN_PTR_ADDR]] to ptr
 // CHECK-NEXT:    [[M_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_ADDR]] to ptr
 // CHECK-NEXT:    [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
 // CHECK-NEXT:    [[RESULT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RESULT_ADDR]] to ptr
 // CHECK-NEXT:    [[M_CASTED_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_CASTED]] to ptr
 // CHECK-NEXT:    [[DOTZERO_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTZERO_ADDR]] to ptr
 // CHECK-NEXT:    [[DOTTHREADID_TEMP__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTTHREADID_TEMP_]] to ptr
+// CHECK-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store i64 [[M]], ptr [[M_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    store ptr [[RESULT]], ptr [[RESULT_ADDR_ASCAST]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load...
[truncated]

@jdoerfert jdoerfert force-pushed the non_racy_team_reductions branch 4 times, most recently from 5568f0a to 4d864e5 Compare October 31, 2023 22:04
@jdoerfert jdoerfert force-pushed the non_racy_team_reductions branch from 4d864e5 to 04aafdc Compare November 1, 2023 03:17
@jdoerfert jdoerfert changed the title [OpenMP] Non racy team reductions [OpenMP][FIX] Allocate per launch memory for GPU team reductions Nov 1, 2023
Copy link
Contributor

@shiltian shiltian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG with some nits

We used to perform team reduction on global memory allocated in the
runtime and by clang. This was racy as multiple instances of a kernel,
or different kernels with team reductions, would use the same locations.
Since we now have the kernel launch environment, we can allocate dynamic
memory per-launch, allowing us to move all the state into a non-racy
place.

Fixes: llvm#70249
@jdoerfert jdoerfert force-pushed the non_racy_team_reductions branch from 04aafdc to 1859bd4 Compare November 1, 2023 18:11
@jdoerfert jdoerfert merged commit f9a89e6 into llvm:main Nov 1, 2023
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Nov 2, 2023
Revert items 2 and 3 and work separately with 1 to get time to integrate them into ASO
   [OpenMP] Introduce the KernelLaunchEnvironment as implicit
   [OpenMP][FIX] Allocate per launch memory for GPU team reductions (llvm#70752)
   [OpenMP][FIX] Do not add implicit argument to device Ctors and Dtors

Change-Id: I987405a1541ed3102ca78430496f611e565db9a0
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Nov 3, 2023
…m#70752)

We used to perform team reduction on global memory allocated in the
runtime and by clang. This was racy as multiple instances of a kernel,
or different kernels with team reductions, would use the same locations.
Since we now have the kernel launch environment, we can allocate dynamic
memory per-launch, allowing us to move all the state into a non-racy
place.

Fixes: llvm#70249

Change-Id: Id8a5932a1cde8cfcbb0e17655ef3f390f6f4d050
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU clang:codegen IR generation bugs: mangling, exceptions, etc. clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:openmp OpenMP related changes to Clang clang Clang issues not falling into any other category flang:openmp llvm:transforms mlir:llvm mlir openmp:libomptarget OpenMP offload runtime openmp
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[OpenMP] incorrect concurrent target reduction
3 participants