Skip to content

[Libomptarget] Fix GPU Dtors referencing possibly deallocated image #77828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 11, 2024

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Jan 11, 2024

Summary:
The constructors and destructors look up a symbol in the ELF quickly to
determine if they need to be run on the GPU. This allows us to avoid the
very slow actions required to do the slower lookup using the vendor API.

One problem occurs with how we handle the lifetime of these images.
Right now there is no invariant to specify the lifetime of the
underlying binary image that is loaded. In the typical case, this comes
from the binary itself in the .llvm.offloading section, meaning that
the lifetime of the binary should match the executable itself. This
would work fine, if it weren't for the fact that the plugin is loaded
via dlopen and can have a teardown order out of sync with the main
executable.

This was likely what was occuring when this failed on some systems but
not others. A potential solution would be to simply copy images into
memory so the runtime does not rely on external references. Another
would be to manually zero these out after initialization as to prevent
this mistake from happening accidentally. The former has the benefit of
making some checks easier, and allowing for constant initialization be
done on the ELF itself (normally we can't do this because writing to a
constant section, e.g. .llvm.offloading is a segfault.). The downside
would be the extra time required to copy the image in bulk (Although we
are likely doing this in the vendor runtimes as well).

This patch went with a quick solution to simply set a boolean value at
initialization time if we need to call destructors.

Fixes: #77798

Summary:
The constructors and destructors look up a symbol in the ELF quickly to
determine if they need to be run on the GPU. This allows us to avoid the
very slow actions required to do the slower lookup using the vendor API.

One problem occurs with how we handle the lifetime of these images.
Right now there is no invariant to specify the lifetime of the
underlying binary image that is loaded. In the typical case, this comes
from the binary itself in the `.llvm.offloading` section, meaning that
the lifetime of the binary should match the executable itself. This
would work fine, if it weren't for the fact that the plugin is loaded
via `dlopen` and can have a teardown order out of sync with the main
executable.

This was likely what was occuring when this failed on some systems but
not others. A potential solution would be to simply copy images into
memory so the runtime does not rely on external references. Another
would be to manually zero these out after initialization as to prevent
this mistake from happening accidentally. The former has the benefit of
making some checks easier, and allowing for constant initialization be
done on the ELF itself (normally we can't do this because writing to a
constant section, e.g. .llvm.offloading is a segfault.). The downside
would be the extra time required to copy the image in bulk (Although we
are likely doing this in the vendor runtimes as well).

This patch went with a quick solution to simply set a boolean value at
initialization time if we need to call destructors.

Fixes: llvm#77798
@llvmbot
Copy link
Member

llvmbot commented Jan 11, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Joseph Huber (jhuber6)

Changes

Summary:
The constructors and destructors look up a symbol in the ELF quickly to
determine if they need to be run on the GPU. This allows us to avoid the
very slow actions required to do the slower lookup using the vendor API.

One problem occurs with how we handle the lifetime of these images.
Right now there is no invariant to specify the lifetime of the
underlying binary image that is loaded. In the typical case, this comes
from the binary itself in the .llvm.offloading section, meaning that
the lifetime of the binary should match the executable itself. This
would work fine, if it weren't for the fact that the plugin is loaded
via dlopen and can have a teardown order out of sync with the main
executable.

This was likely what was occuring when this failed on some systems but
not others. A potential solution would be to simply copy images into
memory so the runtime does not rely on external references. Another
would be to manually zero these out after initialization as to prevent
this mistake from happening accidentally. The former has the benefit of
making some checks easier, and allowing for constant initialization be
done on the ELF itself (normally we can't do this because writing to a
constant section, e.g. .llvm.offloading is a segfault.). The downside
would be the extra time required to copy the image in bulk (Although we
are likely doing this in the vendor runtimes as well).

This patch went with a quick solution to simply set a boolean value at
initialization time if we need to call destructors.

Fixes: #77798


Full diff: https://github.com/llvm/llvm-project/pull/77828.diff

3 Files Affected:

  • (modified) openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp (+13-5)
  • (modified) openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h (+12-1)
  • (modified) openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp (+10-2)
diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
index 8424d0f5df0822..e7a38a93c9acbe 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -1976,12 +1976,18 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
 
   virtual Error callGlobalConstructors(GenericPluginTy &Plugin,
                                        DeviceImageTy &Image) override {
-    return callGlobalCtorDtorCommon(Plugin, Image, "amdgcn.device.init");
+    GenericGlobalHandlerTy &Handler = Plugin.getGlobalHandler();
+    if (Handler.isSymbolInImage(*this, Image, "amdgcn.device.fini"))
+      Image.setPendingGlobalDtors();
+
+    return callGlobalCtorDtorCommon(Plugin, Image, /*IsCtor=*/true);
   }
 
   virtual Error callGlobalDestructors(GenericPluginTy &Plugin,
                                       DeviceImageTy &Image) override {
-    return callGlobalCtorDtorCommon(Plugin, Image, "amdgcn.device.fini");
+    if (Image.hasPendingGlobalDtors())
+      return callGlobalCtorDtorCommon(Plugin, Image, /*IsCtor=*/false);
+    return Plugin::success();
   }
 
   const uint64_t getStreamBusyWaitMicroseconds() const {
@@ -2701,15 +2707,17 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
   /// Common method to invoke a single threaded constructor or destructor
   /// kernel by name.
   Error callGlobalCtorDtorCommon(GenericPluginTy &Plugin, DeviceImageTy &Image,
-                                 const char *Name) {
+                                 bool IsCtor) {
+    const char *KernelName =
+        IsCtor ? "amdgcn.device.init" : "amdgcn.device.fini";
     // Perform a quick check for the named kernel in the image. The kernel
     // should be created by the 'amdgpu-lower-ctor-dtor' pass.
     GenericGlobalHandlerTy &Handler = Plugin.getGlobalHandler();
-    if (!Handler.isSymbolInImage(*this, Image, Name))
+    if (IsCtor && !Handler.isSymbolInImage(*this, Image, KernelName))
       return Plugin::success();
 
     // Allocate and construct the AMDGPU kernel.
-    AMDGPUKernelTy AMDGPUKernel(Name);
+    AMDGPUKernelTy AMDGPUKernel(KernelName);
     if (auto Err = AMDGPUKernel.init(*this, Image))
       return Err;
 
diff --git a/openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h b/openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h
index b85dc146d86d2f..f7f723610dd056 100644
--- a/openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h
+++ b/openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h
@@ -218,12 +218,19 @@ class DeviceImageTy {
   const __tgt_device_image *TgtImage;
   const __tgt_device_image *TgtImageBitcode;
 
+  /// If this image has any global destructors that much be called.
+  /// FIXME: This is only required because we currently have no invariants
+  ///        towards the lifetime of the underlying image. We should either copy
+  ///        the image into memory locally or erase the pointers after init.
+  bool PendingGlobalDtors;
+
   /// Table of offload entries.
   OffloadEntryTableTy OffloadEntryTable;
 
 public:
   DeviceImageTy(int32_t Id, const __tgt_device_image *Image)
-      : ImageId(Id), TgtImage(Image), TgtImageBitcode(nullptr) {
+      : ImageId(Id), TgtImage(Image), TgtImageBitcode(nullptr),
+        PendingGlobalDtors(false) {
     assert(TgtImage && "Invalid target image");
   }
 
@@ -255,6 +262,10 @@ class DeviceImageTy {
                            "Image");
   }
 
+  /// Accessors to the boolean value
+  bool setPendingGlobalDtors() { return PendingGlobalDtors = true; }
+  bool hasPendingGlobalDtors() const { return PendingGlobalDtors; }
+
   /// Get a reference to the offload entry table for the image.
   OffloadEntryTableTy &getOffloadEntryTable() { return OffloadEntryTable; }
 };
diff --git a/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
index 7ccd0b4fef2840..ce6b39898ae95a 100644
--- a/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
@@ -392,12 +392,20 @@ struct CUDADeviceTy : public GenericDeviceTy {
 
   virtual Error callGlobalConstructors(GenericPluginTy &Plugin,
                                        DeviceImageTy &Image) override {
+    // Check for the presense of global destructors at initialization time. This
+    // is required when the image may be deallocated before destructors are run.
+    GenericGlobalHandlerTy &Handler = Plugin.getGlobalHandler();
+    if (Handler.isSymbolInImage(*this, Image, "nvptx$device$fini"))
+      Image.setPendingGlobalDtors();
+
     return callGlobalCtorDtorCommon(Plugin, Image, /*IsCtor=*/true);
   }
 
   virtual Error callGlobalDestructors(GenericPluginTy &Plugin,
                                       DeviceImageTy &Image) override {
-    return callGlobalCtorDtorCommon(Plugin, Image, /*IsCtor=*/false);
+    if (Image.hasPendingGlobalDtors())
+      return callGlobalCtorDtorCommon(Plugin, Image, /*IsCtor=*/false);
+    return Plugin::success();
   }
 
   Expected<std::unique_ptr<MemoryBuffer>>
@@ -1145,7 +1153,7 @@ struct CUDADeviceTy : public GenericDeviceTy {
     // Perform a quick check for the named kernel in the image. The kernel
     // should be created by the 'nvptx-lower-ctor-dtor' pass.
     GenericGlobalHandlerTy &Handler = Plugin.getGlobalHandler();
-    if (!Handler.isSymbolInImage(*this, Image, KernelName))
+    if (IsCtor && !Handler.isSymbolInImage(*this, Image, KernelName))
       return Plugin::success();
 
     // The Nvidia backend cannot handle creating the ctor / dtor array

Copy link
Contributor

@shiltian shiltian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

@jhuber6 jhuber6 merged commit 37c1a5e into llvm:main Jan 11, 2024
justinfargnoli pushed a commit to justinfargnoli/llvm-project that referenced this pull request Jan 28, 2024
…lvm#77828)

Summary:
The constructors and destructors look up a symbol in the ELF quickly to
determine if they need to be run on the GPU. This allows us to avoid the
very slow actions required to do the slower lookup using the vendor API.

One problem occurs with how we handle the lifetime of these images.
Right now there is no invariant to specify the lifetime of the
underlying binary image that is loaded. In the typical case, this comes
from the binary itself in the `.llvm.offloading` section, meaning that
the lifetime of the binary should match the executable itself. This
would work fine, if it weren't for the fact that the plugin is loaded
via `dlopen` and can have a teardown order out of sync with the main
executable.

This was likely what was occuring when this failed on some systems but
not others. A potential solution would be to simply copy images into
memory so the runtime does not rely on external references. Another
would be to manually zero these out after initialization as to prevent
this mistake from happening accidentally. The former has the benefit of
making some checks easier, and allowing for constant initialization be
done on the ELF itself (normally we can't do this because writing to a
constant section, e.g. .llvm.offloading is a segfault.). The downside
would be the extra time required to copy the image in bulk (Although we
are likely doing this in the vendor runtimes as well).

This patch went with a quick solution to simply set a boolean value at
initialization time if we need to call destructors.

Fixes: llvm#77798
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[OpenMP] offload to amdgpu hit "Input is not an ELF file" error
3 participants