Skip to content

[OpenMP][NFC] Remove DelayedBinDesc #74360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 6, 2023
Merged

Conversation

jdoerfert
Copy link
Member

@jdoerfert jdoerfert commented Dec 4, 2023

Remove DelayedBinDesc as it is not necessary since bc4e0c0.
See #74360 (comment) for details.

@llvmbot
Copy link
Member

llvmbot commented Dec 4, 2023

@llvm/pr-subscribers-openmp

Author: Johannes Doerfert (jdoerfert)

Changes

The DelayedBinDesc was removed as it is not necessary. It was either
never used or racy. The only time we could have called it was after we
created PM in the init() constructor. However, we immediately went
ahead and called registerDelayedLibraries which disabled
DelayedBinDesc for the future. This was like that since
https://reviews.llvm.org/D142249 landed.


Full diff: https://github.com/llvm/llvm-project/pull/74360.diff

3 Files Affected:

  • (modified) openmp/libomptarget/include/PluginManager.h (-17)
  • (modified) openmp/libomptarget/src/interface.cpp (-3)
  • (modified) openmp/libomptarget/src/rtl.cpp (-1)
diff --git a/openmp/libomptarget/include/PluginManager.h b/openmp/libomptarget/include/PluginManager.h
index 94ecce01ca74c..2473e39de34bc 100644
--- a/openmp/libomptarget/include/PluginManager.h
+++ b/openmp/libomptarget/include/PluginManager.h
@@ -106,23 +106,6 @@ struct PluginManager {
   HostPtrToTableMapTy HostPtrToTableMap;
   std::mutex TblMapMtx; ///< For HostPtrToTableMap
 
-  // Work around for plugins that call dlopen on shared libraries that call
-  // tgt_register_lib during their initialisation. Stash the pointers in a
-  // vector until the plugins are all initialised and then register them.
-  bool delayRegisterLib(__tgt_bin_desc *Desc) {
-    if (RTLsLoaded)
-      return false;
-    DelayedBinDesc.push_back(Desc);
-    return true;
-  }
-
-  void registerDelayedLibraries() {
-    // Only called by libomptarget constructor
-    RTLsLoaded = true;
-    for (auto *Desc : DelayedBinDesc)
-      __tgt_register_lib(Desc);
-    DelayedBinDesc.clear();
-  }
 
   int getNumDevices() {
     std::lock_guard<decltype(RTLsMtx)> Lock(RTLsMtx);
diff --git a/openmp/libomptarget/src/interface.cpp b/openmp/libomptarget/src/interface.cpp
index 62cf2262deb62..b27410ed12eaf 100644
--- a/openmp/libomptarget/src/interface.cpp
+++ b/openmp/libomptarget/src/interface.cpp
@@ -46,9 +46,6 @@ EXTERN void __tgt_register_requires(int64_t Flags) {
 /// adds a target shared library to the target execution image
 EXTERN void __tgt_register_lib(__tgt_bin_desc *Desc) {
   TIMESCOPE();
-  if (PM->delayRegisterLib(Desc))
-    return;
-
   PM->registerLib(Desc);
 }
 
diff --git a/openmp/libomptarget/src/rtl.cpp b/openmp/libomptarget/src/rtl.cpp
index 5eb1c553df491..27db7034d8956 100644
--- a/openmp/libomptarget/src/rtl.cpp
+++ b/openmp/libomptarget/src/rtl.cpp
@@ -51,7 +51,6 @@ __attribute__((constructor(101))) void init() {
   PM->init();
 
   Profiler::get();
-  PM->registerDelayedLibraries();
 }
 
 __attribute__((destructor(101))) void deinit() {

@jhuber6
Copy link
Contributor

jhuber6 commented Dec 4, 2023

This was the original issue #60119. I think HSA might've solved this as well in future versions.

Copy link

github-actions bot commented Dec 4, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@jdoerfert
Copy link
Member Author

This was the original issue #60119. I think HSA might've solved this as well in future versions.

This code was from the very beginning either unused or racy. I don't see a point in keeping it. If the issue exists, we need to store delayed lib initialization in a different place.

@jhuber6
Copy link
Contributor

jhuber6 commented Dec 4, 2023

This was the original issue #60119. I think HSA might've solved this as well in future versions.

This code was from the very beginning either unused or racy. I don't see a point in keeping it. If the issue exists, we need to store delayed lib initialization in a different place.

I'll wait until @JonChesterfield chimes in, since if I recall this patch did indeed resolve the issue caused by infinitely recursive constructor calls.

@jdoerfert
Copy link
Member Author

This was the original issue #60119. I think HSA might've solved this as well in future versions.

This code was from the very beginning either unused or racy. I don't see a point in keeping it. If the issue exists, we need to store delayed lib initialization in a different place.

I'll wait until @JonChesterfield chimes in, since if I recall this patch did indeed resolve the issue caused by infinitely recursive constructor calls.

The part I am deleting did never resolve an issue w/o race or UB. The critical part in the original patch was to move the loadRTLs call earlier.
I included the reproducer of the original issue (#60119) which was not in our tests.
It passes just fine on my machine w/o this code.

@jhuber6
Copy link
Contributor

jhuber6 commented Dec 4, 2023

This was the original issue #60119. I think HSA might've solved this as well in future versions.

This code was from the very beginning either unused or racy. I don't see a point in keeping it. If the issue exists, we need to store delayed lib initialization in a different place.

I'll wait until @JonChesterfield chimes in, since if I recall this patch did indeed resolve the issue caused by infinitely recursive constructor calls.

The part I am deleting did never resolve an issue w/o race or UB. The critical part in the original patch was to move the loadRTLs call earlier. I included the reproducer of the original issue (#60119) which was not in our tests. It passes just fine on my machine w/o this code.

IIRC the bug was triggered by HSA's handling of that test case rather than the test itself. HSA was doing some evil binary trawling to search for whether or not it had to stand up debugging utilities or something if memory serves. This ended up causing the binary to re-load its constructors.

@jdoerfert
Copy link
Member Author

This was the original issue #60119. I think HSA might've solved this as well in future versions.

This code was from the very beginning either unused or racy. I don't see a point in keeping it. If the issue exists, we need to store delayed lib initialization in a different place.

I'll wait until @JonChesterfield chimes in, since if I recall this patch did indeed resolve the issue caused by infinitely recursive constructor calls.

The part I am deleting did never resolve an issue w/o race or UB. The critical part in the original patch was to move the loadRTLs call earlier. I included the reproducer of the original issue (#60119) which was not in our tests. It passes just fine on my machine w/o this code.

IIRC the bug was triggered by HSA's handling of that test case rather than the test itself. HSA was doing some evil binary trawling to search for whether or not it had to stand up debugging utilities or something if memory serves. This ended up causing the binary to re-load its constructors.

That might be true, but 1) the code in question was, and is, still UB or dead, and 2) the test should have been included.

@JonChesterfield
Copy link
Collaborator

I think this code is here because dlopen'ed cuda or hsa otherwise falls over. That's not the default build configuration. Have you tested this with dynamic cuda/hsa?

What's the race condition / UB in question? Destruction time is run by one thread I thought.

@jdoerfert
Copy link
Member Author

What's the race condition / UB in question?

The only time we could have called it was after we
created PM in the init() constructor. However, we immediately went
ahead and called registerDelayedLibraries which disabled
DelayedBinDesc for the future. There is no "time" at which
RTLsLoaded is false, hence DelayedBinDesc can be used,
after PM is initialized. So, whenever we hit the delayRegisterLib,
PM is nullptr, or DelayedBinDesc is set to false.

@JonChesterfield
Copy link
Collaborator

It wasn't dead when we wrote it. It stopped HSA segfaulting when running any openmp programs.

The code might be dead now. I expect git log will find a "NFC" patch that changed the control flow so that this no guard no longer works, with no associated tests.

Naturally we don't test with dlopen'ed HSA on CI so when it stopped working we didn't notice.

As an argument against deleting dynamic_cuda and dynamic_hsa, instead of fixing them, those are the expected mechanisms by which apt install llvm might give users a working toolchain, since the distribution build machines generally won't have rocm or cuda installed to satisfy the link dependency.

@jdoerfert
Copy link
Member Author

jdoerfert commented Dec 5, 2023

It wasn't dead when we wrote it. It stopped HSA segfaulting when running any openmp programs

Again, it was always dead, UB, or racy. My money is on dead. This is the code in your patch:

  PM = new PluginManager(UseEventsForAtomicTransfers);

  ProfileTraceFile = getenv("LIBOMPTARGET_PROFILE");
  // TODO: add a configuration option for time granularity
  if (ProfileTraceFile)
    timeTraceProfilerInitialize(500 /* us */, "libomptarget");

  PM->RTLs.loadRTLs();
  PM->registerDelayedLibraries();

There is no point between PM = new and PM-> registerDelayedLibraries at which someone could call __tgt_register_lib and thereby use the DelayedBinDesc in a well defined manner because they cannot synchronize with the thread that runs this code. Your patch removed the crash because it moved PM->RTLs.loadRTLs();, everything else was, and is, dead and not needed.

@JonChesterfield
Copy link
Collaborator

JonChesterfield commented Dec 5, 2023

I'm pretty sure libomptarget is racy in general. Initialising the same libomptarget from multiple threads is unlikely to work. Fortunately it is usually initialised by the constructor called from crt.

It looks like this was a workaround for HSA doing dubious things. At the time you were insistent that we had to work around HSA, and keep the workaround around so things would work with older versions of HSA once it was fixed. I wanted to hard error on that HSA version, where users continue using older ones until HSA stopped crawling the address space looking for profiler thing.

Have you reproduced the crash this code avoided - I think it would be a specific HSA version via dlopen?

I am nervous about (the current, in libomptarget) extensive NFC patches given the limited automated test coverage. This looks like a functional change without any automated test coverage. What manual testing supports the change?

@JonChesterfield
Copy link
Collaborator

This doesn't look dead to me. Before there's a bunch of calls to tgt_register_lib, after there's none. Why do you think this code is dead?

@jdoerfert
Copy link
Member Author

As promised, I spend some time on this.

TL;DR;

  • The code that I delete is neither dead nor racy. It's not racy because there is only one thread involved, so that thread circles back after PM = new... and adds a image into the DeleayedBinDesc container. This is also the reason it is not strictly speaking dead either.
  • That said, the code (today) simply adds the image into the DelayedBinDesc and then pops them out, so let's see what happens and why we don't need this anymore, and, while we are at it, also why it was needed before.

Setup:

HSA is found at build time so we do not use the dlwrapper.
On the call that was said to be "the requirement" but I don't believe so.
The dlopen in the issue (#60119) is not the dlwrapper but just a dlopen from HSA.
It also seems I can reproduce this just fine in the dlwrapper and non-dlwrapper setting.


Let's start with the fix:

/d/s/d/r/s/llvm-project ❯❯❯ gcs --name-only
commit 2257e3d2e55a52d70db10e9f4ed1669ab79ace3f
Author: Jon Chesterfield <[email protected]>
Date:   Sat Jan 21 12:01:13 2023 +0000

    [openmp] Workaround for HSA in issue 60119

Verifying it breaks (so we go one commit back):

/d/s/d/r/s/llvm-project ❯❯❯ git reset --hard HEAD~
/d/s/d/r/s/llvm-project ❯❯❯ cat bug60119.c
int main() {}
/d/s/d/r/s/llvm-project ❯❯❯ touch empty.c
/d/s/d/r/s/llvm-project ❯❯❯ clang -fopenmp --offload-arch=gfx90a -fPIC -shared empty.c -o liba.so
/d/s/d/r/s/llvm-project ❯❯❯ clang -fopenmp --offload-arch=gfx90a -fPIC -shared empty.c -o libb.so
/d/s/d/r/s/llvm-project ❯❯❯ clang -fopenmp --offload-arch=gfx90a -L . -la -lb bug60119.c -o bug60119
/d/s/d/r/s/llvm-project ❯❯❯ LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH  ./bug60119

Yes, it hangs.

Observation 1:

It goes away and works fine with rocm 5.5.0 and later while rocm 5.4.3 and before breaks.


Setup con't:

The system has an old libomptarget.rtl.cuda.so.15 which fails to load:

omptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
omptarget --> Invalid plugin as necessary interface function (init_plugin) was not found.

Observation 2:

Without this commit (on top of trunk):

/d/s/d/r/s/llvm-project ❯❯❯ gcs --name-only
  commit 1f283a60a4bb896fa2d37ce00a3018924be82b9f (HEAD -> offload_prep7, llvm/main)
  Author: Matt Arsenault <[email protected]>
  Date:   Fri Nov 10 10:31:40 2023 +0900

      Reapply "RegisterCoalescer: Add implicit-def of super register when coalescing SUBREG_TO_REG"

It just hangs:

#1  0x000015554c0e77a7 in __gthread_once (__func=<optimized out>, __once=0x15554c29d940 <InitializeMemorySSAWrapperPassPassFlag>)
    at .../include/g++/x86_64-redhat-linux/bits/gthr-default.h:700
#2  std::call_once<void* (&)(llvm::PassRegistry&), std::reference_wrapper<llvm::PassRegistry> > (
    __f=@0x15554c0e6280: {void *(llvm::PassRegistry &)} 0x15554c0e6280 <initializeMemorySSAWrapperPassPassOnce(llvm::PassRegistry&)>, __once=...) at .../gcc/12.2.0/snos/include/g++/mutex:859
#3  llvm::call_once<void* (&)(llvm::PassRegistry&), std::reference_wrapper<llvm::PassRegistry> > (
    F=@0x15554c0e6280: {void *(llvm::PassRegistry &)} 0x15554c0e6280 <initializeMemorySSAWrapperPassPassOnce(llvm::PassRegistry&)>, flag=...)
    at .../llvm-project/llvm/include/llvm/Support/Threading.h:89
...
#15 llvm::omp::target::JITEngine::JITEngine(llvm::Triple::ArchType) () from .../build/llvm/lib/libomptarget.rtl.amdgpu.so

However, this seems unrelated.

Observation 3:

Replacing libomptarget.rtl.cuda.so.15 with an empty file libomptarget.rtl.cuda.so, it works fine. So,

touch libomptarget.rtl.cuda.so
LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./bug60119

works fine after:

omptarget --> Attempting to load library 'libomptarget.rtl.cuda.so'...
omptarget --> Unable to load library 'libomptarget.rtl.cuda.so': ./libomptarget.rtl.cuda.so: file too short!

The reason is that libomptarget.rtl.cuda.so.15 will initialize a different LLVM and that doesn't work.

==> Don't mix LLVM versions in one library (with exposed symbols).

FWIW, with #74538 this will not happen anymore since we will stop picking up random plugins by default.

Observation 4:

Is the code dead?
No. It is executed by the initial thread.
The thread adds libraries to the delayed container, then registers them.
There is no second thread involved, it's just the one thread (as far as I can tell).


Observation 5:

With this commit (on top of head):

/d/s/d/r/s/llvm-project ❯❯❯ gcs --name-only
  commit 369775d1461f1f7402a0fc9a312bb3b701410dd6 (HEAD -> offload_prep7)
  Author: Johannes Doerfert <[email protected]>
  Date:   Mon Dec 4 11:16:25 2023 -0800

       [OpenMP][NFC] Remove `DelayedBinDesc`

and the empty libomptarget.rtl.cuda.so from above it works fine.

I tested all these rocm versions:

4.2.0            4.5.2            5.1.0            5.2.1            5.3.3            5.4.2            5.5.1            5.7.1
4.3.1            5.0.0            5.1.1            5.2.3            5.4.0            5.4.3            5.6.0            5.7.0            
4.5.0            5.0.2            5.2.0            5.3.0            5.4.1            5.5.0            5.6.1

Analysis

So what happened? Let's actually look at the original bug.
I will explain below what changed in rocm 5.5.0 to make this work (in basically all cases) but let's stay with rocm 5.3.0.

In the old code (before I started to rewrite libomptarget), we created the RTLInfoTy (what is now PluginAdapterTy) as part of LoadRTLs().

for (const char *Name : RTLNames) {
      AllRTLs.emplace_back();

      RTLInfoTy &RTL = AllRTLs.back();

      const std::string BaseRTLName(Name);
      if (NextGenPlugins) {
        if (attemptLoadRTL(BaseRTLName + ".nextgen.so", RTL))
          continue;

So, we added an uninitialized RTLInfoTy into AllRTLS when we started to actually load the RTL (PluginAdaptorTy) and hence the plugin.

You can probably imagine this is going to be a problem.

What happens next is the HSA "bug" that was mentioned.
libomptarget.rtl.amdgpu.so loads HSA.
HSA then calls rocr::os::GetLoadedLibs() which already sounds bad (see https://github.com/RadeonOpenCompute/ROCR-Runtime/blame/adae6c61e10d371f7cbc3d0e94ae2c070cab18a4/src/core/util/lnx/os_linux.cpp#L206)
This function seems to be gone in rocm 5.5.x.
Here, HSA will load all runtimes linked to the application, including our liba and libb.
Once we get there, we call __tgt_register_lib which will iterate over AllRTLs and unconditionally call RTLInfoTy::is_valid_binary (which is necessary for plugins to provide).
While that should not be an issue but RTL above is not initialized yet (for the most part).
So we call a nullptr and crash.

Remember, the original issue was a hang not a crash.
The move of loadRTLs fixed that hang just fine.
And I really mean fixed, since the expected behavior should be that a register call is valid.
The problem after the move is that we put an uninitialized plugin (adaptor) into the list of all plugin adaptors.

So let's track back when we stopped adding half initialized plugin adaptors into the container.
First, I went back before bc4e0c0 and applied this patch.
We get a hang again (which rocm < 5.5.0).
This time it's in the libomptarget AMDGPU plugin itself since it was never done initializing the plugin.
The global constructor of the Plugin in Plugin::get() is running, loads HSA which caused it to call back into the plugin and there it waits for the constructor to be done -> hang.
It doesn't hang/crash like before because we setup is_valid_binary (and all the other runtime call pointers) before we call RTL.init_plugin.
Said differently, RTL in AllRTLs is not mostly uninitialized anymore, all function pointers are initialized.

So, let's go back before b80b5f1 and apply this patch.
That's the last commit when we only loaded RTL.init_plugin from the plugin shared library and call it before we setup all the other pointers, including is_valid_binary.
Voila, we get the crash we'd expect as we call is_valid_binary which is still a nullptr.


Summary

  1. The original fix solved two problems: a) The move of loadRTLS prevented the reported hang. Recursion into libomptarget did not stop at register_image anymore but would have segfaulted later without the delay buffer on the outermost level. The buffer was needed because the runtime was put into an inconsistent state and it was accessed during that time via the public interface that expects consistent state.
  2. In the call it was mentioned dlwrapper is needed, that is not the case. It was also mentioned that multiple threads need to concurrently initialize libomptarget, which is also not necessary (or possible, as far as I can tell).
  3. Since rocm 5.5.0 we don't need any of this anymore. The rocm bug has been fixed, they stopped loading all runtimes unconditionally.
  4. Since b80b5f1 we properly initialize all function pointers in the (now) PluginAdaptorTy before we start using those pointers. That prevents the crash you'd see without the delay buffer. We still will hang.
  5. Since bc4e0c0 we will not add PluginAdaptors into the container of all plugin adaptors before the plugin is not ready. The error is thereby gone. When and old HSA loads other libraries they can call register_image but that will simply not register the image with the plugin we are currently initializing. That seems like reasonable behavior, thought it is good to keep in mind if we ever want a kernel library (@jhuber6 @mjklemm). We can still have a standalone kernel library though or load it late after all plugins are setup (which seems reasonable).

Conclusion

I don't see a reason this code cannot be deleted, regardless of the rocm version used.
The reproducer is included and it works just fine with rocm 5.3.0.

@JonChesterfield
Copy link
Collaborator

The multiple threads comment was in response to you claiming this code was racy. I said it wasn't, as you can't get here from multiple threads. Nothing to do with the original bug fix.

So the original patch hacked around using these objects in a partially initialised state and also sidestepped HSA 5.3ish doing dubious things. The partially initialised state is now gone, so we don't need that.

There is a missing piece which is I believe this original problem only occurred using dynamic_hsa, though I can believe other things fell over when linking it normally. Given you've already got this set up, could you toggle to using the dynamically loaded hsa and find out whether 5.3 segv's or not?

I'd rather we not keep a workaround in tree for 5.3, given I didn't want to introduce one in the first place, but it seems worth achieving the closure of knowing whether dynamic hsa which does the dubious address space walking would segv after this patch.

@jdoerfert
Copy link
Member Author

It worked fine when I loaded rocm only after I build LLVM. I almost all of the above on both machines with the same results. The second machine allocation is gone though.

@JonChesterfield
Copy link
Collaborator

Good to hear. Thanks!

  The `DelayedBinDesc` was removed as it is not necessary. It was either
  never used or racy. The only time we could have called it was after we
  created `PM` in the `init()` constructor. However, we immediately went
  ahead and called `registerDelayedLibraries` which disabled
  `DelayedBinDesc` for the future. This was like that since
  https://reviews.llvm.org/D142249 landed.

  Also include the reproducer of
  llvm#60119
  which was missing in D142249.
@jdoerfert jdoerfert merged commit d552ce2 into llvm:main Dec 6, 2023
jdoerfert added a commit that referenced this pull request Dec 7, 2023
jdoerfert added a commit that referenced this pull request Dec 7, 2023
Reverts #74360

As I wrote in the analysis of #74360:

Since
bc4e0c0
we will not add PluginAdaptors into the container of all plugin adaptors
before the plugin is not ready. The error is thereby gone. When and old
HSA loads other libraries they can call register_image but that will
simply not register the image with the plugin we are currently
initializing. That seems like reasonable behavior, thought it is good to
keep in mind if we ever want a kernel library (@jhuber6 @mjklemm). We
can still have a standalone kernel library though or load it late after
all plugins are setup (which seems reasonable).

I did not expect one our tests actually doing exactly what this will not
allow anymore, at least when you use rocm <5.5.0. Need to figure out if
we want this behavior (for rocm <5.5.0).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
openmp:libomptarget OpenMP offload runtime openmp
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants