Skip to content

Make WriteIndexesThinBackend multi threaded #109847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

NuriAmari
Copy link
Contributor

@NuriAmari NuriAmari commented Sep 24, 2024

We've noticed that for large builds executing thin-link can take on the order of 10s of minutes. We are only using a single thread to write the sharded indices and import files for each input bitcode file. While we need to ensure the index file produced lists modules in a deterministic order, that doesn't prevent us from executing the rest of the work in parallel.

In this change we use a thread pool to execute as much of the backend's work as possible in parallel. In local testing on a machine with 80 cores, this change makes a thin-link for ~100,000 input files run in ~2 minutes. Without this change it takes upwards of 10 minutes.

@teresajohnson
Copy link
Contributor

Cool! I'm interested in reviewing this change when it is ready.

@NuriAmari
Copy link
Contributor Author

Some LLDB test timed out, but I'm relatively certain I haven't caused that

@NuriAmari NuriAmari marked this pull request as ready for review September 24, 2024 21:34
@llvmbot llvmbot added the LTO Link time optimization (regular/full LTO or ThinLTO) label Sep 24, 2024
@llvmbot
Copy link
Member

llvmbot commented Sep 24, 2024

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-lld-macho
@llvm/pr-subscribers-llvm-support
@llvm/pr-subscribers-lld

@llvm/pr-subscribers-lto

Author: Nuri Amari (NuriAmari)

Changes

We've noticed that for large builds executing thin-link can take on the order of 10s of minutes. We are only using a single thread to write the sharded indices and import files for each input bitcode file. While we need to ensure the index file produced lists modules in a deterministic order, that doesn't prevent us from executing the rest of the work in parallel.

In this change we use a thread pool to execute as much of the backend's work as possible in parallel. In local testing on a machine with 80 cores, this change makes a thin-link for ~100,000 input files run in ~2 minutes. Without this change it takes upwards of 10 minutes.


Full diff: https://github.com/llvm/llvm-project/pull/109847.diff

1 Files Affected:

  • (modified) llvm/lib/LTO/LTO.cpp (+45-12)
diff --git a/llvm/lib/LTO/LTO.cpp b/llvm/lib/LTO/LTO.cpp
index a88124dacfaefd..78084c7aedcd91 100644
--- a/llvm/lib/LTO/LTO.cpp
+++ b/llvm/lib/LTO/LTO.cpp
@@ -1395,11 +1395,12 @@ class lto::ThinBackendProc {
       MapVector<StringRef, BitcodeModule> &ModuleMap) = 0;
   virtual Error wait() = 0;
   virtual unsigned getThreadCount() = 0;
+  virtual bool isSensitiveToInputOrder() { return false; }
 
   // Write sharded indices and (optionally) imports to disk
   Error emitFiles(const FunctionImporter::ImportMapTy &ImportList,
                   llvm::StringRef ModulePath,
-                  const std::string &NewModulePath) {
+                  const std::string &NewModulePath) const {
     ModuleToSummariesForIndexTy ModuleToSummariesForIndex;
     GVSummaryPtrSet DeclarationSummaries;
 
@@ -1613,6 +1614,10 @@ namespace {
 class WriteIndexesThinBackend : public ThinBackendProc {
   std::string OldPrefix, NewPrefix, NativeObjectPrefix;
   raw_fd_ostream *LinkedObjectsFile;
+  DefaultThreadPool BackendThreadPool;
+  std::optional<Error> Err;
+  std::mutex ErrMu;
+  std::mutex OnWriteMu;
 
 public:
   WriteIndexesThinBackend(
@@ -1634,8 +1639,6 @@ class WriteIndexesThinBackend : public ThinBackendProc {
       const std::map<GlobalValue::GUID, GlobalValue::LinkageTypes> &ResolvedODR,
       MapVector<StringRef, BitcodeModule> &ModuleMap) override {
     StringRef ModulePath = BM.getModuleIdentifier();
-    std::string NewModulePath =
-        getThinLTOOutputFile(ModulePath, OldPrefix, NewPrefix);
 
     if (LinkedObjectsFile) {
       std::string ObjectPrefix =
@@ -1645,19 +1648,48 @@ class WriteIndexesThinBackend : public ThinBackendProc {
       *LinkedObjectsFile << LinkedObjectsFilePath << '\n';
     }
 
-    if (auto E = emitFiles(ImportList, ModulePath, NewModulePath))
-      return E;
+    BackendThreadPool.async(
+        [this](const StringRef ModulePath,
+               const FunctionImporter::ImportMapTy &ImportList,
+               const std::string &OldPrefix, const std::string &NewPrefix) {
+          std::string NewModulePath =
+              getThinLTOOutputFile(ModulePath, OldPrefix, NewPrefix);
+          auto E = emitFiles(ImportList, ModulePath, NewModulePath);
+          if (E) {
+            std::unique_lock<std::mutex> L(ErrMu);
+            if (Err)
+              Err = joinErrors(std::move(*Err), std::move(E));
+            else
+              Err = std::move(E);
+            return;
+          }
+          if (OnWrite) {
+            // Serialize calls to the on write callback in case it is not thread
+            // safe
+            std::unique_lock<std::mutex> L(OnWriteMu);
+            OnWrite(std::string(ModulePath));
+          }
+        },
+        ModulePath, ImportList, OldPrefix, NewPrefix);
+    return Error::success();
+  }
 
-    if (OnWrite)
-      OnWrite(std::string(ModulePath));
+  Error wait() override {
+    BackendThreadPool.wait();
+    if (Err)
+      return std::move(*Err);
     return Error::success();
   }
 
-  Error wait() override { return Error::success(); }
+  unsigned getThreadCount() override {
+    return BackendThreadPool.getMaxConcurrency();
+  }
 
-  // WriteIndexesThinBackend should always return 1 to prevent module
-  // re-ordering and avoid non-determinism in the final link.
-  unsigned getThreadCount() override { return 1; }
+  bool isSensitiveToInputOrder() override {
+    // The order which modules are written to LinkedObjectsFile should be
+    // deterministic and match the order they are passed on the command line.
+    return true;
+  }
 };
 } // end anonymous namespace
 
@@ -1856,7 +1888,8 @@ Error LTO::runThinLTO(AddStreamFn AddStream, FileCache Cache,
                               ThinLTO.ModuleMap);
   };
 
-  if (BackendProc->getThreadCount() == 1) {
+  if (BackendProc->getThreadCount() == 1 ||
+      BackendProc->isSensitiveToInputOrder()) {
     // Process the modules in the order they were provided on the command-line.
     // It is important for this codepath to be used for WriteIndexesThinBackend,
     // to ensure the emitted LinkedObjectsFile lists ThinLTO objects in the same

Copy link
Contributor

@teresajohnson teresajohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this patch. I have a few suggestions below. Have you measured the compile time if the max concurrency is 1 or a small value - wondering if there is overhead?

@NuriAmari
Copy link
Contributor Author

Have you measured the compile time if the max concurrency is 1 or a small value - wondering if there is overhead?

I haven't, but I'll give it a try.

@NuriAmari
Copy link
Contributor Author

These measurements were taken with a RelWithDebInfo build of the toolchain, but I think it should serve to illustrate we shouldn't be too worried about overhead.

Without this patch all together I measured:

real    10m38.836s
user    9m57.354s
sys     0m47.409s

With this patch but with the thread pool hard coded to use max concurrency = 1:

real    11m11.859s
user    10m29.944s
sys     0m49.081s

With this patch but with the thread pool hard coded to use max concurrency = 2:

real    6m24.810s
user    10m1.542s
sys     0m48.052s

I imagine the overhead would be reduced with a release build, and we still see benefit from 2 threads onwards.

Copy link
Contributor

@teresajohnson teresajohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been meaning to test this in our build system, will try to get to that tomorrow. That's a noticeable enough overhead that I want to be sure some parallelism will kick in and behave properly.

Copy link
Contributor

@teresajohnson teresajohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing some testing with a large thin link and the results look really good! Couple comments below, mostly need to make sure the parallelism is adjustable as it is for in process thinlto.

Nuri Amari added 4 commits October 6, 2024 16:32
We've noticed that for large builds executing thin-link can take on the
order of 10s of minutes. We are only using a single thread to write the
sharded indices and import files for each input bitcode file. While we
need to ensure the index files produced lists modules in a deterministic
order, that doesn't prevent us from executing the rest of the work in parallel.

In this change we use a thread pool to execute as much of the backend's
work as possible in parallel. In local testing on a machine with 80
cores, this change makes a thin-link for ~100,000 input files run in ~2 minutes.
Without this change is takes upwards of 10 minutes.
Copy link

github-actions bot commented Oct 7, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@teresajohnson teresajohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@NuriAmari
Copy link
Contributor Author

Thanks for the review @teresajohnson !

@NuriAmari NuriAmari merged commit 2edd897 into llvm:main Oct 7, 2024
9 checks passed
NuriAmari pushed a commit to NuriAmari/llvm-project that referenced this pull request Oct 7, 2024
llvm#109847 inadvertently
introduced compile errors to the gold plugin. This PR fixes the issue.
NuriAmari pushed a commit to NuriAmari/llvm-project that referenced this pull request Oct 7, 2024
llvm#109847 inadvertently
introduced compile errors to the gold plugin. This PR fixes the issue.
NuriAmari pushed a commit to NuriAmari/llvm-project that referenced this pull request Oct 7, 2024
llvm#109847 inadvertently
introduced compile errors to the gold plugin. This PR fixes the issue.
ilovepi added a commit that referenced this pull request Oct 7, 2024
This seems to have been overlooked in #109847, probably because most
bots don't build w/ gold enabled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants