-
Notifications
You must be signed in to change notification settings - Fork 14.3k
Make WriteIndexesThinBackend multi threaded #109847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make WriteIndexesThinBackend multi threaded #109847
Conversation
Cool! I'm interested in reviewing this change when it is ready. |
Some LLDB test timed out, but I'm relatively certain I haven't caused that |
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-lto Author: Nuri Amari (NuriAmari) ChangesWe've noticed that for large builds executing thin-link can take on the order of 10s of minutes. We are only using a single thread to write the sharded indices and import files for each input bitcode file. While we need to ensure the index file produced lists modules in a deterministic order, that doesn't prevent us from executing the rest of the work in parallel. In this change we use a thread pool to execute as much of the backend's work as possible in parallel. In local testing on a machine with 80 cores, this change makes a thin-link for ~100,000 input files run in ~2 minutes. Without this change it takes upwards of 10 minutes. Full diff: https://github.com/llvm/llvm-project/pull/109847.diff 1 Files Affected:
diff --git a/llvm/lib/LTO/LTO.cpp b/llvm/lib/LTO/LTO.cpp
index a88124dacfaefd..78084c7aedcd91 100644
--- a/llvm/lib/LTO/LTO.cpp
+++ b/llvm/lib/LTO/LTO.cpp
@@ -1395,11 +1395,12 @@ class lto::ThinBackendProc {
MapVector<StringRef, BitcodeModule> &ModuleMap) = 0;
virtual Error wait() = 0;
virtual unsigned getThreadCount() = 0;
+ virtual bool isSensitiveToInputOrder() { return false; }
// Write sharded indices and (optionally) imports to disk
Error emitFiles(const FunctionImporter::ImportMapTy &ImportList,
llvm::StringRef ModulePath,
- const std::string &NewModulePath) {
+ const std::string &NewModulePath) const {
ModuleToSummariesForIndexTy ModuleToSummariesForIndex;
GVSummaryPtrSet DeclarationSummaries;
@@ -1613,6 +1614,10 @@ namespace {
class WriteIndexesThinBackend : public ThinBackendProc {
std::string OldPrefix, NewPrefix, NativeObjectPrefix;
raw_fd_ostream *LinkedObjectsFile;
+ DefaultThreadPool BackendThreadPool;
+ std::optional<Error> Err;
+ std::mutex ErrMu;
+ std::mutex OnWriteMu;
public:
WriteIndexesThinBackend(
@@ -1634,8 +1639,6 @@ class WriteIndexesThinBackend : public ThinBackendProc {
const std::map<GlobalValue::GUID, GlobalValue::LinkageTypes> &ResolvedODR,
MapVector<StringRef, BitcodeModule> &ModuleMap) override {
StringRef ModulePath = BM.getModuleIdentifier();
- std::string NewModulePath =
- getThinLTOOutputFile(ModulePath, OldPrefix, NewPrefix);
if (LinkedObjectsFile) {
std::string ObjectPrefix =
@@ -1645,19 +1648,48 @@ class WriteIndexesThinBackend : public ThinBackendProc {
*LinkedObjectsFile << LinkedObjectsFilePath << '\n';
}
- if (auto E = emitFiles(ImportList, ModulePath, NewModulePath))
- return E;
+ BackendThreadPool.async(
+ [this](const StringRef ModulePath,
+ const FunctionImporter::ImportMapTy &ImportList,
+ const std::string &OldPrefix, const std::string &NewPrefix) {
+ std::string NewModulePath =
+ getThinLTOOutputFile(ModulePath, OldPrefix, NewPrefix);
+ auto E = emitFiles(ImportList, ModulePath, NewModulePath);
+ if (E) {
+ std::unique_lock<std::mutex> L(ErrMu);
+ if (Err)
+ Err = joinErrors(std::move(*Err), std::move(E));
+ else
+ Err = std::move(E);
+ return;
+ }
+ if (OnWrite) {
+ // Serialize calls to the on write callback in case it is not thread
+ // safe
+ std::unique_lock<std::mutex> L(OnWriteMu);
+ OnWrite(std::string(ModulePath));
+ }
+ },
+ ModulePath, ImportList, OldPrefix, NewPrefix);
+ return Error::success();
+ }
- if (OnWrite)
- OnWrite(std::string(ModulePath));
+ Error wait() override {
+ BackendThreadPool.wait();
+ if (Err)
+ return std::move(*Err);
return Error::success();
}
- Error wait() override { return Error::success(); }
+ unsigned getThreadCount() override {
+ return BackendThreadPool.getMaxConcurrency();
+ }
- // WriteIndexesThinBackend should always return 1 to prevent module
- // re-ordering and avoid non-determinism in the final link.
- unsigned getThreadCount() override { return 1; }
+ bool isSensitiveToInputOrder() override {
+ // The order which modules are written to LinkedObjectsFile should be
+ // deterministic and match the order they are passed on the command line.
+ return true;
+ }
};
} // end anonymous namespace
@@ -1856,7 +1888,8 @@ Error LTO::runThinLTO(AddStreamFn AddStream, FileCache Cache,
ThinLTO.ModuleMap);
};
- if (BackendProc->getThreadCount() == 1) {
+ if (BackendProc->getThreadCount() == 1 ||
+ BackendProc->isSensitiveToInputOrder()) {
// Process the modules in the order they were provided on the command-line.
// It is important for this codepath to be used for WriteIndexesThinBackend,
// to ensure the emitted LinkedObjectsFile lists ThinLTO objects in the same
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this patch. I have a few suggestions below. Have you measured the compile time if the max concurrency is 1 or a small value - wondering if there is overhead?
I haven't, but I'll give it a try. |
These measurements were taken with a Without this patch all together I measured:
With this patch but with the thread pool hard coded to use max concurrency = 1:
With this patch but with the thread pool hard coded to use max concurrency = 2:
I imagine the overhead would be reduced with a release build, and we still see benefit from 2 threads onwards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been meaning to test this in our build system, will try to get to that tomorrow. That's a noticeable enough overhead that I want to be sure some parallelism will kick in and behave properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing some testing with a large thin link and the results look really good! Couple comments below, mostly need to make sure the parallelism is adjustable as it is for in process thinlto.
We've noticed that for large builds executing thin-link can take on the order of 10s of minutes. We are only using a single thread to write the sharded indices and import files for each input bitcode file. While we need to ensure the index files produced lists modules in a deterministic order, that doesn't prevent us from executing the rest of the work in parallel. In this change we use a thread pool to execute as much of the backend's work as possible in parallel. In local testing on a machine with 80 cores, this change makes a thin-link for ~100,000 input files run in ~2 minutes. Without this change is takes upwards of 10 minutes.
846ac06
to
33fb21b
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks for the review @teresajohnson ! |
llvm#109847 inadvertently introduced compile errors to the gold plugin. This PR fixes the issue.
llvm#109847 inadvertently introduced compile errors to the gold plugin. This PR fixes the issue.
llvm#109847 inadvertently introduced compile errors to the gold plugin. This PR fixes the issue.
This seems to have been overlooked in #109847, probably because most bots don't build w/ gold enabled.
We've noticed that for large builds executing thin-link can take on the order of 10s of minutes. We are only using a single thread to write the sharded indices and import files for each input bitcode file. While we need to ensure the index file produced lists modules in a deterministic order, that doesn't prevent us from executing the rest of the work in parallel.
In this change we use a thread pool to execute as much of the backend's work as possible in parallel. In local testing on a machine with 80 cores, this change makes a thin-link for ~100,000 input files run in ~2 minutes. Without this change it takes upwards of 10 minutes.