-
Notifications
You must be signed in to change notification settings - Fork 341
[ThinLTO] Asynchronous caching #8236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Just noting that there are still some parts that needs more polish. I'm looking for high-level comments and discussion on whether this level of additional complexity makes sense. |
Prior to this patch, the implementation of `RemoteModuleCacheEntry` assumed `tryLoadingBuffer()` always finishes before `write()` gets called. This assumption was leveraged to perform determinism checks in the `write()` function. In the future, these two functions will run concurrently, and the assumption will no longer hold. This patch extracts the determinism check into new function `areLoadedAndWrittenResultsIdentical()` that will be guaranteed to be always called from context where both of those functions have already returned. This patch also takes care to avoid concurrent writes into single file from those two functions. This is done by manipulating these three files: * The ".downloaded.tmp" file, where we're downloading the cached buffer. * The ".computed.tmp" file, where we're streaming out the computed buffer. * The file at `OutputPath`, where we atomically move either of the above. This file also acts as the backing buffer for the "UseBufferAPI" mode. Note that there's no synchronization between the atomic moves, so it's possible for `tryLoadingBuffer()` and `getMappedBuffer()` to return the buffer written out by `write()`. In practise, we expect these buffers to be identical (and we verify that with determinism checks) so this should be fine.
This is supposed to be an NFC refactoring that makes the following commit more obvious.
This patch turns the ThinLTO steps asynchronous, adds async APIs to `ModuleCacheEntry` and actually implements `tryLoadingBuffer()` to be asynchronous. For now, `write()` remains synchronous.
It assumed cache is always written before the object file.
e68c156
to
8fc1687
Compare
Both @cachemeifyoucan and @akyrtzi wanted to avoid diverging even more than we currently do from the upstream ThinLTO implementation. The newest commit makes the changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small nit picks. Diff is still quite large but seems more manageable.
}); | ||
} | ||
|
||
// Parallel optimizer + codegen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this block same as blow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this block is pretty different from the existing logic. Only the chunk of code between line 1960 and 1986 is identical, and I think it's still better to duplicate that in order to make the downstream diff more self-contained.
@@ -1364,6 +1526,48 @@ ThinLTOCodeGenerator::writeGeneratedObject(StringRef OutputPath, | |||
return std::string(OutputPath); | |||
} | |||
|
|||
template <class T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Move to a local header in lib/LTO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only used in this .cpp file, so I'd prefer to keep it here to reduce the spread of our downstream diff.
[ThinLTO] Asynchronous caching (cherry picked from commit 03af59c)
This patch makes ThinLTO caching asynchronous, solving the problem of IO-bound cache operations blocking the thread pool for CPU-bound codegen.
We spawn asynchronous cache lookup queries for all modules at the start and then start proactively performing the codegen locally. Each module has both of these tasks racing. In the normal mode, we pick the winner to be written as the final object file. With deteminism checks on, we wait for both of these tasks to finish to compare their results.
Our current implementation of the remote caching service doesn't like concurrent read and write requests, so we wait with cache writes until all reads have finished.
We have a cancellation mechanism in place so that even with particularly slow remote cache, we're not much slower than with no cache at all.
Below are my wall-time measurements of the parallel part of
ThinLTOCodeGenerator::run()
, optimizing 3019 modules using our remote caching service:Before this patch, using a particularly slow cache would result in linear unbounded growth of the wall-time, while the worst case after this patch is essentially an empty local cache + some timeout constant (currently hard-coded to 5s).