-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] Implement per-aspect device code split #7302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Implement per-aspect device code split #7302
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an initial version to get early feedback, there are still some bits missing:
- no tests for ESIMD + aspects
- no tests for doubleGFR + aspects
- aspects
1, 2
treated like they are not the same as aspects2, 1
, which is a bug - global variables are not split into separate modules if they are of optional types
Tagging @asudarsa, @sarnex, @kbobrovs - I could use your feedback folks to see if there are major comments/objections/suggestions about the approach in general
return Ret; | ||
} | ||
|
||
SmallSet<int, 4> Aspects; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this container doesn't maintain a stable ordering of items, which is a problem: there is a test in this PR, which adds aspects 1
and 2
in different order and they are treated as two distinct groups of aspects, which is not correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe SmallSetVector works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, my thumbs up is premature.
(Small)SetVector
iterator guarantees iteration in the insertion order. That's not what we need. While operator ==
will work with any insertion and iteration order, hash
needs sorted order, not insertion order - meeting 1
first then 2
will give the same property set as 2
then 1
. So the hash should not depend on the insertion order.
std::set will work for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this and since we don't modify this container here like ever (we only construct it once), it should be enough to simply have a vector + sort it in constructor. Presumably, amount of aspects won't be any huge to cause any significant slowdowns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When processing kernel A
we may encounter property 1
first then property 2
, so they will be recorded as {1, 2}
in the Aspects for this kernel, and for kernel B
we may encounter properties in different order and record as { 2, 1 }
.
So this function
unsigned hash() const {
llvm::hash_code AspectsHash =
llvm::hash_combine_range(Aspects.begin(), Aspects.end());
return static_cast<unsigned>(llvm::hash_combine(AspectsHash));
}
will return different result for A and B. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will return different result for A and B
Yeah, that is the problem which existing SmallSet
. My idea is to use SmallVector
and have it sorted in KernelProperties
constructor. This way, hash
will be the same regardless of order in which aspects were discovered.
I've implemented that approach in bd2ae95
const bool SplitByScope = ScopedSplitter->totalSplits() > 1; | ||
Modified |= SplitByScope; | ||
Modified |= SplitByProperties; | ||
|
||
// TODO this nested splitting scheme will not scale well when other split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I do not plan to resolve this TODO
in this PR. My current idea is to move doublegfr
splitter to the new properties
splitter and thus reduce the amount of split "dimensions" we have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No flags to that idea, eventually we want the large grf stuff to be specified using kernel properties, so having them be treated the same here makes sense
KernelProperties() = default; | ||
|
||
KernelProperties(const Function *F) { | ||
if (const MDNode *MDN = F->getMetadata("sycl_used_aspects")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought @steffenlarsen made some changes to sycl-post-link splitting for his work, but I don't see it in this history for this file. Is this intended to implement the stuff he's doing too in the future besides sycl_used_aspects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What @steffenlarsen implemented is support for sycl_ext_oneapi_kernel_properties
extension. At the moment, it simply provides an alternative interface for 4 core SYCL attributes (reqd_sub_group_size
, reqd_work_group_size
, work_group_size_hint
?, device_has
). Under the hood, we use the same LLVM IR representation for both attributes and properties (but the alignment might happen at different stages/in different places for different attributes/property pairs).
So, the idea is that this code will handle both in assumption that unification has happened prior it. When introducing new properties or attributes, I suppose that each will be designed case-by-case, but we should aim to have a similar handling flow for all of them.
For example, this new splitter I'm adding is also supposed to split modules by their reqd_sub_group_size
, for example, but I will add that in a separate PR not to overcomplicate this one and have some baseline/infrastructure ready first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks
No flags, I made some comments inline |
@@ -761,5 +761,138 @@ getDoubleGRFSplitter(ModuleDesc &&MD, bool EmitOnlyKernelsAsEntryPoints) { | |||
return std::make_unique<ModuleCopier>(std::move(MD), std::move(Groups)); | |||
} | |||
|
|||
namespace { | |||
struct KernelProperties { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this struct works with aspects, so maybe the name could reflect that rather than 'properties'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to expand it further to also handle things like reqd_sub_group_size
, large_gfr
, etc. So, the intent of is to be some abstract properties. Perhaps we can come up with a different name like "device requirements" or "optional features"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Features" seem nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to UsedOptionalFeatures
in 18ba12c
…aspect-device-code-split
…aspect-device-code-split
/summary:run |
struct UsedOptionalFeatures { | ||
SmallVector<int, 4> Aspects; | ||
// TODO: extend this further with reqd-sub-group-size, reqd-work-group-size, | ||
// double-grf and other properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to worry about reqd-sub-group-size
and reqd-work-group-size
since they are required features through their Clang attributes - and by association the corresponding properties - and they are applied to entry points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are applied, but if we have two kernels with different required sub-group sizes, they must be split into a separate modules or otherwise JIT compilation might fail, because not all features are required to be supported by all devices for all sub-group sizes. The same is applicable for required work-group size, because different devices might have different supported work-group sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
One final NIT - it might make sense to add XFAIL test with function pointers.
if (!isEntryPoint(F, EmitOnlyKernelsAsEntryPoints) || | ||
!MD.isEntryPointCandidate(F)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "||" is confusing for an uninformed reader (like me). I would have expected to see "&&" instead, based on the methods called.
If the call is correct, can you add clarifying comment to the sources, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at isEntryPointCandidate
description it seems to me that the description is correct, i.e. we might have an entry point in a module, which actually belongs to a different device image and from that point of view we need to use ||
here to avoid inclusion of that entry point to more than one module.
However, I don't quite understand how it is possible to get a module that contains entry points which belong to another module. That "entry point candidate" mechanism was introduced in 9a55da5 as a replacement for "allowed entries" introduced by 8c7bb45. Perhaps @kbobrovs can share some background here?
/summary:run |
thanks for addressing comments, lgtm thanks |
All my stylistic comments have been addressed (except #7302 (comment) that would need more discussion, I guess), thanks! I'm not familiar with the subject area so will let others give the final approval. No objections from my side. |
@AlexeySachkov, please handle/explain fails. |
That doesn't work today btw. |
intel/llvm#7302 brough per-aspects device code split into the compiler, which means that there is no need to specify explicit per-kernel split to avoid issues with fp64 and speculative compilation.
) #### Intro This is a refactoring of how we perform device code split in `sycl-post-link`, which is intended to solve several existing issues with the current implementation: 1. increased peak RAM consumption by `sycl-post-link` 2. bad scaling with more and more split "dimensions" being added 3. increased tests maintenance cost due to non-deterministic order (between commits) of output files produced by `sycl-post-link` #### A bit more context about the issues above: (1) Increase peak RAM consumption is caused by the fact that we currently preserve **all** splits in-memory, even though we can process them on-by-one and discard them as soon as we stored them to a disk. This was implemented as a memory consumption optimization in #5021, but it got accidentally reverted in #7302 as an attempt to workaround (2). (2) is pretty much summarized in our source code: https://github.com/intel/llvm/blob/afebb2543ccecb89f83c84b68fba7616bbab89ac/llvm/tools/sycl-post-link/sycl-post-link.cpp#L806-L811 (3) is caused by a bad implementation decision made in #7302: because every split is now identified by a hash, every time you add a new split "dimension"/new feature to an account, it results in different hashes for existing tests. Just look how many unrelated tests had to be updated in #7512, #8056 and #8167 #### Now to the PR itself: It introduces a new infrastructure for categorizing/grouping kernel functions: instead of using hashes, we now build a string description for each kernel function and then group kernels with the same description string together. String description is built by a new entity: it accepts a set of rules, where each rule is a simple function which returns a string for passed `llvm::Function`. Results of all rules are concatenated together and rules are invoked in a stable order of their registration. There is a simple API for building those rules. It provides some predefined rules for the most popular use cases like turning a function attribute or a metadata into a string descriptor for the function. There is also a possibility to pass a custom callback there to implement more complicated logic. #### How does this PR help with issues above? (1) and (2) are fixed in conjunction: `sycl-post-link` was refactored to avoid storing more than one split module at a time and that is possible because the PR unifies per-scope and optional-kernel-features splitters into a single generic splitter. The new API for kernels categorization seems to be flexible enough to provide that infrastructure so merged splitters still look OK code-wise. (3) is caused by using string identifiers instead of hashes as well as by using a data structure which sorts identifiers. #### Any other benefits from this PR? About 50 lines of code less to support :) Extending device code split for more optional features would be even easier than it is now: instead of adding several changes to various places around `UsedOptionalFeatures` structure, it will be just adding a 1-3 lines of code. Please also note that `UsedOptionalFeatures` contains tons of inconsistencies in its implementation, which will all gone with this PR: in `operator==` we don't use hash and instead compare certain fields directly (and we do miss some of them); `generateModuleName` method skips some of optional features and ignores them. Cross-module `device_global` usages checks should now work at all split dimensions (except for ESIMD). #### Any potential downsides? With current `UsedOptionalFeatures` there is a possibility to embed various information (used aspects, `large-grf` flag, etc.) directly during device code split to avoid re-gathering that information later when we generate properties. With the suggested approach, it would be harder to do, because it doesn't seem to naturally fit to the proposed infrastructure: see changes I did around `large-grf` in this PR. However, we have never actually implemented this and re-querying some metadata from function doesn't seem like a bottleneck, so it should really be a very minor and only theoretical downside.
This is a piece of implementation of optional kernels features from SYCL 2020 spec, in accordance with our design doc.
Note: this PR doesn't fully implement the feature, but rather lays a foundation for subsequent incremental updates.
The main TODO items for future PRs:
largeGRF
to use new splitterreqd_sub_group_size
,reqd_work_group_size