Skip to content

[SYCL] Implement per-aspect device code split #7302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

AlexeySachkov
Copy link
Contributor

@AlexeySachkov AlexeySachkov commented Nov 7, 2022

This is a piece of implementation of optional kernels features from SYCL 2020 spec, in accordance with our design doc.

Note: this PR doesn't fully implement the feature, but rather lays a foundation for subsequent incremental updates.
The main TODO items for future PRs:

  • refactor handling of largeGRF to use new splitter
  • add handling for other kernel properties like reqd_sub_group_size, reqd_work_group_size
  • add handling of global variables of optional types

Copy link
Contributor Author

@AlexeySachkov AlexeySachkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an initial version to get early feedback, there are still some bits missing:

  • no tests for ESIMD + aspects
  • no tests for doubleGFR + aspects
  • aspects 1, 2 treated like they are not the same as aspects 2, 1, which is a bug
  • global variables are not split into separate modules if they are of optional types

Tagging @asudarsa, @sarnex, @kbobrovs - I could use your feedback folks to see if there are major comments/objections/suggestions about the approach in general

return Ret;
}

SmallSet<int, 4> Aspects;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this container doesn't maintain a stable ordering of items, which is a problem: there is a test in this PR, which adds aspects 1 and 2 in different order and they are treated as two distinct groups of aspects, which is not correct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe SmallSetVector works?

Copy link
Contributor

@kbobrovs kbobrovs Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, my thumbs up is premature.
(Small)SetVector iterator guarantees iteration in the insertion order. That's not what we need. While operator == will work with any insertion and iteration order, hash needs sorted order, not insertion order - meeting 1 first then 2 will give the same property set as 2 then 1. So the hash should not depend on the insertion order.

std::set will work for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this and since we don't modify this container here like ever (we only construct it once), it should be enough to simply have a vector + sort it in constructor. Presumably, amount of aspects won't be any huge to cause any significant slowdowns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When processing kernel A we may encounter property 1 first then property 2, so they will be recorded as {1, 2} in the Aspects for this kernel, and for kernel B we may encounter properties in different order and record as { 2, 1 }.
So this function

unsigned hash() const {
  llvm::hash_code AspectsHash =
          llvm::hash_combine_range(Aspects.begin(), Aspects.end());
  return static_cast<unsigned>(llvm::hash_combine(AspectsHash));
}

will return different result for A and B. Am I missing something?

Copy link
Contributor Author

@AlexeySachkov AlexeySachkov Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will return different result for A and B

Yeah, that is the problem which existing SmallSet. My idea is to use SmallVector and have it sorted in KernelProperties constructor. This way, hash will be the same regardless of order in which aspects were discovered.

I've implemented that approach in bd2ae95

const bool SplitByScope = ScopedSplitter->totalSplits() > 1;
Modified |= SplitByScope;
Modified |= SplitByProperties;

// TODO this nested splitting scheme will not scale well when other split
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I do not plan to resolve this TODO in this PR. My current idea is to move doublegfr splitter to the new properties splitter and thus reduce the amount of split "dimensions" we have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No flags to that idea, eventually we want the large grf stuff to be specified using kernel properties, so having them be treated the same here makes sense

KernelProperties() = default;

KernelProperties(const Function *F) {
if (const MDNode *MDN = F->getMetadata("sycl_used_aspects")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought @steffenlarsen made some changes to sycl-post-link splitting for his work, but I don't see it in this history for this file. Is this intended to implement the stuff he's doing too in the future besides sycl_used_aspects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @steffenlarsen implemented is support for sycl_ext_oneapi_kernel_properties extension. At the moment, it simply provides an alternative interface for 4 core SYCL attributes (reqd_sub_group_size, reqd_work_group_size, work_group_size_hint?, device_has). Under the hood, we use the same LLVM IR representation for both attributes and properties (but the alignment might happen at different stages/in different places for different attributes/property pairs).

So, the idea is that this code will handle both in assumption that unification has happened prior it. When introducing new properties or attributes, I suppose that each will be designed case-by-case, but we should aim to have a similar handling flow for all of them.

For example, this new splitter I'm adding is also supposed to split modules by their reqd_sub_group_size, for example, but I will add that in a separate PR not to overcomplicate this one and have some baseline/infrastructure ready first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks

@sarnex
Copy link
Contributor

sarnex commented Nov 7, 2022

No flags, I made some comments inline

@@ -761,5 +761,138 @@ getDoubleGRFSplitter(ModuleDesc &&MD, bool EmitOnlyKernelsAsEntryPoints) {
return std::make_unique<ModuleCopier>(std::move(MD), std::move(Groups));
}

namespace {
struct KernelProperties {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this struct works with aspects, so maybe the name could reflect that rather than 'properties'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to expand it further to also handle things like reqd_sub_group_size, large_gfr, etc. So, the intent of is to be some abstract properties. Perhaps we can come up with a different name like "device requirements" or "optional features"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Features" seem nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to UsedOptionalFeatures in 18ba12c

@AlexeySachkov AlexeySachkov marked this pull request as ready for review November 17, 2022 13:38
@AlexeySachkov AlexeySachkov requested a review from a team as a code owner November 17, 2022 13:38
@AlexeySachkov AlexeySachkov requested a review from a team November 17, 2022 13:38
@AlexeySachkov
Copy link
Contributor Author

/summary:run

struct UsedOptionalFeatures {
SmallVector<int, 4> Aspects;
// TODO: extend this further with reqd-sub-group-size, reqd-work-group-size,
// double-grf and other properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to worry about reqd-sub-group-size and reqd-work-group-size since they are required features through their Clang attributes - and by association the corresponding properties - and they are applied to entry points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are applied, but if we have two kernels with different required sub-group sizes, they must be split into a separate modules or otherwise JIT compilation might fail, because not all features are required to be supported by all devices for all sub-group sizes. The same is applicable for required work-group size, because different devices might have different supported work-group sizes.

Copy link
Contributor

@kbobrovs kbobrovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
One final NIT - it might make sense to add XFAIL test with function pointers.

Comment on lines +879 to +880
if (!isEntryPoint(F, EmitOnlyKernelsAsEntryPoints) ||
!MD.isEntryPointCandidate(F)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "||" is confusing for an uninformed reader (like me). I would have expected to see "&&" instead, based on the methods called.

If the call is correct, can you add clarifying comment to the sources, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at isEntryPointCandidate description it seems to me that the description is correct, i.e. we might have an entry point in a module, which actually belongs to a different device image and from that point of view we need to use || here to avoid inclusion of that entry point to more than one module.

However, I don't quite understand how it is possible to get a module that contains entry points which belong to another module. That "entry point candidate" mechanism was introduced in 9a55da5 as a replacement for "allowed entries" introduced by 8c7bb45. Perhaps @kbobrovs can share some background here?

@AlexeySachkov
Copy link
Contributor Author

/summary:run

@sarnex
Copy link
Contributor

sarnex commented Nov 21, 2022

thanks for addressing comments, lgtm thanks

@aelovikov-intel
Copy link
Contributor

aelovikov-intel commented Nov 21, 2022

All my stylistic comments have been addressed (except #7302 (comment) that would need more discussion, I guess), thanks!

I'm not familiar with the subject area so will let others give the final approval. No objections from my side.

@pvchupin
Copy link
Contributor

@AlexeySachkov, please handle/explain fails.

@pvchupin
Copy link
Contributor

/summary:run

That doesn't work today btw.

@AlexeySachkov
Copy link
Contributor Author

@pvchupin, from what I see the only fail in all configurations is a typo in warning message about SYCL_DEVICE_FILTER, it was fixed in #7454.

@pvchupin pvchupin merged commit 9a2c4fe into intel:sycl Nov 21, 2022
AlexeySachkov added a commit to AlexeySachkov/llvm-test-suite that referenced this pull request Dec 19, 2022
intel/llvm#7302 brough per-aspects device code split into the compiler,
which means that there is no need to specify explicit per-kernel split
to avoid issues with fp64 and speculative compilation.
@AlexeySachkov AlexeySachkov deleted the private/asachkov/per-aspect-device-code-split branch March 29, 2023 12:26
AlexeySachkov added a commit that referenced this pull request Apr 28, 2023
)

#### Intro

This is a refactoring of how we perform device code split in
`sycl-post-link`, which is intended to solve several existing issues
with the current implementation:
1. increased peak RAM consumption by `sycl-post-link`
2. bad scaling with more and more split "dimensions" being added
3. increased tests maintenance cost due to non-deterministic order
(between commits) of output files produced by `sycl-post-link`

#### A bit more context about the issues above:

(1) Increase peak RAM consumption is caused by the fact that we
currently preserve **all** splits in-memory, even though we can process
them on-by-one and discard them as soon as we stored them to a disk.
This was implemented as a memory consumption optimization in #5021, but
it got accidentally reverted in #7302 as an attempt to workaround (2).

(2) is pretty much summarized in our source code:

https://github.com/intel/llvm/blob/afebb2543ccecb89f83c84b68fba7616bbab89ac/llvm/tools/sycl-post-link/sycl-post-link.cpp#L806-L811

(3) is caused by a bad implementation decision made in #7302: because
every split is now identified by a hash, every time you add a new split
"dimension"/new feature to an account, it results in different hashes
for existing tests. Just look how many unrelated tests had to be updated
in #7512, #8056 and #8167

#### Now to the PR itself:

It introduces a new infrastructure for categorizing/grouping kernel
functions: instead of using hashes, we now build a string description
for each kernel function and then group kernels with the same
description string together.

String description is built by a new entity: it accepts a set of rules,
where each rule is a simple function which returns a string for passed
`llvm::Function`. Results of all rules are concatenated together and
rules are invoked in a stable order of their registration.

There is a simple API for building those rules. It provides some
predefined rules for the most popular use cases like turning a function
attribute or a metadata into a string descriptor for the function. There
is also a possibility to pass a custom callback there to implement more
complicated logic.

#### How does this PR help with issues above?

(1) and (2) are fixed in conjunction: `sycl-post-link` was refactored to
avoid storing more than one split module at a time and that is possible
because the PR unifies per-scope and optional-kernel-features splitters
into a single generic splitter. The new API for kernels categorization
seems to be flexible enough to provide that infrastructure so merged
splitters still look OK code-wise.

(3) is caused by using string identifiers instead of hashes as well as
by using a data structure which sorts identifiers.

#### Any other benefits from this PR?

About 50 lines of code less to support :)

Extending device code split for more optional features would be even
easier than it is now: instead of adding several changes to various
places around `UsedOptionalFeatures` structure, it will be just adding a
1-3 lines of code. Please also note that `UsedOptionalFeatures` contains
tons of inconsistencies in its implementation, which will all gone with
this PR: in `operator==` we don't use hash and instead compare certain
fields directly (and we do miss some of them); `generateModuleName`
method skips some of optional features and ignores them.

Cross-module `device_global` usages checks should now work at all split
dimensions (except for ESIMD).

#### Any potential downsides?

With current `UsedOptionalFeatures` there is a possibility to embed
various information (used aspects, `large-grf` flag, etc.) directly
during device code split to avoid re-gathering that information later
when we generate properties. With the suggested approach, it would be
harder to do, because it doesn't seem to naturally fit to the proposed
infrastructure: see changes I did around `large-grf` in this PR.

However, we have never actually implemented this and re-querying some
metadata from function doesn't seem like a bottleneck, so it should
really be a very minor and only theoretical downside.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants