Skip to content

[SYCL][ESIMD] Support 64-bit offsets for stateless accessor gather/scatter #9462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 22, 2023

Conversation

sarnex
Copy link
Contributor

@sarnex sarnex commented May 15, 2023

I manually ran the test on PVC.

After this change, I'll add support for more APIs per-PR.

@sarnex sarnex marked this pull request as ready for review May 15, 2023 17:41
@sarnex sarnex requested a review from a team as a code owner May 15, 2023 17:41
@sarnex sarnex temporarily deployed to aws May 15, 2023 19:44 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to aws May 15, 2023 21:26 — with GitHub Actions Inactive
scatter(AccessorTy acc, simd<uint32_t, N> offsets, simd<T, N> vals,
uint32_t glob_offset = 0, simd_mask<N> mask = 1) {
scatter(AccessorTy acc,
#ifdef __ESIMD_FORCE_STATELESS_MEM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test below compiled for me without -fsycl-esimd-force-stateless-mem switch. I believe the intention was to have 64 bit offsets for -fsycl-esimd-force-stateless-mem and 32 bit for pure accessors. If so why not to have offset type as template parameter and impose the conditions using static_assert ? In this case you would also allow 32 bit offset for -fsycl-esimd-force-stateless-mem switch. It seems simd just silently converts data so you don't actually impose any conditions on offset type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I thought that 32-bit offsets for stateless variant could be more efficient, but if 32-bit vector still meets this line: https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/intel/esimd/memory.hpp#L136 for gathers and https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/intel/esimd/memory.hpp#L214 for scatters.
If so, then having simd<uint64_t,N> type for offsets in stateless mode seems reasonable to me.

Copy link
Contributor

@v-klochkov v-klochkov May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it was a previous mistake having simd<Toffset,N> type for 'offsets' here: https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/intel/esimd/memory.hpp#L131

If conversion to vector of uint64_t happens anyway, then why not simplify that prototype and specify it there as:

template <typename Tx, int N>
__ESIMD_API simd<Tx, N> gather(const Tx *p, simd<uint64_t, N> offsets,
                               simd_mask<N> mask = 1) {

and then remove the line https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/intel/esimd/memory.hpp#L136

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gregory, do you mean by this: "It seems simd just silently converts data so you don't actually impose any conditions on offset type" that 64-bit offset passed to stateful gather is silently truncated to 32-bit offset, which doesn't happen for stateless gather?

The approach used here is consistent with previous PR: #9160
AFAIR, we discussed this during teams meeting and it did not meet objections. Do you think it is a mistake in 9160 not reporting an static_error on attempt to pass uint64_t offset to stateful block load?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the review.
in our discussion we thought for the simd vector of offsets case it would compile time error if you try to use that stateful api with 64bit offsets because of template conversion rules, but it seems for this api it does convert, so imo we should make sure some error/warning is thrown. the glob_offset parameter does warn as expected though.

note the stateful case goes to gather_impl which does not hit the uint64 convert linked above.

let me add a commit that does this, i think it does what gregory is suggesting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is a mistake in 9160 not reporting an static_error on attempt to pass uint64_t offset to stateful block load?

We throw a warning in that case, so IMO it's fine. The problem is that we don't throw any warning or error in this case. My latest commit adds an error

Copy link
Contributor Author

@sarnex sarnex May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we look at the CI failure, we see we actually rely on the conversion today:

https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/ESIMD/accessor_gather_scatter.hpp#L37

the result of

simd<uint32_t, VL> offsets(0, STRIDE);
...
 offsets * sizeof(T) 

is
simd<unsigned long, N>

so my change causes an error when it used to work.

maybe we should just update the test, but maybe users are relying on this behavior working and we should not error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it was a previous mistake having simd<Toffset,N> type for 'offsets' here: https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/intel/esimd/memory.hpp#L131

If conversion to vector of uint64_t happens anyway, then why not simplify that prototype and specify it there as:

template <typename Tx, int N>
__ESIMD_API simd<Tx, N> gather(const Tx *p, simd<uint64_t, N> offsets,
                               simd_mask<N> mask = 1) {

and then remove the line https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/intel/esimd/memory.hpp#L136
The pleasant side effect from using templated offset type is catching errors where users flip values and offsets parameters and end up passing float offsets and the API is not able to catch it. And there were actual tests that had mistakes like that that were discovered only after introducing templated offset parameter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gregory, do you mean by this: "It seems simd just silently converts data so you don't actually impose any conditions on offset type" that 64-bit offset passed to stateful gather is silently truncated to 32-bit offset, which doesn't happen for stateless gather?

The approach used here is consistent with previous PR: #9160 AFAIR, we discussed this during teams meeting and it did not meet objections. Do you think it is a mistake in 9160 not reporting an static_error on attempt to pass uint64_t offset to stateful block load?

As far as I remember the reason why it was done in the previous PR was to preserve some warning we were emitting when passing 64 bit offsets to stateful version. Without going to discussion about correctness/need for that approach, we don't have such an issue here i.e. no warning is emitted when we are passing 64 bit offsets so technically, we don't have to follow the same pattern here

Copy link
Contributor Author

@sarnex sarnex May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember the reason why it was done in the previous PR was to preserve some warning we were emitting when passing 64 bit offsets to stateful version.

Yes, that's right.

Here there are two items to look at:

  1. The global offset

In the current version of the PR, we preserve the warning in stateful mode with 64-bit offset because we use an ifdef for param type to be 32-bit. Note that we already have code that passes 64 bit global offset in stateful mode, in copy_to for accessors.

  1. The per-element offset

In the current version of the PR, we do not error or warn about 32 vs 64 bit. This is because conversion happened anyway (32->64 and 64->32) in both stateful and stateless today, and we have at least one test that does this today (in stateful mode), so my intuition is that users may be doing it too.

q.submit([&](handler &cgh) {
auto PA = bufa.get_access<access::mode::read_write>(cgh);
cgh.single_task<class Test>([=]() SYCL_ESIMD_KERNEL {
uint64_t offsetStart = (Size - VL) * sizeof(uint64_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe here and in 2 lines below it should be sizeof(uint32_t) as you manipulating 32 bit data unless the intent is to skip some data

Copy link
Contributor Author

@sarnex sarnex May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i did intend to skip some data, right now the test verifies the beginning and end of the allocated uint64 memory, and if the uint64 isnt applied, we write to the beginning and the test will fail because that's supposed to have a different value

we can't use 64-bit vectors, so im loading/storing 4 bytes and keeping the 64-bit offset so we get the above behavior. in the verify part of the test we cast the data to uint32_t before doing the mathematical operation that the device was supposed to do.

if there's a simpler way to do this and still make sure the test will fail if the uint64 offset doesn't work please let me know

@v-klochkov v-klochkov self-requested a review May 17, 2023 02:20
@sarnex sarnex temporarily deployed to aws May 17, 2023 15:12 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to aws May 17, 2023 15:53 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to aws May 17, 2023 16:29 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to aws May 17, 2023 18:13 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to aws May 17, 2023 18:54 — with GitHub Actions Inactive
@v-klochkov
Copy link
Contributor

@sarnex - something is wrong with Jenkins/Precommit. Can you please do 'git merge' with head?
@fineg74 - please re-review the updated fix. Is your comment addressed fully?

sarnex added 5 commits May 19, 2023 12:40
…atter

After this change, I'll add support for more APIs
in a single PR.

Signed-off-by: Sarnie, Nick <[email protected]>
Signed-off-by: Sarnie, Nick <[email protected]>
Signed-off-by: Sarnie, Nick <[email protected]>
Signed-off-by: Sarnie, Nick <[email protected]>
@sarnex
Copy link
Contributor Author

sarnex commented May 19, 2023

done thx

@sarnex sarnex temporarily deployed to aws May 19, 2023 20:40 — with GitHub Actions Inactive
@fineg74 fineg74 self-requested a review May 19, 2023 21:33
@sarnex sarnex temporarily deployed to aws May 19, 2023 21:34 — with GitHub Actions Inactive
@v-klochkov v-klochkov merged commit d04ebb0 into intel:sycl May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants