-
Notifications
You must be signed in to change notification settings - Fork 790
[SYCL][ESIMD] Support 64-bit offsets for stateless accessor gather/scatter #9462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
scatter(AccessorTy acc, simd<uint32_t, N> offsets, simd<T, N> vals, | ||
uint32_t glob_offset = 0, simd_mask<N> mask = 1) { | ||
scatter(AccessorTy acc, | ||
#ifdef __ESIMD_FORCE_STATELESS_MEM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test below compiled for me without -fsycl-esimd-force-stateless-mem switch. I believe the intention was to have 64 bit offsets for -fsycl-esimd-force-stateless-mem and 32 bit for pure accessors. If so why not to have offset type as template parameter and impose the conditions using static_assert ? In this case you would also allow 32 bit offset for -fsycl-esimd-force-stateless-mem switch. It seems simd just silently converts data so you don't actually impose any conditions on offset type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I thought that 32-bit offsets for stateless variant could be more efficient, but if 32-bit vector still meets this line: https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/intel/esimd/memory.hpp#L136 for gathers and https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/intel/esimd/memory.hpp#L214 for scatters.
If so, then having simd<uint64_t,N> type for offsets in stateless mode seems reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it was a previous mistake having simd<Toffset,N> type for 'offsets' here: https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/intel/esimd/memory.hpp#L131
If conversion to vector of uint64_t happens anyway, then why not simplify that prototype and specify it there as:
template <typename Tx, int N>
__ESIMD_API simd<Tx, N> gather(const Tx *p, simd<uint64_t, N> offsets,
simd_mask<N> mask = 1) {
and then remove the line https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/intel/esimd/memory.hpp#L136
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gregory, do you mean by this: "It seems simd just silently converts data so you don't actually impose any conditions on offset type" that 64-bit offset passed to stateful gather is silently truncated to 32-bit offset, which doesn't happen for stateless gather?
The approach used here is consistent with previous PR: #9160
AFAIR, we discussed this during teams meeting and it did not meet objections. Do you think it is a mistake in 9160 not reporting an static_error on attempt to pass uint64_t offset to stateful block load?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the review.
in our discussion we thought for the simd vector of offsets case it would compile time error if you try to use that stateful api with 64bit offsets because of template conversion rules, but it seems for this api it does convert, so imo we should make sure some error/warning is thrown. the glob_offset
parameter does warn as expected though.
note the stateful case goes to gather_impl
which does not hit the uint64 convert linked above.
let me add a commit that does this, i think it does what gregory is suggesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is a mistake in 9160 not reporting an static_error on attempt to pass uint64_t offset to stateful block load?
We throw a warning in that case, so IMO it's fine. The problem is that we don't throw any warning or error in this case. My latest commit adds an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we look at the CI failure, we see we actually rely on the conversion today:
https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/ESIMD/accessor_gather_scatter.hpp#L37
the result of
simd<uint32_t, VL> offsets(0, STRIDE);
...
offsets * sizeof(T)
is
simd<unsigned long, N>
so my change causes an error when it used to work.
maybe we should just update the test, but maybe users are relying on this behavior working and we should not error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it was a previous mistake having simd<Toffset,N> type for 'offsets' here: https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/intel/esimd/memory.hpp#L131
If conversion to vector of uint64_t happens anyway, then why not simplify that prototype and specify it there as:
template <typename Tx, int N> __ESIMD_API simd<Tx, N> gather(const Tx *p, simd<uint64_t, N> offsets, simd_mask<N> mask = 1) {
and then remove the line https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/intel/esimd/memory.hpp#L136
The pleasant side effect from using templated offset type is catching errors where users flip values and offsets parameters and end up passing float offsets and the API is not able to catch it. And there were actual tests that had mistakes like that that were discovered only after introducing templated offset parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gregory, do you mean by this: "It seems simd just silently converts data so you don't actually impose any conditions on offset type" that 64-bit offset passed to stateful gather is silently truncated to 32-bit offset, which doesn't happen for stateless gather?
The approach used here is consistent with previous PR: #9160 AFAIR, we discussed this during teams meeting and it did not meet objections. Do you think it is a mistake in 9160 not reporting an static_error on attempt to pass uint64_t offset to stateful block load?
As far as I remember the reason why it was done in the previous PR was to preserve some warning we were emitting when passing 64 bit offsets to stateful version. Without going to discussion about correctness/need for that approach, we don't have such an issue here i.e. no warning is emitted when we are passing 64 bit offsets so technically, we don't have to follow the same pattern here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I remember the reason why it was done in the previous PR was to preserve some warning we were emitting when passing 64 bit offsets to stateful version.
Yes, that's right.
Here there are two items to look at:
- The global offset
In the current version of the PR, we preserve the warning in stateful mode with 64-bit offset because we use an ifdef
for param type to be 32-bit. Note that we already have code that passes 64 bit global offset in stateful mode, in copy_to for accessors.
- The per-element offset
In the current version of the PR, we do not error or warn about 32 vs 64 bit. This is because conversion happened anyway (32->64 and 64->32) in both stateful and stateless today, and we have at least one test that does this today (in stateful mode), so my intuition is that users may be doing it too.
q.submit([&](handler &cgh) { | ||
auto PA = bufa.get_access<access::mode::read_write>(cgh); | ||
cgh.single_task<class Test>([=]() SYCL_ESIMD_KERNEL { | ||
uint64_t offsetStart = (Size - VL) * sizeof(uint64_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe here and in 2 lines below it should be sizeof(uint32_t) as you manipulating 32 bit data unless the intent is to skip some data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i did intend to skip some data, right now the test verifies the beginning and end of the allocated uint64
memory, and if the uint64
isnt applied, we write to the beginning and the test will fail because that's supposed to have a different value
we can't use 64-bit vectors, so im loading/storing 4 bytes and keeping the 64-bit offset so we get the above behavior. in the verify part of the test we cast the data to uint32_t
before doing the mathematical operation that the device was supposed to do.
if there's a simpler way to do this and still make sure the test will fail if the uint64
offset doesn't work please let me know
…atter After this change, I'll add support for more APIs in a single PR. Signed-off-by: Sarnie, Nick <[email protected]>
Signed-off-by: Sarnie, Nick <[email protected]>
Signed-off-by: Sarnie, Nick <[email protected]>
Signed-off-by: Sarnie, Nick <[email protected]>
Signed-off-by: Sarnie, Nick <[email protected]>
done thx |
I manually ran the test on PVC.
After this change, I'll add support for more APIs per-PR.