[SYCL][ESIMD] Add 8/16-bit type support to lsc_block_load/store #6757

fineg74 · 2022-09-10T04:31:06Z

No description provided.

# Conflicts: # sycl/include/sycl/ext/intel/experimental/esimd/memory.hpp

fineg74 · 2022-09-10T04:36:24Z

Test PR: intel/llvm-test-suite#1249

sycl/include/sycl/ext/intel/experimental/esimd/memory.hpp

v-klochkov · 2022-09-16T18:40:15Z

sycl/include/sycl/ext/intel/experimental/esimd/memory.hpp

+  } else if constexpr (_DS == lsc_data_size::u16) {
+    static_assert(NElts % 2 == 0,
+                  "Number of elements is not supported by Transposed store");
+    detail::check_lsc_vector_size<NElts / 2>();


If we choose to translate loads/stores of 8 and 16-bit elements to loads/stores of 32-bit elements, then that allows load/store of (MaxN * 2) 16-bit elements and (MaxN * 4) 8-bit elements.
Don't we want to use 64-bit loads/stores when it is possible? In those cases it would allow loading/storing 2x longer memory chunk of 8/16-bit elements, i.e. MaxN4 for 16-bit elements and MaxN8 for 8-bit elements.

Update: I just realized that the attempt to support bigger/longer vectors of 8/16-bit vectors using 64-bit loads/stores would require stricter requirement to address alignment.

@kbobrovs - FYI. Please confirm we choose to make users responsible to align addresses to 4-byte for 8-16-bit lsc_block_load/stores. and that we don't choose emulation via 64-bit loads/stores, that would require 8-bit alignment. What is your opinion on this?

Good point, @v-klochkov. We need to understand what are requirements for the hardware instruction first, and add them to the doxygen comments of the function as a requirement. Similarly to https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/intel/esimd/memory.hpp#L272.

Note that for block_load we also have alignment template parameter, depending on which one or the other intrinsic is generated. Here I think we always generate single intrinsic and it does not allow to convey alignment info, right? Otherwise we might need to add alignment parameter here as well.

the address passed to intrinsic (and HW instruction) must be 4-byte aligned for load/stores of 32-bit elements, and 8-byte aligned for 64-bit element vectors loads/stores.

@fineg74 can you please add a comment explaining the requirement to alignment of the address passed to the intrinsics.

sycl/include/sycl/ext/intel/experimental/esimd/memory.hpp

into lsc_block

kbobrovs · 2022-09-21T00:09:03Z

sycl/include/sycl/ext/intel/experimental/esimd/memory.hpp

-  return __esimd_lsc_load_stateless<T, L1H, L3H, _AddressScale, _ImmOffset, _DS,
-                                    _VS, _Transposed, N>(pred.data(),
-                                                         addrs.data());
+  constexpr int SmallIntFactor =


Please extend documentation for the lsc_block_load/store and make sure checks verify the constraints:

/// Accesses contiguous block of memory of `NElts * S` bytes starting from /// given address, where S is a byte size of an "element" defined by the \c DS /// template parameter. The maximum size of accessed block is 512 bytes for PVC /// and 256 bytes for ACM (DG2). /// When \с DS equals \с lsc_data_size::u64, the address must be 8-byte aligned, /// otherwise - 4-bytes aligned. Allowed values for the data size are /// \с lsc_data_size::u32 and \с lsc_data_size::u64. Allowed NElts values are /// 1, 2, 3, 4, 8, 16, 32, 64. /// Note that to access 512 bytes, DS must be \с lsc_data_size::u64 and \c NElts /// must be 64.

kbobrovs

Thanks for adding. Some improvements are still needed:

The text I suggested works only for block load/stores (_Transpose=1). Sorry if I was unclear. For gather/scatter constraints are different (e.g. u8u32 и u16u32 data sizes are also allowed)
Please do not duplicate text. I suggest to add it to one of the APIs and refer to it from others.
"access" works for both load and store. E.g. you can add "See comments in the block_load API for description and parameter constraints" to block_store

Also, please add '3' to allowed NElts. (We should have a test on this too) - I missed that initially.

fineg74 · 2022-09-21T19:07:27Z

Thanks for adding. Some improvements are still needed:

The text I suggested works only for block load/stores (_Transpose=1). Sorry if I was unclear. For gather/scatter constraints are different (e.g. u8u32 и u16u32 data sizes are also allowed)

Please do not duplicate text. I suggest to add it to one of the APIs and refer to it from others.
"access" works for both load and store. E.g. you can add "See comments in the block_load API for description and parameter constraints" to block_store

Also, please add '3' to allowed NElts. (We should have a test on this too) - I missed that initially.

The only functions that were changed in this PR are lsc_block_load/store. I didn't touch gather/scatter and didn't update documentation for these functions.

fineg74 added 7 commits September 7, 2022 08:20

initial commit

663a9d9

Add implementation forlsc_block_load for pointers

4d8d5b1

Merge remote-tracking branch 'origin/sycl' into lsc_block

0ca1301

Add support for lsc_block_load/store

41e711f

Merge remote-tracking branch 'origin/sycl' into lsc_block

e11e177

# Conflicts: # sycl/include/sycl/ext/intel/experimental/esimd/memory.hpp

Fix several build issues

aa78201

Fix a merge artefact

37f3087

fineg74 requested a review from a team as a code owner September 10, 2022 04:31

fineg74 mentioned this pull request Sep 10, 2022

[SYCL][ESIMD]Add tests for lsc_block_load supporting 8/16 bit data intel/llvm-test-suite#1249

Merged