Skip to content

Commit c6362a0

Browse files
authored
[ESIMD] Fix perf regression caused by assumed align in block_load(usm) (#11850)
The element-size address alignment is valid from correctness point of view, but using 1-byte and 2-byte alignment implicitly causes performance regression for block_load(const int8_t *, ...) and block_load(const int16_t *, ...) because GPU BE have to generate slower GATHER instead of more efficient BLOCK-LOAD. Without this fix block-load causes up to 44% performance slow-down on some apps that used block_load() with alignment assumptions used before block_load(usm, ..., compile_time_props) was implemented. The reasoning for the expected/assumed alignment from element-size to 4-bytes for byte- and word-vectors is such: The idea of block_load() call (opposing to gather() call) is to have efficient block-load, and thus the assumed alignment is such that allows to generate block-load. This is a bit more tricky for user but that is how block_load/store API always worked before: block-load had restrictions that needed to be honored. To be on safer side, user can always pass the guaranteed alignment. --------- Signed-off-by: Klochkov, Vyacheslav N <[email protected]>
1 parent 24ea162 commit c6362a0

File tree

2 files changed

+22
-14
lines changed

2 files changed

+22
-14
lines changed

sycl/include/sycl/ext/intel/esimd/memory.hpp

Lines changed: 13 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -695,9 +695,11 @@ block_store(Tx *addr, simd<Tx, N> vals, Flags) {
695695
/// the cache_hint::none value is assumed by default.
696696
///
697697
/// Alignment: If \p props does not specify the 'alignment' property, then
698-
/// the default assumed alignment is the minimally required element-size
699-
/// alignment. Note that additional/temporary restrictions may apply
700-
/// (see Restrictions below).
698+
/// the default assumed alignment is 4-bytes for 4-byte or smaller elements
699+
/// and 8-bytes for 8-byte elements. The address may be element-size aligned
700+
/// even for byte- and word-elements, but in such case the smaller alignment
701+
/// property must explicitly passed to this function. Extra restrictions
702+
/// may be in place - see Restrictions/R1 below.
701703
///
702704
/// Restrictions - cache hint imposed - temporary:
703705
/// If L1 or L2 cache hint is passed, then:
@@ -727,21 +729,16 @@ block_load(const T *ptr, PropertyListT props = {}) {
727729
"L3 cache hint is reserved. The old/experimental L3 LSC cache "
728730
"hint is cache_level::L2 now.");
729731

732+
constexpr size_t DefaultAlignment = (sizeof(T) <= 4) ? 4 : sizeof(T);
733+
constexpr size_t Alignment =
734+
detail::getPropertyValue<PropertyListT, alignment_key>(DefaultAlignment);
730735
if constexpr (L1Hint != cache_hint::none || L2Hint != cache_hint::none) {
731736
detail::check_cache_hint<detail::cache_action::load, L1Hint, L2Hint>();
732-
constexpr size_t DefaultAlignment = (sizeof(T) <= 4) ? 4 : sizeof(T);
733-
constexpr size_t Alignment =
734-
detail::getPropertyValue<PropertyListT, alignment_key>(
735-
DefaultAlignment);
736737

737738
simd_mask<1> Mask = 1;
738739
return detail::block_load_impl<T, N, L1Hint, L2Hint>(
739740
ptr, Mask, overaligned_tag<Alignment>{});
740741
} else {
741-
// If the alignment property is not passed, then assume the pointer
742-
// is element-aligned.
743-
constexpr size_t Alignment =
744-
detail::getPropertyValue<PropertyListT, alignment_key>(sizeof(T));
745742
return block_load<T, N>(ptr, overaligned_tag<Alignment>{});
746743
}
747744
}
@@ -763,9 +760,11 @@ block_load(const T *ptr, PropertyListT props = {}) {
763760
/// the cache_hint::none value is assumed by default.
764761
///
765762
/// Alignment: If \p props does not specify the 'alignment' property, then
766-
/// the default assumed alignment is the minimally required element-size
767-
/// alignment. Note that additional/temporary restrictions may apply
768-
/// (see Restrictions below).
763+
/// the default assumed alignment is 4-bytes for 4-byte or smaller elements
764+
/// and 8-bytes for 8-byte elements. The address may be element-size aligned
765+
/// even for byte- and word-elements, but in such case the smaller alignment
766+
/// property must explicitly passed to this function. Extra restrictions
767+
/// may be in place - see Restrictions/R1 below.
769768
///
770769
/// Restrictions - cache hint imposed - temporary:
771770
/// If L1 or L2 cache hint is passed, then:

sycl/test/esimd/memory_properties.cpp

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ SYCL_ESIMD_FUNCTION SYCL_EXTERNAL void foo(AccType &acc,
6464
simd<float, N> pass_thru = 1;
6565
simd<int, N> pass_thrui = 1;
6666
const int *ptri = reinterpret_cast<const int *>(ptrf);
67+
const int8_t *ptrb = reinterpret_cast<const int8_t *>(ptrf);
6768

6869
// CHECK: call <4 x float> @llvm.genx.lsc.load.stateless.v4f32.v1i1.v1i64(<1 x i1> {{[^)]+}}, i8 0, i8 5, i8 2, i16 1, i32 0, i8 3, i8 4, i8 2, i8 0, <1 x i64> {{[^)]+}}, i32 0)
6970
auto d1 = block_load<float, N>(ptrf, props_a);
@@ -187,4 +188,12 @@ SYCL_ESIMD_FUNCTION SYCL_EXTERNAL void foo(AccType &acc,
187188
simd<double, 4> pass_thrud4 = 2.0;
188189
auto lacc_bl6 = block_load<double, 4>(local_acc, byte_offset32, mask,
189190
pass_thrud4, props_a);
191+
192+
// Check the default/assumed alignment when the alignment property is
193+
// not specified explicitly.
194+
// TODO: Extend this kind of tests:
195+
// {usm, acc, local_acc, slm} x {byte, word, dword, qword}.
196+
197+
// CHECK: load <16 x i8>, ptr addrspace(4) {{[^)]+}}, align 4
198+
auto align_check1 = block_load<int8_t, 16>(ptrb);
190199
}

0 commit comments

Comments
 (0)