Skip to content

Commit c8ae6c6

Browse files
authored
[ESIMD][NFC][DOC] Add load/store/prefetch_2d functions, L1/L2 hint combinations(#13218)
Signed-off-by: Klochkov, Vyacheslav N <[email protected]>
1 parent 76837a1 commit c8ae6c6

File tree

1 file changed

+182
-2
lines changed

1 file changed

+182
-2
lines changed

sycl/doc/extensions/supported/sycl_ext_intel_esimd/sycl_ext_intel_esimd_functions.md

Lines changed: 182 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,15 @@ See more general ESIMD documentation [here](./sycl_ext_intel_esimd.md).
66

77
## Table of contents
88
- [Compile-time properties](#compile-time-properties)
9+
- [Cache-hint properties and restrictions depending on the usage context](#cache-hint-properties)
910
- [Stateless/stateful memory mode](#statelessstateful-memory-mode)
1011
- [block_load(...) - fast load from a contiguous memory block](#block_load---fast-load-from-a-contiguous-memory-block)
1112
- [block_store(...) - fast store to a contiguous memory block](#block-store---fast-store-to-a-contiguous-memory-block)
1213
- [gather(...)](#gather---load-from-memory-locations-addressed-by-a-vector-of-offsets)
1314
- [scatter(...)](#scatter---store-to-memory-locations-addressed-by-a-vector-of-offsets)
15+
- [load_2d(...) - load 2D block](#load_2d---load-2d-block)
16+
- [prefetch_2d(...) - prefetch 2D block](#prefetch_2d---prefetch-2d-block)
17+
- [store_2d(...) - store 2D block](#store_2d---store-2d-block)
1418
- [atomic_update(...)](#atomic_update)
1519
- [prefetch(...)](#prefetch)
1620
- [fence(...) - set the memory read/write order](#fence---set-the-memory-readwrite-order)
@@ -62,8 +66,54 @@ auto vec_a = block_load<float, 16>(f32_ptr, properties{alignment<16>});
6266
properties props{cache_hint_L1<cache_hint::uncached>, alignment<4> cache_hint_L1<cache_hint::cached>};
6367
auto vec_b = block_load<float, 16>(f32_ptr + 1, props);
6468
```
69+
### Cache-hint properties
70+
Cache-hint properties (if passed) currently add a restriction on the target-device, it must be a Intel® Arc Series (aka DG2) or Intel® Data Center GPU Max Series (aka PVC).
71+
The valid combinations of L1/L2 cache-hints depend on the usage context.. There are 4 contexts:
72+
* load: `block_load()`, `load_2d()`, `gather()` functions;
73+
* prefetch: `prefetch()` and `prefetch_2d()` functions;
74+
* store: `block_store()`, `store_2d()`, `scatter()` functions;
75+
* atomic_update: `atomic_update()` functions.
76+
77+
#### Valid combinations of `L1` and `L2` cache-hints for `load` functions:
78+
| `L1` | `L2` |
79+
|-|-|
80+
| none | none |
81+
| uncached | uncached |
82+
| uncached | cached |
83+
| cached | uncached |
84+
| cached | cached |
85+
| streaming | uncached |
86+
| streaming | cached |
87+
| read_invalidate | cached |
88+
89+
#### Valid combinations of `L1` and `L2` cache-hints for `prefetch` functions:
90+
| `L1` | `L2` |
91+
|-|-|
92+
| uncached | cached |
93+
| cached | uncached |
94+
| cached | cached |
95+
| streaming | uncached |
96+
| streaming | cached |
97+
98+
#### Valid combinations of `L1` and `L2` cache-hints for `store` functions:
99+
| `L1` | `L2` |
100+
|-|-|
101+
| none | none |
102+
| uncached | uncached |
103+
| uncached | write_back |
104+
| write_through | uncached |
105+
| write_through | write_back |
106+
| streaming | uncached |
107+
| streaming | write_back |
108+
| write_back | write_back |
109+
110+
#### Valid combinations of `L1` and `L2` cache-hints for `atomic_update` functions:
111+
| `L1` | `L2` |
112+
|-|-|
113+
| none | none |
114+
| uncached | uncached |
115+
| uncached | write_back |
65116

66-
Cache-hint properties (if passed) currently adds a restriction on the target-device, it must be a Intel® Arc Series (aka DG2) or Intel® Data Center GPU Max Series (aka PVC).
67117

68118
## block_load(...) - fast load from a contiguous memory block
69119
```C++
@@ -114,6 +164,8 @@ The optional [compile-time properties](#compile-time-properties) list `props` ma
114164
### Restrictions/assumptions:
115165
`Alignment` - if not specified by the `props` param, then `assumed` alignment is used. If the actual memory reference has a smaller alignment than the `assumed`, then it must be explicitly passed in `props` argument.
116166
167+
`Cache-hint` properties if passed must follow the [rules](#valid-combinations-of-l1-and-l2-cache-hints-for-load-functions) for `load` functions.
168+
117169
| `Function` | `Assumed` alignment | `Minimally required` alignment |
118170
|-|-|-|
119171
| `(usm-bl-*)` | `max(4, sizeof(T))` | `sizeof(T)` if no cache-hints, otherwise it is `max(4, sizeof(T))` |
@@ -183,6 +235,8 @@ The optional [compile-time properties](#compile-time-properties) list `props` ma
183235
### Restrictions/assumptions:
184236
`Alignment` - if not specified by the `props` param, then `assumed` alignment is used. If the actual memory reference requires a smaller alignment than the `assumed`, then it must be explicitly passed in `props` argument.
185237

238+
`Cache-hint` properties if passed must follow the [rules](#valid-combinations-of-l1-and-l2-cache-hints-for-store-functions) for `store` functions.
239+
186240
| `Function` | Condition | `Assumed` alignment | `Minimally required` alignment |
187241
|-|-|-|-|
188242
| `(usm-bs-*)` | (no cache-hints) and (`pred` is not passed). | `16` | `sizeof(T))` |
@@ -354,6 +408,9 @@ simd<float, 8> vec8 = gather<float, 8, 2>(ptr, offsets);
354408
```
355409

356410
### Restrictions
411+
412+
`Cache-hint` properties if passed must follow the [rules](#valid-combinations-of-l1-and-l2-cache-hints-for-load-functions) for `load` functions.
413+
357414
| `Function` | `Condition` | Required Intel GPU |
358415
|-|-|-|
359416
| `(usm-ga-1,4,7)`,`(acc-ga-1,4,7)` | true (`pass_thru` arg is passed) | DG2 or PVC |
@@ -457,6 +514,10 @@ scatter<float, 8, 2>(ptr, offsets4);
457514
```
458515

459516
### Restrictions
517+
518+
`Cache-hint` properties if passed must follow the [rules](#valid-combinations-of-l1-and-l2-cache-hints-for-store-functions) for `store` functions.
519+
520+
460521
| `Function` | `Condition` | Required Intel GPU |
461522
|-|-|-|
462523
| `(usm-sc-*)`, `(acc-sc-*)` | !(cache-hints) and (`VS` == 1) and (`N` == 1,2,4,8,16,32) | Any Intel GPU |
@@ -465,6 +526,118 @@ scatter<float, 8, 2>(ptr, offsets4);
465526
| `(slm-sc-*)`, `(lacc-sc-*)` | !(cache-hints) and (`VS` == 1) and (`N` == 1,2,4,8,16,32) | Any Intel GPU |
466527
| `(slm-sc-*)`, `(lacc-sc-*)` | (cache-hints) or (`VS` > 1) or (`N` != 1,2,4,8,16,32) | DG2 or PVC |
467528

529+
## load_2d(...) - load 2D block
530+
```C++
531+
template <typename T, int BlockWidth, int BlockHeight = 1, int NBlocks = 1,
532+
bool Transposed = false, bool Transformed = false,
533+
int N = detail::get_lsc_block_2d_data_size<T, NBlocks, BlockHeight, BlockWidth, Transposed, Transformed>(),
534+
typename PropertyListT = empty_properties_t>
535+
simd<T, N> load_2d(const T *Ptr, unsigned SurfaceWidth, unsigned SurfaceHeight,
536+
unsigned SurfacePitch, int X, int Y, PropertyListT props = {});
537+
```
538+
### Description
539+
Loads and returns a vector `simd<T, N>` where `N` is `BlockWidth * BlockHeight * NBlocks`.
540+
`T` is element type.
541+
`BlockWidth` - the block width in number of elements.
542+
`BlockHeight` - the block height in number of elements.
543+
`NBlocks` - the number of blocks.
544+
`Transposed` - the transposed version or not.
545+
`Transformed` - apply VNNI transform or not.
546+
`N` - (automatically deduced) the size of the returned vector in elements.
547+
`Ptr` - the surface base address for this operation.
548+
`SurfaceWidth` - the surface width minus 1 in bytes.
549+
`SurfaceHeight` - the surface height minus 1 in rows.
550+
`SurfacePitch` - the surface pitch minus 1 in bytes.
551+
`X` - zero based X-coordinate of the left upper rectangle corner in number of elements.
552+
`Y` - zero based Y-coordinate of the left upper rectangle corner in rows.
553+
`props` - The optional compile-time properties. Only cache hint properties are used.
554+
555+
### Restrictions
556+
* This function is available only for Intel® Data Center GPU Max Series (aka PVC).
557+
* `Cache-hint` properties if passed must follow the [rules](#valid-combinations-of-l1-and-l2-cache-hints-for-load-functions) for `load` functions.
558+
* `Transformed` and `Transposed` cannot be set to true at the same time.
559+
* `BlockWidth` * `BlockHeight` * `NBlocks` * sizeof(`T`) must not exceed 2048.
560+
* If `Transposed` is `true` then:
561+
* sizeof(`T`) must be 4- or 8-byte (`dwords` or `qwords`).
562+
* `NBlocks` must be 1.
563+
* `BlockHeight` must be 8 for `qwords` and be in range [`1`..`32`] for `dwords`.
564+
* `BlockWidth` must be 1,2,4 for `qwords` and be in range [`1`..`8`] for `dwords`.
565+
* If `Transformed` is `true` then:
566+
* sizeof(`T`) must be 1- or 2-byte (`bytes` or `words`).
567+
* `NBlocks` must be 1,2,4.
568+
* `BlockHeight` must in range [4..32] for `bytes` and [2..32] for `words`.
569+
* `BlockWidth` must in range [4..16] for `bytes` and [2..16] for `words`.
570+
* `BlockWidth` * `NBlocks` must not exceed 64 for `bytes` and 32 for `words`.
571+
* If `Transposed` and `Transformed` are both set to `false` then:
572+
* `NBlocks` must be {1,2,4} for `bytes` and `words`, {1,2} for `dwords`, 1 for `qwords`.
573+
* `BlockHeight` must not exceed 32.
574+
* `BlockWidth` must be 4 or more for `bytes`, 2 or more for `words`, 1 or more for `dwords` and `qwords`.
575+
* `BlockWidth` * `NBlocks` must not exceed 64 for `bytes`, 32 for `words`, 16 for `dwords, and 8 for `qwords`.
576+
577+
578+
## prefetch_2d(...) - prefetch 2D block
579+
```C++
580+
template <typename T, int BlockWidth, int BlockHeight = 1, int NBlocks = 1,
581+
int N = detail::get_lsc_block_2d_data_size<T, NBlocks, BlockHeight, BlockWidth, false /*Transposed*/, false /*Transformed*/>(),
582+
typename PropertyListT = empty_properties_t>
583+
void prefetch_2d(const T *Ptr, unsigned SurfaceWidth, unsigned SurfaceHeight,
584+
unsigned SurfacePitch, int X, int Y, PropertyListT props = {});
585+
```
586+
### Description
587+
Prefetches elements from a memory block of the size `BlockWidth * BlockHeight * NBlocks` to cache.
588+
`T` is element type.
589+
`BlockWidth` - the block width in number of elements.
590+
`BlockHeight` - the block height in number of elements.
591+
`NBlocks` - the number of blocks.
592+
`N` - (automatically deduced) the size of the returned vector in elements.
593+
`Ptr` - the surface base address for this operation.
594+
`SurfaceWidth` - the surface width minus 1 in bytes.
595+
`SurfaceHeight` - the surface height minus 1 in rows.
596+
`SurfacePitch` - the surface pitch minus 1 in bytes.
597+
`X` - zero based X-coordinate of the left upper rectangle corner in number of elements.
598+
`Y` - zero based Y-coordinate of the left upper rectangle corner in rows.
599+
`props` - The compile-time properties, which must specify cache-hints.
600+
601+
### Restrictions
602+
* This function is available only for Intel® Data Center GPU Max Series (aka PVC).
603+
* `Cache-hint` properties must follow the [rules](#valid-combinations-of-l1-and-l2-cache-hints-for-prefetch-functions) for `prefetch` functions.
604+
* `BlockWidth` * `BlockHeight` * `NBlocks` * sizeof(`T`) must not exceed 2048.
605+
* `NBlocks` must be {1,2,4} for `bytes` and `words`, {1,2} for `dwords`, 1 for `qwords`.
606+
* `BlockHeight` must not exceed 32.
607+
* `BlockWidth` must be 4 or more for `bytes`, 2 or more for `words`, 1 or more for `dwords` and `qwords`.
608+
* `BlockWidth` * `NBlocks` must not exceed 64 for `bytes`, 32 for `words`, 16 for `dwords, and 8 for `qwords`.
609+
610+
## store_2d(...) - store 2D block
611+
```C++
612+
template <typename T, int BlockWidth, int BlockHeight = 1,
613+
int N = detail::get_lsc_block_2d_data_size<T, 1u, BlockHeight, BlockWidth, false /*Transposed*/, false /*Transformed*/>(),
614+
typename PropertyListT = empty_properties_t>
615+
void store_2d(T *Ptr, unsigned SurfaceWidth, unsigned SurfaceHeight,
616+
unsigned SurfacePitch, int X, int Y, simd<T, N> Vals, PropertyListT props = {});
617+
618+
```
619+
### Description
620+
Stores the vector `Vals` of the type `simd<T, N>` to 2D memory block where `N` is `BlockWidth * BlockHeight`.
621+
`T` is element type of the values to be stored to memory.
622+
`BlockWidth` - the block width in number of elements.
623+
`BlockHeight` - the block height in number of elements.
624+
`N` - (automatically deduced) the size of the vector to be stored.
625+
`Ptr` - the surface base address for this operation.
626+
`SurfaceWidth` - the surface width minus 1 in bytes.
627+
`SurfaceHeight` - the surface height minus 1 in rows.
628+
`SurfacePitch` - the surface pitch minus 1 in bytes.
629+
`X` - zero based X-coordinate of the left upper rectangle corner in number of elements.
630+
`Y` - zero based Y-coordinate of the left upper rectangle corner in rows.
631+
`props` - The optional compile-time properties. Only cache hint properties are used.
632+
633+
### Restrictions
634+
* This function is available only for Intel® Data Center GPU Max Series (aka PVC).
635+
* `Cache-hint` properties if passed must follow the [rules](#valid-combinations-of-l1-and-l2-cache-hints-for-store-functions) for `store` functions.
636+
* `BlockWidth` * `BlockHeight` * sizeof(`T`) must not exceed 512.
637+
* `BlockHeight` must not exceed 8.
638+
* `BlockWidth` must be 4 or more for `bytes`, 2 or more for `words`, 1 or more for `dwords` and `qwords`.
639+
* `BlockWidth` must not exceed 64 for `bytes`, 32 for `words`, 16 for `dwords, and 8 for `qwords`.
640+
468641
## atomic_update(...)
469642
470643
### atomic_update() with 0 operands (inc, dec, load)
@@ -604,6 +777,8 @@ The template parameter `T` specifies the type of the elements used in the atomic
604777
The template parameter `N` is the number of elements being atomically updated.
605778

606779
### Restrictions
780+
'Cache-hint` properties if passed must follow the [rules](#valid-combinations-of-l1-and-l2-cache-hints-for-atomic_update-functions) for `atomic_update` functions.
781+
607782
| `Function` | `Condition` | Required Intel GPU |
608783
|-|-|-|
609784
| `(usm-au0-*)`, `(acc-au0-*)` | !(cache-hints) and (`N` == 1,2,4,8,16,32) and (sizeof(T) >= 4) | Any Intel GPU |
@@ -699,13 +874,18 @@ The `byte_offsets` is a vector of any integral type elements, limited in [statef
699874
700875
`(acc-pf-7,8,9,10)`: Prefetches a linear block of memory addressed by the accessor `acc` and the optional `byte-offset` parameter, which is 64-bit in [stateless](#statelessstateful-memory-mode) mode(default), and 32-bit in [stateful](#statelessstateful-memory-mode) mode.
701876
702-
703877
`(usm-pf-1,2,3,4,5,6)`, `(acc-pf-1,2,3,4,5,6)`: The optional parameter `mask` provides a `simd_mask`. If some element in `mask` is zero, then the corresponding memory location is not prefetched.
704878
`(usm-pf-7,8,9,10)`, `(acc-pf-7,8,9,10)`: The optional parameter `mask` provides 1-element
705879
`simd_mask`. If it is zero, then the whole prefetch operation is skipped.
706880
707881
`(usm-pf-*)`, `(acc-pf-*)`: The [compile-time properties](#compile-time-properties) list `props` must specify `cache-hints`.
708882
883+
### Restrictions
884+
885+
* This function is available only for Intel® Arc Series (aka DG2) or Intel® Data Center GPU Max Series (aka PVC).
886+
* 'Cache-hint` properties must follow the [rules](#valid-combinations-of-l1-and-l2-cache-hints-for-prefetch-functions) for `prefetch` functions.
887+
888+
709889
710890
## fence(...) - set the memory read/write order
711891
```C++

0 commit comments

Comments
 (0)