Skip to content

Commit bc9f0aa

Browse files
committed
Add experimental latency control API to FPGA extension docs
1 parent 6e5dd48 commit bc9f0aa

File tree

2 files changed

+262
-2
lines changed

2 files changed

+262
-2
lines changed

sycl/doc/extensions/DataFlowPipes/data_flow_pipes_rev4_proposed.asciidoc

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -634,6 +634,92 @@ Automated mechanisms are possible to provide uniquification across calls, and co
634634
*RESOLUTION*: Resolved. Abstraction/libraries on top enable functionality like this. We will make public a library that enables arrays of pipes.
635635
--
636636

637+
== Experimental APIs
638+
639+
*NOTE*: The APIs described in this section are experimental. Future versions of this extension may change these APIs in ways that are incompatible with the versions described here.
640+
641+
In the experimental API version, read/write methods take template arguments, which can contain the latency control properties `latency_anchor_id` and/or `latency_constraint`.
642+
643+
* `sycl::ext::intel::experimental::latency_anchor_id<N>`, where `N` is an integer: An ID to associate with the current read/write function call, which can then be referenced by other `latency_constraint` properties elsewhere in the program to define relative latency constaints. ID must be unique within the application, and a diagnostic is required if that condition is not met.
644+
* `sycl::ext::intel::experimental::latency_constraint<A, B, C>`: A tuple of three values which cause the current read/write function call to act as an endpoint of a latency constraint relative to a specified `latency_anchor_id` defined by a different instruction.
645+
** `A` is an integer: The ID of the target anchor defined on a different instruction through a `latency_anchor_id` property.
646+
** `B` is an enum value: The type of control from the set {`type::exact`, `type::max`, `type::min`}.
647+
** `C` is an integer: The relative clock cycle difference between the target anchor and the current function call, that the constraint should infer subject to the type of the control (exact, max, min).
648+
649+
The template arguments above don't have to be specified if user doesn't want to apply latency controls. The template arguments can be passed in arbitrary order.
650+
651+
=== Implementation
652+
653+
[source,c++]
654+
----
655+
// Added in version 2 of this extension.
656+
namespace sycl::ext::intel::experimental {
657+
enum class type {
658+
none, // default
659+
exact,
660+
max,
661+
min
662+
};
663+
664+
template <int32_t _N> struct latency_anchor_id {
665+
static constexpr int32_t value = _N;
666+
static constexpr int32_t default_value = -1;
667+
};
668+
669+
template <int32_t _N1, type _N2, int32_t _N3> struct latency_constraint {
670+
static constexpr std::tuple<int32_t, type, int32_t> value = {_N1, _N2, _N3};
671+
static constexpr std::tuple<int32_t, type, int32_t> default_value = {
672+
0, type::none, 0};
673+
};
674+
675+
template <typename name,
676+
typename dataT,
677+
size_t min_capacity = 0>
678+
class pipe {
679+
// Blocking
680+
template <class... _Params>
681+
static dataT read();
682+
template <class... _Params>
683+
static void write( const dataT &data );
684+
685+
// Non-blocking
686+
template <class... _Params>
687+
static dataT read( bool &success_code );
688+
template <class... _Params>
689+
static void write( const dataT &data, bool &success_code );
690+
}
691+
} // namespace sycl::ext::intel::experimental
692+
----
693+
694+
=== Usage
695+
696+
[source,c++]
697+
----
698+
// Added in version 2 of this extension.
699+
#include <sycl/ext/intel/fpga_extensions.hpp>
700+
...
701+
using Pipe1 = ext::intel::experimental::pipe<class PipeClass1, int, 8>;
702+
using Pipe2 = ext::intel::experimental::pipe<class PipeClass2, int, 8>;
703+
using Pipe3 = ext::intel::experimental::pipe<class PipeClass2, int, 8>;
704+
705+
myQueue.submit([&](handler &cgh) {
706+
cgh.single_task<class foo>([=] {
707+
// The following Pipe1::read is anchor 0
708+
int value = Pipe1::read<ext::intel::experimental::latency_anchor_id<0>>();
709+
710+
// The following Pipe2::write is anchor 1
711+
// The following Pipe2::write occurs exactly 2 cycles after anchor 0
712+
Pipe2::write<ext::intel::experimental::latency_anchor_id<1>,
713+
ext::intel::experimental::latency_constraint<
714+
0, ext::intel::experimental::type::exact, 2>>(value);
715+
716+
// The following Pipe3::write occurs at least 2 cycles after anchor 1
717+
Pipe3::write<ext::intel::experimental::latency_constraint<
718+
1, ext::intel::experimental::type::min, 2>>(value);
719+
});
720+
});
721+
----
722+
637723
== Feature test macro
638724

639725
This extension provides a feature-test macro as described in the core SYCL
@@ -648,6 +734,7 @@ extension's APIs the implementation supports.
648734
|===
649735
|Value |Description
650736
|1 |Initial extension version. Base features are supported.
737+
|2 |Add experimental latency control API.
651738
|===
652739

653740
== Revision History
@@ -660,6 +747,7 @@ extension's APIs the implementation supports.
660747
|1|2019-09-12|Michael Kinsner|*Initial public working draft*
661748
|2|2019-11-13|Michael Kinsner|Incorporate feedback
662749
|3|2020-04-27|Michael Kinsner|Clarify that pipe operations behave as-if they are relaxed atomic operations. Make SYCL2020 the baseline
750+
|4|2021-12-02|Shuo Niu|Add experimental latency control API
663751
|========================================
664752

665753
//************************************************************************

sycl/doc/extensions/IntelFPGA/FPGALsu_rev2_proposed.md

Lines changed: 174 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -119,16 +119,188 @@ Queue.submit([&](cl::sycl::handler &cgh) {
119119
...
120120
```
121121

122+
## Experimental APIs
123+
124+
**NOTE**: The APIs described in this section are experimental. Future versions of
125+
this extension may change these APIs in ways that are incompatible with the
126+
versions described here.
127+
128+
In the experimental API version, member functions `load()` and `store()` take
129+
template arguments, which can contain the latency control properties
130+
`latency_anchor_id` and/or `latency_constraint`.
131+
132+
1. **`sycl::ext::intel::experimental::latency_anchor_id<N>`, where `N` is an integer**:
133+
represents ID of the current function call when it performs as an anchor. The ID
134+
must be unique within the application, with a diagnostic required if that
135+
condition is not met.
136+
2. **`sycl::ext::intel::experimental::latency_constraint<A, B, C>`** contains control
137+
parameters when the current function performs as a non-anchor, where:
138+
- **`A` is an integer**: The ID of the target anchor defined on a different
139+
instruction through a `latency_anchor_id` property.
140+
- **`B` is an enum value**: The type of control from the set
141+
{`type::exact`, `type::max`, `type::min`}.
142+
- **`C` is an integer**: The relative clock cycle difference between the
143+
target anchor and the current function call, that the constraint should
144+
infer subject to the type of the control (exact, max, min).
145+
146+
The template arguments above don't have to be specified if user doesn't want to
147+
apply latency controls. The template arguments can be passed in arbitrary order.
148+
149+
### Implementation
150+
```c++
151+
// Added in version 2 of this extension.
152+
namespace sycl::ext::intel::experimental {
153+
enum class type {
154+
none, // default
155+
exact,
156+
max,
157+
min
158+
};
159+
160+
template <int32_t _N> struct latency_anchor_id {
161+
static constexpr int32_t value = _N;
162+
static constexpr int32_t default_value = -1;
163+
};
164+
165+
template <int32_t _N1, type _N2, int32_t _N3> struct latency_constraint {
166+
static constexpr std::tuple<int32_t, type, int32_t> value = {_N1, _N2, _N3};
167+
static constexpr std::tuple<int32_t, type, int32_t> default_value = {
168+
0, type::none, 0};
169+
};
170+
171+
template <class... mem_access_params> class lsu final {
172+
public:
173+
lsu() = delete;
174+
175+
template <class... _Params, typename _T, access::address_space _space>
176+
static _T load(sycl::multi_ptr<_T, _space> Ptr) {
177+
check_space<_space>();
178+
check_load();
179+
#if defined(__SYCL_DEVICE_ONLY__) && __has_builtin(__builtin_intel_fpga_mem)
180+
static constexpr auto _anchor_id =
181+
__GetValue<int, latency_anchor_id, _Params...>::value;
182+
static constexpr auto _constraint =
183+
__GetValue3<int, type, int, latency_constraint, _Params...>::value;
184+
185+
static constexpr int _target_anchor = std::get<0>(_constraint);
186+
static constexpr type _control_type = std::get<1>(_constraint);
187+
static constexpr int _cycle = std::get<2>(_constraint);
188+
int _type;
189+
if (_control_type == type::none) {
190+
_type = 0;
191+
} else if (_control_type == type::exact) {
192+
_type = 1;
193+
} else if (_control_type == type::max) {
194+
_type = 2;
195+
} else { // _control_type == type::min
196+
_type = 3;
197+
}
198+
199+
return *__latency_control_mem_wrapper((_T *)Ptr, _anchor_id, _target_anchor,
200+
_type, _cycle);
201+
#else
202+
return *Ptr;
203+
#endif
204+
}
205+
206+
template <class... _Params, typename _T, access::address_space _space>
207+
static void store(sycl::multi_ptr<_T, _space> Ptr, _T Val) {
208+
check_space<_space>();
209+
check_store();
210+
#if defined(__SYCL_DEVICE_ONLY__) && __has_builtin(__builtin_intel_fpga_mem)
211+
static constexpr auto _anchor_id =
212+
__GetValue<int, latency_anchor_id, _Params...>::value;
213+
static constexpr auto _constraint =
214+
__GetValue3<int, type, int, latency_constraint, _Params...>::value;
215+
216+
static constexpr int _target_anchor = std::get<0>(_constraint);
217+
static constexpr type _control_type = std::get<1>(_constraint);
218+
static constexpr int _cycle = std::get<2>(_constraint);
219+
int _type;
220+
if (_control_type == type::none) {
221+
_type = 0;
222+
} else if (_control_type == type::exact) {
223+
_type = 1;
224+
} else if (_control_type == type::max) {
225+
_type = 2;
226+
} else { // _control_type == type::min
227+
_type = 3;
228+
}
229+
230+
*__latency_control_mem_wrapper((_T *)Ptr, _anchor_id, _target_anchor, _type,
231+
_cycle) = Val;
232+
#else
233+
*Ptr = Val;
234+
#endif
235+
}
236+
...
237+
private:
238+
#if defined(__SYCL_DEVICE_ONLY__) && __has_builtin(__builtin_intel_fpga_mem)
239+
template <typename _T>
240+
static _T *__latency_control_mem_wrapper(_T * Ptr, int32_t AnchorID,
241+
int32_t TargetAnchor, int32_t Type,
242+
int32_t Cycle) {
243+
return __builtin_intel_fpga_mem(Ptr,
244+
_burst_coalesce | _cache |
245+
_dont_statically_coalesce | _prefetch,
246+
_cache_val);
247+
}
248+
#endif
249+
...
250+
}
251+
} // namespace sycl::ext::intel::experimental
252+
```
253+
254+
### Usage
255+
```c++
256+
// Added in version 2 of this extension.
257+
#include <sycl/ext/intel/fpga_extensions.hpp>
258+
...
259+
sycl::buffer<int, 1> output_buffer(output_data, 1);
260+
sycl::buffer<int, 1> input_buffer(input_data, 1);
261+
Queue.submit([&](sycl::handler &cgh) {
262+
auto output_accessor =
263+
output_buffer.get_access<sycl::access::mode::write>(cgh);
264+
auto input_accessor = input_buffer.get_access<sycl::access::mode::read>(cgh);
265+
266+
cgh.single_task<class kernel>([=] {
267+
auto input_ptr = input_accessor.get_pointer();
268+
auto output_ptr = output_accessor.get_pointer();
269+
270+
// latency controls
271+
using ExpPrefetchingLSU = sycl::ext::intel::experimental::lsu<
272+
sycl::ext::intel::experimental::prefetch<true>,
273+
sycl::ext::intel::experimental::statically_coalesce<false>>;
274+
275+
using ExpBurstCoalescedLSU = sycl::ext::intel::experimental::lsu<
276+
sycl::ext::intel::experimental::burst_coalesce<false>,
277+
sycl::ext::intel::experimental::statically_coalesce<false>>;
278+
279+
// The following load is anchor 1
280+
int Z = ExpPrefetchingLSU::load<
281+
sycl::ext::intel::experimental::latency_anchor_id<1>>(input_ptr + 2);
282+
283+
// The following store occurs exactly 5 cycles after the anchor 1 read
284+
ExpBurstCoalescedLSU::store<
285+
sycl::ext::intel::experimental::latency_constraint<
286+
1, sycl::ext::intel::experimental::type::exact, 5>>(output_ptr + 2,
287+
Z);
288+
});
289+
});
290+
...
291+
```
292+
122293
## Feature Test Macro
123294

124295
This extension provides a feature-test macro as described in the core SYCL
125296
specification section 6.3.3 "Feature test macros". Therefore, an implementation
126297
supporting this extension must predefine the macro `SYCL_EXT_INTEL_FPGA_LSU`
127298
to one of the values defined in the table below. Applications can test for the
128299
existence of this macro to determine if the implementation supports this
129-
feature, or applications can test the macros value to determine which of the
130-
extensions APIs the implementation supports.
300+
feature, or applications can test the macro's value to determine which of the
301+
extension's APIs the implementation supports.
131302

132303
|Value |Description|
133304
|:---- |:---------:|
134305
|1 |Initial extension version. Base features are supported.|
306+
|2 |Add experimental latency control API.|

0 commit comments

Comments
 (0)