Skip to content

Commit 23ecc0b

Browse files
author
iclsrc
committed
Merge from 'sycl' to 'sycl-web'
2 parents 3e4395e + eb3df08 commit 23ecc0b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+758
-231
lines changed
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Intel "Explicit SIMD" SYCL extension
2+
3+
OneAPI provides the "Explicit SIMD" SYCL extension (or simply "ESIMD") for
4+
lower-level Intel GPU programming. It provides APIs closely matching Intel GPU ISA
5+
yet allows to write explicitly vectorized device code. This helps programmer to
6+
have more control over the generated code and depend less on compiler
7+
optimizations. The [specification](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/ExplicitSIMD/dpcpp-explicit-simd.md),
8+
[documented ESIMD APIs headers](https://github.com/intel/llvm/tree/sycl/sycl/include/CL/sycl/INTEL/esimd) and
9+
[working code examples](https://github.com/intel/llvm/blob/sycl/sycl/test/esimd/on-device) are available on the Intel DPC++ project's github.
10+
11+
**_NOTE:_** _This extension is under active development and lots of APIs are
12+
subject to change. There are currenly a number of restrictions specified
13+
below._
14+
15+
ESIMD kernels and functions always require the subgroup size of one, which means
16+
compiler never does vectorization across workitems in a subgroup. Instead,
17+
vectorization is experessed explicitly in the code by the programmer. Here is a
18+
trivial example which adds elements of two arrays and writes the results to the
19+
third:
20+
21+
```cpp
22+
float *A = static_cast<float *>(malloc_shared(Size * sizeof(float), dev, ctxt));
23+
float *B = static_cast<float *>(malloc_shared(Size * sizeof(float), dev, ctxt));
24+
float *C = static_cast<float *>(malloc_shared(Size * sizeof(float), dev, ctxt));
25+
26+
for (unsigned i = 0; i < Size; ++i) {
27+
A[i] = B[i] = i;
28+
}
29+
30+
// We need that many workitems. Each processes VL elements of data.
31+
cl::sycl::range<1> GlobalRange{Size / VL};
32+
// Number of workitems in each workgroup.
33+
cl::sycl::range<1> LocalRange{GroupSize};
34+
35+
cl::sycl::nd_range<1> Range(GlobalRange, LocalRange);
36+
37+
auto e = q.submit([&](handler &cgh) {
38+
cgh.parallel_for<class Test>(Range, [=](nd_item<1> ndi) SYCL_ESIMD_KERNEL {
39+
using namespace sycl::INTEL::gpu;
40+
41+
int i = ndi.get_global_id(0);
42+
simd<float, VL> va = block_load<float, VL>(A + i * VL);
43+
simd<float, VL> vb = block_load<float, VL>(B + i * VL);
44+
simd<float, VL> vc = va + vb;
45+
block_store<float, VL>(C + i * VL, vc);
46+
});
47+
});
48+
```
49+
50+
In this example the lambda function passed to the `parallel_for` is marked with
51+
a special attribute - `SYCL_ESIMD_KERNEL`. This tells the compiler that this
52+
kernel is a ESIMD one and ESIMD APIs can be used inside it. Here the `simd`
53+
objects and `block_load`/`block_store` intrinsics are used which are avaiable
54+
only in the ESIMD extension.
55+
Full runnable code sample can be found on the
56+
[github repo](https://github.com/intel/llvm/blob/sycl/sycl/test/esimd/on-device/vadd_usm.cpp).
57+
58+
#### Compiling and running ESIMD code.
59+
To compile a code which uses the ESIMD extension, a special compiler switch
60+
`-fsycl-explicit-simd` switch must be used:
61+
62+
> `$ clang++ -fsycl -fsycl-explicit-simd vadd_usm.cpp`
63+
64+
The resulting executable can only be run on Intel GPU hardware, such as
65+
Intel HD Graphics 600 or later. To run it, couple additional environment
66+
variables must be used - `SYCL_BE=PI_OPENCL` and
67+
`SYCL_PROGRAM_COMPILE_OPTIONS=-vc-codegen`:
68+
69+
> `$ SYCL_BE=PI_OPENCL SYCL_PROGRAM_COMPILE_OPTIONS=-vc-codegen ./a.out`
70+
71+
#### Restrictions
72+
73+
Here is a list of main restrictions imposed on using ESIMD extension. Note that
74+
some of them are not enforced by the compiler, which may lead to undefined
75+
program behavior if violated.
76+
77+
##### Features not supported with ESIMD extension:
78+
- Windows target
79+
- Ahead-of-time compilation
80+
- The [C and C++ Standard libraries support](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/C-CXX-StandardLibrary/C-CXX-StandardLibrary.rst)
81+
- The [Device library extensions](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/C-CXX-StandardLibrary/DeviceLibExtensions.rst)
82+
- Host device (in some cases)
83+
84+
##### Unsupported standard SYCL APIs:
85+
- Local accessors
86+
- Most of image APIs
87+
- Specialization constants
88+
- Memory access through a raw pointer returned by `sycl::accessor::get_pointer()`
89+
90+
##### Other restrictions:
91+
- Only Intel GPU device is supported
92+
- Usual and ESIMD DPC++ kernels can not co-exist in the same application in most
93+
cases
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Specialization constants.
2+
3+
Specialization constant is basically a variable in a SYCL program set by host
4+
code and used in device code which appears to be constant for the online (JIT)
5+
compiler of the device code. Things like optimal tile size in a tiled matrix
6+
multiplication kernel may depend on the hardware and can be expressed via a
7+
specialization constant for better code generation.
8+
9+
This version of oneAPI provides experimental implementation of specialization
10+
constants based on the
11+
[proposal](https://github.com/codeplaysoftware/standards-proposals/blob/master/spec-constant/index.md)
12+
from Codeplay.
13+
14+
**NOTE:** _In future versions it may be superseded by [SYCL 2020
15+
specification](https://www.khronos.org/registry/SYCL/specs/sycl-2020-provisional.pdf)._
16+
17+
A specialization constant is identified by a C++ type name, similarly to a
18+
kernel, its value is set via `program::set_spec_constant` class API and is
19+
"frozen" once the program is built. The following example shows how
20+
different values of a specialization constant can be used within the same
21+
kernel:
22+
23+
```cpp
24+
for (int i = 0; i < n_sc_sets; i++) {
25+
cl::sycl::program program(q.get_context());
26+
const int *sc_set = &sc_vals[i][0];
27+
cl::sycl::ONEAPI::experimental::spec_constant<int32_t, SC0> sc0 =
28+
program.set_spec_constant<SC0>(sc_set[0]);
29+
cl::sycl::ONEAPI::experimental::spec_constant<int32_t, SC1> sc1 =
30+
program.set_spec_constant<SC1>(sc_set[1]);
31+
32+
program.build_with_kernel_type<KernelAAA>();
33+
34+
try {
35+
cl::sycl::buffer<int, 1> buf(vec.data(), vec.size());
36+
37+
q.submit([&](cl::sycl::handler &cgh) {
38+
auto acc = buf.get_access<cl::sycl::access::mode::write>(cgh);
39+
cgh.single_task<KernelAAA>(
40+
program.get_kernel<KernelAAA>(),
41+
[=]() {
42+
acc[i] = sc0.get() + sc1.get();
43+
});
44+
});
45+
} catch (cl::sycl::exception &e) {
46+
std::cout << "*** Exception caught: " << e.what() << "\n";
47+
return 1;
48+
}
49+
...
50+
}
51+
```
52+
Here the values of specialization constants `SC0` and `SC1` are changed on
53+
every loop iteration. All what's needed is re-creating a `program` class
54+
instance, setting new values and rebuilding it via
55+
`program::build_with_kernel_type`. JIT compiler will effectively replace
56+
`sc0.get()` and `sc1.get()` within thhe device code with the corresponding
57+
constant values (`sc_vals[i][0]` and `sc_vals[i][1]`). Full runnable example
58+
can be found on
59+
[github](https://github.com/intel/llvm/blob/sycl/sycl/test/spec_const/spec_const_redefine.cpp).
60+
61+
Specialization constants can be used in programs compiled Ahead-Of-Time, in this
62+
case a specialization constant takes default value for its type (as specified by
63+
[C++ standard](https://en.cppreference.com/w/cpp/language/value_initialization)).
64+
65+
#### Limitations
66+
- The implementation does not support the `template <unsigned NID> struct spec_constant_id`
67+
API design for interoperability with OpenCL - to set specializataion constants
68+
in SYCL programs originating from external SPIRV modules and wrapped by OpenCL
69+
program objects. In SPIRV/OpenCL specialization constants are identified by an
70+
integer number, and the `spec_constant_id` class models that.
71+
- Only primitive numeric types are supported.
72+

sycl/include/CL/sycl/info/info_desc.hpp

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -214,6 +214,19 @@ enum class kernel_sub_group : cl_kernel_sub_group_info {
214214
compile_sub_group_size = CL_KERNEL_COMPILE_SUB_GROUP_SIZE_INTEL
215215
};
216216

217+
enum class kernel_device_specific : cl_kernel_work_group_info {
218+
global_work_size = CL_KERNEL_GLOBAL_WORK_SIZE,
219+
work_group_size = CL_KERNEL_WORK_GROUP_SIZE,
220+
compile_work_group_size = CL_KERNEL_COMPILE_WORK_GROUP_SIZE,
221+
preferred_work_group_size_multiple =
222+
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE,
223+
private_mem_size = CL_KERNEL_PRIVATE_MEM_SIZE,
224+
max_sub_group_size = CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE,
225+
max_num_sub_groups = CL_KERNEL_MAX_NUM_SUB_GROUPS,
226+
compile_num_sub_groups = CL_KERNEL_COMPILE_NUM_SUB_GROUPS,
227+
compile_sub_group_size = CL_KERNEL_COMPILE_SUB_GROUP_SIZE_INTEL
228+
};
229+
217230
// A.6 Program information desctiptors
218231
enum class program : cl_program_info {
219232
context = CL_PROGRAM_CONTEXT,
@@ -242,6 +255,8 @@ enum class event_profiling : cl_profiling_info {
242255
// Provide an alias to the return type for each of the info parameters
243256
template <typename T, T param> class param_traits {};
244257

258+
template <typename T, T param> struct compatibility_param_traits {};
259+
245260
#define PARAM_TRAITS_SPEC(param_type, param, ret_type) \
246261
template <> class param_traits<param_type, param_type::param> { \
247262
public: \
@@ -263,6 +278,7 @@ template <typename T, T param> class param_traits {};
263278

264279
#include <CL/sycl/info/event_profiling_traits.def>
265280

281+
#include <CL/sycl/info/kernel_device_specific_traits.def>
266282
#include <CL/sycl/info/kernel_sub_group_traits.def>
267283
#include <CL/sycl/info/kernel_traits.def>
268284
#include <CL/sycl/info/kernel_work_group_traits.def>
@@ -276,6 +292,24 @@ template <typename T, T param> class param_traits {};
276292
#undef PARAM_TRAITS_SPEC
277293
#undef PARAM_TRAITS_SPEC_WITH_INPUT
278294

295+
#define PARAM_TRAITS_SPEC(param_type, param, ret_type) \
296+
template <> \
297+
struct compatibility_param_traits<param_type, param_type::param> { \
298+
static constexpr auto value = kernel_device_specific::param; \
299+
};
300+
301+
#define PARAM_TRAITS_SPEC_WITH_INPUT(param_type, param, ret_type, in_type) \
302+
template <> \
303+
struct compatibility_param_traits<param_type, param_type::param> { \
304+
static constexpr auto value = kernel_device_specific::param; \
305+
};
306+
307+
#include <CL/sycl/info/kernel_sub_group_traits.def>
308+
#include <CL/sycl/info/kernel_work_group_traits.def>
309+
310+
#undef PARAM_TRAITS_SPEC
311+
#undef PARAM_TRAITS_SPEC_WITH_INPUT
312+
279313
} // namespace info
280314
} // namespace sycl
281315
} // __SYCL_INLINE_NAMESPACE(cl)
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
PARAM_TRAITS_SPEC(kernel_device_specific, compile_work_group_size,
2+
cl::sycl::range<3>)
3+
PARAM_TRAITS_SPEC(kernel_device_specific, global_work_size, cl::sycl::range<3>)
4+
PARAM_TRAITS_SPEC(kernel_device_specific,
5+
preferred_work_group_size_multiple, size_t)
6+
PARAM_TRAITS_SPEC(kernel_device_specific, private_mem_size, cl_ulong)
7+
PARAM_TRAITS_SPEC(kernel_device_specific, work_group_size, size_t)
8+
PARAM_TRAITS_SPEC_WITH_INPUT(kernel_device_specific, max_sub_group_size,
9+
uint32_t, cl::sycl::range<3>)
10+
PARAM_TRAITS_SPEC(kernel_device_specific, max_num_sub_groups, uint32_t)
11+
PARAM_TRAITS_SPEC(kernel_device_specific, compile_num_sub_groups, uint32_t)
12+
PARAM_TRAITS_SPEC(kernel_device_specific, compile_sub_group_size, uint32_t)

sycl/include/CL/sycl/kernel.hpp

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,27 @@ class __SYCL_EXPORT kernel {
9292
typename info::param_traits<info::kernel, param>::return_type
9393
get_info() const;
9494

95+
/// Query device-specific information from the kernel object using the
96+
/// info::kernel_device_specific descriptor.
97+
///
98+
/// \param Device is a valid SYCL device to query info for.
99+
/// \return depends on information being queried.
100+
template <info::kernel_device_specific param>
101+
typename info::param_traits<info::kernel_device_specific, param>::return_type
102+
get_info(const device &Device) const;
103+
104+
/// Query device-specific information from a kernel using the
105+
/// info::kernel_device_specific descriptor for a specific device and value.
106+
///
107+
/// \param Device is a valid SYCL device.
108+
/// \param Value depends on information being queried.
109+
/// \return depends on information being queried.
110+
template <info::kernel_device_specific param>
111+
typename info::param_traits<info::kernel_device_specific, param>::return_type
112+
get_info(const device &Device,
113+
typename info::param_traits<info::kernel_device_specific,
114+
param>::input_type Value) const;
115+
95116
/// Query work-group information from a kernel using the
96117
/// info::kernel_work_group descriptor for a specific device.
97118
///
@@ -107,8 +128,11 @@ class __SYCL_EXPORT kernel {
107128
/// \param Device is a valid SYCL device.
108129
/// \return depends on information being queried.
109130
template <info::kernel_sub_group param>
131+
// clang-format off
110132
typename info::param_traits<info::kernel_sub_group, param>::return_type
133+
__SYCL_DEPRECATED("Use get_info with info::kernel_device_specific instead.")
111134
get_sub_group_info(const device &Device) const;
135+
// clang-format on
112136

113137
/// Query sub-group information from a kernel using the
114138
/// info::kernel_sub_group descriptor for a specific device and value.
@@ -117,11 +141,13 @@ class __SYCL_EXPORT kernel {
117141
/// \param Value depends on information being queried.
118142
/// \return depends on information being queried.
119143
template <info::kernel_sub_group param>
144+
// clang-format off
120145
typename info::param_traits<info::kernel_sub_group, param>::return_type
121-
get_sub_group_info(
122-
const device &Device,
123-
typename info::param_traits<info::kernel_sub_group, param>::input_type
124-
Value) const;
146+
__SYCL_DEPRECATED("Use get_info with info::kernel_device_specific instead.")
147+
get_sub_group_info(const device &Device,
148+
typename info::param_traits<info::kernel_sub_group,
149+
param>::input_type Value) const;
150+
// clang-format on
125151

126152
private:
127153
/// Constructs a SYCL kernel object from a valid kernel_impl instance.

sycl/plugins/cuda/pi_cuda.cpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4133,7 +4133,7 @@ pi_result cuda_piextUSMHostAlloc(void **result_ptr, pi_context context,
41334133
} catch (pi_result error) {
41344134
result = error;
41354135
}
4136-
assert(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0);
4136+
41374137
return result;
41384138
}
41394139

@@ -4154,7 +4154,7 @@ pi_result cuda_piextUSMDeviceAlloc(void **result_ptr, pi_context context,
41544154
} catch (pi_result error) {
41554155
result = error;
41564156
}
4157-
assert(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0);
4157+
41584158
return result;
41594159
}
41604160

@@ -4176,7 +4176,7 @@ pi_result cuda_piextUSMSharedAlloc(void **result_ptr, pi_context context,
41764176
} catch (pi_result error) {
41774177
result = error;
41784178
}
4179-
assert(reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0);
4179+
41804180
return result;
41814181
}
41824182

sycl/plugins/level_zero/pi_level_zero.cpp

Lines changed: 44 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1441,7 +1441,16 @@ piextDeviceSelectBinary(pi_device Device, // TODO: does this need to be context?
14411441
pi_device_binary *Binaries, pi_uint32 NumBinaries,
14421442
pi_uint32 *SelectedBinaryInd) {
14431443

1444-
// TODO dummy implementation.
1444+
assert(Device);
1445+
assert(SelectedBinaryInd);
1446+
assert(NumBinaries == 0 || Binaries);
1447+
1448+
// TODO: this is a bare-bones implementation for choosing a device image
1449+
// that would be compatible with the targeted device. An AOT-compiled
1450+
// image is preferred over SPIR-V for known devices (i.e. Intel devices)
1451+
// The implementation makes no effort to differentiate between multiple images
1452+
// for the given device, and simply picks the first one compatible.
1453+
//
14451454
// Real implementation will use the same mechanism OpenCL ICD dispatcher
14461455
// uses. Something like:
14471456
// PI_VALIDATE_HANDLE_RETURN_HANDLE(ctx, PI_INVALID_CONTEXT);
@@ -1450,9 +1459,28 @@ piextDeviceSelectBinary(pi_device Device, // TODO: does this need to be context?
14501459
// where context->dispatch is set to the dispatch table provided by PI
14511460
// plugin for platform/device the ctx was created for.
14521461

1462+
// Look for GEN binary, which we known can only be handled by Level-Zero now.
1463+
const char *BinaryTarget = PI_DEVICE_BINARY_TARGET_SPIRV64_GEN;
1464+
1465+
// Find the appropriate device image, fallback to spirv if not found
14531466
constexpr pi_uint32 InvalidInd = std::numeric_limits<pi_uint32>::max();
1454-
*SelectedBinaryInd = NumBinaries > 0 ? 0 : InvalidInd;
1455-
return PI_SUCCESS;
1467+
pi_uint32 Spirv = InvalidInd;
1468+
1469+
for (pi_uint32 i = 0; i < NumBinaries; ++i) {
1470+
if (strcmp(Binaries[i]->DeviceTargetSpec, BinaryTarget) == 0) {
1471+
*SelectedBinaryInd = i;
1472+
return PI_SUCCESS;
1473+
}
1474+
if (strcmp(Binaries[i]->DeviceTargetSpec,
1475+
PI_DEVICE_BINARY_TARGET_SPIRV64) == 0)
1476+
Spirv = i;
1477+
}
1478+
// Points to a spirv image, if such indeed was found
1479+
if ((*SelectedBinaryInd = Spirv) != InvalidInd)
1480+
return PI_SUCCESS;
1481+
1482+
// No image can be loaded for the given device
1483+
return PI_INVALID_BINARY;
14561484
}
14571485

14581486
pi_result piextDeviceGetNativeHandle(pi_device Device,
@@ -3100,7 +3128,20 @@ pi_result piEventGetProfilingInfo(pi_event Event, pi_profiling_info ParamName,
31003128
case PI_PROFILING_INFO_COMMAND_END: {
31013129
zeEventQueryKernelTimestamp(Event->ZeEvent, &tsResult);
31023130

3131+
uint64_t ContextStartTime = tsResult.context.kernelStart;
31033132
uint64_t ContextEndTime = tsResult.context.kernelEnd;
3133+
//
3134+
// Handle a possible wrap-around (the underlying HW counter is < 64-bit).
3135+
// Note, it will not report correct time if there were multiple wrap
3136+
// arounds, and the longer term plan is to enlarge the capacity of the
3137+
// HW timestamps.
3138+
//
3139+
if (ContextEndTime <= ContextStartTime) {
3140+
pi_device Device = Event->Context->Devices[0];
3141+
const uint64_t TimestampMaxValue =
3142+
(1LL << Device->ZeDeviceProperties.kernelTimestampValidBits) - 1;
3143+
ContextEndTime += TimestampMaxValue - ContextStartTime;
3144+
}
31043145
ContextEndTime *= ZeTimerResolution;
31053146

31063147
return ReturnValue(uint64_t{ContextEndTime});

0 commit comments

Comments
 (0)