Skip to content

[SYCL][Level Zero] Implement sycl_ext_intel_cslice extension #7626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 12, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions sycl/doc/EnvironmentVariables.md
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,7 @@ variables in production code.</span>
| `SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE_FOR_FILL` | Integer | When set to a positive value enables use of a copy engine for memory fill operations. Default is 0. |
| `SYCL_PI_LEVEL_ZERO_SINGLE_ROOT_DEVICE_BUFFER_MIGRATION` | Integer | When set to "0" tells to use single root-device allocation for all devices in a context where all devices have same root. Otherwise performs regular buffer migration. Default is 1. |
| `SYCL_PI_LEVEL_ZERO_REUSE_DISCARDED_EVENTS` | Integer | When set to a positive value enables the mode when discarded Level Zero events are reset and reused in scope of the same in-order queue based on the dependency chain between commands. Default is 1. |
| `SYCL_PI_LEVEL_ZERO_EXPOSE_CSLICE_IN_AFFINITY_PARTITIONING` (Deprecated) | Integer | When set to non-zero value exposes compute slices as sub-sub-devices in `sycl::info::partition_property::partition_by_affinity_domain` partitioning scheme. Default is zero meaning that they are only exposed when partitioning by `sycl::info::partition_property::ext_intel_partition_by_cslice`. This option is introduced for compatibility reasons and is immediately deprecated. New code must not rely on this behavior. Also note that even if sub-sub-device was created using `partition_by_affinity_domain` it would still be reported as created via partitioning by compute slices. |

## Debugging variables for CUDA Plugin

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,7 @@ SYCL specification refer to that revision.

== Status

This is a proposed extension specification, intended to gather community
feedback. Interfaces defined in this specification may not be implemented yet
or may be in a preliminary state. The specification itself may also change in
incompatible ways before it is finalized. *Shipping software products should
not rely on APIs defined in this specification.*
This extension is implemented and fully supported by DPC++.


== Overview
Expand Down
11 changes: 9 additions & 2 deletions sycl/include/sycl/detail/pi.h
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,14 @@
// PI_EXT_ONEAPI_QUEUE_PRIORITY_HIGH queue properties.
// 11.18 Add new parameter name PI_EXT_ONEAPI_QUEUE_INFO_EMPTY to
// _pi_queue_info.
// 12.19 Add new PI_EXT_INTEL_DEVICE_PARTITION_BY_CSLICE piDevicePartition
// scheme. Sub-sub-devices (representing compute slice) creation via
// partitioning by affinity domain is disabled by default and can be temporarily
// restored via SYCL_PI_LEVEL_ZERO_EXPOSE_CSLICE_IN_AFFINITY_PARTITIONING
// environment variable.

#define _PI_H_VERSION_MAJOR 11
#define _PI_H_VERSION_MINOR 18
#define _PI_H_VERSION_MAJOR 12
#define _PI_H_VERSION_MINOR 19

#define _PI_STRING_HELPER(a) #a
#define _PI_CONCAT(a, b) _PI_STRING_HELPER(a.b)
Expand Down Expand Up @@ -630,6 +635,8 @@ static constexpr pi_device_partition_property
PI_DEVICE_PARTITION_BY_COUNTS_LIST_END = 0x0;
static constexpr pi_device_partition_property
PI_DEVICE_PARTITION_BY_AFFINITY_DOMAIN = 0x1088;
static constexpr pi_device_partition_property
PI_EXT_INTEL_DEVICE_PARTITION_BY_CSLICE = 0x1089;

// For compatibility with OpenCL define this not as enum.
using pi_device_affinity_domain = pi_bitfield;
Expand Down
13 changes: 13 additions & 0 deletions sycl/include/sycl/device.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,19 @@ class __SYCL_EXPORT device : public detail::OwnerLessBase<device> {
std::vector<device>
create_sub_devices(info::partition_affinity_domain AffinityDomain) const;

/// Partition device into sub devices
///
/// Available only when prop is
/// info::partition_property::ext_intel_partition_by_cslice. If this SYCL
/// device does not support
/// info::partition_property::ext_intel_partition_by_cslice a
/// feature_not_supported exception must be thrown.
///
/// \return a vector class of sub devices partitioned from this SYCL
/// device at a granularity of "cslice" (compute slice).
template <info::partition_property prop>
std::vector<device> create_sub_devices() const;

/// Queries this SYCL device for information requested by the template
/// parameter param
///
Expand Down
3 changes: 2 additions & 1 deletion sycl/include/sycl/info/info_desc.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,8 @@ enum class partition_property : pi_device_partition_property {
no_partition = 0,
partition_equally = PI_DEVICE_PARTITION_EQUALLY,
partition_by_counts = PI_DEVICE_PARTITION_BY_COUNTS,
partition_by_affinity_domain = PI_DEVICE_PARTITION_BY_AFFINITY_DOMAIN
partition_by_affinity_domain = PI_DEVICE_PARTITION_BY_AFFINITY_DOMAIN,
ext_intel_partition_by_cslice = PI_EXT_INTEL_DEVICE_PARTITION_BY_CSLICE
};

enum class partition_affinity_domain : pi_device_affinity_domain {
Expand Down
144 changes: 100 additions & 44 deletions sycl/plugins/level_zero/pi_level_zero.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,12 @@ static const int DeviceEventsSetting = [] {
return AllHostVisible;
}();

static const bool ExposeCSliceInAffinityPartitioning = [] {
const char *Flag =
std::getenv("SYCL_PI_LEVEL_ZERO_EXPOSE_CSLICE_IN_AFFINITY_PARTITIONING");
return Flag ? std::atoi(Flag) != 0 : false;
}();

// Helper function to implement zeHostSynchronize.
// The behavior is to avoid infinite wait during host sync under ZE_DEBUG.
// This allows for a much more responsive debugging of hangs.
Expand Down Expand Up @@ -2606,28 +2612,33 @@ pi_result _pi_platform::populateDeviceCacheIfNeeded() {
}
}

// Create PI sub-sub-devices with the sub-device for all the ordinals.
// Each {ordinal, index} points to a specific CCS which constructs
// a sub-sub-device at this point.
// FIXME: Level Zero creates multiple PiDevices for a single physical
// device when sub-device is partitioned into sub-sub-devices.
// Sub-sub-device is technically a command queue and we should not build
// program for each command queue. PiDevice is probably not the right
// abstraction for a Level Zero command queue.
for (uint32_t J = 0; J < Ordinals.size(); ++J) {
for (uint32_t K = 0; K < QueueGroupProperties[Ordinals[J]].numQueues;
++K) {
std::unique_ptr<_pi_device> PiSubSubDevice(
new _pi_device(ZeSubdevices[I], this, PiSubDevice.get()));
pi_result Result = PiSubSubDevice->initialize(Ordinals[J], K);
if (Result != PI_SUCCESS) {
return Result;
// If isn't PVC, then submissions to different CCS can be executed on
// the same EUs still, so we cannot treat them as sub-sub-devices.
if (PiSubDevice->isPVC() || ExposeCSliceInAffinityPartitioning) {
// Create PI sub-sub-devices with the sub-device for all the ordinals.
// Each {ordinal, index} points to a specific CCS which constructs
// a sub-sub-device at this point.
//
// FIXME: Level Zero creates multiple PiDevices for a single physical
// device when sub-device is partitioned into sub-sub-devices.
// Sub-sub-device is technically a command queue and we should not
// build program for each command queue. PiDevice is probably not the
// right abstraction for a Level Zero command queue.
for (uint32_t J = 0; J < Ordinals.size(); ++J) {
for (uint32_t K = 0;
K < QueueGroupProperties[Ordinals[J]].numQueues; ++K) {
std::unique_ptr<_pi_device> PiSubSubDevice(
new _pi_device(ZeSubdevices[I], this, PiSubDevice.get()));
pi_result Result = PiSubSubDevice->initialize(Ordinals[J], K);
if (Result != PI_SUCCESS) {
return Result;
}

// save pointers to sub-sub-devices for quick retrieval in the
// future.
PiSubDevice->SubDevices.push_back(PiSubSubDevice.get());
PiDevicesCache.push_back(std::move(PiSubSubDevice));
}

// save pointers to sub-sub-devices for quick retrieval in the
// future.
PiSubDevice->SubDevices.push_back(PiSubSubDevice.get());
PiDevicesCache.push_back(std::move(PiSubSubDevice));
}
}

Expand Down Expand Up @@ -2868,31 +2879,49 @@ pi_result piDeviceGetInfo(pi_device Device, pi_device_info ParamName,
if (ZeSubDeviceCount < 2) {
return ReturnValue(pi_device_partition_property{0});
}
// It is debatable if SYCL sub-device and partitioning APIs sufficient to
// expose Level Zero sub-devices? We start with support of
// "partition_by_affinity_domain" and "next_partitionable" but if that
// doesn't seem to be a good fit we could look at adding a more descriptive
// partitioning type.
struct {
pi_device_partition_property Arr[2];
} PartitionProperties = {{PI_DEVICE_PARTITION_BY_AFFINITY_DOMAIN, 0}};
return ReturnValue(PartitionProperties);
bool PartitionedByCSlice = Device->SubDevices[0]->isCCS();

auto ReturnHelper = [&](auto... Partitions) {
struct {
pi_device_partition_property Arr[sizeof...(Partitions) + 1];
} PartitionProperties = {{Partitions..., 0}};
return ReturnValue(PartitionProperties);
};

if (ExposeCSliceInAffinityPartitioning) {
if (PartitionedByCSlice)
return ReturnHelper(PI_EXT_INTEL_DEVICE_PARTITION_BY_CSLICE,
PI_DEVICE_PARTITION_BY_AFFINITY_DOMAIN);

else
return ReturnHelper(PI_DEVICE_PARTITION_BY_AFFINITY_DOMAIN);
} else {
return ReturnHelper(PartitionedByCSlice
? PI_EXT_INTEL_DEVICE_PARTITION_BY_CSLICE
: PI_DEVICE_PARTITION_BY_AFFINITY_DOMAIN);
}
}
case PI_DEVICE_INFO_PARTITION_AFFINITY_DOMAIN:
return ReturnValue(pi_device_affinity_domain{
PI_DEVICE_AFFINITY_DOMAIN_NUMA |
PI_DEVICE_AFFINITY_DOMAIN_NEXT_PARTITIONABLE});
case PI_DEVICE_INFO_PARTITION_TYPE: {
if (Device->isSubDevice()) {
// For root-device there is no partitioning to report.
if (!Device->isSubDevice())
return ReturnValue(pi_device_partition_property{0});

if (Device->isCCS()) {
struct {
pi_device_partition_property Arr[3];
} PartitionProperties = {{PI_DEVICE_PARTITION_BY_AFFINITY_DOMAIN,
PI_DEVICE_AFFINITY_DOMAIN_NEXT_PARTITIONABLE,
0}};
pi_device_partition_property Arr[2];
} PartitionProperties = {{PI_EXT_INTEL_DEVICE_PARTITION_BY_CSLICE, 0}};
return ReturnValue(PartitionProperties);
}
// For root-device there is no partitioning to report.
return ReturnValue(pi_device_partition_property{0});

struct {
pi_device_partition_property Arr[3];
} PartitionProperties = {{PI_DEVICE_PARTITION_BY_AFFINITY_DOMAIN,
PI_DEVICE_AFFINITY_DOMAIN_NEXT_PARTITIONABLE, 0}};
return ReturnValue(PartitionProperties);
}

// Everything under here is not supported yet
Expand Down Expand Up @@ -3264,15 +3293,19 @@ pi_result piDevicePartition(pi_device Device,
const pi_device_partition_property *Properties,
pi_uint32 NumDevices, pi_device *OutDevices,
pi_uint32 *OutNumDevices) {
PI_ASSERT(Device, PI_ERROR_INVALID_DEVICE);
// Other partitioning ways are not supported by Level Zero
if (Properties[0] != PI_DEVICE_PARTITION_BY_AFFINITY_DOMAIN ||
(Properties[1] != PI_DEVICE_AFFINITY_DOMAIN_NEXT_PARTITIONABLE &&
Properties[1] != PI_DEVICE_AFFINITY_DOMAIN_NUMA)) {
if (Properties[0] == PI_DEVICE_PARTITION_BY_AFFINITY_DOMAIN) {
if ((Properties[1] != PI_DEVICE_AFFINITY_DOMAIN_NEXT_PARTITIONABLE &&
Properties[1] != PI_DEVICE_AFFINITY_DOMAIN_NUMA))
return PI_ERROR_INVALID_VALUE;
} else if (Properties[0] == PI_EXT_INTEL_DEVICE_PARTITION_BY_CSLICE) {
if (Properties[1] != 0)
return PI_ERROR_INVALID_VALUE;
} else {
return PI_ERROR_INVALID_VALUE;
}

PI_ASSERT(Device, PI_ERROR_INVALID_DEVICE);

// Devices cache is normally created in piDevicesGet but still make
// sure that cache is populated.
//
Expand All @@ -3281,16 +3314,39 @@ pi_result piDevicePartition(pi_device Device,
return Res;
}

auto EffectiveNumDevices = [&]() -> decltype(Device->SubDevices.size()) {
if (Device->SubDevices.size() == 0)
return 0;

// Sub-Sub-Devices are partitioned by CSlices, not by affinity domain.
// However, if
// SYCL_PI_LEVEL_ZERO_EXPOSE_CSLICE_IN_AFFINITY_PARTITIONING overrides that
// still expose CSlices in partitioning by affinity domain for compatibility
// reasons.
if (Properties[0] == PI_DEVICE_PARTITION_BY_AFFINITY_DOMAIN &&
!ExposeCSliceInAffinityPartitioning) {
if (Device->isSubDevice())
return 0;
}
if (Properties[0] == PI_EXT_INTEL_DEVICE_PARTITION_BY_CSLICE) {
// Not a CSlice-based partitioning.
if (!Device->SubDevices[0]->isCCS())
return 0;
}

return Device->SubDevices.size();
}();

if (OutNumDevices) {
*OutNumDevices = Device->SubDevices.size();
*OutNumDevices = EffectiveNumDevices;
}

if (OutDevices) {
// TODO: Consider support for partitioning to <= total sub-devices.
// Currently supported partitioning (by affinity domain/numa) would always
// partition to all sub-devices.
//
PI_ASSERT(NumDevices == Device->SubDevices.size(), PI_ERROR_INVALID_VALUE);
PI_ASSERT(NumDevices == EffectiveNumDevices, PI_ERROR_INVALID_VALUE);

for (uint32_t I = 0; I < NumDevices; I++) {
OutDevices[I] = Device->SubDevices[I];
Expand Down
8 changes: 8 additions & 0 deletions sycl/plugins/level_zero/pi_level_zero.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -379,6 +379,14 @@ struct _pi_device : _pi_object {

bool isSubDevice() { return RootDevice != nullptr; }

// Is this a Data Center GPU Max series (aka PVC).
bool isPVC() { return (ZeDeviceProperties->deviceId & 0xff0) == 0xbd0; }

// Does this device represent a single compute slice?
bool isCCS() const {
return QueueGroup[_pi_device::queue_group_info_t::Compute].ZeIndex >= 0;
}

// Cache of the immutable device properties.
ZeCache<ZeStruct<ze_device_properties_t>> ZeDeviceProperties;
ZeCache<ZeStruct<ze_device_compute_properties_t>> ZeDeviceComputeProperties;
Expand Down
22 changes: 22 additions & 0 deletions sycl/source/detail/device_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,28 @@ std::vector<device> device_impl::create_sub_devices(
return create_sub_devices(Properties, SubDevicesCount);
}

std::vector<device> device_impl::create_sub_devices() const {
assert(!MIsHostDevice && "Partitioning is not supported on host.");

if (!is_partition_supported(
info::partition_property::ext_intel_partition_by_cslice)) {
throw sycl::feature_not_supported(
"Device does not support "
"sycl::info::partition_property::ext_intel_partition_by_cslice.",
PI_ERROR_INVALID_OPERATION);
}

const pi_device_partition_property Properties[2] = {
PI_EXT_INTEL_DEVICE_PARTITION_BY_CSLICE, 0};

pi_uint32 SubDevicesCount = 0;
const detail::plugin &Plugin = getPlugin();
Plugin.call<sycl::errc::invalid, PiApiKind::piDevicePartition>(
MDevice, Properties, 0, nullptr, &SubDevicesCount);

return create_sub_devices(Properties, SubDevicesCount);
}

pi_native_handle device_impl::getNative() const {
auto Plugin = getPlugin();
if (Plugin.getBackend() == backend::opencl)
Expand Down
10 changes: 10 additions & 0 deletions sycl/source/detail/device_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,16 @@ class device_impl {
std::vector<device>
create_sub_devices(info::partition_affinity_domain AffinityDomain) const;

/// Partition device into sub devices
///
/// If this SYCL device does not support
/// info::partition_property::ext_intel_partition_by_cslice a
/// feature_not_supported exception must be thrown.
///
/// \return a vector class of sub devices partitioned from this SYCL
/// device at a granularity of "cslice" (compute slice).
std::vector<device> create_sub_devices() const;

/// Check if desired partition property supported by device
///
/// \param Prop is one of info::partition_property::(partition_equally,
Expand Down
1 change: 1 addition & 0 deletions sycl/source/detail/device_info.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -358,6 +358,7 @@ static bool is_sycl_partition_property(info::partition_property PP) {
case info::partition_property::partition_equally:
case info::partition_property::partition_by_counts:
case info::partition_property::partition_by_affinity_domain:
case info::partition_property::ext_intel_partition_by_cslice:
return true;
}
return false;
Expand Down
8 changes: 8 additions & 0 deletions sycl/source/device.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,14 @@ template __SYCL_EXPORT std::vector<device> device::create_sub_devices<
info::partition_property::partition_by_affinity_domain>(
info::partition_affinity_domain AffinityDomain) const;

template <info::partition_property prop>
std::vector<device> device::create_sub_devices() const {
return impl->create_sub_devices();
}

template __SYCL_EXPORT std::vector<device> device::create_sub_devices<
info::partition_property::ext_intel_partition_by_cslice>() const;

bool device::has_extension(const std::string &extension_name) const {
return impl->has_extension(extension_name);
}
Expand Down
1 change: 1 addition & 0 deletions sycl/test/abi/sycl_symbols_linux.dump
Original file line number Diff line number Diff line change
Expand Up @@ -4175,6 +4175,7 @@ _ZNK4sycl3_V16device14is_acceleratorEv
_ZNK4sycl3_V16device18create_sub_devicesILNS0_4info18partition_propertyE4230EEESt6vectorIS1_SaIS1_EEm
_ZNK4sycl3_V16device18create_sub_devicesILNS0_4info18partition_propertyE4231EEESt6vectorIS1_SaIS1_EERKS5_ImSaImEE
_ZNK4sycl3_V16device18create_sub_devicesILNS0_4info18partition_propertyE4232EEESt6vectorIS1_SaIS1_EENS3_25partition_affinity_domainE
_ZNK4sycl3_V16device18create_sub_devicesILNS0_4info18partition_propertyE4233EEESt6vectorIS1_SaIS1_EEv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update Windows symbols as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local windows build is broken for me. I'm working with @steffenlarsen on resolving this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

_ZNK4sycl3_V16device3getEv
_ZNK4sycl3_V16device3hasENS0_6aspectE
_ZNK4sycl3_V16device6is_cpuEv
Expand Down
1 change: 1 addition & 0 deletions sycl/test/abi/sycl_symbols_windows.dump
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
??$create_sub_devices@$0BAIG@@device@_V1@sycl@@QEBA?AV?$vector@Vdevice@_V1@sycl@@V?$allocator@Vdevice@_V1@sycl@@@std@@@std@@_K@Z
??$create_sub_devices@$0BAIH@@device@_V1@sycl@@QEBA?AV?$vector@Vdevice@_V1@sycl@@V?$allocator@Vdevice@_V1@sycl@@@std@@@std@@AEBV?$vector@_KV?$allocator@_K@std@@@4@@Z
??$create_sub_devices@$0BAII@@device@_V1@sycl@@QEBA?AV?$vector@Vdevice@_V1@sycl@@V?$allocator@Vdevice@_V1@sycl@@@std@@@std@@W4partition_affinity_domain@info@12@@Z
??$create_sub_devices@$0BAIJ@@device@_V1@sycl@@QEBA?AV?$vector@Vdevice@_V1@sycl@@V?$allocator@Vdevice@_V1@sycl@@@std@@@std@@XZ
??$getPlugin@$00@pi@detail@_V1@sycl@@YAAEBVplugin@123@XZ
??$getPlugin@$01@pi@detail@_V1@sycl@@YAAEBVplugin@123@XZ
??$getPlugin@$02@pi@detail@_V1@sycl@@YAAEBVplugin@123@XZ
Expand Down