Skip to content

[SYCL] [ABI-Break] Partial implementation of sycl_ext_oneapi_cuda_cluster_group #14113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 89 commits into from
Jul 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
5242112
Introduce mappings to UR launch props.
JackAKirk May 27, 2024
02001a1
initial changes supporting thread block cluster launch
AD2605 May 29, 2024
939d01e
Merge remote-tracking branch 'jack-llvm/pi-launch-properties' into at…
AD2605 May 29, 2024
1b23bc4
fix compilation issues 1
AD2605 May 29, 2024
1a15a1c
fix compilation errors
AD2605 May 30, 2024
b26f303
fix compilation issues 1
AD2605 May 31, 2024
eceb616
delete executable added by mistake
AD2605 May 31, 2024
b718967
add unit test and temporary changes for testing
AD2605 Jun 1, 2024
c37f4be
restore handler.hpp
AD2605 Jun 1, 2024
80ed867
Merge remote-tracking branch 'intel-llvm/sycl' into atharva/thread_bl…
AD2605 Jun 1, 2024
9b75ebb
return PI_ERROR_UNSUPPORTED_FEATURE for non cuda backends
AD2605 Jun 1, 2024
79f9d1f
fix bug in test
AD2605 Jun 2, 2024
40c0db4
cleanup
AD2605 Jun 2, 2024
411aa70
pass properties in enqueue functions and add test for cluster launch
AD2605 Jun 2, 2024
519034d
restore enqueue funcitons to latest tip to check if it is buggy
AD2605 Jun 2, 2024
140b4d7
Revert "restore enqueue funcitons to latest tip to check if it is buggy"
AD2605 Jun 2, 2024
f95801a
remove test via parallel for, test only via enqueue functions and add…
AD2605 Jun 10, 2024
3ff69cf
remove cluster dim validity as it's already happening in UR
AD2605 Jun 10, 2024
9651d4f
Merge remote-tracking branch 'origin/sycl' into atharva/thread_block_…
AD2605 Jun 10, 2024
81c5456
Bump PI Minor version
AD2605 Jun 10, 2024
f2adb81
add requires cuda in cluster launch test
AD2605 Jun 10, 2024
4e82463
move cluster_size struct to cuda namespace
AD2605 Jun 10, 2024
3b5ee0a
do not use alias and clang format
AD2605 Jun 10, 2024
86a73cd
Fix unittest build failures.
JackAKirk Jun 10, 2024
537d7c5
Merge pull request #1 from JackAKirk/pi-test-enqueue-custom
AD2605 Jun 10, 2024
007ef24
disable fusion when kernel uses cluster launch
AD2605 Jun 10, 2024
364d1d8
fix warning in CI
AD2605 Jun 11, 2024
24d13cf
Merge branch 'sycl' into atharva/thread_block_cluster_launch
JackAKirk Jun 11, 2024
e2ecd3e
Add native-cpu impl.
JackAKirk Jun 11, 2024
1a16024
Merge pull request #2 from JackAKirk/pi-add-native-cpu-impl-cluster
AD2605 Jun 11, 2024
846f5f1
add piextEnqueueKernelLaunchCustom to *_symbol_check.dump and fix nam…
AD2605 Jun 13, 2024
5cf823f
added symbols for linux and windows and revert unintended change in l…
AD2605 Jun 13, 2024
875038a
Merge remote-tracking branch 'intel-llvm/sycl' into atharva/thread_bl…
AD2605 Jun 13, 2024
ad0adcd
restore vector_arith.hpp back to upstream sycl after git clang-format
AD2605 Jun 13, 2024
e5015b5
review comments 1
AD2605 Jun 17, 2024
edb3c9d
fix compilation errors
AD2605 Jun 17, 2024
6f39040
add missing headers in test
AD2605 Jun 17, 2024
e3fcd1d
correct usage of headers in tests
AD2605 Jun 17, 2024
f694315
add more tests
AD2605 Jun 17, 2024
94f6f77
amend tests acc. to the UR bugix
AD2605 Jun 17, 2024
5963b36
Merge remote-tracking branch 'intel-llvm/sycl' into atharva/thread_bl…
AD2605 Jun 17, 2024
aa21ff5
fix bug in test
AD2605 Jun 17, 2024
f3a7dfa
fix ABI tests
AD2605 Jun 18, 2024
db6ed43
Merge remote-tracking branch 'intel-llvm/sycl' into atharva/thread_bl…
AD2605 Jun 19, 2024
71e3336
Merge remote-tracking branch 'intel-llvm/sycl' into atharva/thread_bl…
AD2605 Jun 20, 2024
9863621
add test enqueueLaunchCustom_check_event_deps.cpp
AD2605 Jun 20, 2024
3395142
add new line in latest test
AD2605 Jun 20, 2024
1c35fdd
update test to include kernel after cluster launch to check kernel to…
AD2605 Jun 20, 2024
96c84ca
remove else block
AD2605 Jun 20, 2024
f0f9bfd
fix case when launching for esimd_simulator and fix formatting in clu…
AD2605 Jun 20, 2024
61e3474
Merge remote-tracking branch 'intel-llvm/sycl' into atharva/thread_bl…
AD2605 Jun 24, 2024
2f4ac06
remove properties being passed to parallel for when using a sycl::kernel
AD2605 Jun 25, 2024
4e1e14f
Add aspect ext_oneapi_cuda_cluster_group.
JackAKirk Jun 26, 2024
9eb69ad
Point ur to testing branch.
JackAKirk Jun 26, 2024
0380732
Merge pull request #3 from JackAKirk/cuda-cluster-launch-aspect
AD2605 Jun 27, 2024
0cf1681
Add cluster_group aspect to .td defs.
JackAKirk Jun 27, 2024
bd85b80
Merge pull request #4 from JackAKirk/cluster-fix-aspect-tests
AD2605 Jun 27, 2024
a1b80d5
Merge branch 'sycl' into atharva/thread_block_cluster_launch
JackAKirk Jun 27, 2024
d26e53f
add sycl graph test, ensuring failure if cluster launch is used with …
AD2605 Jun 27, 2024
e0aa8c8
throw error in sycl graphs if cluster launch is used
AD2605 Jun 27, 2024
6374f6e
Merge remote-tracking branch 'intel-llvm/sycl' into atharva/thread_bl…
AD2605 Jun 27, 2024
055bbc9
correct naming convention
AD2605 Jun 28, 2024
3f909c9
change signature of setKernelUsesClusterLaunch
AD2605 Jun 30, 2024
1032723
Merge remote-tracking branch 'intel-llvm/sycl' into atharva/thread_bl…
AD2605 Jul 1, 2024
56acac7
update unified-runtime sha to the latest
AD2605 Jul 1, 2024
722d29a
Merge branch 'sycl' into atharva/thread_block_cluster_launch
JackAKirk Jul 1, 2024
af08b2a
Fix graph cluster Exceptions test.
JackAKirk Jul 1, 2024
f27769e
Merge branch 'sycl' into atharva/thread_block_cluster_launch
JackAKirk Jul 1, 2024
9d7938a
Merge branch 'sycl' into atharva/thread_block_cluster_launch
JackAKirk Jul 2, 2024
353f759
Remove wait_and_throw in test.
JackAKirk Jul 2, 2024
aa5a64d
Update enum value.
JackAKirk Jul 2, 2024
8ffacc3
Aid cluster_size template deduction.
JackAKirk Jul 2, 2024
1f0ba28
bool param removed from new abi.
JackAKirk Jul 2, 2024
4e52601
Merge branch 'sycl' into atharva/thread_block_cluster_launch
JackAKirk Jul 2, 2024
39f8f5d
Impl requested test naming convention.
JackAKirk Jul 2, 2024
86db950
Fix format.
JackAKirk Jul 2, 2024
7c86278
Fix format.
JackAKirk Jul 2, 2024
43553ea
Merge branch 'sycl' into atharva/thread_block_cluster_launch
JackAKirk Jul 2, 2024
2cfe979
Merge branch 'sycl' into atharva/thread_block_cluster_launch
JackAKirk Jul 2, 2024
9ec11de
Merge branch 'sycl' into atharva/thread_block_cluster_launch
JackAKirk Jul 4, 2024
c6a8ef2
Fix CHECK-NEXT failure.
JackAKirk Jul 4, 2024
24eb8f2
Merge remote-tracking branch 'intel-llvm/sycl' into atharva/thread_bl…
AD2605 Jul 5, 2024
91bc7af
make size private and refactoor hasClusterDim
AD2605 Jul 5, 2024
f11bc5a
Fix build.
JackAKirk Jul 5, 2024
8498375
Merge branch 'sycl' into atharva/thread_block_cluster_launch
JackAKirk Jul 5, 2024
c24fc47
Fix format.
JackAKirk Jul 5, 2024
a17f229
Fix format.
JackAKirk Jul 5, 2024
8801a6a
Add queue wait to avoid race.
JackAKirk Jul 5, 2024
5a0b039
sm90 fix.
JackAKirk Jul 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions llvm/include/llvm/SYCLLowerIR/DeviceConfigFile.td
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ def AspectExt_oneapi_limited_graph : Aspect<"ext_oneapi_limited_graph">;
def AspectExt_oneapi_private_alloca : Aspect<"ext_oneapi_private_alloca">;
def AspectExt_oneapi_queue_profiling_tag : Aspect<"ext_oneapi_queue_profiling_tag">;
def AspectExt_oneapi_virtual_mem : Aspect<"ext_oneapi_virtual_mem">;
def AspectExt_oneapi_cuda_cluster_group : Aspect<"ext_oneapi_cuda_cluster_group">;
// Deprecated aspects
def AspectInt64_base_atomics : Aspect<"int64_base_atomics">;
def AspectInt64_extended_atomics : Aspect<"int64_extended_atomics">;
Expand Down Expand Up @@ -139,7 +140,7 @@ def : TargetInfo<"__TestAspectList",
AspectExt_oneapi_ballot_group, AspectExt_oneapi_fixed_size_group, AspectExt_oneapi_opportunistic_group,
AspectExt_oneapi_tangle_group, AspectExt_intel_matrix, AspectExt_oneapi_is_composite, AspectExt_oneapi_is_component,
AspectExt_oneapi_graph, AspectExt_intel_fpga_task_sequence, AspectExt_oneapi_limited_graph,
AspectExt_oneapi_private_alloca, AspectExt_oneapi_queue_profiling_tag, AspectExt_oneapi_virtual_mem],
AspectExt_oneapi_private_alloca, AspectExt_oneapi_queue_profiling_tag, AspectExt_oneapi_virtual_mem, AspectExt_oneapi_cuda_cluster_group],
[]>;
// This definition serves the only purpose of testing whether the deprecated aspect list defined in here and in SYCL RT
// match.
Expand Down Expand Up @@ -205,9 +206,9 @@ def : CudaTargetInfo<"nvidia_gpu_sm_87", !listconcat(CudaMinAspects, CudaBindles
def : CudaTargetInfo<"nvidia_gpu_sm_89", !listconcat(CudaMinAspects, CudaBindlessImagesAspects,
[AspectFp16, AspectAtomic64, AspectExt_oneapi_cuda_async_barrier])>;
def : CudaTargetInfo<"nvidia_gpu_sm_90", !listconcat(CudaMinAspects, CudaBindlessImagesAspects,
[AspectFp16, AspectAtomic64, AspectExt_oneapi_cuda_async_barrier])>;
[AspectFp16, AspectAtomic64, AspectExt_oneapi_cuda_async_barrier, AspectExt_oneapi_cuda_cluster_group])>;
def : CudaTargetInfo<"nvidia_gpu_sm_90a", !listconcat(CudaMinAspects, CudaBindlessImagesAspects,
[AspectFp16, AspectAtomic64, AspectExt_oneapi_cuda_async_barrier])>;
[AspectFp16, AspectAtomic64, AspectExt_oneapi_cuda_async_barrier, AspectExt_oneapi_cuda_cluster_group])>;

//
// HIP / AMDGPU device aspects
Expand Down
1 change: 1 addition & 0 deletions sycl/doc/design/DeviceConfigFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,7 @@ def AspectExt_intel_free_memory : Aspect<"ext_intel_free_memory">;
def AspectExt_intel_device_id : Aspect<"ext_intel_device_id">;
def AspectExt_intel_memory_clock_rate : Aspect<"ext_intel_memory_clock_rate">;
def AspectExt_intel_memory_bus_width : Aspect<"ext_intel_memory_bus_width">;
def AspectExt_oneapi_cuda_cluster_group : Aspect<"ext_oneapi_cuda_cluster_group">;
def AspectEmulated : Aspect<"emulated">;

def TargetTable : DynamicTable {
Expand Down
7 changes: 5 additions & 2 deletions sycl/include/sycl/detail/cg.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,7 @@ class CGExecKernel : public CG {
std::vector<std::shared_ptr<const void>> MAuxiliaryResources;
sycl::detail::pi::PiKernelCacheConfig MKernelCacheConfig;
bool MKernelIsCooperative = false;
bool MKernelUsesClusterLaunch = false;

CGExecKernel(NDRDescT NDRDesc, std::shared_ptr<HostKernelBase> HKernel,
std::shared_ptr<detail::kernel_impl> SyclKernel,
Expand All @@ -188,15 +189,17 @@ class CGExecKernel : public CG {
std::vector<std::shared_ptr<const void>> AuxiliaryResources,
CGTYPE Type,
sycl::detail::pi::PiKernelCacheConfig KernelCacheConfig,
bool KernelIsCooperative, detail::code_location loc = {})
bool KernelIsCooperative, bool MKernelUsesClusterLaunch,
detail::code_location loc = {})
: CG(Type, std::move(CGData), std::move(loc)),
MNDRDesc(std::move(NDRDesc)), MHostKernel(std::move(HKernel)),
MSyclKernel(std::move(SyclKernel)),
MKernelBundle(std::move(KernelBundle)), MArgs(std::move(Args)),
MKernelName(std::move(KernelName)), MStreams(std::move(Streams)),
MAuxiliaryResources(std::move(AuxiliaryResources)),
MKernelCacheConfig(std::move(KernelCacheConfig)),
MKernelIsCooperative(KernelIsCooperative) {
MKernelIsCooperative(KernelIsCooperative),
MKernelUsesClusterLaunch(MKernelUsesClusterLaunch) {
assert(getType() == Kernel && "Wrong type of exec kernel CG.");
}

Expand Down
16 changes: 14 additions & 2 deletions sycl/include/sycl/detail/cg_types.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,8 @@ class ArgDesc {
int MIndex;
};

// The structure represents NDRange - global, local sizes, global offset and
// number of dimensions.
// The structure represents NDRange - global, local sizes, global offset,
// number of dimensions, and the cluster dimensions if applicable.
class NDRDescT {
// The method initializes all sizes for dimensions greater than the passed one
// to the default values, so they will not affect execution.
Expand Down Expand Up @@ -128,13 +128,25 @@ class NDRDescT {
Dims = Dims_;
}

template <int Dims_> void setClusterDimensions(sycl::range<Dims_> N) {
if (Dims_ != Dims) {
throw std::runtime_error(
"Dimensionality of cluster, global and local ranges must be same");
}

for (int I = 0; I < Dims_; ++I) {
ClusterDimensions[I] = N[I];
}
}

sycl::range<3> GlobalSize;
sycl::range<3> LocalSize;
sycl::id<3> GlobalOffset;
/// Number of workgroups, used to record the number of workgroups from the
/// simplest form of parallel_for_work_group. If set, all other fields must be
/// zero
sycl::range<3> NumWorkGroups;
sycl::range<3> ClusterDimensions{1, 1, 1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an ABI break, as suggested by layout_handler.cpp and symbol_size_alignment.cpp. If we want to roll with that, this PR should be marked [ABI-break] and the corresponding label should be added.

Alternatively, you could change MKernelUsesClusterLaunch in handler_impl to be a std::optional<sycl::range<3>> and pass that along. We have had some issues with passing std::optional across the library boundary before, but as long as it stays inside the source files, it should not be a problem.

I am of two minds, because on one hand this seems like a fitting place for the new information and ABI-breaks are allowed. On the other hand, changing the layout of handler is exactly what handler_impl is here to prevent. Maybe a better solution is to make another ABI-break by moving NDRDescT out of handler and into handler_impl. @aelovikov-intel - Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that your logic is sensible, moving to handler_impl might be the best option. @aelovikov-intel I'd also appreciate your input. Thanks

We focused on getting a reasonable implementation of this feature up to collect feedback, particularly on the scheduling/handler details from Intel developers, and this was one of the main points that we foresaw could be challenging/contentious.

I think it would be a good idea to focus on this point and get it right first time since it is an abi-break, and really we only have until the end of next week to solve this and get it merged, since both I and @AD2605 are on holiday after that and won't be back until the ABI-break window is over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made a patch for moving more detail parts into sources and moving some of the handler members into the handler_impl: #14460

If that is the way we want to go, I would be okay with merging this as-is and moving the new changes as part of the aforementioned patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

size_t Dims;
};

Expand Down
3 changes: 3 additions & 0 deletions sycl/include/sycl/detail/pi.def
Original file line number Diff line number Diff line change
Expand Up @@ -230,4 +230,7 @@ _PI_API(piextVirtualMemGetInfo)
// Enqueue native command
_PI_API(piextEnqueueNativeCommand)

// Kernel Launch Properties
_PI_API(piextEnqueueKernelLaunchCustom)

#undef _PI_API
36 changes: 35 additions & 1 deletion sycl/include/sycl/detail/pi.h
Original file line number Diff line number Diff line change
Expand Up @@ -197,9 +197,11 @@
// pi_virtual_access_flags bit flags.
// 15.55 Added piextEnqueueNativeCommand as well as associated types and enums
// 16.56 Replaced piextUSMEnqueueMemset with piextUSMEnqueueFill
// 16.57 Added mappings to UR launch properties extension
// (piextEnqueueKernelLaunchCustom)

#define _PI_H_VERSION_MAJOR 16
#define _PI_H_VERSION_MINOR 56
#define _PI_H_VERSION_MINOR 57

#define _PI_STRING_HELPER(a) #a
#define _PI_CONCAT(a, b) _PI_STRING_HELPER(a.b)
Expand Down Expand Up @@ -514,8 +516,12 @@ typedef enum {

// Virtual memory support
PI_EXT_ONEAPI_DEVICE_INFO_SUPPORTS_VIRTUAL_MEM = 0x2011E,

// Native enqueue
PI_EXT_ONEAPI_DEVICE_INFO_ENQUEUE_NATIVE_COMMAND_SUPPORT = 0x2011F,

// Return whether cluster launch is supported by device
PI_EXT_ONEAPI_DEVICE_INFO_CLUSTER_LAUNCH = 0x2021,
} _pi_device_info;

typedef enum {
Expand Down Expand Up @@ -1317,8 +1323,28 @@ typedef enum {
///< P2P link, otherwise such operations are not supported.
} _pi_peer_attr;

typedef enum {
PI_LAUNCH_PROPERTY_IGNORE = 0x0,
PI_LAUNCH_PROPERTY_COOPERATIVE = 0x1,
PI_LAUNCH_PROPERTY_CLUSTER_DIMENSION = 0x2,
} _pi_launch_property_id;

typedef union {
int cooperative;
int32_t cluster_dims[3];
} _pi_launch_property_value;

using pi_mem_info = _pi_mem_info;
using pi_peer_attr = _pi_peer_attr;
using pi_launch_property_id = _pi_launch_property_id;
using pi_launch_property_value = _pi_launch_property_value;

typedef struct {
pi_launch_property_id id;
pi_launch_property_value value;
} _pi_launch_property;

using pi_launch_property = _pi_launch_property;

//
// Following section contains SYCL RT Plugin Interface (PI) functions.
Expand Down Expand Up @@ -1933,6 +1959,14 @@ __SYCL_EXPORT pi_result piextEnqueueCooperativeKernelLaunch(
const size_t *local_work_size, pi_uint32 num_events_in_wait_list,
const pi_event *event_wait_list, pi_event *event);

__SYCL_EXPORT pi_result piextEnqueueKernelLaunchCustom(
pi_queue queue, pi_kernel kernel, pi_uint32 work_dim,
const size_t *global_work_size, const size_t *local_work_size,
pi_uint32 num_props_in_launch_prop_list,
const pi_launch_property *launch_prop_list,
pi_uint32 num_events_in_wait_list, const pi_event *event_wait_list,
pi_event *event);

__SYCL_EXPORT pi_result piEnqueueEventsWait(pi_queue command_queue,
pi_uint32 num_events_in_wait_list,
const pi_event *event_wait_list,
Expand Down
1 change: 1 addition & 0 deletions sycl/include/sycl/detail/pi.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,7 @@ using PiImageOffset = ::pi_image_offset_struct;
using PiImageRegion = ::pi_image_region_struct;
using PiPhysicalMem = ::pi_physical_mem;
using PiVirtualAccessFlags = ::pi_virtual_access_flags;
using PiLaunchProperty = ::pi_launch_property;

__SYCL_EXPORT void contextSetExtendedDeleter(const sycl::context &constext,
pi_context_extended_deleter func,
Expand Down
10 changes: 10 additions & 0 deletions sycl/include/sycl/device_aspect_macros.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,11 @@
#define __SYCL_ALL_DEVICES_HAVE_ext_oneapi_virtual_mem__ 0
#endif

#ifndef __SYCL_ALL_DEVICES_HAVE_ext_oneapi_cuda_cluster_group__
// __SYCL_ASPECT(ext_oneapi_cuda_cluster_group, 75)
#define __SYCL_ALL_DEVICES_HAVE_ext_oneapi_cuda_cluster_group__ 0
#endif

#ifndef __SYCL_ANY_DEVICE_HAS_host__
// __SYCL_ASPECT(host, 0)
#define __SYCL_ANY_DEVICE_HAS_host__ 0
Expand Down Expand Up @@ -750,3 +755,8 @@
// __SYCL_ASPECT(ext_oneapi_virtual_mem, 74)
#define __SYCL_ANY_DEVICE_HAS_ext_oneapi_virtual_mem__ 0
#endif

#ifndef __SYCL_ANY_DEVICE_HAS_ext_oneapi_cuda_cluster_group__
// __SYCL_ASPECT(ext_oneapi_cuda_cluster_group, 75)
#define __SYCL_ANY_DEVICE_HAS_ext_oneapi_cuda_cluster_group__ 0
#endif
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
//==--- cluster_group_prop.hpp --- SYCL extension for cuda cluster group ---==//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

#pragma once

#include <sycl/ext/oneapi/properties/properties.hpp>
#include <sycl/range.hpp>

namespace sycl {
inline namespace _V1 {
namespace ext::oneapi::experimental {

namespace cuda {
template <int Dim>
struct cluster_size
: ::sycl::ext::oneapi::experimental::detail::run_time_property_key<
::sycl::ext::oneapi::experimental::detail::ClusterLaunch> {
cluster_size(const range<Dim> &size) : size(size) {}
sycl::range<Dim> get_cluster_size() { return size; }

private:
range<Dim> size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This member should maybe be private.

Copy link
Contributor

@JackAKirk JackAKirk Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but I'm not sure. I doubt that this is very important either way but I didn't design the cluster_size property, so I would prefer not to touch it at this stage, especially since the CI is brittle and we have finally gotten all green and no conflicts. This is an experimental interface and minor things like this can be easily changed later.
The important thing is that we get this feature in some form merged asap. Apparently this should be available in a nightly build asap @Ruyk ?

Soon I will be on holiday and will be unable to continue refactoring it as new abi-break PRs and unified runtime changes are merged, such that it will be at risk of missing the abi-break window at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@intel/dpcpp-tools-reviewers

This is 100% green now. Please could you review this asap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AerialMantis Made the member size private in the latest commit

};

template <int Dim> using cluster_size_key = cluster_size<Dim>;

} // namespace cuda

template <>
struct is_property_key<cuda::cluster_size_key<1>> : std::true_type {};
template <>
struct is_property_key<cuda::cluster_size_key<2>> : std::true_type {};
template <>
struct is_property_key<cuda::cluster_size_key<3>> : std::true_type {};

template <typename T>
struct is_property_key_of<cuda::cluster_size_key<1>, T> : std::true_type {};

template <typename T>
struct is_property_key_of<cuda::cluster_size_key<2>, T> : std::true_type {};

template <typename T>
struct is_property_key_of<cuda::cluster_size_key<3>, T> : std::true_type {};

template <>
struct is_property_value<cuda::cluster_size_key<1>>
: is_property_key<cuda::cluster_size_key<1>> {};
template <>
struct is_property_value<cuda::cluster_size_key<2>>
: is_property_key<cuda::cluster_size_key<2>> {};
template <>
struct is_property_value<cuda::cluster_size_key<3>>
: is_property_key<cuda::cluster_size_key<3>> {};

template <typename O>
struct is_property_value_of<cuda::cluster_size_key<1>, O>
: is_property_key_of<cuda::cluster_size_key<1>, O> {};

template <typename O>
struct is_property_value_of<cuda::cluster_size_key<2>, O>
: is_property_key_of<cuda::cluster_size_key<2>, O> {};

template <typename O>
struct is_property_value_of<cuda::cluster_size_key<3>, O>
: is_property_key_of<cuda::cluster_size_key<3>, O> {};

namespace detail {
template <typename PropertiesT> constexpr std::size_t getClusterDim() {
if constexpr (PropertiesT::template has_property<
sycl::ext::oneapi::experimental::cuda::cluster_size_key<
1>>()) {
return 1;
}
if constexpr (PropertiesT::template has_property<
sycl::ext::oneapi::experimental::cuda::cluster_size_key<
2>>()) {
return 2;
}
if constexpr (PropertiesT::template has_property<
sycl::ext::oneapi::experimental::cuda::cluster_size_key<
3>>()) {
return 3;
}
return 0;
}
} // namespace detail
} // namespace ext::oneapi::experimental
} // namespace _V1
} // namespace sycl
Original file line number Diff line number Diff line change
Expand Up @@ -149,9 +149,9 @@ void parallel_for(handler &CGH,
ext::oneapi::experimental::detail::LaunchConfigAccess<range<Dimensions>,
Properties>
ConfigAccess(Config);
CGH.parallel_for<KernelName>(ConfigAccess.getRange(),
std::forward<ReductionsT>(Reductions)...,
KernelObj);
CGH.parallel_for<KernelName>(
ConfigAccess.getRange(), ConfigAccess.getProperties(),
std::forward<ReductionsT>(Reductions)..., KernelObj);
Comment on lines +152 to +154
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @steffenlarsen , I was wondering if you could share your insight here.

For some context, This PR is a partial implementation of the sycl_ext_oneapi_cuda_cluster_group.

This introduces a new runtime launch property, called as cluster_size, and if this property is set, it use a new UR entry point, to launch the kernel in a different fashion. I make use of the functions added in sycl_ext_oneapi_enqueue_functions extensions, to pass the properties and launch the kernel.

My question here is, I see ConfigAccess.getProperties() was not being passed previously, due to which I suppose any runtime property the user might define, will not get propagated further from what I understand. However, if I do pass these properties, the enqueue functions tests using kernel bundle break on compilation, complaining that sycl::kernel is not device copyable (link to failing tests for reference, I was wondering if I am missing something here, and have an incomplete understanding of the implementation...

Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AD2605! The issue you're seeing stems from the fact that the extension that adds properties to parallel_for and single_task does not add it to the overloads that take a sycl::kernel. I suspect the reason is that there was no use for it, given all cases that used properties on these needed them applied to the kernel object, which cannot be done when passing the kernel as a sycl::kernel.

Eventually we will need it for passing properties with runtime information, like in the work_group_specific extension, which will soon be renamed to work_group_static.

I suggest you omit the changes to line 258 for now and if you need a runtime-value property passed down to it, the functionality can be added in a follow up, with related extension changes. That assumes this extension doesn't need it immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks a Lot,

Well since we do not need the kernel bundle support for now, I can simply remove it in the mean time.

Thanks

}

template <typename KernelName = sycl::detail::auto_name, int Dimensions,
Expand Down Expand Up @@ -225,9 +225,9 @@ void nd_launch(handler &CGH,
ext::oneapi::experimental::detail::LaunchConfigAccess<nd_range<Dimensions>,
Properties>
ConfigAccess(Config);
CGH.parallel_for<KernelName>(ConfigAccess.getRange(),
std::forward<ReductionsT>(Reductions)...,
KernelObj);
CGH.parallel_for<KernelName>(
ConfigAccess.getRange(), ConfigAccess.getProperties(),
std::forward<ReductionsT>(Reductions)..., KernelObj);
}

template <typename KernelName = sycl::detail::auto_name, int Dimensions,
Expand Down
5 changes: 4 additions & 1 deletion sycl/include/sycl/ext/oneapi/experimental/graph.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,8 @@ enum class UnsupportedGraphFeatures {
sycl_ext_oneapi_enqueue_barrier = 4,
sycl_ext_oneapi_memcpy2d = 5,
sycl_ext_oneapi_device_global = 6,
sycl_ext_oneapi_bindless_images = 7
sycl_ext_oneapi_bindless_images = 7,
sycl_ext_oneapi_experimental_cuda_cluster_launch = 8
};

inline const char *
Expand All @@ -76,6 +77,8 @@ UnsupportedFeatureToString(UnsupportedGraphFeatures Feature) {
return "sycl_ext_oneapi_device_global";
case UGF::sycl_ext_oneapi_bindless_images:
return "sycl_ext_oneapi_bindless_images";
case UGF::sycl_ext_oneapi_experimental_cuda_cluster_launch:
return "sycl_ext_oneapi_experimental_cuda_cluster_launch";
}

assert(false && "Unhandled graphs feature");
Expand Down
3 changes: 2 additions & 1 deletion sycl/include/sycl/ext/oneapi/properties/property.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -211,8 +211,9 @@ enum PropKind : uint32_t {
OutputDataPlacement = 70,
IncludeFiles = 71,
RegisteredKernelNames = 72,
ClusterLaunch = 73,
// PropKindSize must always be the last value.
PropKindSize = 73,
PropKindSize = 74,
};

struct property_key_base_tag {};
Expand Down
Loading
Loading