Skip to content

Commit 4b9eef3

Browse files
authored
[SYCL][XPTI] Added kernel info (#5887)
- kernel execution parameter (global size, local size, offset) - kernel argument parameters (number, size, pointer which can be associated with accessor/USM buffer)
1 parent 5231fe4 commit 4b9eef3

File tree

3 files changed

+142
-41
lines changed

3 files changed

+142
-41
lines changed

sycl/doc/design/SYCLInstrumentationUsingXPTI.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -279,13 +279,14 @@ All trace point types in bold provide semantic information about the graph, node
279279
| `sym_source_file_name` | C-style string | Source file name |
280280
| `sym_line_no` | `int32_t` | File line number |
281281
| `sym_column_no` | `int32_t` | File column number |
282-
282+
| `enqueue_kernel_data` | `xpti::offload_kernel_arg_data_t` | Includes kernel execution parameters (global size, local size, offset) and number of kernel arguments |
283+
| `argN` | `xpti::offload_kernel_arg_data_t` | Description for the Nth kernel argument. It includes argument kind (sycl::detail::kernel_param_kind_t), pointer to the value, size and index in the argument list. |
283284

284285
## Buffer management stream `"sycl.experimental.buffer"` Notification Signatures
285286

286287
| Trace Point Type | Parameter Description | Metadata |
287288
| :------------------------: | :-------------------- | :------- |
288-
| `offload_alloc_construct` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::offload_buffer_data_t` that marks offload buffer creation point</li> <li> **parent**: Event ID created for all functions in the `oneapi.experimental.buffer` layer.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: `nullptr` since no begin-end event alignment is needed. </li> <li> **user_data**: A pointer to `offload_buffer_data_t` object, that includes buffer object ID, host pointer used to create/initialize buffer, buffer element information (type name, size), number of buffer dimensions and buffer size for each dimension. </li></div> | None |
289+
| `offload_alloc_construct` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::offload_buffer_data_t` that marks offload buffer creation point</li> <li> **parent**: Event ID created for all functions in the `oneapi.experimental.buffer` layer.</li> <li> **event**: `xpti::trace_event_data_t` - contains information about source location.</li> <li> **instance**: `nullptr` since no begin-end event alignment is needed. </li> <li> **user_data**: A pointer to `offload_buffer_data_t` object, that includes buffer object ID, host pointer used to create/initialize buffer, buffer element information (type name, size), number of buffer dimensions and buffer size for each dimension. </li></div> | None |
289290
| `offload_alloc_associate` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::offload_buffer_association_data_t` that provides association between user level buffer object and platform specific memory object</li> <li> **parent**: Event ID created for all functions in the `oneapi.experimental.buffer` layer.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: `nullptr` since no begin-end event alignment is needed.</li> <li> **user_data**: A pointer to `offload_buffer_association_data_t` object, that includes user object ID and platform-specific representation for offload buffer. </li></div> | None |
290291
| `offload_alloc_destruct` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::offload_buffer_data_t` that marks offload buffer destruction point</li> <li> **parent**: Event ID created for all functions in the `oneapi.experimental.buffer` layer.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: `nullptr` since no begin-end event alignment is needed. </li> <li> **user_data**: A pointer to `offload_buffer_data_t` object, that includes buffer object ID. </li></div> | None |
291292
| `offload_alloc_release` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::offload_buffer_release_data_t` that provides information about release of platform specific memory object</li> <li> **parent**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: `nullptr` since no begin-end event alignment is needed.</li> <li> **user_data**: A pointer to `offload_buffer_association_data_t` object, that includes user object ID and platform-specific representation for offload buffer. </li></div> | None |

sycl/source/detail/scheduler/commands.cpp

Lines changed: 113 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,41 @@ static std::string deviceToString(device Device) {
8686
return "UNKNOWN";
8787
}
8888

89+
static void applyFuncOnFilteredArgs(
90+
const ProgramManager::KernelArgMask &EliminatedArgMask,
91+
std::vector<ArgDesc> &Args,
92+
std::function<void(detail::ArgDesc &Arg, int NextTrueIndex)> Func) {
93+
if (EliminatedArgMask.empty()) {
94+
for (ArgDesc &Arg : Args) {
95+
Func(Arg, Arg.MIndex);
96+
}
97+
} else {
98+
// TODO this is not necessary as long as we can guarantee that the
99+
// arguments are already sorted (e. g. handle the sorting in handler
100+
// if necessary due to set_arg(...) usage).
101+
std::sort(Args.begin(), Args.end(), [](const ArgDesc &A, const ArgDesc &B) {
102+
return A.MIndex < B.MIndex;
103+
});
104+
int LastIndex = -1;
105+
size_t NextTrueIndex = 0;
106+
107+
for (ArgDesc &Arg : Args) {
108+
// Handle potential gaps in set arguments (e. g. if some of them are
109+
// set on the user side).
110+
for (int Idx = LastIndex + 1; Idx < Arg.MIndex; ++Idx)
111+
if (!EliminatedArgMask[Idx])
112+
++NextTrueIndex;
113+
LastIndex = Arg.MIndex;
114+
115+
if (EliminatedArgMask[Arg.MIndex])
116+
continue;
117+
118+
Func(Arg, NextTrueIndex);
119+
++NextTrueIndex;
120+
}
121+
}
122+
}
123+
89124
#ifdef XPTI_ENABLE_INSTRUMENTATION
90125
static size_t deviceToID(const device &Device) {
91126
if (Device.is_host())
@@ -1779,6 +1814,73 @@ void ExecCGCommand::emitInstrumentationData() {
17791814
xpti::addMetadata(CmdTraceEvent, "sym_column_no", MCommandGroup->MColumn);
17801815
}
17811816

1817+
if (MCommandGroup->getType() == detail::CG::Kernel) {
1818+
auto KernelCG =
1819+
reinterpret_cast<detail::CGExecKernel *>(MCommandGroup.get());
1820+
auto &NDRDesc = KernelCG->MNDRDesc;
1821+
std::vector<ArgDesc> Args;
1822+
1823+
auto FilterArgs = [&Args](detail::ArgDesc &Arg, int NextTrueIndex) {
1824+
Args.push_back({Arg.MType, Arg.MPtr, Arg.MSize, NextTrueIndex});
1825+
};
1826+
RT::PiProgram Program = nullptr;
1827+
RT::PiKernel Kernel = nullptr;
1828+
std::mutex *KernelMutex = nullptr;
1829+
1830+
std::shared_ptr<kernel_impl> SyclKernelImpl;
1831+
std::shared_ptr<device_image_impl> DeviceImageImpl;
1832+
auto KernelBundleImplPtr = KernelCG->getKernelBundle();
1833+
1834+
// Use kernel_bundle if available unless it is interop.
1835+
// Interop bundles can't be used in the first branch, because the kernels
1836+
// in interop kernel bundles (if any) do not have kernel_id
1837+
// and can therefore not be looked up, but since they are self-contained
1838+
// they can simply be launched directly.
1839+
if (KernelBundleImplPtr && !KernelBundleImplPtr->isInterop()) {
1840+
kernel_id KernelID =
1841+
detail::ProgramManager::getInstance().getSYCLKernelID(
1842+
KernelCG->MKernelName);
1843+
kernel SyclKernel =
1844+
KernelBundleImplPtr->get_kernel(KernelID, KernelBundleImplPtr);
1845+
Program = detail::getSyclObjImpl(SyclKernel)
1846+
->getDeviceImage()
1847+
->get_program_ref();
1848+
} else if (nullptr != KernelCG->MSyclKernel) {
1849+
auto SyclProg = detail::getSyclObjImpl(
1850+
KernelCG->MSyclKernel->get_info<info::kernel::program>());
1851+
Program = SyclProg->getHandleRef();
1852+
} else {
1853+
std::tie(Kernel, KernelMutex, Program) =
1854+
detail::ProgramManager::getInstance().getOrCreateKernel(
1855+
KernelCG->MOSModuleHandle, MQueue->getContextImplPtr(),
1856+
MQueue->getDeviceImplPtr(), KernelCG->MKernelName, nullptr);
1857+
}
1858+
1859+
ProgramManager::KernelArgMask EliminatedArgMask;
1860+
if (nullptr == KernelCG->MSyclKernel ||
1861+
!KernelCG->MSyclKernel->isCreatedFromSource()) {
1862+
EliminatedArgMask =
1863+
detail::ProgramManager::getInstance().getEliminatedKernelArgMask(
1864+
KernelCG->MOSModuleHandle, Program, KernelCG->MKernelName);
1865+
}
1866+
1867+
applyFuncOnFilteredArgs(EliminatedArgMask, KernelCG->MArgs, FilterArgs);
1868+
1869+
xpti::offload_kernel_enqueue_data_t KernelData{
1870+
{NDRDesc.GlobalSize[0], NDRDesc.GlobalSize[1], NDRDesc.GlobalSize[2]},
1871+
{NDRDesc.LocalSize[0], NDRDesc.LocalSize[1], NDRDesc.LocalSize[2]},
1872+
{NDRDesc.GlobalOffset[0], NDRDesc.GlobalOffset[1],
1873+
NDRDesc.GlobalOffset[2]},
1874+
Args.size()};
1875+
xpti::addMetadata(CmdTraceEvent, "enqueue_kernel_data", KernelData);
1876+
for (size_t i = 0; i < Args.size(); i++) {
1877+
std::string Prefix("arg");
1878+
xpti::offload_kernel_arg_data_t arg{(int)Args[i].MType, Args[i].MPtr,
1879+
Args[i].MSize, Args[i].MIndex};
1880+
xpti::addMetadata(CmdTraceEvent, Prefix + std::to_string(i), arg);
1881+
}
1882+
}
1883+
17821884
xptiNotifySubscribers(MStreamID, xpti::trace_node_create,
17831885
detail::GSYCLGraphEvent, CmdTraceEvent,
17841886
CGKernelInstanceNo,
@@ -1937,35 +2039,7 @@ static pi_result SetKernelParamsAndLaunch(
19372039
}
19382040
};
19392041

1940-
if (EliminatedArgMask.empty()) {
1941-
for (ArgDesc &Arg : Args) {
1942-
setFunc(Arg, Arg.MIndex);
1943-
}
1944-
} else {
1945-
// TODO this is not necessary as long as we can guarantee that the arguments
1946-
// are already sorted (e. g. handle the sorting in handler if necessary due
1947-
// to set_arg(...) usage).
1948-
std::sort(Args.begin(), Args.end(), [](const ArgDesc &A, const ArgDesc &B) {
1949-
return A.MIndex < B.MIndex;
1950-
});
1951-
int LastIndex = -1;
1952-
size_t NextTrueIndex = 0;
1953-
1954-
for (ArgDesc &Arg : Args) {
1955-
// Handle potential gaps in set arguments (e. g. if some of them are set
1956-
// on the user side).
1957-
for (int Idx = LastIndex + 1; Idx < Arg.MIndex; ++Idx)
1958-
if (!EliminatedArgMask[Idx])
1959-
++NextTrueIndex;
1960-
LastIndex = Arg.MIndex;
1961-
1962-
if (EliminatedArgMask[Arg.MIndex])
1963-
continue;
1964-
1965-
setFunc(Arg, NextTrueIndex);
1966-
++NextTrueIndex;
1967-
}
1968-
}
2042+
applyFuncOnFilteredArgs(EliminatedArgMask, Args, setFunc);
19692043

19702044
adjustNDRangePerKernel(NDRDesc, Kernel, *(Queue->getDeviceImplPtr()));
19712045

@@ -2380,15 +2454,16 @@ cl_int ExecCGCommand::enqueueImp() {
23802454
Plugin.call<PiApiKind::piEventsWait>(RawEvents.size(), &RawEvents[0]);
23812455
}
23822456
std::vector<interop_handler::ReqToMem> ReqMemObjs;
2383-
// Extract the Mem Objects for all Requirements, to ensure they are available if
2384-
// a user ask for them inside the interop task scope
2385-
const auto& HandlerReq = ExecInterop->MRequirements;
2386-
std::for_each(std::begin(HandlerReq), std::end(HandlerReq), [&](Requirement* Req) {
2387-
AllocaCommandBase *AllocaCmd = getAllocaForReq(Req);
2388-
auto MemArg = reinterpret_cast<pi_mem>(AllocaCmd->getMemAllocation());
2389-
interop_handler::ReqToMem ReqToMem = std::make_pair(Req, MemArg);
2390-
ReqMemObjs.emplace_back(ReqToMem);
2391-
});
2457+
// Extract the Mem Objects for all Requirements, to ensure they are
2458+
// available if a user ask for them inside the interop task scope
2459+
const auto &HandlerReq = ExecInterop->MRequirements;
2460+
std::for_each(
2461+
std::begin(HandlerReq), std::end(HandlerReq), [&](Requirement *Req) {
2462+
AllocaCommandBase *AllocaCmd = getAllocaForReq(Req);
2463+
auto MemArg = reinterpret_cast<pi_mem>(AllocaCmd->getMemAllocation());
2464+
interop_handler::ReqToMem ReqToMem = std::make_pair(Req, MemArg);
2465+
ReqMemObjs.emplace_back(ReqToMem);
2466+
});
23922467

23932468
std::sort(std::begin(ReqMemObjs), std::end(ReqMemObjs));
23942469
interop_handler InteropHandler(std::move(ReqMemObjs), MQueue);

xpti/include/xpti/xpti_data_types.h

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -394,7 +394,7 @@ enum class trace_point_type_t : uint16_t {
394394
offload_alloc_destruct = XPTI_TRACE_POINT_BEGIN(22),
395395
/// Used to notify about releasing internal handle for offload buffer
396396
offload_alloc_release = XPTI_TRACE_POINT_BEGIN(23),
397-
/// Used to notify about creation accessor for ofload buffer
397+
/// Used to notify about creation accessor for offload buffer
398398
offload_alloc_accessor = XPTI_TRACE_POINT_BEGIN(24),
399399
/// Indicates that the trace point is user defined and only the tool defined
400400
/// for a stream will be able to handle it
@@ -569,6 +569,31 @@ struct offload_buffer_association_data_t {
569569
/// A pointer to platform specific handler for the offload object
570570
uintptr_t mem_object_handle = 0;
571571
};
572+
573+
/// Describes enqueued kernel object
574+
struct offload_kernel_enqueue_data_t {
575+
/// Global size
576+
size_t global_size[3] = {0, 0, 0};
577+
/// Local size
578+
size_t local_size[3] = {0, 0, 0};
579+
/// Offset
580+
size_t offset[3] = {0, 0, 0};
581+
/// Number of kernel arguments
582+
size_t args_num = 0;
583+
};
584+
585+
/// Describes enqueued kernel argument
586+
struct offload_kernel_arg_data_t {
587+
/// Argument type as set in kernel_param_kind_t
588+
int type = -1;
589+
/// Pointer to the data
590+
void *pointer = nullptr;
591+
/// Size of the argument
592+
int size = 0;
593+
/// Index of the argument in the kernel
594+
int index = 0;
595+
};
596+
572597
/// Describes memory allocation
573598
struct mem_alloc_data_t {
574599
/// A platform-specific memory object handle. Some heterogeneous programming

0 commit comments

Comments
 (0)