Skip to content

[SYCL][XPTI] Added kernel info #5887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 5, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions sycl/doc/design/SYCLInstrumentationUsingXPTI.md
Original file line number Diff line number Diff line change
Expand Up @@ -279,13 +279,14 @@ All trace point types in bold provide semantic information about the graph, node
| `sym_source_file_name` | C-style string | Source file name |
| `sym_line_no` | `int32_t` | File line number |
| `sym_column_no` | `int32_t` | File column number |

| `enqueue_kernel_data` | `xpti::offload_kernel_arg_data_t` | Includes kernel execution parameters (global size, local size, offset) and number of kernel arguments |
| `argN` | `xpti::offload_kernel_arg_data_t` | Description for the Nth kernel argument. It includes argument kind (sycl::detail::kernel_param_kind_t), pointer to the value, size and index in the argument list. |

## Buffer management stream `"sycl.experimental.buffer"` Notification Signatures

| Trace Point Type | Parameter Description | Metadata |
| :------------------------: | :-------------------- | :------- |
| `offload_alloc_construct` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::offload_buffer_data_t` that marks offload buffer creation point</li> <li> **parent**: Event ID created for all functions in the `oneapi.experimental.buffer` layer.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: `nullptr` since no begin-end event alignment is needed. </li> <li> **user_data**: A pointer to `offload_buffer_data_t` object, that includes buffer object ID, host pointer used to create/initialize buffer, buffer element information (type name, size), number of buffer dimensions and buffer size for each dimension. </li></div> | None |
| `offload_alloc_construct` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::offload_buffer_data_t` that marks offload buffer creation point</li> <li> **parent**: Event ID created for all functions in the `oneapi.experimental.buffer` layer.</li> <li> **event**: `xpti::trace_event_data_t` - contains information about source location.</li> <li> **instance**: `nullptr` since no begin-end event alignment is needed. </li> <li> **user_data**: A pointer to `offload_buffer_data_t` object, that includes buffer object ID, host pointer used to create/initialize buffer, buffer element information (type name, size), number of buffer dimensions and buffer size for each dimension. </li></div> | None |
| `offload_alloc_associate` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::offload_buffer_association_data_t` that provides association between user level buffer object and platform specific memory object</li> <li> **parent**: Event ID created for all functions in the `oneapi.experimental.buffer` layer.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: `nullptr` since no begin-end event alignment is needed.</li> <li> **user_data**: A pointer to `offload_buffer_association_data_t` object, that includes user object ID and platform-specific representation for offload buffer. </li></div> | None |
| `offload_alloc_destruct` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::offload_buffer_data_t` that marks offload buffer destruction point</li> <li> **parent**: Event ID created for all functions in the `oneapi.experimental.buffer` layer.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: `nullptr` since no begin-end event alignment is needed. </li> <li> **user_data**: A pointer to `offload_buffer_data_t` object, that includes buffer object ID. </li></div> | None |
| `offload_alloc_release` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::offload_buffer_release_data_t` that provides information about release of platform specific memory object</li> <li> **parent**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: `nullptr` since no begin-end event alignment is needed.</li> <li> **user_data**: A pointer to `offload_buffer_association_data_t` object, that includes user object ID and platform-specific representation for offload buffer. </li></div> | None |
Expand Down
151 changes: 113 additions & 38 deletions sycl/source/detail/scheduler/commands.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,41 @@ static std::string deviceToString(device Device) {
return "UNKNOWN";
}

static void applyFuncOnFilteredArgs(
const ProgramManager::KernelArgMask &EliminatedArgMask,
std::vector<ArgDesc> &Args,
std::function<void(detail::ArgDesc &Arg, int NextTrueIndex)> Func) {
if (EliminatedArgMask.empty()) {
for (ArgDesc &Arg : Args) {
Func(Arg, Arg.MIndex);
}
} else {
// TODO this is not necessary as long as we can guarantee that the
// arguments are already sorted (e. g. handle the sorting in handler
// if necessary due to set_arg(...) usage).
std::sort(Args.begin(), Args.end(), [](const ArgDesc &A, const ArgDesc &B) {
return A.MIndex < B.MIndex;
});
int LastIndex = -1;
size_t NextTrueIndex = 0;

for (ArgDesc &Arg : Args) {
// Handle potential gaps in set arguments (e. g. if some of them are
// set on the user side).
for (int Idx = LastIndex + 1; Idx < Arg.MIndex; ++Idx)
if (!EliminatedArgMask[Idx])
++NextTrueIndex;
LastIndex = Arg.MIndex;

if (EliminatedArgMask[Arg.MIndex])
continue;

Func(Arg, NextTrueIndex);
++NextTrueIndex;
}
}
}

#ifdef XPTI_ENABLE_INSTRUMENTATION
static size_t deviceToID(const device &Device) {
if (Device.is_host())
Expand Down Expand Up @@ -1779,6 +1814,73 @@ void ExecCGCommand::emitInstrumentationData() {
xpti::addMetadata(CmdTraceEvent, "sym_column_no", MCommandGroup->MColumn);
}

if (MCommandGroup->getType() == detail::CG::Kernel) {
auto KernelCG =
reinterpret_cast<detail::CGExecKernel *>(MCommandGroup.get());
auto &NDRDesc = KernelCG->MNDRDesc;
std::vector<ArgDesc> Args;

auto FilterArgs = [&Args](detail::ArgDesc &Arg, int NextTrueIndex) {
Args.push_back({Arg.MType, Arg.MPtr, Arg.MSize, NextTrueIndex});
};
RT::PiProgram Program = nullptr;
RT::PiKernel Kernel = nullptr;
std::mutex *KernelMutex = nullptr;

std::shared_ptr<kernel_impl> SyclKernelImpl;
std::shared_ptr<device_image_impl> DeviceImageImpl;
auto KernelBundleImplPtr = KernelCG->getKernelBundle();

// Use kernel_bundle if available unless it is interop.
// Interop bundles can't be used in the first branch, because the kernels
// in interop kernel bundles (if any) do not have kernel_id
// and can therefore not be looked up, but since they are self-contained
// they can simply be launched directly.
if (KernelBundleImplPtr && !KernelBundleImplPtr->isInterop()) {
kernel_id KernelID =
detail::ProgramManager::getInstance().getSYCLKernelID(
KernelCG->MKernelName);
kernel SyclKernel =
KernelBundleImplPtr->get_kernel(KernelID, KernelBundleImplPtr);
Program = detail::getSyclObjImpl(SyclKernel)
->getDeviceImage()
->get_program_ref();
} else if (nullptr != KernelCG->MSyclKernel) {
auto SyclProg = detail::getSyclObjImpl(
KernelCG->MSyclKernel->get_info<info::kernel::program>());
Program = SyclProg->getHandleRef();
} else {
std::tie(Kernel, KernelMutex, Program) =
detail::ProgramManager::getInstance().getOrCreateKernel(
KernelCG->MOSModuleHandle, MQueue->getContextImplPtr(),
MQueue->getDeviceImplPtr(), KernelCG->MKernelName, nullptr);
}

ProgramManager::KernelArgMask EliminatedArgMask;
if (nullptr == KernelCG->MSyclKernel ||
!KernelCG->MSyclKernel->isCreatedFromSource()) {
EliminatedArgMask =
detail::ProgramManager::getInstance().getEliminatedKernelArgMask(
KernelCG->MOSModuleHandle, Program, KernelCG->MKernelName);
}

applyFuncOnFilteredArgs(EliminatedArgMask, KernelCG->MArgs, FilterArgs);

xpti::offload_kernel_enqueue_data_t KernelData{
{NDRDesc.GlobalSize[0], NDRDesc.GlobalSize[1], NDRDesc.GlobalSize[2]},
{NDRDesc.LocalSize[0], NDRDesc.LocalSize[1], NDRDesc.LocalSize[2]},
{NDRDesc.GlobalOffset[0], NDRDesc.GlobalOffset[1],
NDRDesc.GlobalOffset[2]},
Args.size()};
xpti::addMetadata(CmdTraceEvent, "enqueue_kernel_data", KernelData);
for (size_t i = 0; i < Args.size(); i++) {
std::string Prefix("arg");
xpti::offload_kernel_arg_data_t arg{(int)Args[i].MType, Args[i].MPtr,
Args[i].MSize, Args[i].MIndex};
xpti::addMetadata(CmdTraceEvent, Prefix + std::to_string(i), arg);
}
}

xptiNotifySubscribers(MStreamID, xpti::trace_node_create,
detail::GSYCLGraphEvent, CmdTraceEvent,
CGKernelInstanceNo,
Expand Down Expand Up @@ -1937,35 +2039,7 @@ static pi_result SetKernelParamsAndLaunch(
}
};

if (EliminatedArgMask.empty()) {
for (ArgDesc &Arg : Args) {
setFunc(Arg, Arg.MIndex);
}
} else {
// TODO this is not necessary as long as we can guarantee that the arguments
// are already sorted (e. g. handle the sorting in handler if necessary due
// to set_arg(...) usage).
std::sort(Args.begin(), Args.end(), [](const ArgDesc &A, const ArgDesc &B) {
return A.MIndex < B.MIndex;
});
int LastIndex = -1;
size_t NextTrueIndex = 0;

for (ArgDesc &Arg : Args) {
// Handle potential gaps in set arguments (e. g. if some of them are set
// on the user side).
for (int Idx = LastIndex + 1; Idx < Arg.MIndex; ++Idx)
if (!EliminatedArgMask[Idx])
++NextTrueIndex;
LastIndex = Arg.MIndex;

if (EliminatedArgMask[Arg.MIndex])
continue;

setFunc(Arg, NextTrueIndex);
++NextTrueIndex;
}
}
applyFuncOnFilteredArgs(EliminatedArgMask, Args, setFunc);

adjustNDRangePerKernel(NDRDesc, Kernel, *(Queue->getDeviceImplPtr()));

Expand Down Expand Up @@ -2380,15 +2454,16 @@ cl_int ExecCGCommand::enqueueImp() {
Plugin.call<PiApiKind::piEventsWait>(RawEvents.size(), &RawEvents[0]);
}
std::vector<interop_handler::ReqToMem> ReqMemObjs;
// Extract the Mem Objects for all Requirements, to ensure they are available if
// a user ask for them inside the interop task scope
const auto& HandlerReq = ExecInterop->MRequirements;
std::for_each(std::begin(HandlerReq), std::end(HandlerReq), [&](Requirement* Req) {
AllocaCommandBase *AllocaCmd = getAllocaForReq(Req);
auto MemArg = reinterpret_cast<pi_mem>(AllocaCmd->getMemAllocation());
interop_handler::ReqToMem ReqToMem = std::make_pair(Req, MemArg);
ReqMemObjs.emplace_back(ReqToMem);
});
// Extract the Mem Objects for all Requirements, to ensure they are
// available if a user ask for them inside the interop task scope
const auto &HandlerReq = ExecInterop->MRequirements;
std::for_each(
std::begin(HandlerReq), std::end(HandlerReq), [&](Requirement *Req) {
AllocaCommandBase *AllocaCmd = getAllocaForReq(Req);
auto MemArg = reinterpret_cast<pi_mem>(AllocaCmd->getMemAllocation());
interop_handler::ReqToMem ReqToMem = std::make_pair(Req, MemArg);
ReqMemObjs.emplace_back(ReqToMem);
});

std::sort(std::begin(ReqMemObjs), std::end(ReqMemObjs));
interop_handler InteropHandler(std::move(ReqMemObjs), MQueue);
Expand Down
27 changes: 26 additions & 1 deletion xpti/include/xpti/xpti_data_types.h
Original file line number Diff line number Diff line change
Expand Up @@ -394,7 +394,7 @@ enum class trace_point_type_t : uint16_t {
offload_alloc_destruct = XPTI_TRACE_POINT_BEGIN(22),
/// Used to notify about releasing internal handle for offload buffer
offload_alloc_release = XPTI_TRACE_POINT_BEGIN(23),
/// Used to notify about creation accessor for ofload buffer
/// Used to notify about creation accessor for offload buffer
offload_alloc_accessor = XPTI_TRACE_POINT_BEGIN(24),
/// Indicates that the trace point is user defined and only the tool defined
/// for a stream will be able to handle it
Expand Down Expand Up @@ -569,6 +569,31 @@ struct offload_buffer_association_data_t {
/// A pointer to platform specific handler for the offload object
uintptr_t mem_object_handle = 0;
};

/// Describes enqueued kernel object
struct offload_kernel_enqueue_data_t {
/// Global size
size_t global_size[3] = {0, 0, 0};
/// Local size
size_t local_size[3] = {0, 0, 0};
/// Offset
size_t offset[3] = {0, 0, 0};
/// Number of kernel arguments
size_t args_num = 0;
};

/// Describes enqueued kernel argument
struct offload_kernel_arg_data_t {
/// Argument type as set in kernel_param_kind_t
int type = -1;
/// Pointer to the data
void *pointer = nullptr;
/// Size of the argument
int size = 0;
/// Index of the argument in the kernel
int index = 0;
};

/// Describes memory allocation
struct mem_alloc_data_t {
/// A platform-specific memory object handle. Some heterogeneous programming
Expand Down