Skip to content

[wip][sycl] Cuda tracing #5797

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions sycl/doc/EnvironmentVariables.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,13 +142,13 @@ variables in production code.</span>
| -------------------- | ------ | ----------- |
| `SYCL_PI_LEVEL_ZERO_SINGLE_THREAD_MODE` | Integer | A single-threaded app has an opportunity to enable this mode to avoid overhead from mutex locking in the Level Zero plugin. A value greater than 0 enables single thread mode. A value of 0 disables single thread mode. The default is 0. |
| `SYCL_PI_LEVEL_ZERO_MAX_COMMAND_LIST_CACHE` | Positive integer | Maximum number of oneAPI Level Zero Command lists that can be allocated with no reuse before throwing an "out of resources" error. Default is 20000, threshold may be increased based on resource availabilty and workload demand. |
| `SYCL_PI_LEVEL_ZERO_USM_ALLOCATOR` | [EnableBuffers][;[MaxPoolSize][;[host\|device\|shared:][MaxPoolableSize][,[Capacity][,SlabMinSize]]]...] | EnableBuffers enables pooling for SYCL buffers, default 0, set to 1 to enable. MaxPoolSize is the maximum size of the pool, default 0. MemType is host, device or shared. Other parameters are values specified as positive integers with optional K, M or G suffix. MaxPoolableSize is the maximum allocation size that may be pooled, default 0 for host and shared, 32KB for device. Capacity is the number of allocations in each size range freed by the program but retained in the pool for reallocation, default 0. Size ranges follow this pattern: 64, 96, 128, 192, and so on, i.e., powers of 2, with one range in between. SlabMinSize is the minimum allocation size, 64KB for host and device, 2MB for shared. Example: SYCL_PI_LEVEL_ZERO_USM_ALLOCATOR=1;32M;host:1M,4,64K;device:1M,4,64K;shared:0,0,2M|
| `SYCL_PI_LEVEL_ZERO_BATCH_SIZE` | Integer | Sets a preferred number of compute commands to batch into a command list before executing the command list. A value of 0 causes the batch size to be adjusted dynamically. A value greater than 0 specifies fixed size batching, with the batch size set to the specified value. The default is 0. |
| `SYCL_PI_LEVEL_ZERO_USM_ALLOCATOR` | [EnableBuffers][;[MaxPoolSize][;[host\|device\|shared:][MaxPoolableSize][,[Capacity][,SlabMinSize]]]...] | EnableBuffers enables pooling for SYCL buffers, default 0, set to 1 to enable. MaxPoolSize is the maximum size of the pool, default 0. MemType is host, device or shared. Other parameters are values specified as positive integers with optional K, M or G suffix. MaxPoolableSize is the maximum allocation size that may be pooled, default 0 for host and shared, 32KB for device. Capacity is the number of allocations in each size range freed by the program but retained in the pool for reallocation, default 0. Size ranges follow this pattern: 64, 96, 128, 192, and so on, i.e., powers of 2, with one range in between. SlabMinSize is the minimum allocation size, 64KB for host and device, 2MB for shared. Example: SYCL_PI_LEVEL_ZERO_USM_ALLOCATOR=1;32M;host:1M,4,64K;device:1M,4,64K;shared:0,0,2M| | `SYCL_PI_LEVEL_ZERO_BATCH_SIZE` | Integer | Sets a preferred number of compute commands to batch into a command list before executing the command list. A value of 0 causes the batch size to be adjusted dynamically. A value greater than 0 specifies fixed size batching, with the batch size set to the specified value. The default is 0. |
| `SYCL_PI_LEVEL_ZERO_COPY_BATCH_SIZE` | Integer | Sets a preferred number of copy commands to batch into a command list before executing the command list. A value of 0 causes the batch size to be adjusted dynamically. A value greater than 0 specifies fixed size batching, with the batch size set to the specified value. The default is 0. |
| `SYCL_PI_LEVEL_ZERO_FILTER_EVENT_WAIT_LIST` | Integer | When set to 0, disables filtering of signaled events from wait lists when using the Level Zero backend. The default is 1. |
| `SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE` | Any(\*) | This environment variable enables users to control use of copy engines for copy operations. If the value is an integer, it will allow the use of copy engines, if available in the device, in Level Zero plugin to transfer SYCL buffer or image data between the host and/or device(s) and to fill SYCL buffer or image data in device or shared memory. The value of this environment variable can also be a pair of the form "lower_index:upper_index" where the indices point to copy engines in a list of all available copy engines. The default is 1. |
| `SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE_FOR_D2D_COPY` (experimental) | Integer | Allows the use of copy engine, if available in the device, in Level Zero plugin for device to device copy operations. The default is 0. This option is experimental and will be removed once heuristics are added to make a decision about use of copy engine for device to device copy operations. |
| `SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS` | Any(\*) | Enable support of device-scope events whose state is not visible to the host. If enabled mode is SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=1 the Level Zero plugin would create all events having device-scope only and create proxy host-visible events for them when their status is needed (wait/query) on the host. If enabled mode is SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=2 the Level Zero plugin would create all events having device-scope and add proxy host-visible event at the end of each command-list submission. The default is 0, meaning all events are host-visible. |
| `SYCL_PI_LEVEL_ZERO_ENABLE_TRACING` | Any(\*) | Enable XPTI-based tracing in L0 plugin |

## Debugging variables for CUDA Plugin

Expand Down
20 changes: 20 additions & 0 deletions sycl/doc/design/SYCLInstrumentationUsingXPTI.md
Original file line number Diff line number Diff line change
Expand Up @@ -299,3 +299,23 @@ All trace point types in bold provide semantic information about the graph, node
| `mem_alloc_end` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::mem_alloc_end` that marks the end of memory allocation process</li> <li> **parent**: Event ID created for all functions in the `oneapi.level_zero.experimental.mem_alloc` layer.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: Unique ID to allow the correlation of the `mem_alloc_begin` event with the `mem_alloc_end` event. This value is guaranteed to be the same value received by the trace event for the corresponding `mem_alloc_begin`.</li> <li> **user_data**: A pointer to `mem_alloc_data_t` object, that includes memory object ID (if any), allocated pointer, allocation size, and guard zone size (if any). </li></div> | None |
| `mem_release_begin` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::mem_release_begin` that marks the beginning of memory allocation process</li> <li> **parent**: Event ID created for all functions in the `oneapi.level_zero.experimental.mem_alloc` layer.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: Unique ID to allow the correlation of the `mem_release_begin` event with the `mem_release_end` event. </li> <li> **user_data**: A pointer to `mem_alloc_data_t` object, that includes memory object ID (if any) and released pointer. </li></div> | None |
| `mem_release_end` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::mem_release_end` that marks the end of memory allocation process</li> <li> **parent**: Event ID created for all functions in the `oneapi.level_zero.experimental.mem_alloc` layer.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: Unique ID to allow the correlation of the `mem_release_begin` event with the `mem_release_end` event. This value is guaranteed to be the same value received by the trace event for the corresponding `mem_release_begin`.</li> <li> **user_data**: A pointer to `mem_alloc_data_t` object, that includes memory object ID (if any) and released pointer. </li></div> | None |

## SYCL Stream `"sycl.experimental.level_zero.call"` Notification Signatures

This stream transfers events about Level Zero API calls made by SYCL
application.

| Trace Point Type | Parameter Description | Metadata |
| :--------------: | :-------------------- | :------- |
| `function_begin` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::function_begin` that marks the beginning of a function</li> <li> **parent**: Event ID created for all functions in the `sycl.pi` layer.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: Unique ID to allow the correlation of the `function_begin` event with the `function_end` event. </li> <li> **user_data**: Name of the function being called sent in as `const char *` </li></div> | None |
| `function_end` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::function_end` that marks the beginning of a function</li> <li> **parent**: Event ID created for all functions in the `sycl.pi` layer.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: Unique ID to allow the correlation of the `function_begin` event with the `function_end` event. This value is guaranteed to be the same value received by the trace event for the corresponding `function_begin` </li> <li> **user_data**: Name of the function being called sent in as `const char *` </li></div> | None |

## SYCL Stream `"sycl.experimental.level_zero.debug"` Notification Signatures

This stream transfers events about Level Zero API calls and their function
arguments made by SYCL application.

| Trace Point Type | Parameter Description | Metadata |
| :------------------------: | :-------------------- | :------- |
| `function_with_args_begin` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::function_with_args_begin` that marks the beginning of a function</li> <li> **parent**: Event ID created for all functions in the `sycl.pi.debug` layer.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: Unique ID to allow the correlation of the `function_with_args_begin` event with the `function_with_args_end` event. </li> <li> **user_data**: A pointer to `function_with_args_t` object, that includes function ID, name, and arguments. </li></div> | None |
| `function_with_args_end` | <div style="text-align: left"><li>**trace_type**: `xpti::trace_point_type_t::function_with_args_end` that marks the beginning of a function</li> <li> **parent**: Event ID created for all functions in the `sycl.pi.debug` layer.</li> <li> **event**: `nullptr` - since the stream of data just captures functions being called.</li> <li> **instance**: Unique ID to allow the correlation of the `function_with_args_begin` event with the `function_with_args_end` event. This value is guaranteed to be the same value received by the trace event for the corresponding `function_with_args_begin` </li> <li> **user_data**: A pointer to `function_with_args_t` object, that includes function ID, name, arguments, and return value. </li></div> | None |
17 changes: 17 additions & 0 deletions sycl/plugins/cuda/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,26 +24,43 @@ else()
)
endif()

if (SYCL_ENABLE_XPTI_TRACING)
set(XPTI_PROXY_SRC "${CMAKE_SOURCE_DIR}/../xpti/src/xpti_proxy.cpp")
endif()

add_library(pi_cuda SHARED
"${sycl_inc_dir}/CL/sycl/detail/pi.h"
"${sycl_inc_dir}/CL/sycl/detail/pi.hpp"
"pi_cuda.hpp"
"pi_cuda.cpp"
"tracing.cpp"
${XPTI_PROXY_SRC}
)

if (SYCL_ENABLE_XPTI_TRACING)
target_compile_definitions(pi_cuda PRIVATE
XPTI_ENABLE_INSTRUMENTATION
XPTI_STATIC_LIBRARY
)
target_include_directories(pi_cuda PRIVATE "${CMAKE_SOURCE_DIR}/../xpti/include")
target_link_libraries(pi_cuda PRIVATE ${CMAKE_DL_LIBS})
endif()

add_dependencies(sycl-toolchain pi_cuda)

set_target_properties(pi_cuda PROPERTIES LINKER_LANGUAGE CXX)

target_include_directories(pi_cuda
PRIVATE
${sycl_inc_dir}
"${CUDA_TOOLKIT_ROOT_DIR}/extras/CUPTI/include"
)

target_link_libraries(pi_cuda
PRIVATE
OpenCL-Headers
cudadrv
${CUDA_cupti_LIBRARY}
)

if (MSVC)
Expand Down
5 changes: 5 additions & 0 deletions sycl/plugins/cuda/pi_cuda.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@
#include <mutex>
#include <regex>

// Forward declarations
void enableCUDATracing();

namespace {
std::string getCudaVersionString() {
int driver_version = 0;
Expand Down Expand Up @@ -4957,6 +4960,8 @@ pi_result piPluginInit(pi_plugin *PluginInit) {
std::memset(&(PluginInit->PiFunctionTable), 0,
sizeof(PluginInit->PiFunctionTable));

enableCUDATracing();

// Forward calls to CUDA RT.
#define _PI_CL(pi_api, cuda_api) \
(PluginInit->PiFunctionTable).pi_api = (decltype(&::pi_api))(&cuda_api);
Expand Down
99 changes: 99 additions & 0 deletions sycl/plugins/cuda/tracing.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
//===-------------- tracing.cpp - CUDA Host API Tracing --------------------==//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

#ifdef XPTI_ENABLE_INSTRUMENTATION
#include <xpti/xpti_data_types.h>
#include <xpti/xpti_trace_framework.h>
#endif

#include <cuda.h>
#include <cupti.h>

#include <exception>
#include <iostream>

constexpr auto CUDA_CALL_STREAM_NAME = "sycl.experimental.cuda.call";
constexpr auto CUDA_DEBUG_STREAM_NAME = "sycl.experimental.cuda.debug";

thread_local uint64_t CallCorrelationID = 0;
thread_local uint64_t DebugCorrelationID = 0;

#ifdef XPTI_ENABLE_INSTRUMENTATION
static xpti_td *GCallEvent = nullptr;
static xpti_td *GDebugEvent = nullptr;
#endif // XPTI_ENABLE_INSTRUMENTATION

constexpr auto GVerStr = "0.1";
constexpr int GMajVer = 0;
constexpr int GMinVer = 1;

#ifdef XPTI_ENABLE_INSTRUMENTATION
static void cuptiCallback(void *userdata, CUpti_CallbackDomain,
CUpti_CallbackId CBID, const void *CBData) {
if (xptiTraceEnabled()) {
const auto *CBInfo = static_cast<const CUpti_CallbackData *>(CBData);

if (CBInfo->callbackSite == CUPTI_API_ENTER) {
CallCorrelationID = xptiGetUniqueId();
DebugCorrelationID = xptiGetUniqueId();
}

const char *FuncName = CBInfo->functionName;
uint32_t FuncID = static_cast<uint32_t>(CBID);
uint16_t TraceTypeArgs = CBInfo->callbackSite == CUPTI_API_ENTER
? xpti::trace_function_with_args_begin
: xpti::trace_function_with_args_end;
uint16_t TraceType = CBInfo->callbackSite == CUPTI_API_ENTER
? xpti::trace_function_begin
: xpti::trace_function_end;

uint8_t CallStreamID = xptiRegisterStream(CUDA_CALL_STREAM_NAME);
uint8_t DebugStreamID = xptiRegisterStream(CUDA_DEBUG_STREAM_NAME);

xptiNotifySubscribers(CallStreamID, TraceType, GCallEvent, nullptr,
CallCorrelationID, FuncName);

xpti::function_with_args_t Payload{
FuncID, FuncName, const_cast<void *>(CBInfo->functionParams),
CBInfo->functionReturnValue, CBInfo->context};
xptiNotifySubscribers(DebugStreamID, TraceTypeArgs, GDebugEvent, nullptr,
DebugCorrelationID, &Payload);
}
}
#endif

void enableCUDATracing() {
#ifdef XPTI_ENABLE_INSTRUMENTATION
if (!xptiTraceEnabled())
return;

xptiRegisterStream(CUDA_CALL_STREAM_NAME);
xptiInitialize(CUDA_CALL_STREAM_NAME, GMajVer, GMinVer, GVerStr);
xptiRegisterStream(CUDA_DEBUG_STREAM_NAME);
xptiInitialize(CUDA_DEBUG_STREAM_NAME, GMajVer, GMinVer, GVerStr);

uint64_t Dummy;
xpti::payload_t CUDAPayload("CUDA Plugin Layer");
GCallEvent =
xptiMakeEvent("CUDA Plugin Layer", &CUDAPayload,
xpti::trace_algorithm_event, xpti_at::active, &Dummy);

xpti::payload_t CUDADebugPayload("CUDA Plugin Debug Layer");
GDebugEvent =
xptiMakeEvent("CUDA Plugin Debug Layer", &CUDADebugPayload,
xpti::trace_algorithm_event, xpti_at::active, &Dummy);

CUpti_SubscriberHandle Subscriber;
cuptiSubscribe(&Subscriber, cuptiCallback, nullptr);
cuptiEnableDomain(1, Subscriber, CUPTI_CB_DOMAIN_DRIVER_API);
cuptiEnableCallback(0, Subscriber, CUPTI_CB_DOMAIN_DRIVER_API,
CUPTI_DRIVER_TRACE_CBID_cuGetErrorString);
cuptiEnableCallback(0, Subscriber, CUPTI_CB_DOMAIN_DRIVER_API,
CUPTI_DRIVER_TRACE_CBID_cuGetErrorName);
#endif
}
15 changes: 15 additions & 0 deletions sycl/plugins/level_zero/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -102,14 +102,29 @@ target_include_directories(LevelZeroLoader-Headers

include_directories("${sycl_inc_dir}")

if (SYCL_ENABLE_XPTI_TRACING)
set(XPTI_PROXY_SRC "${CMAKE_SOURCE_DIR}/../xpti/src/xpti_proxy.cpp")
endif()

add_library(pi_level_zero SHARED
"${sycl_inc_dir}/CL/sycl/detail/pi.h"
"${CMAKE_CURRENT_SOURCE_DIR}/pi_level_zero.cpp"
"${CMAKE_CURRENT_SOURCE_DIR}/pi_level_zero.hpp"
"${CMAKE_CURRENT_SOURCE_DIR}/usm_allocator.cpp"
"${CMAKE_CURRENT_SOURCE_DIR}/usm_allocator.hpp"
"${CMAKE_CURRENT_SOURCE_DIR}/tracing.cpp"
${XPTI_PROXY_SRC}
)

if (SYCL_ENABLE_XPTI_TRACING)
target_compile_definitions(pi_level_zero PRIVATE
XPTI_ENABLE_INSTRUMENTATION
XPTI_STATIC_LIBRARY
)
target_include_directories(pi_level_zero PRIVATE "${CMAKE_SOURCE_DIR}/../xpti/include")
target_link_libraries(pi_level_zero PRIVATE ${CMAKE_DL_LIBS})
endif()

if (MSVC)
# by defining __SYCL_BUILD_SYCL_DLL, we can use __declspec(dllexport)
# which are individually tagged for all pi* symbols in pi.h
Expand Down
6 changes: 6 additions & 0 deletions sycl/plugins/level_zero/pi_level_zero.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ static pi_result EventCreate(pi_context Context, pi_queue Queue,
bool HostVisible, pi_event *RetEvent);
}

void enableL0Tracing();

namespace {

// Controls Level Zero calls serialization to w/a Level Zero driver being not MT
Expand Down Expand Up @@ -7607,6 +7609,10 @@ pi_result piPluginInit(pi_plugin *PluginInit) {
(PluginInit->PiFunctionTable).api = (decltype(&::api))(&api);
#include <CL/sycl/detail/pi.def>

if (std::getenv("SYCL_PI_LEVEL_ZERO_ENABLE_TRACING") != nullptr) {
enableL0Tracing();
}

return PI_SUCCESS;
}

Expand Down
Loading