Skip to content

Commit b04f894

Browse files
mfrancepilloisEwanCsteffenlarsenreble
authored
[SYCL][Graph] Add support for enabling Command-Buffer profiling (#11324)
Support for graph execution profiling - Modifies the urEventGetProfilingInfo function to get the CommandBuffer start time from sync point timestamps. - Approximates the submit time if the difference between the estimated device clock (from which the submit time is taken) and the actual device clock is greater than the elapsed time between the command-buffer submission and its start (involving an event sequence issue, i.e. submit time can be after start time) - Adds a profiling e2e test. - Updates the design doc --------- Co-authored-by: Ewan Crawford <[email protected]> Co-authored-by: Steffen Larsen <[email protected]> Co-authored-by: Pablo Reble <[email protected]>
1 parent 36e123d commit b04f894

File tree

8 files changed

+234
-89
lines changed

8 files changed

+234
-89
lines changed

sycl/doc/design/CommandGraph.md

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -241,11 +241,22 @@ created on UR command-buffer enqueue.
241241

242242
There is also a *WaitEvent* used by the `ur_exp_command_buffer_handle_t` class
243243
in the prefix to wait on any dependencies passed in the enqueue wait-list.
244-
This WaitEvent is reset at the end of the suffix, along with reset commands
245-
to reset the L0 events used to implement the UR sync-points back to the
246-
non-signaled state.
247-
248-
![L0 command-buffer diagram](images/L0_UR_command-buffer.svg)
244+
This WaitEvent is reset in the suffix.
245+
246+
A command-buffer is expected to be submitted multiple times. Consequently,
247+
we need to ensure that L0 events associated with graph commands have not
248+
been signaled by a previous execution. These events are therefore reset to the
249+
non-signaled state before running the actual graph associated commands. Note
250+
that this reset is performed in the prefix and not in the suffix to avoid
251+
additional synchronization w.r.t profiling data extraction.
252+
253+
If a command-buffer is about to be submitted to a queue with the profiling
254+
property enabled, an extra command that copies timestamps of L0 events
255+
associated with graph commands into a dedicated memory which is attached to the
256+
returned UR event. This memory stores the profiling information that
257+
corresponds to the current submission of the command-buffer.
258+
259+
![L0 command-buffer diagram](images/L0_UR_command-buffer-v3.jpg)
249260

250261
For a call to `urCommandBufferEnqueueExp` with an `event_list` *EL*,
251262
command-buffer *CB*, and return event *RE* our implementation has to submit two
Loading

sycl/doc/design/images/L0_UR_command-buffer.svg

Lines changed: 0 additions & 1 deletion
This file was deleted.

sycl/source/detail/event_impl.cpp

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -278,17 +278,33 @@ void event_impl::checkProfilingPreconditions() const {
278278
"Profiling information is unavailable as the queue associated with "
279279
"the event does not have the 'enable_profiling' property.");
280280
}
281-
if (MEventFromSubmitedExecCommandBuffer) {
282-
throw sycl::exception(make_error_code(sycl::errc::invalid),
283-
"Profiling information is unavailable for events "
284-
"returned by a graph submission.");
285-
}
286281
}
287282

288283
template <>
289284
uint64_t
290285
event_impl::get_profiling_info<info::event_profiling::command_submit>() {
291286
checkProfilingPreconditions();
287+
// The delay between the submission and the actual start of a CommandBuffer
288+
// can be short. Consequently, the submission time, which is based on
289+
// an estimated clock and not on the real device clock, may be ahead of the
290+
// start time, which is based on the actual device clock.
291+
// MSubmitTime is set in a critical performance path.
292+
// Force reading the device clock when setting MSubmitTime may deteriorate
293+
// the performance.
294+
// Since submit time is an estimated time, we implement this little hack
295+
// that allows all profiled time to be meaningful.
296+
// (Note that the observed time deviation between the estimated clock and
297+
// the real device clock is typically less than 0.5ms. The approximation we
298+
// made by forcing the re-sync of submit time to start time is less than
299+
// 0.5ms. These timing values were obtained empirically using an integrated
300+
// Intel GPU).
301+
if (MEventFromSubmittedExecCommandBuffer && !MHostEvent && MEvent) {
302+
uint64_t StartTime =
303+
get_event_profiling_info<info::event_profiling::command_start>(
304+
this->getHandleRef(), this->getPlugin());
305+
if (StartTime < MSubmitTime)
306+
MSubmitTime = StartTime;
307+
}
292308
return MSubmitTime;
293309
}
294310

sycl/source/detail/event_impl.hpp

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -282,12 +282,12 @@ class event_impl {
282282
return MGraph.lock();
283283
}
284284

285-
void setEventFromSubmitedExecCommandBuffer(bool value) {
286-
MEventFromSubmitedExecCommandBuffer = value;
285+
void setEventFromSubmittedExecCommandBuffer(bool value) {
286+
MEventFromSubmittedExecCommandBuffer = value;
287287
}
288288

289-
bool isEventFromSubmitedExecCommandBuffer() const {
290-
return MEventFromSubmitedExecCommandBuffer;
289+
bool isEventFromSubmittedExecCommandBuffer() const {
290+
return MEventFromSubmittedExecCommandBuffer;
291291
}
292292

293293
protected:
@@ -340,8 +340,8 @@ class event_impl {
340340
/// Store the command graph associated with this event, if any.
341341
/// This event is also be stored in the graph so a weak_ptr is used.
342342
std::weak_ptr<ext::oneapi::experimental::detail::graph_impl> MGraph;
343-
/// Indicates that the event results from a command graph submission
344-
bool MEventFromSubmitedExecCommandBuffer = false;
343+
/// Indicates that the event results from a command graph submission.
344+
bool MEventFromSubmittedExecCommandBuffer = false;
345345

346346
// If this event represents a submission to a
347347
// sycl::detail::pi::PiExtCommandBuffer the sync point for that submission is

sycl/source/detail/graph_impl.cpp

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -757,7 +757,6 @@ exec_graph_impl::enqueue(const std::shared_ptr<sycl::detail::queue_impl> &Queue,
757757
auto NewEvent = std::make_shared<sycl::detail::event_impl>(Queue);
758758
NewEvent->setContextImpl(Queue->getContextImplPtr());
759759
NewEvent->setStateIncomplete();
760-
NewEvent->setEventFromSubmitedExecCommandBuffer(true);
761760
return NewEvent;
762761
});
763762

@@ -840,7 +839,7 @@ exec_graph_impl::enqueue(const std::shared_ptr<sycl::detail::queue_impl> &Queue,
840839
NewEvent = sycl::detail::Scheduler::getInstance().addCG(
841840
std::move(CommandGroup), Queue);
842841
}
843-
842+
NewEvent->setEventFromSubmittedExecCommandBuffer(true);
844843
} else if ((CurrentPartition->MSchedule.size() > 0) &&
845844
(CurrentPartition->MSchedule.front()->MCGType ==
846845
sycl::detail::CG::CGTYPE::CodeplayHostTask)) {
Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
// REQUIRES: level_zero || cuda, gpu
2+
// RUN: %{build} -o %t.out
3+
// RUN: %{run} %t.out 2>&1
4+
// RUN: %if ext_oneapi_level_zero %{env UR_L0_LEAKS_DEBUG=1 %{run} %t.out 2>&1 | FileCheck --implicit-check-not=LEAK %s %}
5+
6+
// This test checks the profiling of an event returned
7+
// from graph submission with event::get_profiling_info().
8+
// It first tests a graph made exclusively of memory operations,
9+
// then tests a graph made of kernels.
10+
// The second run is to check that there are no leaks reported with the embedded
11+
// UR_L0_LEAKS_DEBUG testing capability.
12+
13+
#include "graph_common.hpp"
14+
15+
#define GRAPH_TESTS_VERBOSE_PRINT 0
16+
17+
#if GRAPH_TESTS_VERBOSE_PRINT
18+
#include <chrono>
19+
#endif
20+
21+
bool verifyProfiling(event Event) {
22+
auto Submit =
23+
Event.get_profiling_info<sycl::info::event_profiling::command_submit>();
24+
auto Start =
25+
Event.get_profiling_info<sycl::info::event_profiling::command_start>();
26+
auto End =
27+
Event.get_profiling_info<sycl::info::event_profiling::command_end>();
28+
29+
#if GRAPH_TESTS_VERBOSE_PRINT
30+
std::cout << "Submit = " << Submit << std::endl;
31+
std::cout << "Start = " << Start << std::endl;
32+
std::cout << "End = " << End << " ( " << (End - Start) << " ) "
33+
<< " => full ( " << (End - Submit) << " ) " << std::endl;
34+
#endif
35+
36+
assert((Submit && Start && End) && "Profiling information failed.");
37+
assert(Submit <= Start);
38+
assert(Start < End);
39+
40+
bool Pass = sycl::info::event_command_status::complete ==
41+
Event.get_info<sycl::info::event::command_execution_status>();
42+
43+
return Pass;
44+
}
45+
46+
bool compareProfiling(event Event1, event Event2) {
47+
assert(Event1 != Event2);
48+
49+
auto SubmitEvent1 =
50+
Event1.get_profiling_info<sycl::info::event_profiling::command_submit>();
51+
auto StartEvent1 =
52+
Event1.get_profiling_info<sycl::info::event_profiling::command_start>();
53+
auto EndEvent1 =
54+
Event1.get_profiling_info<sycl::info::event_profiling::command_end>();
55+
assert((SubmitEvent1 && StartEvent1 && EndEvent1) &&
56+
"Profiling information failed.");
57+
58+
auto SubmitEvent2 =
59+
Event2.get_profiling_info<sycl::info::event_profiling::command_submit>();
60+
auto StartEvent2 =
61+
Event2.get_profiling_info<sycl::info::event_profiling::command_start>();
62+
auto EndEvent2 =
63+
Event2.get_profiling_info<sycl::info::event_profiling::command_end>();
64+
assert((SubmitEvent2 && StartEvent2 && EndEvent2) &&
65+
"Profiling information failed.");
66+
67+
assert(SubmitEvent1 != SubmitEvent2);
68+
assert(StartEvent1 != StartEvent2);
69+
assert(EndEvent1 != EndEvent2);
70+
71+
bool Pass1 = sycl::info::event_command_status::complete ==
72+
Event1.get_info<sycl::info::event::command_execution_status>();
73+
bool Pass2 = sycl::info::event_command_status::complete ==
74+
Event2.get_info<sycl::info::event::command_execution_status>();
75+
76+
return (Pass1 && Pass2);
77+
}
78+
79+
// The test checks that get_profiling_info waits for command asccociated with
80+
// event to complete execution.
81+
int main() {
82+
device Dev;
83+
queue Queue{Dev,
84+
{sycl::ext::intel::property::queue::no_immediate_command_list{},
85+
sycl::property::queue::enable_profiling()}};
86+
87+
const size_t Size = 100000;
88+
int Data[Size] = {0};
89+
for (size_t I = 0; I < Size; ++I) {
90+
Data[I] = I;
91+
}
92+
int Values[Size] = {0};
93+
94+
buffer<int, 1> BufferFrom(Data, range<1>(Size));
95+
buffer<int, 1> BufferTo(Values, range<1>(Size));
96+
97+
buffer<int, 1> BufferA(Data, range<1>(Size));
98+
buffer<int, 1> BufferB(Values, range<1>(Size));
99+
buffer<int, 1> BufferC(Values, range<1>(Size));
100+
101+
BufferFrom.set_write_back(false);
102+
BufferTo.set_write_back(false);
103+
BufferA.set_write_back(false);
104+
BufferB.set_write_back(false);
105+
BufferC.set_write_back(false);
106+
{ // buffer copy
107+
exp_ext::command_graph CopyGraph{
108+
Queue.get_context(),
109+
Queue.get_device(),
110+
{exp_ext::property::graph::assume_buffer_outlives_graph{}}};
111+
CopyGraph.begin_recording(Queue);
112+
113+
Queue.submit([&](sycl::handler &Cgh) {
114+
accessor<int, 1, access::mode::read, access::target::device> AccessorFrom(
115+
BufferFrom, Cgh, range<1>(Size));
116+
accessor<int, 1, access::mode::write, access::target::device> AccessorTo(
117+
BufferTo, Cgh, range<1>(Size));
118+
Cgh.copy(AccessorFrom, AccessorTo);
119+
});
120+
121+
CopyGraph.end_recording(Queue);
122+
123+
// kernel launch
124+
exp_ext::command_graph KernelGraph{
125+
Queue.get_context(),
126+
Queue.get_device(),
127+
{exp_ext::property::graph::assume_buffer_outlives_graph{}}};
128+
KernelGraph.begin_recording(Queue);
129+
130+
run_kernels(Queue, Size, BufferA, BufferB, BufferC);
131+
132+
KernelGraph.end_recording(Queue);
133+
134+
auto CopyGraphExec = CopyGraph.finalize();
135+
auto KernelGraphExec = KernelGraph.finalize();
136+
137+
event CopyEvent, KernelEvent1, KernelEvent2;
138+
// Run graphs
139+
#if GRAPH_TESTS_VERBOSE_PRINT
140+
auto StartCopyGraph = std::chrono::high_resolution_clock::now();
141+
#endif
142+
CopyEvent = Queue.submit(
143+
[&](handler &CGH) { CGH.ext_oneapi_graph(CopyGraphExec); });
144+
Queue.wait_and_throw();
145+
#if GRAPH_TESTS_VERBOSE_PRINT
146+
auto EndCopyGraph = std::chrono::high_resolution_clock::now();
147+
auto StartKernelSubmit1 = std::chrono::high_resolution_clock::now();
148+
#endif
149+
KernelEvent1 = Queue.submit(
150+
[&](handler &CGH) { CGH.ext_oneapi_graph(KernelGraphExec); });
151+
Queue.wait_and_throw();
152+
#if GRAPH_TESTS_VERBOSE_PRINT
153+
auto endKernelSubmit1 = std::chrono::high_resolution_clock::now();
154+
auto StartKernelSubmit2 = std::chrono::high_resolution_clock::now();
155+
#endif
156+
KernelEvent2 = Queue.submit(
157+
[&](handler &CGH) { CGH.ext_oneapi_graph(KernelGraphExec); });
158+
Queue.wait_and_throw();
159+
#if GRAPH_TESTS_VERBOSE_PRINT
160+
auto endKernelSubmit2 = std::chrono::high_resolution_clock::now();
161+
162+
double DelayCopy = std::chrono::duration_cast<std::chrono::nanoseconds>(
163+
EndCopyGraph - StartCopyGraph)
164+
.count();
165+
std::cout << "Copy Graph delay (in ns) : " << DelayCopy << std::endl;
166+
double DelayKernel1 = std::chrono::duration_cast<std::chrono::nanoseconds>(
167+
endKernelSubmit1 - StartKernelSubmit1)
168+
.count();
169+
std::cout << "Kernel 1st Execution delay (in ns) : " << DelayKernel1
170+
<< std::endl;
171+
double DelayKernel2 = std::chrono::duration_cast<std::chrono::nanoseconds>(
172+
endKernelSubmit2 - StartKernelSubmit2)
173+
.count();
174+
std::cout << "Kernel 2nd Execution delay (in ns) : " << DelayKernel2
175+
<< std::endl;
176+
#endif
177+
178+
// Checks profiling times
179+
assert(verifyProfiling(CopyEvent) && verifyProfiling(KernelEvent1) &&
180+
verifyProfiling(KernelEvent2) &&
181+
compareProfiling(KernelEvent1, KernelEvent2));
182+
}
183+
184+
host_accessor HostData(BufferTo);
185+
for (size_t I = 0; I < Size; ++I) {
186+
assert(HostData[I] == Values[I]);
187+
}
188+
189+
return 0;
190+
}

0 commit comments

Comments
 (0)