Skip to content

Commit 5b19f42

Browse files
committed
[OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking
This patch lazily initializes queues/streams/events since their initialization might come at a cost even if we do not use them. To further benefit from this, AMDGPU/HSA queue management is moved into the AMDGPUStreamManager of an AMDGPUDevice. Streams may now use different HSA queues during their lifetime and identify busy queues. When a Stream is requested from the resource manager, it will search for and try to assign an idle queue. During the search for an idle queue the manager may initialize more queues, up to the set maximum (default: 4). When no idle queue could be found: resort to round robin selection. With contributions from Johannes Doerfert <[email protected]> Depends on D156245 Reviewed By: kevinsala Differential Revision: https://reviews.llvm.org/D154523
1 parent 5388149 commit 5b19f42

File tree

5 files changed

+147
-45
lines changed

5 files changed

+147
-45
lines changed

openmp/docs/design/Runtimes.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1193,15 +1193,16 @@ throughout the execution if needed. A stream is a queue of asynchronous
11931193
operations (e.g., kernel launches and memory copies) that are executed
11941194
sequentially. Parallelism is achieved by featuring multiple streams. The
11951195
``libomptarget`` leverages streams to exploit parallelism between plugin
1196-
operations. The default value is ``32``.
1196+
operations. The default value is ``1``, more streams are created as needed.
11971197

11981198
LIBOMPTARGET_NUM_INITIAL_EVENTS
11991199
"""""""""""""""""""""""""""""""
12001200

12011201
This environment variable sets the number of pre-created events in the
12021202
plugin (if supported) at initialization. More events will be created
12031203
dynamically throughout the execution if needed. An event is used to synchronize
1204-
a stream with another efficiently. The default value is ``32``.
1204+
a stream with another efficiently. The default value is ``1``, more events are
1205+
created as needed.
12051206

12061207
LIBOMPTARGET_LOCK_MAPPED_HOST_BUFFERS
12071208
"""""""""""""""""""""""""""""""""""""

openmp/libomptarget/include/Utilities.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,12 @@ template <typename Ty> class Envar {
8383
}
8484
}
8585

86+
Envar<Ty> &operator=(const Ty &V) {
87+
Data = V;
88+
Initialized = true;
89+
return *this;
90+
}
91+
8692
/// Get the definitive value.
8793
const Ty &get() const {
8894
// Throw a runtime error in case this envar is not initialized.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

Lines changed: 133 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -583,10 +583,12 @@ using AMDGPUSignalManagerTy = GenericDeviceResourceManagerTy<AMDGPUSignalRef>;
583583
/// Class holding an HSA queue to submit kernel and barrier packets.
584584
struct AMDGPUQueueTy {
585585
/// Create an empty queue.
586-
AMDGPUQueueTy() : Queue(nullptr), Mutex() {}
586+
AMDGPUQueueTy() : Queue(nullptr), Mutex(), NumUsers(0) {}
587587

588-
/// Initialize a new queue belonging to a specific agent.
588+
/// Lazily initialize a new queue belonging to a specific agent.
589589
Error init(hsa_agent_t Agent, int32_t QueueSize) {
590+
if (Queue)
591+
return Plugin::success();
590592
hsa_status_t Status =
591593
hsa_queue_create(Agent, QueueSize, HSA_QUEUE_TYPE_MULTI, callbackError,
592594
nullptr, UINT32_MAX, UINT32_MAX, &Queue);
@@ -595,10 +597,22 @@ struct AMDGPUQueueTy {
595597

596598
/// Deinitialize the queue and destroy its resources.
597599
Error deinit() {
600+
std::lock_guard<std::mutex> Lock(Mutex);
601+
if (!Queue)
602+
return Plugin::success();
598603
hsa_status_t Status = hsa_queue_destroy(Queue);
599604
return Plugin::check(Status, "Error in hsa_queue_destroy: %s");
600605
}
601606

607+
/// Returns if this queue is considered busy
608+
bool isBusy() const { return NumUsers > 0; }
609+
610+
/// Decrement user count of the queue object
611+
void removeUser() { --NumUsers; }
612+
613+
/// Increase user count of the queue object
614+
void addUser() { ++NumUsers; }
615+
602616
/// Push a kernel launch to the queue. The kernel launch requires an output
603617
/// signal and can define an optional input signal (nullptr if none).
604618
Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,
@@ -611,6 +625,7 @@ struct AMDGPUQueueTy {
611625
// the addition of other packets to the queue. The following piece of code
612626
// should be lightweight; do not block the thread, allocate memory, etc.
613627
std::lock_guard<std::mutex> Lock(Mutex);
628+
assert(Queue && "Interacted with a non-initialized queue!");
614629

615630
// Avoid defining the input dependency if already satisfied.
616631
if (InputSignal && !InputSignal->load())
@@ -659,6 +674,7 @@ struct AMDGPUQueueTy {
659674
const AMDGPUSignalTy *InputSignal2) {
660675
// Lock the queue during the packet publishing process.
661676
std::lock_guard<std::mutex> Lock(Mutex);
677+
assert(Queue && "Interacted with a non-initialized queue!");
662678

663679
// Push the barrier with the lock acquired.
664680
return pushBarrierImpl(OutputSignal, InputSignal1, InputSignal2);
@@ -777,6 +793,9 @@ struct AMDGPUQueueTy {
777793
/// TODO: There are other more advanced approaches to avoid this mutex using
778794
/// atomic operations. We can further investigate it if this is a bottleneck.
779795
std::mutex Mutex;
796+
797+
/// Indicates that the queue is busy when > 0
798+
int NumUsers;
780799
};
781800

782801
/// Struct that implements a stream of asynchronous operations for AMDGPU
@@ -886,7 +905,7 @@ struct AMDGPUStreamTy {
886905
hsa_agent_t Agent;
887906

888907
/// The queue that the stream uses to launch kernels.
889-
AMDGPUQueueTy &Queue;
908+
AMDGPUQueueTy *Queue;
890909

891910
/// The manager of signals to reuse signals.
892911
AMDGPUSignalManagerTy &SignalManager;
@@ -978,6 +997,9 @@ struct AMDGPUStreamTy {
978997
/// signal of the current stream, and 2) the last signal of the other stream.
979998
/// Use a barrier packet with two input signals.
980999
Error waitOnStreamOperation(AMDGPUStreamTy &OtherStream, uint32_t Slot) {
1000+
if (Queue == nullptr)
1001+
return Plugin::error("Target queue was nullptr");
1002+
9811003
/// The signal that we must wait from the other stream.
9821004
AMDGPUSignalTy *OtherSignal = OtherStream.Slots[Slot].Signal;
9831005

@@ -999,7 +1021,7 @@ struct AMDGPUStreamTy {
9991021
return Err;
10001022

10011023
// Push a barrier into the queue with both input signals.
1002-
return Queue.pushBarrier(OutputSignal, InputSignal, OtherSignal);
1024+
return Queue->pushBarrier(OutputSignal, InputSignal, OtherSignal);
10031025
}
10041026

10051027
/// Callback for running a specific asynchronous operation. This callback is
@@ -1085,6 +1107,9 @@ struct AMDGPUStreamTy {
10851107
uint32_t NumThreads, uint64_t NumBlocks,
10861108
uint32_t GroupSize,
10871109
AMDGPUMemoryManagerTy &MemoryManager) {
1110+
if (Queue == nullptr)
1111+
return Plugin::error("Target queue was nullptr");
1112+
10881113
// Retrieve an available signal for the operation's output.
10891114
AMDGPUSignalTy *OutputSignal = nullptr;
10901115
if (auto Err = SignalManager.getResource(OutputSignal))
@@ -1102,8 +1127,8 @@ struct AMDGPUStreamTy {
11021127
return Err;
11031128

11041129
// Push the kernel with the output signal and an input signal (optional)
1105-
return Queue.pushKernelLaunch(Kernel, KernelArgs, NumThreads, NumBlocks,
1106-
GroupSize, OutputSignal, InputSignal);
1130+
return Queue->pushKernelLaunch(Kernel, KernelArgs, NumThreads, NumBlocks,
1131+
GroupSize, OutputSignal, InputSignal);
11071132
}
11081133

11091134
/// Push an asynchronous memory copy between pinned memory buffers.
@@ -1331,6 +1356,8 @@ struct AMDGPUStreamTy {
13311356

13321357
/// Make the stream wait on an event.
13331358
Error waitEvent(const AMDGPUEventTy &Event);
1359+
1360+
friend struct AMDGPUStreamManagerTy;
13341361
};
13351362

13361363
/// Class representing an event on AMDGPU. The event basically stores some
@@ -1428,6 +1455,99 @@ Error AMDGPUStreamTy::waitEvent(const AMDGPUEventTy &Event) {
14281455
return waitOnStreamOperation(RecordedStream, Event.RecordedSlot);
14291456
}
14301457

1458+
struct AMDGPUStreamManagerTy final
1459+
: GenericDeviceResourceManagerTy<AMDGPUResourceRef<AMDGPUStreamTy>> {
1460+
using ResourceRef = AMDGPUResourceRef<AMDGPUStreamTy>;
1461+
using ResourcePoolTy = GenericDeviceResourceManagerTy<ResourceRef>;
1462+
1463+
AMDGPUStreamManagerTy(GenericDeviceTy &Device, hsa_agent_t HSAAgent)
1464+
: GenericDeviceResourceManagerTy(Device), NextQueue(0), Agent(HSAAgent) {}
1465+
1466+
Error init(uint32_t InitialSize, int NumHSAQueues, int HSAQueueSize) {
1467+
Queues = std::vector<AMDGPUQueueTy>(NumHSAQueues);
1468+
QueueSize = HSAQueueSize;
1469+
MaxNumQueues = NumHSAQueues;
1470+
// Initialize one queue eagerly
1471+
if (auto Err = Queues.front().init(Agent, QueueSize))
1472+
return Err;
1473+
1474+
return GenericDeviceResourceManagerTy::init(InitialSize);
1475+
}
1476+
1477+
/// Deinitialize the resource pool and delete all resources. This function
1478+
/// must be called before the destructor.
1479+
Error deinit() override {
1480+
// De-init all queues
1481+
for (AMDGPUQueueTy &Queue : Queues) {
1482+
if (auto Err = Queue.deinit())
1483+
return Err;
1484+
}
1485+
1486+
return GenericDeviceResourceManagerTy::deinit();
1487+
}
1488+
1489+
/// Get a single stream from the pool or create new resources.
1490+
virtual Error getResource(AMDGPUStreamTy *&StreamHandle) override {
1491+
return getResourcesImpl(1, &StreamHandle, [this](AMDGPUStreamTy *&Handle) {
1492+
return assignNextQueue(Handle);
1493+
});
1494+
}
1495+
1496+
/// Return stream to the pool.
1497+
virtual Error returnResource(AMDGPUStreamTy *StreamHandle) override {
1498+
return returnResourceImpl(StreamHandle, [](AMDGPUStreamTy *Handle) {
1499+
Handle->Queue->removeUser();
1500+
return Plugin::success();
1501+
});
1502+
}
1503+
1504+
private:
1505+
/// Search for and assign an prefereably idle queue to the given Stream. If
1506+
/// there is no queue without current users, resort to round robin selection.
1507+
inline Error assignNextQueue(AMDGPUStreamTy *Stream) {
1508+
uint32_t StartIndex = NextQueue % MaxNumQueues;
1509+
AMDGPUQueueTy *Q = nullptr;
1510+
1511+
for (int i = 0; i < MaxNumQueues; ++i) {
1512+
Q = &Queues[StartIndex++];
1513+
if (StartIndex == MaxNumQueues)
1514+
StartIndex = 0;
1515+
1516+
if (Q->isBusy())
1517+
continue;
1518+
else {
1519+
if (auto Err = Q->init(Agent, QueueSize))
1520+
return Err;
1521+
1522+
Q->addUser();
1523+
Stream->Queue = Q;
1524+
return Plugin::success();
1525+
}
1526+
}
1527+
1528+
// All queues busy: Round robin (StartIndex has the initial value again)
1529+
Queues[StartIndex].addUser();
1530+
Stream->Queue = &Queues[StartIndex];
1531+
++NextQueue;
1532+
return Plugin::success();
1533+
}
1534+
1535+
/// The next queue index to use for round robin selection.
1536+
uint32_t NextQueue;
1537+
1538+
/// The queues which are assigned to requested streams.
1539+
std::vector<AMDGPUQueueTy> Queues;
1540+
1541+
/// The corresponding device as HSA agent.
1542+
hsa_agent_t Agent;
1543+
1544+
/// The maximum number of queues.
1545+
int MaxNumQueues;
1546+
1547+
/// The size of created queues.
1548+
int QueueSize;
1549+
};
1550+
14311551
/// Abstract class that holds the common members of the actual kernel devices
14321552
/// and the host device. Both types should inherit from this class.
14331553
struct AMDGenericDeviceTy {
@@ -1607,9 +1727,8 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
16071727
OMPX_InitialNumSignals("LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS",
16081728
64),
16091729
OMPX_StreamBusyWait("LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT", 2000000),
1610-
AMDGPUStreamManager(*this), AMDGPUEventManager(*this),
1611-
AMDGPUSignalManager(*this), Agent(Agent), HostDevice(HostDevice),
1612-
Queues() {}
1730+
AMDGPUStreamManager(*this, Agent), AMDGPUEventManager(*this),
1731+
AMDGPUSignalManager(*this), Agent(Agent), HostDevice(HostDevice) {}
16131732

16141733
~AMDGPUDeviceTy() {}
16151734

@@ -1676,17 +1795,12 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
16761795
return Err;
16771796

16781797
// Compute the number of queues and their size.
1679-
const uint32_t NumQueues = std::min(OMPX_NumQueues.get(), MaxQueues);
1680-
const uint32_t QueueSize = std::min(OMPX_QueueSize.get(), MaxQueueSize);
1681-
1682-
// Construct and initialize each device queue.
1683-
Queues = std::vector<AMDGPUQueueTy>(NumQueues);
1684-
for (AMDGPUQueueTy &Queue : Queues)
1685-
if (auto Err = Queue.init(Agent, QueueSize))
1686-
return Err;
1798+
OMPX_NumQueues = std::max(1U, std::min(OMPX_NumQueues.get(), MaxQueues));
1799+
OMPX_QueueSize = std::min(OMPX_QueueSize.get(), MaxQueueSize);
16871800

16881801
// Initialize stream pool.
1689-
if (auto Err = AMDGPUStreamManager.init(OMPX_InitialNumStreams))
1802+
if (auto Err = AMDGPUStreamManager.init(OMPX_InitialNumStreams,
1803+
OMPX_NumQueues, OMPX_QueueSize))
16901804
return Err;
16911805

16921806
// Initialize event pool.
@@ -1725,11 +1839,6 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
17251839
}
17261840
}
17271841

1728-
for (AMDGPUQueueTy &Queue : Queues) {
1729-
if (auto Err = Queue.deinit())
1730-
return Err;
1731-
}
1732-
17331842
// Invalidate agent reference.
17341843
Agent = {0};
17351844

@@ -2416,19 +2525,8 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
24162525
});
24172526
}
24182527

2419-
/// Get the next queue in a round-robin fashion.
2420-
AMDGPUQueueTy &getNextQueue() {
2421-
static std::atomic<uint32_t> NextQueue(0);
2422-
2423-
uint32_t Current = NextQueue.fetch_add(1, std::memory_order_relaxed);
2424-
return Queues[Current % Queues.size()];
2425-
}
2426-
24272528
private:
2428-
using AMDGPUStreamRef = AMDGPUResourceRef<AMDGPUStreamTy>;
24292529
using AMDGPUEventRef = AMDGPUResourceRef<AMDGPUEventTy>;
2430-
2431-
using AMDGPUStreamManagerTy = GenericDeviceResourceManagerTy<AMDGPUStreamRef>;
24322530
using AMDGPUEventManagerTy = GenericDeviceResourceManagerTy<AMDGPUEventRef>;
24332531

24342532
/// Envar for controlling the number of HSA queues per device. High number of
@@ -2484,9 +2582,6 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
24842582

24852583
/// Reference to the host device.
24862584
AMDHostDeviceTy &HostDevice;
2487-
2488-
/// List of device packet queues.
2489-
std::vector<AMDGPUQueueTy> Queues;
24902585
};
24912586

24922587
Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) {
@@ -2558,7 +2653,7 @@ Error AMDGPUResourceRef<ResourceTy>::create(GenericDeviceTy &Device) {
25582653
}
25592654

25602655
AMDGPUStreamTy::AMDGPUStreamTy(AMDGPUDeviceTy &Device)
2561-
: Agent(Device.getAgent()), Queue(Device.getNextQueue()),
2656+
: Agent(Device.getAgent()), Queue(nullptr),
25622657
SignalManager(Device.getSignalManager()), Device(Device),
25632658
// Initialize the std::deque with some empty positions.
25642659
Slots(32), NextSlot(0), SyncCycle(0), RPCServer(nullptr),

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -396,9 +396,9 @@ GenericDeviceTy::GenericDeviceTy(int32_t DeviceId, int32_t NumDevices,
396396
// device initialization. These cannot be consulted until the device is
397397
// initialized correctly. We intialize them in GenericDeviceTy::init().
398398
OMPX_TargetStackSize(), OMPX_TargetHeapSize(),
399-
// By default, the initial number of streams and events are 32.
400-
OMPX_InitialNumStreams("LIBOMPTARGET_NUM_INITIAL_STREAMS", 32),
401-
OMPX_InitialNumEvents("LIBOMPTARGET_NUM_INITIAL_EVENTS", 32),
399+
// By default, the initial number of streams and events is 1.
400+
OMPX_InitialNumStreams("LIBOMPTARGET_NUM_INITIAL_STREAMS", 1),
401+
OMPX_InitialNumEvents("LIBOMPTARGET_NUM_INITIAL_EVENTS", 1),
402402
DeviceId(DeviceId), GridValues(OMPGridValues),
403403
PeerAccesses(NumDevices, PeerAccessState::PENDING), PeerAccessesLock(),
404404
PinnedAllocs(*this), RPCServer(nullptr) {

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1168,7 +1168,7 @@ template <typename ResourceRef> class GenericDeviceResourceManagerTy {
11681168

11691169
/// Deinitialize the resource pool and delete all resources. This function
11701170
/// must be called before the destructor.
1171-
Error deinit() {
1171+
virtual Error deinit() {
11721172
if (NextAvailable)
11731173
DP("Missing %d resources to be returned\n", NextAvailable);
11741174

@@ -1252,7 +1252,7 @@ template <typename ResourceRef> class GenericDeviceResourceManagerTy {
12521252
return Plugin::success();
12531253
}
12541254

1255-
private:
1255+
protected:
12561256
/// The resources between \p OldSize and \p NewSize need to be created or
12571257
/// destroyed. The mutex is locked when this function is called.
12581258
Error resizeResourcePoolImpl(uint32_t OldSize, uint32_t NewSize) {

0 commit comments

Comments
 (0)