Skip to content

Commit e6a3d2d

Browse files
committed
[libc][Docs] Update the GPU RPC documentation
Summary: This adds some more concrete information on the RPC interface. Hopefully this is intelligable and provides some useful examples.
1 parent 621bafd commit e6a3d2d

File tree

3 files changed

+289
-7
lines changed

3 files changed

+289
-7
lines changed

libc/docs/gpu/rpc-diagram.svg

Lines changed: 1 addition & 0 deletions
Loading

libc/docs/gpu/rpc.rst

Lines changed: 285 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,291 @@ Remote Procedure Calls
1111
Remote Procedure Call Implementation
1212
====================================
1313

14-
Certain features from the standard C library, such as allocation or printing,
15-
require support from the operating system. We instead implement a remote
16-
procedure call (RPC) interface to allow submitting work from the GPU to a host
17-
server that forwards it to the host system.
14+
Traditionally, the C library abstracts over several functions that interface
15+
with the platform's operating system through system calls. The GPU however does
16+
not provide an operating system that can be queried to handle target dependent
17+
operations. Instead, we rely on remote procedure calls to interface with the
18+
host's operating system while executing on a GPU.
19+
20+
Creating remote procedure calls requires a communication channel that can be
21+
shared between the CPU host and the GPU device. All modern GPU platforms provide
22+
a form of device accessible host memory typically through pinning or paging.
23+
Furthermore, this memory supports asynchronous access and atomic operations
24+
which allows us to perform mutual exclusion on a shared buffer between the two
25+
processes.
26+
27+
This work treats the GPU as a client and the host as a server. The client
28+
initiates a communication while the server listens for them. In order to
29+
communicate between the host and the device, we simply maintain a buffer of
30+
memory and two mailboxes. One mailbox is write-only while the other is
31+
read-only. This exposes three primitive operations: using the buffer, giving
32+
away ownership, and waiting for ownership. This provides a half-duplex
33+
transmission channel between the two sides. We decided to assign ownership of
34+
the buffer to the client when the inbox and outbox bits are equal and to the
35+
server when they are not.
36+
37+
In order to make this transmission channel thread-safe, we abstract ownership of
38+
the given mailbox pair and buffer around a port, effectively acting as a lock
39+
and an index into the allocated buffer. The server and device have independent
40+
locks around the given port. In this scheme, the buffer can be used to
41+
communicate intent and data generically with the server.
42+
43+
If this were simply a standard CPU system, this would be sufficient. However,
44+
GPUs have my unique architectural challenges. First, GPU threads execute in
45+
lock-step with each other in groups typically called warps or wavefronts. We
46+
need to target the smallest unit of independent parallelism, so the RPC
47+
interface needs to handle an entire group of threads at once. This is done by
48+
increasing the size of the buffer and adding a thread mask argument so the
49+
server knows which threads are active when it handles the communication. Second,
50+
GPUs generally have no forward progress guarantees. In order to guarantee we do
51+
not encounter deadlocks while executing it is required that the number of ports
52+
matches the maximum amount of hardware parallelism on the device. It is also
53+
very important that the thread mask remains consistent while interfacing with
54+
the port.
55+
56+
.. image:: ./rpc-diagram.svg
57+
:width: 75%
58+
:align: center
59+
60+
The above diagram outlines the architecture of the RPC interface. For clarity
61+
the following list will explain the operations done by the client and server
62+
respectively when initiating a communication.
63+
64+
First, a communication from the perspective of the client:
65+
66+
* The client searches for an available port and claims the lock.
67+
* The client checks that the port is still available to the current device and
68+
continues if so.
69+
* The client writes its data to the fixed-size packet and toggles its outbox.
70+
* The client waits until its inbox matches its outbox.
71+
* The client reads the data from the fixed-size packet.
72+
* The client closes the port and continues executing.
73+
74+
Now, the same communication from the perspective of the server:
75+
76+
* The server searches for an available port with pending work and claims the
77+
lock.
78+
* The server checks that the port is still available to the current device.
79+
* The server reads the opcode to perform the expected operation, in this
80+
case a receive and then send.
81+
* The server reads the data from the fixed-size packet.
82+
* The server writes its data to the fixed-size packet and toggles its outbox.
83+
* The server closes the port and continues searching for ports that need to be
84+
serviced
85+
86+
This architecture currently requires that the host periodically checks the RPC
87+
server's buffer for ports with pending work. Note that a port can be closed
88+
without waiting for its submitted work to be completed. This allows us to model
89+
asynchronous operations that do not need to wait until the server has completed
90+
them. If an operation requires more data than the fixed size buffer, we simply
91+
send multiple packets back and forth in a streaming fashion.
92+
93+
Server Library
94+
--------------
95+
96+
The RPC server's basic functionality is provided by the LLVM C library. A static
97+
library called ``libllvmlibc_rpc_server.a`` includes handling for the basic
98+
operations, such as printing or exiting. This has a small API that handles
99+
setting up the unified buffer and an interface to check the opcodes.
100+
101+
Some operations are too divergent to provide generic implementations for, such
102+
as allocating device accessible memory. For these cases, we provide a callback
103+
registration scheme to add a custom handler for any given opcode through the
104+
port API. More information can be found in the installed header
105+
``<install>/include/gpu-none-llvm/rpc_server.h``.
106+
107+
Client Example
108+
--------------
109+
110+
The Client API is not currently exported by the LLVM C library. This is
111+
primarily due to being written in C++ and relying on internal data structures.
112+
It uses a simple send and receive interface with a fixed-size packet. The
113+
following example uses the RPC interface to call a function pointer on the
114+
server.
115+
116+
This code first opens a port with the given opcode to facilitate the
117+
communication. It then copies over the argument struct to the server using the
118+
``send_n`` interface to stream arbitrary bytes. The next send operation provides
119+
the server with the function pointer that will be executed. The final receive
120+
operation is a no-op and simply forces the client to wait until the server is
121+
done. It can be omitted if asynchronous execution is desired.
122+
123+
.. code-block:: c++
124+
125+
void rpc_host_call(void *fn, void *data, size_t size) {
126+
rpc::Client::Port port = rpc::client.open<RPC_HOST_CALL>();
127+
port.send_n(data, size);
128+
port.send([=](rpc::Buffer *buffer) {
129+
buffer->data[0] = reinterpret_cast<uintptr_t>(fn);
130+
});
131+
port.recv([](rpc::Buffer *) {});
132+
port.close();
133+
}
134+
135+
Server Example
136+
--------------
137+
138+
This example shows the server-side handling of the previous client example. When
139+
the server is checked, if there are any ports with pending work it will check
140+
the opcode and perform the appropriate action. In this case, the action is to
141+
call a function pointer provided by the client.
142+
143+
In this example, the server simply runs forever for brevity's sake. Because the
144+
client is a GPU potentially handling several threads at once, the server needs
145+
to loop over all the active threads on the GPU. We abstract this into the
146+
``lane_size`` variable, which is simply the device's warp or wavefront size. The
147+
identifier is simply the threads index into the current warp or wavefront.
148+
149+
.. code-block:: c++
150+
151+
for(;;) {
152+
auto port = server.try_open(index);
153+
if (!port)
154+
return continue;
155+
156+
switch(port->get_opcode()) {
157+
case RPC_HOST_CALL: {
158+
uint64_t sizes[LANE_SIZE];
159+
void *args[LANE_SIZE];
160+
port->recv_n(args, sizes, [&](uint64_t size) { return new char[size]; });
161+
port->recv([&](rpc::Buffer *buffer, uint32_t id) {
162+
reinterpret_cast<void (*)(void *)>(buffer->data[0])(args[id]);
163+
});
164+
port->send([&](rpc::Buffer *, uint32_t id) {
165+
delete[] reinterpret_cast<uint8_t *>(args[id]);
166+
});
167+
break;
168+
}
169+
default:
170+
port->recv([](rpc::Buffer *) {});
171+
break;
172+
}
173+
}
174+
175+
CUDA Server Example
176+
-------------------
177+
178+
The following code shows an example of using the exported RPC interface along
179+
with the C library to manually configure a working server using the CUDA
180+
language. Other runtimes can use the presence of the ``__llvm_libc_rpc_client``
181+
in the GPU executable as an indicator for whether or not the server can be
182+
checked.
183+
184+
.. code-block:: cuda
185+
186+
#include <cstdio>
187+
#include <cstdlib>
188+
#include <cuda_runtime.h>
189+
190+
#include <gpu-none-llvm/rpc_server.h>
191+
192+
[[noreturn]] void handle_error(cudaError_t err) {
193+
fprintf(stderr, "CUDA error: %s\n", cudaGetErrorString(err));
194+
exit(EXIT_FAILURE);
195+
}
196+
197+
[[noreturn]] void handle_error(rpc_status_t err) {
198+
fprintf(stderr, "RPC error: %d\n", err);
199+
exit(EXIT_FAILURE);
200+
}
201+
202+
// The handle to the RPC client provided by the C library.
203+
extern "C" __device__ void *__llvm_libc_rpc_client;
204+
205+
__global__ void get_client_ptr(void **ptr) { *ptr = __llvm_libc_rpc_client; }
206+
207+
// Obtain the RPC client's handle from the device. The CUDA language cannot look
208+
// up the symbol directly like the driver API, so we launch a kernel to read it.
209+
void *get_rpc_client() {
210+
void *rpc_client = nullptr;
211+
void **rpc_client_d = nullptr;
212+
213+
if (cudaError_t err = cudaMalloc(&rpc_client_d, sizeof(void *)))
214+
handle_error(err);
215+
get_client_ptr<<<1, 1>>>(rpc_client_d);
216+
if (cudaError_t err = cudaDeviceSynchronize())
217+
handle_error(err);
218+
if (cudaError_t err = cudaMemcpy(&rpc_client, rpc_client_d, sizeof(void *),
219+
cudaMemcpyDeviceToHost))
220+
handle_error(err);
221+
return rpc_client;
222+
}
223+
224+
// Routines to allocate mapped memory that both the host and the device can
225+
// access asychonrously to communicate with eachother.
226+
void *alloc_host(size_t size, void *) {
227+
void *sharable_ptr;
228+
if (cudaError_t err = cudaMallocHost(&sharable_ptr, sizeof(void *)))
229+
handle_error(err);
230+
return sharable_ptr;
231+
};
232+
233+
void free_host(void *ptr, void *) {
234+
if (cudaError_t err = cudaFreeHost(ptr))
235+
handle_error(err);
236+
}
237+
238+
// The device-side overload of the standard C function to call.
239+
extern "C" __device__ int puts(const char *);
240+
241+
// Calls the C library function from the GPU C library.
242+
__global__ void hello() { puts("Hello world!"); }
243+
244+
int main() {
245+
int device = 0;
246+
// Initialize the RPC server to run on a single device.
247+
if (rpc_status_t err = rpc_init(/*num_device=*/1))
248+
handle_error(err);
249+
250+
// Initialize the RPC server to run on the given device.
251+
if (rpc_status_t err =
252+
rpc_server_init(device, RPC_MAXIMUM_PORT_COUNT,
253+
/*warp_size=*/32, alloc_host, /*data=*/nullptr))
254+
handle_error(err);
255+
256+
// Initialize the RPC client by copying the buffer to the device's handle.
257+
void *rpc_client = get_rpc_client();
258+
if (cudaError_t err =
259+
cudaMemcpy(rpc_client, rpc_get_client_buffer(device),
260+
rpc_get_client_size(), cudaMemcpyHostToDevice))
261+
handle_error(err);
262+
263+
cudaStream_t stream;
264+
if (cudaError_t err = cudaStreamCreate(&stream))
265+
handle_error(err);
266+
267+
// Execute the kernel.
268+
hello<<<1, 1, 0, stream>>>();
269+
270+
// While the kernel is executing, check the RPC server for work to do.
271+
while (cudaStreamQuery(stream) == cudaErrorNotReady)
272+
if (rpc_status_t err = rpc_handle_server(device))
273+
handle_error(err);
274+
275+
// Shut down the server running on the given device.
276+
if (rpc_status_t err =
277+
rpc_server_shutdown(device, free_host, /*data=*/nullptr))
278+
handle_error(err);
279+
280+
// Shut down the entire RPC server interface.
281+
if (rpc_status_t err = rpc_shutdown())
282+
handle_error(err);
283+
284+
return EXIT_SUCCESS;
285+
}
286+
287+
The above code must be compiled in CUDA's relocatable device code mode and with
288+
the advanced offloading driver to link in the library. Currently this can be
289+
done with the following invocation. Using LTO avoids the overhead normally
290+
associated with relocatable device code linking.
291+
292+
.. code-block:: sh
293+
294+
$> clang++ -x cuda rpc.cpp --offload-arch=native -fgpu-rdc -lcudart -lcgpu \
295+
-I<install-path>include -L<install-path>/lib -lllvmlibc_rpc_server \
296+
-O3 -foffload-lto -o hello
297+
$> ./hello
298+
Hello world!
18299
19300
Extensions
20301
----------

libc/docs/gpu/testing.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,9 @@ Testing Infrastructure
1818
======================
1919

2020
The testing support in LLVM's libc implementation for GPUs is designed to mimic
21-
the standard unit tests as much as possible. We use the `remote procedure call
22-
<libc_gpu_rpc>`_ support to provide the necessary utilities like printing from
23-
the GPU. Execution is performed by emitting a ``_start`` kernel from the GPU
21+
the standard unit tests as much as possible. We use the :ref:`libc_gpu_rpc`
22+
support to provide the necessary utilities like printing from the GPU. Execution
23+
is performed by emitting a ``_start`` kernel from the GPU
2424
that is then called by an external loader utility. This is an example of how
2525
this can be done manually:
2626

0 commit comments

Comments
 (0)