@@ -11,10 +11,291 @@ Remote Procedure Calls
11
11
Remote Procedure Call Implementation
12
12
====================================
13
13
14
- Certain features from the standard C library, such as allocation or printing,
15
- require support from the operating system. We instead implement a remote
16
- procedure call (RPC) interface to allow submitting work from the GPU to a host
17
- server that forwards it to the host system.
14
+ Traditionally, the C library abstracts over several functions that interface
15
+ with the platform's operating system through system calls. The GPU however does
16
+ not provide an operating system that can be queried to handle target dependent
17
+ operations. Instead, we rely on remote procedure calls to interface with the
18
+ host's operating system while executing on a GPU.
19
+
20
+ Creating remote procedure calls requires a communication channel that can be
21
+ shared between the CPU host and the GPU device. All modern GPU platforms provide
22
+ a form of device accessible host memory typically through pinning or paging.
23
+ Furthermore, this memory supports asynchronous access and atomic operations
24
+ which allows us to perform mutual exclusion on a shared buffer between the two
25
+ processes.
26
+
27
+ This work treats the GPU as a client and the host as a server. The client
28
+ initiates a communication while the server listens for them. In order to
29
+ communicate between the host and the device, we simply maintain a buffer of
30
+ memory and two mailboxes. One mailbox is write-only while the other is
31
+ read-only. This exposes three primitive operations: using the buffer, giving
32
+ away ownership, and waiting for ownership. This provides a half-duplex
33
+ transmission channel between the two sides. We decided to assign ownership of
34
+ the buffer to the client when the inbox and outbox bits are equal and to the
35
+ server when they are not.
36
+
37
+ In order to make this transmission channel thread-safe, we abstract ownership of
38
+ the given mailbox pair and buffer around a port, effectively acting as a lock
39
+ and an index into the allocated buffer. The server and device have independent
40
+ locks around the given port. In this scheme, the buffer can be used to
41
+ communicate intent and data generically with the server.
42
+
43
+ If this were simply a standard CPU system, this would be sufficient. However,
44
+ GPUs have my unique architectural challenges. First, GPU threads execute in
45
+ lock-step with each other in groups typically called warps or wavefronts. We
46
+ need to target the smallest unit of independent parallelism, so the RPC
47
+ interface needs to handle an entire group of threads at once. This is done by
48
+ increasing the size of the buffer and adding a thread mask argument so the
49
+ server knows which threads are active when it handles the communication. Second,
50
+ GPUs generally have no forward progress guarantees. In order to guarantee we do
51
+ not encounter deadlocks while executing it is required that the number of ports
52
+ matches the maximum amount of hardware parallelism on the device. It is also
53
+ very important that the thread mask remains consistent while interfacing with
54
+ the port.
55
+
56
+ .. image :: ./rpc-diagram.svg
57
+ :width: 75%
58
+ :align: center
59
+
60
+ The above diagram outlines the architecture of the RPC interface. For clarity
61
+ the following list will explain the operations done by the client and server
62
+ respectively when initiating a communication.
63
+
64
+ First, a communication from the perspective of the client:
65
+
66
+ * The client searches for an available port and claims the lock.
67
+ * The client checks that the port is still available to the current device and
68
+ continues if so.
69
+ * The client writes its data to the fixed-size packet and toggles its outbox.
70
+ * The client waits until its inbox matches its outbox.
71
+ * The client reads the data from the fixed-size packet.
72
+ * The client closes the port and continues executing.
73
+
74
+ Now, the same communication from the perspective of the server:
75
+
76
+ * The server searches for an available port with pending work and claims the
77
+ lock.
78
+ * The server checks that the port is still available to the current device.
79
+ * The server reads the opcode to perform the expected operation, in this
80
+ case a receive and then send.
81
+ * The server reads the data from the fixed-size packet.
82
+ * The server writes its data to the fixed-size packet and toggles its outbox.
83
+ * The server closes the port and continues searching for ports that need to be
84
+ serviced
85
+
86
+ This architecture currently requires that the host periodically checks the RPC
87
+ server's buffer for ports with pending work. Note that a port can be closed
88
+ without waiting for its submitted work to be completed. This allows us to model
89
+ asynchronous operations that do not need to wait until the server has completed
90
+ them. If an operation requires more data than the fixed size buffer, we simply
91
+ send multiple packets back and forth in a streaming fashion.
92
+
93
+ Server Library
94
+ --------------
95
+
96
+ The RPC server's basic functionality is provided by the LLVM C library. A static
97
+ library called ``libllvmlibc_rpc_server.a `` includes handling for the basic
98
+ operations, such as printing or exiting. This has a small API that handles
99
+ setting up the unified buffer and an interface to check the opcodes.
100
+
101
+ Some operations are too divergent to provide generic implementations for, such
102
+ as allocating device accessible memory. For these cases, we provide a callback
103
+ registration scheme to add a custom handler for any given opcode through the
104
+ port API. More information can be found in the installed header
105
+ ``<install>/include/gpu-none-llvm/rpc_server.h ``.
106
+
107
+ Client Example
108
+ --------------
109
+
110
+ The Client API is not currently exported by the LLVM C library. This is
111
+ primarily due to being written in C++ and relying on internal data structures.
112
+ It uses a simple send and receive interface with a fixed-size packet. The
113
+ following example uses the RPC interface to call a function pointer on the
114
+ server.
115
+
116
+ This code first opens a port with the given opcode to facilitate the
117
+ communication. It then copies over the argument struct to the server using the
118
+ ``send_n `` interface to stream arbitrary bytes. The next send operation provides
119
+ the server with the function pointer that will be executed. The final receive
120
+ operation is a no-op and simply forces the client to wait until the server is
121
+ done. It can be omitted if asynchronous execution is desired.
122
+
123
+ .. code-block :: c++
124
+
125
+ void rpc_host_call(void *fn, void *data, size_t size) {
126
+ rpc::Client::Port port = rpc::client.open<RPC_HOST_CALL>();
127
+ port.send_n(data, size);
128
+ port.send([=](rpc::Buffer *buffer) {
129
+ buffer->data[0] = reinterpret_cast<uintptr_t>(fn);
130
+ });
131
+ port.recv([](rpc::Buffer *) {});
132
+ port.close();
133
+ }
134
+
135
+ Server Example
136
+ --------------
137
+
138
+ This example shows the server-side handling of the previous client example. When
139
+ the server is checked, if there are any ports with pending work it will check
140
+ the opcode and perform the appropriate action. In this case, the action is to
141
+ call a function pointer provided by the client.
142
+
143
+ In this example, the server simply runs forever for brevity's sake. Because the
144
+ client is a GPU potentially handling several threads at once, the server needs
145
+ to loop over all the active threads on the GPU. We abstract this into the
146
+ ``lane_size `` variable, which is simply the device's warp or wavefront size. The
147
+ identifier is simply the threads index into the current warp or wavefront.
148
+
149
+ .. code-block :: c++
150
+
151
+ for(;;) {
152
+ auto port = server.try_open(index);
153
+ if (!port)
154
+ return continue;
155
+
156
+ switch(port->get_opcode()) {
157
+ case RPC_HOST_CALL: {
158
+ uint64_t sizes[LANE_SIZE];
159
+ void *args[LANE_SIZE];
160
+ port->recv_n(args, sizes, [&](uint64_t size) { return new char[size]; });
161
+ port->recv([&](rpc::Buffer *buffer, uint32_t id) {
162
+ reinterpret_cast<void ( *)(void *)>(buffer->data[0])(args[id]);
163
+ });
164
+ port->send([&](rpc::Buffer *, uint32_t id) {
165
+ delete[] reinterpret_cast<uint8_t *>(args[id]);
166
+ });
167
+ break;
168
+ }
169
+ default:
170
+ port->recv([](rpc::Buffer *) {});
171
+ break;
172
+ }
173
+ }
174
+
175
+ CUDA Server Example
176
+ -------------------
177
+
178
+ The following code shows an example of using the exported RPC interface along
179
+ with the C library to manually configure a working server using the CUDA
180
+ language. Other runtimes can use the presence of the ``__llvm_libc_rpc_client ``
181
+ in the GPU executable as an indicator for whether or not the server can be
182
+ checked.
183
+
184
+ .. code-block :: cuda
185
+
186
+ #include <cstdio>
187
+ #include <cstdlib>
188
+ #include <cuda_runtime.h>
189
+
190
+ #include <gpu-none-llvm/rpc_server.h>
191
+
192
+ [[noreturn]] void handle_error(cudaError_t err) {
193
+ fprintf(stderr, "CUDA error: %s\n", cudaGetErrorString(err));
194
+ exit(EXIT_FAILURE);
195
+ }
196
+
197
+ [[noreturn]] void handle_error(rpc_status_t err) {
198
+ fprintf(stderr, "RPC error: %d\n", err);
199
+ exit(EXIT_FAILURE);
200
+ }
201
+
202
+ // The handle to the RPC client provided by the C library.
203
+ extern "C" __device__ void *__llvm_libc_rpc_client;
204
+
205
+ __global__ void get_client_ptr(void **ptr) { *ptr = __llvm_libc_rpc_client; }
206
+
207
+ // Obtain the RPC client's handle from the device. The CUDA language cannot look
208
+ // up the symbol directly like the driver API, so we launch a kernel to read it.
209
+ void *get_rpc_client() {
210
+ void *rpc_client = nullptr;
211
+ void **rpc_client_d = nullptr;
212
+
213
+ if (cudaError_t err = cudaMalloc(&rpc_client_d, sizeof(void *)))
214
+ handle_error(err);
215
+ get_client_ptr<<<1, 1>>>(rpc_client_d);
216
+ if (cudaError_t err = cudaDeviceSynchronize())
217
+ handle_error(err);
218
+ if (cudaError_t err = cudaMemcpy(&rpc_client, rpc_client_d, sizeof(void *),
219
+ cudaMemcpyDeviceToHost))
220
+ handle_error(err);
221
+ return rpc_client;
222
+ }
223
+
224
+ // Routines to allocate mapped memory that both the host and the device can
225
+ // access asychonrously to communicate with eachother.
226
+ void *alloc_host(size_t size, void *) {
227
+ void *sharable_ptr;
228
+ if (cudaError_t err = cudaMallocHost(&sharable_ptr, sizeof(void *)))
229
+ handle_error(err);
230
+ return sharable_ptr;
231
+ };
232
+
233
+ void free_host(void *ptr, void *) {
234
+ if (cudaError_t err = cudaFreeHost(ptr))
235
+ handle_error(err);
236
+ }
237
+
238
+ // The device-side overload of the standard C function to call.
239
+ extern "C" __device__ int puts(const char *);
240
+
241
+ // Calls the C library function from the GPU C library.
242
+ __global__ void hello() { puts("Hello world!"); }
243
+
244
+ int main() {
245
+ int device = 0;
246
+ // Initialize the RPC server to run on a single device.
247
+ if (rpc_status_t err = rpc_init(/*num_device=*/1))
248
+ handle_error(err);
249
+
250
+ // Initialize the RPC server to run on the given device.
251
+ if (rpc_status_t err =
252
+ rpc_server_init(device, RPC_MAXIMUM_PORT_COUNT,
253
+ /*warp_size=*/32, alloc_host, /*data=*/nullptr))
254
+ handle_error(err);
255
+
256
+ // Initialize the RPC client by copying the buffer to the device's handle.
257
+ void *rpc_client = get_rpc_client();
258
+ if (cudaError_t err =
259
+ cudaMemcpy(rpc_client, rpc_get_client_buffer(device),
260
+ rpc_get_client_size(), cudaMemcpyHostToDevice))
261
+ handle_error(err);
262
+
263
+ cudaStream_t stream;
264
+ if (cudaError_t err = cudaStreamCreate(&stream))
265
+ handle_error(err);
266
+
267
+ // Execute the kernel.
268
+ hello<<<1, 1, 0, stream>>>();
269
+
270
+ // While the kernel is executing, check the RPC server for work to do.
271
+ while (cudaStreamQuery(stream) == cudaErrorNotReady)
272
+ if (rpc_status_t err = rpc_handle_server(device))
273
+ handle_error(err);
274
+
275
+ // Shut down the server running on the given device.
276
+ if (rpc_status_t err =
277
+ rpc_server_shutdown(device, free_host, /*data=*/nullptr))
278
+ handle_error(err);
279
+
280
+ // Shut down the entire RPC server interface.
281
+ if (rpc_status_t err = rpc_shutdown())
282
+ handle_error(err);
283
+
284
+ return EXIT_SUCCESS;
285
+ }
286
+
287
+ The above code must be compiled in CUDA's relocatable device code mode and with
288
+ the advanced offloading driver to link in the library. Currently this can be
289
+ done with the following invocation. Using LTO avoids the overhead normally
290
+ associated with relocatable device code linking.
291
+
292
+ .. code-block :: sh
293
+
294
+ $> clang++ -x cuda rpc.cpp --offload-arch=native -fgpu-rdc -lcudart -lcgpu \
295
+ -I< install-path> include -L< install-path> /lib -lllvmlibc_rpc_server \
296
+ -O3 -foffload-lto -o hello
297
+ $> ./hello
298
+ Hello world!
18
299
19
300
Extensions
20
301
----------
0 commit comments