|
| 1 | +<!-- |
| 2 | +SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +SPDX-License-Identifier: Apache-2.0 |
| 4 | +
|
| 5 | +Licensed under the Apache License, Version 2.0 (the "License"); |
| 6 | +you may not use this file except in compliance with the License. |
| 7 | +You may obtain a copy of the License at |
| 8 | +
|
| 9 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +Unless required by applicable law or agreed to in writing, software |
| 12 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 13 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 14 | +See the License for the specific language governing permissions and |
| 15 | +limitations under the License. |
| 16 | +--> |
| 17 | + |
| 18 | +# Dynamo Connect |
| 19 | + |
| 20 | +Dynamo connect provides a Pythonic interface to the NIXL base RDMA subsystem via a set of Python classes. |
| 21 | +The primary goal of this library to simplify the integration of NIXL based RDMA into inference applications. |
| 22 | + |
| 23 | +All operations using the Connect library begin with the [`Connector`](connector.md) class and the type of operation required. |
| 24 | +There are four types of supported operations: |
| 25 | + |
| 26 | + 1. **Register local readable memory**: |
| 27 | + |
| 28 | + Register local memory buffer(s) with the RDMA subsystem to enable a remote worker to read from. |
| 29 | + |
| 30 | + 2. **Register local writable memory**: |
| 31 | + |
| 32 | + Register local memory buffer(s) with the RDMA subsystem to enable a remote worker to write to. |
| 33 | + |
| 34 | + 3. **Read from registered, remote memory**: |
| 35 | + |
| 36 | + Read remote memory buffer(s), registered by a remote worker to be readable, into local memory buffer(s). |
| 37 | + |
| 38 | + 4. **Write to registered, remote memory**: |
| 39 | + |
| 40 | + Write local memory buffer(s) to remote memory buffer(s) registered by a remote worker to writable. |
| 41 | + |
| 42 | +By connecting correctly paired operations, high-throughput GPU Direct RDMA data transfers can be completed. |
| 43 | +Given the list above, the correct pairing of operations would be 1 & 3 or 2 & 4. |
| 44 | +Where one side is a "(read|write)-able operation" and the other is its correctly paired "(read|write) operation". |
| 45 | +Specifically, a read operation must be paired with a readable operation, and a write operation must be paired with a writable operation. |
| 46 | + |
| 47 | +```mermaid |
| 48 | +sequenceDiagram |
| 49 | + participant LocalWorker |
| 50 | + participant RemoteWorker |
| 51 | + participant NIXL |
| 52 | +
|
| 53 | + LocalWorker ->> NIXL: Register memory (Descriptor) |
| 54 | + RemoteWorker ->> NIXL: Register memory (Descriptor) |
| 55 | + LocalWorker ->> LocalWorker: Create Readable/WritableOperation |
| 56 | + LocalWorker ->> RemoteWorker: Send RDMA metadata (via HTTP/TCP+NATS) |
| 57 | + RemoteWorker ->> NIXL: Begin Read/WriteOperation with metadata |
| 58 | + NIXL -->> RemoteWorker: Data transfer (RDMA) |
| 59 | + RemoteWorker -->> LocalWorker: Notify completion (unblock awaiter) |
| 60 | +``` |
| 61 | + |
| 62 | +## Examples |
| 63 | + |
| 64 | +### Generic Example |
| 65 | + |
| 66 | +In the diagram below, Local creates a [`WritableOperation`](writable_operation.md) intended to receive data from Remote. |
| 67 | +Local then sends metadata about the requested RDMA operation to Remote. |
| 68 | +Remote then uses the metadata to create a [`WriteOperation`](write_operation.md) which will perform the GPU Direct RDMA memory transfer from Remote's GPU memory to Local's GPU memory. |
| 69 | + |
| 70 | +```mermaid |
| 71 | +--- |
| 72 | +title: Write Operation Between Two Workers |
| 73 | +--- |
| 74 | +flowchart LR |
| 75 | + c1[Remote] --"3: .begin_write()"--- WriteOperation |
| 76 | + WriteOperation e1@=="4: GPU Direct RDMA"==> WritableOperation |
| 77 | + WritableOperation --"1: .create_writable()"--- c2[Local] |
| 78 | + c2 e2@--"2: RDMA Metadata via HTTP"--> c1 |
| 79 | + e1@{ animate: true; } |
| 80 | + e2@{ animate: true; } |
| 81 | +``` |
| 82 | + |
| 83 | +### Multimodal Example |
| 84 | + |
| 85 | +In the case of the [Dynamo Multimodal Disaggregated Example](../../examples/multimodal/README.md): |
| 86 | + |
| 87 | + 1. The HTTP frontend accepts a text prompt and a URL to an image. |
| 88 | + |
| 89 | + 2. The prompt and URL are then enqueued with the Processor before being dispatched to the first available Decode Worker. |
| 90 | + |
| 91 | + 3. Decode Worker then requests a Prefill Worker to provide key-value data for the LLM powering the Decode Worker. |
| 92 | + |
| 93 | + 4. Prefill Worker then requests that the image be processed and provided as embeddings by the Encode Worker. |
| 94 | + |
| 95 | + 5. Encode Worker acquires the image, processes it, performs inference on the image using a specialized vision model, and finally provides the embeddings to Prefill Worker. |
| 96 | + |
| 97 | + 6. Prefill Worker receives the embeddings from Encode Worker and generates a key-value cache (KV$) update for Decode Worker's LLM and writes the update directly to the GPU memory reserved for the data. |
| 98 | + |
| 99 | + 7. Finally, Decode Worker performs the requested inference. |
| 100 | + |
| 101 | +```mermaid |
| 102 | +--- |
| 103 | +title: Multimodal Disaggregated Workflow |
| 104 | +--- |
| 105 | +flowchart LR |
| 106 | + p0[HTTP Frontend] i0@--"text prompt"-->p1[Processor] |
| 107 | + p0 i1@--"url"-->p1 |
| 108 | + p1 i2@--"prompt"-->dw[Decode Worker] |
| 109 | + p1 i3@--"url"-->dw |
| 110 | + dw i4@--"prompt"-->pw[Prefill Worker] |
| 111 | + dw i5@--"url"-->pw |
| 112 | + pw i6@--"url"-->ew[Encode Worker] |
| 113 | + ew o0@=="image embeddings"==>pw |
| 114 | + pw o1@=="kv_cache updates"==>dw |
| 115 | + dw o2@--"inference results"-->p0 |
| 116 | +
|
| 117 | + i0@{ animate: true; } |
| 118 | + i1@{ animate: true; } |
| 119 | + i2@{ animate: true; } |
| 120 | + i3@{ animate: true; } |
| 121 | + i4@{ animate: true; } |
| 122 | + i5@{ animate: true; } |
| 123 | + i6@{ animate: true; } |
| 124 | + o0@{ animate: true; } |
| 125 | + o1@{ animate: true; } |
| 126 | + o2@{ animate: true; } |
| 127 | +``` |
| 128 | + |
| 129 | +> [!Note] |
| 130 | +> In this example, it is the data transfer between the Prefill Worker and the Encode Worker that utilizes the Dynamo Connect library. |
| 131 | +> The KV Cache transfer between Decode Worker and Prefill Worker utilizes the NIXL base RDMA subsystem directly without using the Dynamo Connect library. |
| 132 | +
|
| 133 | +#### Code Examples |
| 134 | + |
| 135 | +See [prefill_worker](../../examples/multimodal/components/prefill_worker.py#L199) or [decode_worker](../../examples/multimodal/components/decode_worker.py#L239), |
| 136 | +for how they coordinate directly with the Encode Worker by creating a [`WritableOperation`](writable_operation.md), |
| 137 | +sending the operation's metadata via Dynamo's round-robin dispatcher, and awaiting the operation for completion before making use of the transferred data. |
| 138 | + |
| 139 | +See [encode_worker](../../examples/multimodal/components/encode_worker.py#L190), |
| 140 | +for how the resulting embeddings are registered with the RDMA subsystem by creating a [`Descriptor`](descriptor.md), |
| 141 | +a [`WriteOperation`](write_operation.md) is created using the metadata provided by the requesting worker, |
| 142 | +and the worker awaits for the data transfer to complete for yielding a response. |
| 143 | + |
| 144 | +## Python Classes |
| 145 | + |
| 146 | + - [Connector](connector.md) |
| 147 | + - [Descriptor](descriptor.md) |
| 148 | + - [Device](device.md) |
| 149 | + - [ReadOperation](read_operation.md) |
| 150 | + - [ReadableOperation](readable_operation.md) |
| 151 | + - [SerializedRequest](serialized_request.md) |
| 152 | + - [WritableOperation](writable_operation.md) |
| 153 | + - [WriteOperation](write_operation.md) |
| 154 | + |
| 155 | + |
| 156 | +## References |
| 157 | + |
| 158 | + - [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) @ [GitHub](https://github.com/ai-dynamo/dynamo) |
| 159 | + - [NVIDIA Dynamo Connect](https://github.com/ai-dynamo/dynamo/tree/main/components/connect) |
| 160 | + - [NVIDIA Inference Transfer Library (NIXL)](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/#nvidia_inference_transfer_library_nixl_low-latency_hardware-agnostic_communication%C2%A0) @ [GitHub](https://github.com/ai-dynamo/nixl) |
| 161 | + - [Dynamo Multimodal Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal) |
| 162 | + - [NVIDIA GPU Direct](https://developer.nvidia.com/gpudirect) |
0 commit comments