|
| 1 | +<!-- |
| 2 | +SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +SPDX-License-Identifier: Apache-2.0 |
| 4 | +
|
| 5 | +Licensed under the Apache License, Version 2.0 (the "License"); |
| 6 | +you may not use this file except in compliance with the License. |
| 7 | +You may obtain a copy of the License at |
| 8 | +
|
| 9 | +https://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +Unless required by applicable law or agreed to in writing, software |
| 12 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 13 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 14 | +See the License for the specific language governing permissions and |
| 15 | +limitations under the License. |
| 16 | +--> |
| 17 | + |
| 18 | +# Router Standalone |
| 19 | + |
| 20 | +A toy implementation of KvRouter that demonstrates standalone usage without dependency on the dynamo runtime, etcd control plane, or nats event plane. |
| 21 | + |
| 22 | +## Overview |
| 23 | + |
| 24 | +This example shows how to use KvRouter in a standalone fashion to intelligently route requests across multiple vLLM workers based on KV cache overlap and load metrics. The router maintains a view of each worker's cached blocks and routes new requests to the worker with the best combination of cache overlap and available capacity. |
| 25 | + |
| 26 | +> [!Tip] |
| 27 | +> The main focus should be put on `router.py` as it contains the bulk of the non-boilerplate code and core routing logic. |
| 28 | +
|
| 29 | +## How It Works |
| 30 | + |
| 31 | +### Core Architecture |
| 32 | + |
| 33 | +The router uses a **RadixTree** data structure (written in Rust) to efficiently track which blocks each worker has cached. When a new request arrives, the router: |
| 34 | + |
| 35 | +1. Uses `find_matches` to calculate overlap scores (number of matching blocks) between the request and each worker's cached blocks |
| 36 | +2. Combines this with current load metrics to select the optimal worker |
| 37 | +3. Routes the request to the chosen worker for processing |
| 38 | + |
| 39 | +### Event-Driven Updates |
| 40 | + |
| 41 | +The router receives two types of events from vLLM engines: |
| 42 | + |
| 43 | +1. **KV Events**: Emitted automatically by vLLM engines when blocks are cached/evicted |
| 44 | +2. **Load Metrics**: GPU usage percentage and waiting request count via custom callbacks |
| 45 | + |
| 46 | +These events keep the router's view of worker state up-to-date in real-time. |
| 47 | + |
| 48 | +### Alternative: Pure Predictive Routing |
| 49 | + |
| 50 | +While not implemented in this example, the router can also operate in a pure predictive mode, estimating the radix tree state and loads based solely on the requests it receives, without relying on backend events. This requires simulating / mocking the block managing (e.g. eviction) and the scheduling policies of the backend engine. This is not recommended as there is no real-time feedback from the engines, and the router state may drift out of sync with the engine states. Nevertheless, this is WIP and can be supported in the future via our mocker engines. |
| 51 | + |
| 52 | +## Components |
| 53 | + |
| 54 | +> [!Note] |
| 55 | +> This is a standalone toy implementation created for pedagogical purposes to demonstrate the core KvRouter concepts in isolation. |
| 56 | +> Our default dynamo router is already very efficient and uses NATS for event communication and etcd for endpoint registration. |
| 57 | +> This example intentionally avoids these production components to provide a simpler, self-contained demonstration of the routing logic and cache overlap mechanics. |
| 58 | +> |
| 59 | +> The toy communication pattern is as follows: |
| 60 | +> - **OpenAI Compatible Frontend** – FastAPI application serving OpenAI compatible HTTP API. |
| 61 | +> - **Router** – Standalone FastAPI endpoint for best worker selection, with core routines implemented in Rust exposed via Python bindings. |
| 62 | +> - **Workers** – Served in-process within the frontend application to reduce complexity and boilerplate, rather than as separate endpoints. |
| 63 | +
|
| 64 | +### `router.py` |
| 65 | +- **KvRouter**: Core routing logic using RadixTree |
| 66 | +- Subscribes to KV cache events and load metrics from workers |
| 67 | +- Implements `get_best_worker()` to select optimal routing destination |
| 68 | +- Runs background tasks to periodically update worker states |
| 69 | + |
| 70 | +### `worker.py` |
| 71 | +- **VllmWorkers**: Manages multiple vLLM worker processes |
| 72 | +- Each worker runs on a separate port with KV cache event emission enabled |
| 73 | +- Provides `direct()` method for sending requests to specific workers |
| 74 | +- Handles worker lifecycle and configuration |
| 75 | + |
| 76 | +### `api.py` |
| 77 | +- **RouterAPI**: Minimal FastAPI server providing OpenAI-compatible chat completions endpoint |
| 78 | +- Enables in-process communication between router and workers |
| 79 | +- Can be easily modified to use external communication (FastAPI clients, dynamo endpoints, etc.) |
| 80 | +- Integrates with vLLM's OpenAI serving components for request preprocessing and response formatting |
| 81 | + |
| 82 | +### `perf.sh` |
| 83 | +- Benchmarking script using `genai-perf` to test the router setup |
| 84 | +- Configured for streaming chat completions with synthetic workloads |
| 85 | +- Tests concurrent requests to evaluate routing performance |
| 86 | + |
| 87 | +## Usage |
| 88 | + |
| 89 | +1. **Install latest vLLM**: |
| 90 | + ```bash |
| 91 | + uv pip uninstall ai-dynamo-vllm |
| 92 | + uv pip install vllm==0.9.0 |
| 93 | + ``` |
| 94 | + *Note: This uninstalls the local vLLM patch (`ai-dynamo-vllm`) and replaces it with the latest standard vLLM package.* |
| 95 | + |
| 96 | +2. **Start the router API**: |
| 97 | + For example: |
| 98 | + ```bash |
| 99 | + python api.py \ |
| 100 | + --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ |
| 101 | + --num-workers 4 \ |
| 102 | + --block-size 64 \ |
| 103 | + --base-kv-events-port 5557 \ |
| 104 | + --base-metrics-port 5657 \ |
| 105 | + --router-port 7000 \ |
| 106 | + --http-port 8000 |
| 107 | + ``` |
| 108 | + |
| 109 | +3. **Ping the endpoint (optional)**: |
| 110 | + ```bash |
| 111 | + ./ping.sh |
| 112 | + ``` |
| 113 | + |
| 114 | +4. **Run performance benchmark**: |
| 115 | + ```bash |
| 116 | + ./perf.sh |
| 117 | + ``` |
0 commit comments