Skip to content

Commit ee91fd7

Browse files
committed
Moved ASCII image to README.md, and added grafana image
1 parent 9a77f72 commit ee91fd7

File tree

4 files changed

+77
-59
lines changed

4 files changed

+77
-59
lines changed

deploy/metrics/README.md

Lines changed: 68 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -7,27 +7,81 @@ This directory contains configuration for visualizing metrics from the metrics a
77
- **Prometheus**: Collects and stores metrics from the service
88
- **Grafana**: Provides visualization dashboards for the metrics
99

10+
## Topology
11+
12+
Default Service Relationship Diagram:
13+
```
14+
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
15+
│ nats-server │ │ etcd-server │ │dcgm-exporter│
16+
│ :4222 │ │ :2379 │ │ :9400 │
17+
│ :6222 │ │ :2380 │ │ │
18+
│ :8222 │ │ │ │ │
19+
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
20+
│ │ │
21+
│ :8222/varz │ :2379/metrics │ :9400/metrics
22+
│ │ │
23+
▼ │ │
24+
┌─────────────┐ │ │
25+
│nats-prom-exp│ │ │
26+
│ :7777 │ │ │
27+
│ │ │ │
28+
│ /metrics │ │ │
29+
└──────┬──────┘ │ │
30+
│ │ │
31+
│ :7777/metrics │ │
32+
│ │ │
33+
▼ ▼ ▼
34+
┌─────────────────────────────────────────────────┐
35+
│ prometheus │
36+
│ :9090 │
37+
│ │
38+
│ scrapes: nats-prom-exp:7777/metrics │
39+
│ etcd-server:2379/metrics │
40+
│ dcgm-exporter:9400/metrics │
41+
└──────────────────┬──────────────────────────────┘
42+
43+
│ :9090/query API
44+
45+
46+
┌─────────────┐
47+
│ grafana │
48+
│ :3001 │
49+
│ │
50+
└─────────────┘
51+
```
52+
53+
Networks:
54+
- monitoring: nats-prom-exp, etcd-server, dcgm-exporter, prometheus, grafana
55+
- default: nats-server (accessible via host network)
56+
1057
## Getting Started
1158

1259
1. Make sure Docker and Docker Compose are installed on your system
1360

14-
2. Start the `components/metrics` application to begin monitoring for metric events from dynamo workers
15-
and aggregating them on a prometheus metrics endpoint: `http://localhost:9091/metrics`.
61+
2. Start the visualization stack:
1662

17-
3. Start worker(s) that publishes KV Cache metrics.
18-
- For quick testing, `examples/rust/service_metrics/bin/server.rs` can populate dummy KV Cache metrics.
19-
- For a real workflow with real data, see the KV Routing example in `examples/python_rs/llm/vllm`.
63+
```bash
64+
docker compose --profile metrics up -d
65+
```
2066

21-
4. Start the visualization stack:
67+
3. Web servers started. The ones that end in /metrics are in Prometheus format:
68+
- Grafana: `http://localhost:3001` (default login: dynamo/dynamo)
69+
- Prometheus Server: `http://localhost:9090`
70+
- NATS Server: `http://localhost:8222` (monitoring endpoints: /varz, /healthz, etc.)
71+
- NATS Prometheus Exporter: `http://localhost:7777/metrics`
72+
- etcd Server: `http://localhost:2379/metrics`
73+
- DCGM Exporter: `http://localhost:9401/metrics`
2274

23-
```bash
24-
docker compose --profile metrics up -d
25-
```
75+
4. Optionally, if you want to experiment further:
76+
Start the `components/metrics` application to begin monitoring for metric events from dynamo workers
77+
and aggregating them on a prometheus metrics endpoint: `http://localhost:9091/metrics`.
78+
79+
Then, uncomment the appropriate lines in prometheus.yml.
80+
81+
5. Optionally, start worker(s) that publishes KV Cache metrics:
82+
- For quick testing, `examples/rust/service_metrics/bin/server.rs` can populate dummy KV Cache metrics.
83+
- For a real workflow with real data, see the KV Routing example in `examples/python_rs/llm/vllm`.
2684

27-
5. Web servers started:
28-
- Grafana: `http://localhost:3001` (default login: admin/admin) (started by docker compose)
29-
- Prometheus Server: `http://localhost:9090` (started by docker compose)
30-
- Prometheus Metrics Endpoint: `http://localhost:9091/metrics` (started by `components/metrics` application)
3185

3286
## Configuration
3387

@@ -42,6 +96,7 @@ Note: You may need to adjust the target based on your host configuration and net
4296
Grafana is pre-configured with:
4397
- Prometheus datasource
4498
- Sample dashboard for visualizing service metrics
99+
![grafana image](./grafana1.png)
45100

46101
## Required Files
47102

deploy/metrics/docker-compose.yml

Lines changed: 2 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -13,51 +13,6 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16-
#
17-
# Service Relationship Diagram:
18-
#
19-
# ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
20-
# │ nats-server │ │ etcd-server │ │dcgm-exporter│
21-
# │ :4222 │ │ :2379 │ │ :9400 │
22-
# │ :6222 │ │ :2380 │ │ │
23-
# │ :8222 │ │ │ │ │
24-
# └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
25-
# │ │ │
26-
# │ :8222/varz │ :2379/metrics │ :9400/metrics
27-
# │ │ │
28-
# ▼ │ │
29-
# ┌─────────────┐ │ │
30-
# │nats-prom-exp│ │ │
31-
# │ :7777 │ │ │
32-
# │ │ │ │
33-
# │ /metrics │ │ │
34-
# └──────┬──────┘ │ │
35-
# │ │ │
36-
# │ :7777/metrics │ │
37-
# │ │ │
38-
# ▼ ▼ ▼
39-
# ┌─────────────────────────────────────────────────┐
40-
# │ prometheus │
41-
# │ :9090 │
42-
# │ │
43-
# │ scrapes: nats-prom-exp:7777/metrics │
44-
# │ etcd-server:2379/metrics │
45-
# │ dcgm-exporter:9400/metrics │
46-
# └──────────────────┬──────────────────────────────┘
47-
#
48-
# │ :9090/query API
49-
#
50-
#
51-
# ┌─────────────┐
52-
# │ grafana │
53-
# │ :3001 │
54-
# │ │
55-
# └─────────────┘
56-
#
57-
# Networks:
58-
# - monitoring: nats-prom-exp, etcd-server, dcgm-exporter, prometheus, grafana
59-
# - default: nats-server (accessible via host network)
60-
#
6116
networks:
6217
server:
6318
driver: bridge
@@ -106,6 +61,8 @@ services:
10661
image: nvidia/dcgm-exporter:4.2.3-4.1.3-ubi9
10762
ports:
10863
- 9401:9400
64+
cap_add:
65+
- SYS_ADMIN
10966
deploy:
11067
resources:
11168
reservations:

deploy/metrics/grafana1.png

241 KB
Loading

deploy/metrics/prometheus.yml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,10 +31,16 @@ scrape_configs:
3131
- job_name: 'dcgm-exporter'
3232
scrape_interval: 5s
3333
static_configs:
34-
- targets: ['dcgm-exporter:9400'] # on the "monitoring" network
34+
- targets: ['dcgm-exporter:9401'] # on the "monitoring" network
3535

3636
# Uncomment to see its own Prometheus metrics
3737
# - job_name: 'prometheus'
3838
# scrape_interval: 5s
3939
# static_configs:
4040
# - targets: ['prometheus:9090'] # on the "monitoring" network
41+
42+
# Uncomment to see the metrics-aggregation-service metrics
43+
# - job_name: 'metrics-aggregation-service'
44+
# scrape_interval: 2s
45+
# static_configs:
46+
# - targets: ['host.docker.internal:9091'] # metrics aggregation service on host

0 commit comments

Comments
 (0)