-
Notifications
You must be signed in to change notification settings - Fork 419
feat(FT): Enable Prometheus and Grafana in the metrics group, running… #1488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… in the monitoring network.
WalkthroughThe metrics stack's Docker Compose configuration was expanded with new services, explicit network isolation, and detailed comments. Service images were pinned to specific versions, dependencies and ports were clarified, and a dedicated monitoring network was introduced. Prometheus and Grafana configurations were updated to reflect new service endpoints and scrape intervals. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Grafana
participant Prometheus
participant nats-prometheus-exporter
participant dcgm-exporter
participant etcd-server
User->>Grafana: Access dashboards (port 3001)
Grafana->>Prometheus: Query metrics (http://prometheus:9090)
Prometheus->>nats-prometheus-exporter: Scrape metrics (port 7777)
Prometheus->>dcgm-exporter: Scrape metrics (port 9401)
Prometheus->>etcd-server: Scrape metrics (port 2379)
Poem
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
deploy/metrics/docker-compose.yml (3)
16-63
: Consider streamlining the in-file service diagram
The ASCII diagram is very helpful for onboarding but can become stale and lengthen the Compose file. You may want to move it to a separateREADME.md
or architecture doc and reference it here for clarity.
117-117
: Remove trailing whitespace
Line 117 (volumes:
) contains trailing spaces that trigger pre-commit hook failures. Remove the extra whitespace at end of line.🧰 Tools
🪛 YAMLlint (1.37.1)
[error] 117-117: trailing spaces
(trailing-spaces)
158-158
: Resolve the stale TODO on networking
TheTODO: Use more explicit networking setup when metrics is containerized
comment is outdated given the explicit network sections. Please remove or address this before merge.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
deploy/metrics/docker-compose.yml
(2 hunks)deploy/metrics/grafana-datasources.yml
(1 hunks)deploy/metrics/prometheus.yml
(1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
deploy/metrics/docker-compose.yml
[warning] 111-111: wrong indentation: expected 6 but found 5
(indentation)
[error] 117-117: trailing spaces
(trailing-spaces)
[warning] 130-130: wrong indentation: expected 6 but found 5
(indentation)
[warning] 132-132: wrong indentation: expected 6 but found 5
(indentation)
[warning] 160-160: wrong indentation: expected 6 but found 5
(indentation)
[warning] 162-162: wrong indentation: expected 6 but found 5
(indentation)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/1488/merge) by keivenchang.
deploy/metrics/docker-compose.yml
[error] 114-114: Pre-commit hook 'trailing-whitespace' failed. Trailing whitespace was found and automatically fixed in this file.
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (5)
deploy/metrics/grafana-datasources.yml (1)
22-22
: URL updated to Docker network hostname
Switching fromlocalhost
toprometheus
ensures Grafana reaches Prometheus over themonitoring
network.deploy/metrics/prometheus.yml (3)
17-18
: Reevaluate global scrape and evaluation intervals
Increasing both intervals to 10s reduces Prometheus load but may impact metric freshness. Confirm that a 10s resolution meets your observability SLOs.
21-24
: Validatenats-prometheus-exporter
scrape configuration
Scraping every 2s fromnats-prometheus-exporter:7777
aligns with the new monitoring network setup. Ensure the service name matches the Docker Compose hostname.
31-34
: Confirmdcgm-exporter
job setup
Thedcgm-exporter:9400
target and 5s interval correctly reflect the container's metrics port on the monitoring network.deploy/metrics/docker-compose.yml (1)
109-110
: Verify NVIDIA runtime support in Compose v3
Usingruntime: nvidia
is not supported in newer Compose specs (v3+). Consider replacing it withdevice_requests
to request GPU devices underdeploy.resources.reservations.devices
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
deploy/metrics/docker-compose.yml (4)
16-61
: Align ASCII diagram with actual service names and network topologyThe diagram refers to
nats-prom-exp
but the service is namednats-prometheus-exporter
, and it describesnats-server
on a host network, whereas the Compose file uses the default bridge network. Please sync the ASCII art and network description with the real service names and the actual network mode.
85-95
: Add restart policy and healthcheck to exportersTo improve resilience, add
restart: unless-stopped
and a basichealthcheck
fornats-prometheus-exporter
. This ensures Prometheus scrapes only healthy endpoints and restarts crashed exporters.
115-137
: Enhance Prometheus startup reliability
depends_on
only dictates container start order—it doesn’t wait for service readiness. Addhealthcheck
blocks fordcgm-exporter
andnats-prometheus-exporter
, then usecondition: service_healthy
so Prometheus only starts once exporters are healthy.
166-168
: Define external network intentIf the
monitoring
network is shared across multiple Compose projects, consider declaring it asexternal: true
. This prevents accidental redeclarations and ensures a single bridge network is reused.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deploy/metrics/docker-compose.yml
(2 hunks)deploy/metrics/prometheus.yml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- deploy/metrics/prometheus.yml
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (2)
deploy/metrics/docker-compose.yml (2)
62-63
: Good practice: images are pinned to specific versionsPinning images prevents unexpected breaking changes. Ensure you have a process in place to update these versions regularly and scan for vulnerabilities.
73-81
: Profile flag missing on etcd-server
etcd-server
is connected to themonitoring
network but lacksprofiles: [metrics]
. If it’s only meant for metrics scraping, add the profile; otherwise confirm it should run unconditionally.
image: grafana/grafana-enterprise:12.0.1 | ||
container_name: grafana | ||
volumes: | ||
- ./metrics/grafana.json:/etc/grafana/provisioning/dashboards/llm-worker-dashboard.json | ||
- ./metrics/grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml | ||
- ./metrics/grafana-dashboard-providers.yml:/etc/grafana/provisioning/dashboards/dashboard-providers.yml | ||
- ./grafana.json:/etc/grafana/provisioning/dashboards/llm-worker-dashboard.json | ||
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml | ||
- ./grafana-dashboard-providers.yml:/etc/grafana/provisioning/dashboards/dashboard-providers.yml | ||
environment: | ||
# Port 3000 is used by "dynamo serve", so use 3001 | ||
# Port 3000 is already used by "dynamo serve", so use 3001 | ||
- GF_SERVER_HTTP_PORT=3001 | ||
- GF_SECURITY_ADMIN_USER=admin | ||
- GF_SECURITY_ADMIN_PASSWORD=admin | ||
- GF_SECURITY_ADMIN_USER=dynamo | ||
- GF_SECURITY_ADMIN_PASSWORD=dynamo | ||
- GF_USERS_ALLOW_SIGN_UP=false | ||
- GF_INSTALL_PLUGINS=grafana-piechart-panel | ||
# Default min interval is 5s, but can be configured lower | ||
- GF_DASHBOARDS_MIN_REFRESH_INTERVAL=2s | ||
restart: unless-stopped | ||
# TODO: Use more explicit networking setup when metrics is containerized | ||
#ports: | ||
# - "3001:3001" | ||
#networks: | ||
# - monitoring | ||
network_mode: "host" | ||
ports: | ||
- "3001:3001" | ||
networks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Secure Grafana credentials and port configuration
Changing Grafana’s HTTP port to 3001
is fine, but verify it doesn’t conflict with other host services. Avoid hardcoding admin credentials in the Compose file; use Docker secrets or external environment variables to manage them securely.
🤖 Prompt for AI Agents
In deploy/metrics/docker-compose.yml around lines 142 to 160, the Grafana
service hardcodes admin credentials and sets the HTTP port to 3001 without
verifying conflicts. To fix this, remove the hardcoded GF_SECURITY_ADMIN_USER
and GF_SECURITY_ADMIN_PASSWORD environment variables and instead reference them
via Docker secrets or external environment variables. Also, confirm that port
3001 is free on the host before assigning it to Grafana to avoid conflicts with
other services.
deploy/metrics/docker-compose.yml
Outdated
dcgm-exporter: | ||
image: nvidia/dcgm-exporter:4.2.3-4.1.3-ubi9 | ||
ports: | ||
- 9401:9400 | ||
deploy: | ||
resources: | ||
reservations: | ||
devices: | ||
- driver: nvidia | ||
count: all | ||
capabilities: [gpu] | ||
environment: | ||
- NVIDIA_VISIBLE_DEVICES=all # Make all GPUs visible to the container | ||
runtime: nvidia # Specify the NVIDIA runtime | ||
networks: | ||
- monitoring | ||
profiles: [metrics] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix port mapping inconsistency for dcgm-exporter
The diagram and comments reference port 9400
, but the Compose file maps host port 9401
to container port 9400
. Align the documentation with this mapping or adjust the binding to use 9400:9400
for consistency.
🤖 Prompt for AI Agents
In deploy/metrics/docker-compose.yml around lines 96 to 112, the port mapping
for dcgm-exporter maps host port 9401 to container port 9400, which is
inconsistent with the diagram and comments referencing port 9400. Fix this by
changing the port mapping to 9400:9400 to align the host and container ports and
maintain consistency with the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
deploy/metrics/docker-compose.yml (2)
105-109
: Align dcgm-exporter port mapping with documentation
The ASCII diagram shows port9400
, but the Compose file binds9401:9400
. Either update the diagram to reference9401
or revert to"9400:9400"
for consistency.
160-162
: Avoid hardcoding Grafana admin credentials
StoringGF_SECURITY_ADMIN_USER
/GF_SECURITY_ADMIN_PASSWORD
in cleartext poses a security risk. Use Docker secrets or inject via environment variables at deploy time.
🧹 Nitpick comments (6)
deploy/metrics/docker-compose.yml (6)
16-38
: Improve ASCII diagram consistency
The diagram uses abbreviations (e.g.,nats-prom-exp
) that don't match your service names exactly (nats-prometheus-exporter
). Align labels and port references with actual Compose definitions for clarity.
123-128
: Mount Prometheus config read-only
Make theprometheus.yml
mount read-only to prevent runtime changes:- - ./prometheus.yml:/etc/prometheus/prometheus.yml + - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
136-139
: Restrict Prometheus port exposure
Binding"9090:9090"
publicly opens the Prometheus UI. Consider127.0.0.1:9090:9090
or useexpose: ["9090"]
if host-wide access isn’t needed.
140-146
: Enhance startup ordering with healthchecks
depends_on
controls container start order but doesn’t wait for services to become healthy. Addhealthcheck
blocks fordcgm-exporter
,nats-prometheus-exporter
, andetcd-server
, then usecondition: service_healthy
.
154-156
: Mount Grafana provisioning files read-only
Prevent accidental edits by marking dashboard and datasource mounts as read-only:- - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml + - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml:ro
167-168
: Consider internal-only Grafana access
Binding3001:3001
publicly may not be necessary. Useexpose: ["3001"]
or bind to localhost (127.0.0.1:3001:3001
) to restrict external access.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deploy/metrics/docker-compose.yml
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (3)
deploy/metrics/docker-compose.yml (3)
61-65
: Networks defined with clear segmentation
Declaring bothserver
andmonitoring
bridge networks provides proper isolation between core and metrics services.
81-86
: Consider isolating etcd metrics endpoint
Port2379
serves both etcd API and metrics but is publicly bound. If external etcd access isn’t required, bind it to localhost (127.0.0.1:2379:2379
) and useexpose
for container-only access.
169-170
: Grafana attached to monitoring network
Adding Grafana to themonitoring
network ensures it can reach Prometheus without exposing core services.
image: nats:2.11.4 | ||
command: [ "-js", "--trace", "-m", "8222" ] | ||
ports: | ||
- 4222:4222 | ||
- 6222:6222 | ||
- 8222:8222 | ||
- 8222:8222 # the endpoints include /varz, /healthz, ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restrict NATS metrics port exposure
Binding port 8222
publicly (- 8222:8222
) exposes internal metrics on all host interfaces. Switch to expose: ["8222"]
or bind to localhost (127.0.0.1:8222:8222
) to limit access.
🤖 Prompt for AI Agents
In deploy/metrics/docker-compose.yml around lines 70 to 75, the NATS metrics
port 8222 is currently bound publicly, exposing internal metrics on all host
interfaces. To restrict access, replace the public port binding '- 8222:8222'
with either 'expose: ["8222"]' to limit exposure to linked services only, or
bind the port to localhost by changing it to '127.0.0.1:8222:8222' to restrict
access to the local machine.
nats-prometheus-exporter: | ||
image: natsio/prometheus-nats-exporter:0.17.3 | ||
command: ["-varz", "-connz", "-routez", "-subz", "-gatewayz", "-leafz", "-jsz=all", "http://nats-server:8222"] | ||
ports: | ||
- 7777:7777 | ||
networks: | ||
- monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Limit NATS exporter to the monitoring network
Exposing 7777:7777
makes the exporter publicly accessible. Replace ports
with expose: ["7777"]
to keep it internal to the monitoring
network.
🤖 Prompt for AI Agents
In deploy/metrics/docker-compose.yml around lines 94 to 100, the NATS exporter
service exposes port 7777 publicly using the ports directive. To restrict access
to the monitoring network only, replace the ports section with expose: ["7777"].
This change keeps the port accessible internally within the monitoring network
but not exposed externally.
services: | ||
nats-server: | ||
image: nats | ||
command: [ "-js", "--trace" ] | ||
image: nats:2.11.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will want to confirm with @nv-anants , @saturley-hall if ok to pin here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
deploy/metrics/docker-compose.yml (2)
114-130
: 🛠️ Refactor suggestionAvoid hardcoding Grafana credentials
Storing admin credentials in-file is insecure. Use Docker secrets or environment variables to injectGF_SECURITY_ADMIN_USER
andGF_SECURITY_ADMIN_PASSWORD
.- environment: - - GF_SECURITY_ADMIN_USER=dynamo - - GF_SECURITY_ADMIN_PASSWORD=dynamo + environment: + - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER} + - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}Define these in a
.env
file or via Docker secrets.
60-78
:⚠️ Potential issueAlign dcgm-exporter host port with documentation
The README and diagram reference port 9400, but the compose mapping uses9401:9400
. Either update the docs to note 9401, or change the binding to9400:9400
for consistency.dcgm-exporter: - ports: - - 9401:9400 + ports: + - 9400:9400
🧹 Nitpick comments (2)
deploy/metrics/README.md (1)
10-13
: Specify fenced code block language
The ASCII topology diagram is enclosed in a fenced code block without a language tag, causing markdownlint MD040. Adding a language (e.g.,text
) will improve readability and satisfy lint rules.🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
13-13: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
deploy/metrics/docker-compose.yml (1)
80-103
: Consider persisting Prometheus TSDB data
Without a volume for/prometheus
, metrics are lost on container restarts. To retain history, mount a named volume:services: prometheus: volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml + - prometheus-data:/prometheus ... volumes: + prometheus-data:
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
deploy/metrics/grafana1.png
is excluded by!**/*.png
📒 Files selected for processing (3)
deploy/metrics/README.md
(2 hunks)deploy/metrics/docker-compose.yml
(2 hunks)deploy/metrics/prometheus.yml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- deploy/metrics/prometheus.yml
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
deploy/metrics/README.md
13-13: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
54-54: Unordered list indentation
Expected: 0; Actual: 2
(MD007, ul-indent)
55-55: Unordered list indentation
Expected: 0; Actual: 2
(MD007, ul-indent)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (4)
deploy/metrics/docker-compose.yml (4)
16-21
: Bridge networks configuration is correct
The explicitserver
andmonitoring
bridge networks isolate core and metrics services as intended.
25-33
: nats-server image pin and networking look good
Pinningnats:2.11.4
, enabling the monitoring port, and attaching to bothserver
andmonitoring
meets isolation and reachability requirements.
36-45
: etcd-server setup is solid
Usingbitnami/etcd:3.6.1
, allowing unauthenticated access for metrics, and attaching to both networks aligns with the overall architecture.
49-59
: nats-prometheus-exporter is configured correctly
The exporter is pinned to0.17.3
, covers all relevant flags, lives solely on themonitoring
network, and declaresdepends_on
properly.
… in the monitoring network.
Overview:
This PR improves the metrics stack by implementing proper Docker networking, adding service documentation, and fixing configuration paths.
Details:
monitoring
networkWhere should the reviewer start?
deploy/metrics/docker-compose.yml
- Review the service relationship diagram and network configurationdeploy/metrics/prometheus.yml
- Verify the updated scrape targets match the new network topologyRelated Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit