Skip to content

feat(FT): Enable Prometheus and Grafana in the metrics group, running… #1488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

keivenchang
Copy link

@keivenchang keivenchang commented Jun 12, 2025

… in the monitoring network.

Overview:

This PR improves the metrics stack by implementing proper Docker networking, adding service documentation, and fixing configuration paths.

Details:

  • Network Architecture: Moved relevant services to the monitoring network
  • Service Documentation: Added comprehensive service relationship diagram showing port mappings and data flow between components
  • Configuration Updates:
    • Updated prometheus.yml to use correct network-based service targets
    • Changed Grafana credentials from admin/admin to dynamo/dynamo so that users don't have to change password every single time
  • Network Isolation: Replaced host networking with proper Docker bridge networking while maintaining external port accessibility
  • Service Organization: Clear separation between core services (nats-server on default network) and metrics services (monitoring network)

Where should the reviewer start?

  • deploy/metrics/docker-compose.yml - Review the service relationship diagram and network configuration
  • deploy/metrics/prometheus.yml - Verify the updated scrape targets match the new network topology

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • Relates to improving metrics infrastructure and Docker networking best practices

Summary by CodeRabbit

  • New Features
    • Introduced new monitoring services, including GPU and NATS exporters, and added Grafana for enhanced observability.
  • Improvements
    • Enhanced network isolation for monitoring services and updated service dependencies for better reliability.
    • Upgraded service versions for improved stability and security.
    • Updated Grafana default credentials and port for easier access.
    • Refined Prometheus scrape intervals and targets for more efficient metrics collection.
    • Expanded documentation with detailed topology diagrams and clearer setup instructions.
  • Bug Fixes
    • Corrected Grafana datasource configuration to use Docker network hostnames.

Copy link

copy-pr-bot bot commented Jun 12, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

coderabbitai bot commented Jun 12, 2025

Walkthrough

The metrics stack's Docker Compose configuration was expanded with new services, explicit network isolation, and detailed comments. Service images were pinned to specific versions, dependencies and ports were clarified, and a dedicated monitoring network was introduced. Prometheus and Grafana configurations were updated to reflect new service endpoints and scrape intervals.

Changes

File(s) Change Summary
deploy/metrics/docker-compose.yml Added server and monitoring bridge networks; pinned images for nats-server, etcd-server, prometheus, and grafana; introduced nats-prometheus-exporter and dcgm-exporter services under metrics profile; updated ports, commands, environment variables, network modes, and service dependencies.
deploy/metrics/grafana-datasources.yml Changed Grafana datasource URL to http://prometheus:9090 from localhost; removed outdated comments.
deploy/metrics/prometheus.yml Increased global scrape and evaluation intervals; replaced existing scrape job with new jobs for nats-prometheus-exporter, etcd-server, and dcgm-exporter; added commented-out jobs for Prometheus self-scrape and metrics-aggregation service.
deploy/metrics/README.md Expanded README with detailed topology ASCII diagram, clarified startup order and URLs, updated Grafana login credentials, added instructions for optional metrics aggregation, and included a screenshot reference.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Grafana
    participant Prometheus
    participant nats-prometheus-exporter
    participant dcgm-exporter
    participant etcd-server

    User->>Grafana: Access dashboards (port 3001)
    Grafana->>Prometheus: Query metrics (http://prometheus:9090)
    Prometheus->>nats-prometheus-exporter: Scrape metrics (port 7777)
    Prometheus->>dcgm-exporter: Scrape metrics (port 9401)
    Prometheus->>etcd-server: Scrape metrics (port 2379)
Loading

Poem

In the warren of metrics, new tunnels appear,
With exporters and bridges, the network is clear.
Grafana now listens on port three-oh-one,
Prometheus gathers, its scraping begun.
Each service is pinned, dependencies tight—
Our monitoring garden grows healthy and bright!
🐰📊✨


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee91fd7 and be8432d.

📒 Files selected for processing (2)
  • deploy/metrics/README.md (2 hunks)
  • deploy/metrics/docker-compose.yml (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • deploy/metrics/README.md
  • deploy/metrics/docker-compose.yml
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Test - vllm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
deploy/metrics/docker-compose.yml (3)

16-63: Consider streamlining the in-file service diagram
The ASCII diagram is very helpful for onboarding but can become stale and lengthen the Compose file. You may want to move it to a separate README.md or architecture doc and reference it here for clarity.


117-117: Remove trailing whitespace
Line 117 (volumes:) contains trailing spaces that trigger pre-commit hook failures. Remove the extra whitespace at end of line.

🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 117-117: trailing spaces

(trailing-spaces)


158-158: Resolve the stale TODO on networking
The TODO: Use more explicit networking setup when metrics is containerized comment is outdated given the explicit network sections. Please remove or address this before merge.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af1f115 and bd45cbc.

📒 Files selected for processing (3)
  • deploy/metrics/docker-compose.yml (2 hunks)
  • deploy/metrics/grafana-datasources.yml (1 hunks)
  • deploy/metrics/prometheus.yml (1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
deploy/metrics/docker-compose.yml

[warning] 111-111: wrong indentation: expected 6 but found 5

(indentation)


[error] 117-117: trailing spaces

(trailing-spaces)


[warning] 130-130: wrong indentation: expected 6 but found 5

(indentation)


[warning] 132-132: wrong indentation: expected 6 but found 5

(indentation)


[warning] 160-160: wrong indentation: expected 6 but found 5

(indentation)


[warning] 162-162: wrong indentation: expected 6 but found 5

(indentation)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/1488/merge) by keivenchang.
deploy/metrics/docker-compose.yml

[error] 114-114: Pre-commit hook 'trailing-whitespace' failed. Trailing whitespace was found and automatically fixed in this file.

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (5)
deploy/metrics/grafana-datasources.yml (1)

22-22: URL updated to Docker network hostname
Switching from localhost to prometheus ensures Grafana reaches Prometheus over the monitoring network.

deploy/metrics/prometheus.yml (3)

17-18: Reevaluate global scrape and evaluation intervals
Increasing both intervals to 10s reduces Prometheus load but may impact metric freshness. Confirm that a 10s resolution meets your observability SLOs.


21-24: Validate nats-prometheus-exporter scrape configuration
Scraping every 2s from nats-prometheus-exporter:7777 aligns with the new monitoring network setup. Ensure the service name matches the Docker Compose hostname.


31-34: Confirm dcgm-exporter job setup
The dcgm-exporter:9400 target and 5s interval correctly reflect the container's metrics port on the monitoring network.

deploy/metrics/docker-compose.yml (1)

109-110: Verify NVIDIA runtime support in Compose v3
Using runtime: nvidia is not supported in newer Compose specs (v3+). Consider replacing it with device_requests to request GPU devices under deploy.resources.reservations.devices.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
deploy/metrics/docker-compose.yml (4)

16-61: Align ASCII diagram with actual service names and network topology

The diagram refers to nats-prom-exp but the service is named nats-prometheus-exporter, and it describes nats-server on a host network, whereas the Compose file uses the default bridge network. Please sync the ASCII art and network description with the real service names and the actual network mode.


85-95: Add restart policy and healthcheck to exporters

To improve resilience, add restart: unless-stopped and a basic healthcheck for nats-prometheus-exporter. This ensures Prometheus scrapes only healthy endpoints and restarts crashed exporters.


115-137: Enhance Prometheus startup reliability

depends_on only dictates container start order—it doesn’t wait for service readiness. Add healthcheck blocks for dcgm-exporter and nats-prometheus-exporter, then use condition: service_healthy so Prometheus only starts once exporters are healthy.


166-168: Define external network intent

If the monitoring network is shared across multiple Compose projects, consider declaring it as external: true. This prevents accidental redeclarations and ensures a single bridge network is reused.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd45cbc and 67f2259.

📒 Files selected for processing (2)
  • deploy/metrics/docker-compose.yml (2 hunks)
  • deploy/metrics/prometheus.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • deploy/metrics/prometheus.yml
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (2)
deploy/metrics/docker-compose.yml (2)

62-63: Good practice: images are pinned to specific versions

Pinning images prevents unexpected breaking changes. Ensure you have a process in place to update these versions regularly and scan for vulnerabilities.


73-81: Profile flag missing on etcd-server

etcd-server is connected to the monitoring network but lacks profiles: [metrics]. If it’s only meant for metrics scraping, add the profile; otherwise confirm it should run unconditionally.

Comment on lines +142 to +160
image: grafana/grafana-enterprise:12.0.1
container_name: grafana
volumes:
- ./metrics/grafana.json:/etc/grafana/provisioning/dashboards/llm-worker-dashboard.json
- ./metrics/grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
- ./metrics/grafana-dashboard-providers.yml:/etc/grafana/provisioning/dashboards/dashboard-providers.yml
- ./grafana.json:/etc/grafana/provisioning/dashboards/llm-worker-dashboard.json
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
- ./grafana-dashboard-providers.yml:/etc/grafana/provisioning/dashboards/dashboard-providers.yml
environment:
# Port 3000 is used by "dynamo serve", so use 3001
# Port 3000 is already used by "dynamo serve", so use 3001
- GF_SERVER_HTTP_PORT=3001
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_SECURITY_ADMIN_USER=dynamo
- GF_SECURITY_ADMIN_PASSWORD=dynamo
- GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-piechart-panel
# Default min interval is 5s, but can be configured lower
- GF_DASHBOARDS_MIN_REFRESH_INTERVAL=2s
restart: unless-stopped
# TODO: Use more explicit networking setup when metrics is containerized
#ports:
# - "3001:3001"
#networks:
# - monitoring
network_mode: "host"
ports:
- "3001:3001"
networks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Secure Grafana credentials and port configuration

Changing Grafana’s HTTP port to 3001 is fine, but verify it doesn’t conflict with other host services. Avoid hardcoding admin credentials in the Compose file; use Docker secrets or external environment variables to manage them securely.

🤖 Prompt for AI Agents
In deploy/metrics/docker-compose.yml around lines 142 to 160, the Grafana
service hardcodes admin credentials and sets the HTTP port to 3001 without
verifying conflicts. To fix this, remove the hardcoded GF_SECURITY_ADMIN_USER
and GF_SECURITY_ADMIN_PASSWORD environment variables and instead reference them
via Docker secrets or external environment variables. Also, confirm that port
3001 is free on the host before assigning it to Grafana to avoid conflicts with
other services.

Comment on lines 96 to 112
dcgm-exporter:
image: nvidia/dcgm-exporter:4.2.3-4.1.3-ubi9
ports:
- 9401:9400
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- NVIDIA_VISIBLE_DEVICES=all # Make all GPUs visible to the container
runtime: nvidia # Specify the NVIDIA runtime
networks:
- monitoring
profiles: [metrics]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix port mapping inconsistency for dcgm-exporter

The diagram and comments reference port 9400, but the Compose file maps host port 9401 to container port 9400. Align the documentation with this mapping or adjust the binding to use 9400:9400 for consistency.

🤖 Prompt for AI Agents
In deploy/metrics/docker-compose.yml around lines 96 to 112, the port mapping
for dcgm-exporter maps host port 9401 to container port 9400, which is
inconsistent with the diagram and comments referencing port 9400. Fix this by
changing the port mapping to 9400:9400 to align the host and container ports and
maintain consistency with the documentation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
deploy/metrics/docker-compose.yml (2)

105-109: Align dcgm-exporter port mapping with documentation
The ASCII diagram shows port 9400, but the Compose file binds 9401:9400. Either update the diagram to reference 9401 or revert to "9400:9400" for consistency.


160-162: Avoid hardcoding Grafana admin credentials
Storing GF_SECURITY_ADMIN_USER/GF_SECURITY_ADMIN_PASSWORD in cleartext poses a security risk. Use Docker secrets or inject via environment variables at deploy time.

🧹 Nitpick comments (6)
deploy/metrics/docker-compose.yml (6)

16-38: Improve ASCII diagram consistency
The diagram uses abbreviations (e.g., nats-prom-exp) that don't match your service names exactly (nats-prometheus-exporter). Align labels and port references with actual Compose definitions for clarity.


123-128: Mount Prometheus config read-only
Make the prometheus.yml mount read-only to prevent runtime changes:

- - ./prometheus.yml:/etc/prometheus/prometheus.yml
+ - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro

136-139: Restrict Prometheus port exposure
Binding "9090:9090" publicly opens the Prometheus UI. Consider 127.0.0.1:9090:9090 or use expose: ["9090"] if host-wide access isn’t needed.


140-146: Enhance startup ordering with healthchecks
depends_on controls container start order but doesn’t wait for services to become healthy. Add healthcheck blocks for dcgm-exporter, nats-prometheus-exporter, and etcd-server, then use condition: service_healthy.


154-156: Mount Grafana provisioning files read-only
Prevent accidental edits by marking dashboard and datasource mounts as read-only:

- - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
+ - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml:ro

167-168: Consider internal-only Grafana access
Binding 3001:3001 publicly may not be necessary. Use expose: ["3001"] or bind to localhost (127.0.0.1:3001:3001) to restrict external access.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 67f2259 and 9a77f72.

📒 Files selected for processing (1)
  • deploy/metrics/docker-compose.yml (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (3)
deploy/metrics/docker-compose.yml (3)

61-65: Networks defined with clear segmentation
Declaring both server and monitoring bridge networks provides proper isolation between core and metrics services.


81-86: Consider isolating etcd metrics endpoint
Port 2379 serves both etcd API and metrics but is publicly bound. If external etcd access isn’t required, bind it to localhost (127.0.0.1:2379:2379) and use expose for container-only access.


169-170: Grafana attached to monitoring network
Adding Grafana to the monitoring network ensures it can reach Prometheus without exposing core services.

Comment on lines +70 to +75
image: nats:2.11.4
command: [ "-js", "--trace", "-m", "8222" ]
ports:
- 4222:4222
- 6222:6222
- 8222:8222
- 8222:8222 # the endpoints include /varz, /healthz, ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Restrict NATS metrics port exposure
Binding port 8222 publicly (- 8222:8222) exposes internal metrics on all host interfaces. Switch to expose: ["8222"] or bind to localhost (127.0.0.1:8222:8222) to limit access.

🤖 Prompt for AI Agents
In deploy/metrics/docker-compose.yml around lines 70 to 75, the NATS metrics
port 8222 is currently bound publicly, exposing internal metrics on all host
interfaces. To restrict access, replace the public port binding '- 8222:8222'
with either 'expose: ["8222"]' to limit exposure to linked services only, or
bind the port to localhost by changing it to '127.0.0.1:8222:8222' to restrict
access to the local machine.

Comment on lines +94 to +100
nats-prometheus-exporter:
image: natsio/prometheus-nats-exporter:0.17.3
command: ["-varz", "-connz", "-routez", "-subz", "-gatewayz", "-leafz", "-jsz=all", "http://nats-server:8222"]
ports:
- 7777:7777
networks:
- monitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Limit NATS exporter to the monitoring network
Exposing 7777:7777 makes the exporter publicly accessible. Replace ports with expose: ["7777"] to keep it internal to the monitoring network.

🤖 Prompt for AI Agents
In deploy/metrics/docker-compose.yml around lines 94 to 100, the NATS exporter
service exposes port 7777 publicly using the ports directive. To restrict access
to the monitoring network only, replace the ports section with expose: ["7777"].
This change keeps the port accessible internally within the monitoring network
but not exposed externally.

@keivenchang keivenchang self-assigned this Jun 12, 2025
services:
nats-server:
image: nats
command: [ "-js", "--trace" ]
image: nats:2.11.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will want to confirm with @nv-anants , @saturley-hall if ok to pin here

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
deploy/metrics/docker-compose.yml (2)

114-130: 🛠️ Refactor suggestion

Avoid hardcoding Grafana credentials
Storing admin credentials in-file is insecure. Use Docker secrets or environment variables to inject GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD.

-    environment:
-      - GF_SECURITY_ADMIN_USER=dynamo
-      - GF_SECURITY_ADMIN_PASSWORD=dynamo
+    environment:
+      - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER}
+      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}

Define these in a .env file or via Docker secrets.


60-78: ⚠️ Potential issue

Align dcgm-exporter host port with documentation
The README and diagram reference port 9400, but the compose mapping uses 9401:9400. Either update the docs to note 9401, or change the binding to 9400:9400 for consistency.

   dcgm-exporter:
-    ports:
-      - 9401:9400
+    ports:
+      - 9400:9400
🧹 Nitpick comments (2)
deploy/metrics/README.md (1)

10-13: Specify fenced code block language
The ASCII topology diagram is enclosed in a fenced code block without a language tag, causing markdownlint MD040. Adding a language (e.g., text) will improve readability and satisfy lint rules.

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

13-13: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

deploy/metrics/docker-compose.yml (1)

80-103: Consider persisting Prometheus TSDB data
Without a volume for /prometheus, metrics are lost on container restarts. To retain history, mount a named volume:

 services:
   prometheus:
     volumes:
       - ./prometheus.yml:/etc/prometheus/prometheus.yml
+      - prometheus-data:/prometheus
 ...
volumes:
+ prometheus-data:
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a77f72 and ee91fd7.

⛔ Files ignored due to path filters (1)
  • deploy/metrics/grafana1.png is excluded by !**/*.png
📒 Files selected for processing (3)
  • deploy/metrics/README.md (2 hunks)
  • deploy/metrics/docker-compose.yml (2 hunks)
  • deploy/metrics/prometheus.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • deploy/metrics/prometheus.yml
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
deploy/metrics/README.md

13-13: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


54-54: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


55-55: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (4)
deploy/metrics/docker-compose.yml (4)

16-21: Bridge networks configuration is correct
The explicit server and monitoring bridge networks isolate core and metrics services as intended.


25-33: nats-server image pin and networking look good
Pinning nats:2.11.4, enabling the monitoring port, and attaching to both server and monitoring meets isolation and reachability requirements.


36-45: etcd-server setup is solid
Using bitnami/etcd:3.6.1, allowing unauthenticated access for metrics, and attaching to both networks aligns with the overall architecture.


49-59: nats-prometheus-exporter is configured correctly
The exporter is pinned to 0.17.3, covers all relevant flags, lives solely on the monitoring network, and declares depends_on properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants