Skip to content

Commit c8a852d

Browse files
Add a user guide for DCGM usage (#1274)
Address feature request in [this MONAI issue](Project-MONAI/MONAI#6190). ### Description closes Project-MONAI/MONAI#6190 ### Checks <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Avoid including large-size files in the PR. - [x] Ensure (1) hyperlinks and markdown anchors are working (2) use relative paths for tutorial repo files (3) put figure and graphs in the `./figure` folder --------- Signed-off-by: Mingxin Zheng <[email protected]>
1 parent 4b749ef commit c8a852d

File tree

3 files changed

+73
-0
lines changed

3 files changed

+73
-0
lines changed

acceleration/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,6 @@ Demonstrates the use of the `ThreadBuffer` class used to generate data batches d
2626
Illustrate reading NIfTI files and test speed of different transforms on different devices.
2727
#### [TensorRT_inference_acceleration](./TensorRT_inference_acceleration.ipynb)
2828
This notebook shows how to use TensorRT to accelerate the model and achieve a better inference latency.
29+
30+
#### [Tutorials for resource monitoring](./monitoring/README.md)
31+
Information about how to set up and apply existing tools to monitor the computing resources.

acceleration/monitoring/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Monitoring Resources
2+
3+
## List of tutorials:
4+
5+
- [Tutorial to show how to run Data Center GPU monitoring (DCGM) locally](using-dcgm.md)

acceleration/monitoring/using-dcgm.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Monitoring and measuring GPU metrics with DCGM
2+
3+
## Introduction
4+
5+
[NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/dcgm) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management. It can be used standalone by infrastructure teams and easily integrates into cluster management tools, resource scheduling and monitoring products from NVIDIA partners. In this tutorial, we will provide basic examples to log the metrics.
6+
7+
## Installation
8+
9+
1. Follow [this guide](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html#supported-linux-distributions) and install datacenter-gpu-manager on a local or remote machine. The users can vserify the installation by `dcgmi discovery -l`.
10+
11+
2. Pull and run the `dcgm-exporter` container in this [link](https://github.com/NVIDIA/dcgm-exporter)) on the same machine to allow `curl`. For example:
12+
```
13+
DCGM_EXPORTER_VERSION=3.1.6-3.1.3 &&
14+
docker run -itd --rm \
15+
--gpus all \
16+
--net host \
17+
--cap-add SYS_ADMIN \
18+
nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION}-ubuntu20.04 \
19+
-r localhost:5555 -f /etc/dcgm-exporter/dcp-metrics-included.csv -a ":<port>"
20+
```
21+
localhost:5555 points to the nv-host-engine in localhost. `<port>` has a default value of 9400 but the users can specify a number based on their environments. `/etc/dcgm-exporter/dcp-metrics-included.csv` is the list of metrics. It provides more information than the default one about the GPU usage.
22+
23+
24+
## Quick start
25+
26+
After the docker is up for about 2-3 minutes, the user can use `curl` to get the GPU information on the "host-ip" machine `curl <host-ip>:<port>/metrics`
27+
28+
To use it in a container, the user can create a `log.sh` to start logging infinitely:
29+
```
30+
#!/bin/bash
31+
32+
set -e
33+
34+
file="output.log"
35+
url="<host-ip>:<port>/metrics"
36+
37+
while true; do
38+
  timestamp=$(date +%Y-%m-%d_%H:%M:%S)
39+
  message=$(curl -s $url)
40+
  echo -e "$timestamp:\n$message\n" >> $file
41+
  sleep 30
42+
43+
done
44+
```
45+
46+
```
47+
DATETIME=$(date +"%Y%m%d-%H%M%S")
48+
nohup ./log.sh >/dev/null 2>&1 & echo $! > log_process_${DATETIME}.pid
49+
python ...
50+
kill $(cat log_process_${DATETIME}.pid) && rm log_process_${DATETIME}.pid
51+
```
52+
53+
The GPU utilization, as well as other metrics such as DRAM use, PCI Tx/Rx, can be found in the `output.log`. Depending on the GPU models, the following metrics may be recorded:
54+
55+
- DCGM_FI_PROF_GR_ENGINE_ACTIVE (GPU utilization if the arch supports)
56+
- DCGM_FI_PROF_PIPE_TENSOR_ACTIVE (Tensor Core utilization)
57+
- DCGM_FI_PROF_PCIE_TX_BYTES (PCI Transmit)
58+
- DCGM_FI_PROF_PCIE_RX_BYTES (PCI Receiving)
59+
- DCGM_FI_DEV_FB_USED (GPU memory usage)
60+
61+
## GPU Utilization
62+
63+
The overview of [NVIDIA DCGM Documentation](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/index.html) provides a detailed description on how to read the GPU utlization in the [metrics section](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html?#metrics).
64+
65+
Typically, the “GPU Utilization” from `nvidia-smi` or `NVML` is a rough metric that reflects how busy GPU cores are utilized. It is defined by “Percent of time over the past sample period during which one or more kernels was executing on the GPU”. In extreme cases, the metric is 100% even there’s only one thread launched to run kernel on GPU during past sample period.

0 commit comments

Comments
 (0)