|
| 1 | +# Monitoring and measuring GPU metrics with DCGM |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +[NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/dcgm) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management. It can be used standalone by infrastructure teams and easily integrates into cluster management tools, resource scheduling and monitoring products from NVIDIA partners. In this tutorial, we will provide basic examples to log the metrics. |
| 6 | + |
| 7 | +## Installation |
| 8 | + |
| 9 | +1. Follow [this guide](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html#supported-linux-distributions) and install datacenter-gpu-manager on a local or remote machine. The users can vserify the installation by `dcgmi discovery -l`. |
| 10 | + |
| 11 | +2. Pull and run the `dcgm-exporter` container in this [link](https://github.com/NVIDIA/dcgm-exporter)) on the same machine to allow `curl`. For example: |
| 12 | +``` |
| 13 | +DCGM_EXPORTER_VERSION=3.1.6-3.1.3 && |
| 14 | +docker run -itd --rm \ |
| 15 | +--gpus all \ |
| 16 | +--net host \ |
| 17 | +--cap-add SYS_ADMIN \ |
| 18 | +nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION}-ubuntu20.04 \ |
| 19 | +-r localhost:5555 -f /etc/dcgm-exporter/dcp-metrics-included.csv -a ":<port>" |
| 20 | +``` |
| 21 | +localhost:5555 points to the nv-host-engine in localhost. `<port>` has a default value of 9400 but the users can specify a number based on their environments. `/etc/dcgm-exporter/dcp-metrics-included.csv` is the list of metrics. It provides more information than the default one about the GPU usage. |
| 22 | + |
| 23 | + |
| 24 | +## Quick start |
| 25 | + |
| 26 | +After the docker is up for about 2-3 minutes, the user can use `curl` to get the GPU information on the "host-ip" machine `curl <host-ip>:<port>/metrics` |
| 27 | + |
| 28 | +To use it in a container, the user can create a `log.sh` to start logging infinitely: |
| 29 | +``` |
| 30 | +#!/bin/bash |
| 31 | +
|
| 32 | +set -e |
| 33 | +
|
| 34 | +file="output.log" |
| 35 | +url="<host-ip>:<port>/metrics" |
| 36 | +
|
| 37 | +while true; do |
| 38 | + timestamp=$(date +%Y-%m-%d_%H:%M:%S) |
| 39 | + message=$(curl -s $url) |
| 40 | + echo -e "$timestamp:\n$message\n" >> $file |
| 41 | + sleep 30 |
| 42 | +
|
| 43 | +done |
| 44 | +``` |
| 45 | + |
| 46 | +``` |
| 47 | +DATETIME=$(date +"%Y%m%d-%H%M%S") |
| 48 | +nohup ./log.sh >/dev/null 2>&1 & echo $! > log_process_${DATETIME}.pid |
| 49 | +python ... |
| 50 | +kill $(cat log_process_${DATETIME}.pid) && rm log_process_${DATETIME}.pid |
| 51 | +``` |
| 52 | + |
| 53 | +The GPU utilization, as well as other metrics such as DRAM use, PCI Tx/Rx, can be found in the `output.log`. Depending on the GPU models, the following metrics may be recorded: |
| 54 | + |
| 55 | +- DCGM_FI_PROF_GR_ENGINE_ACTIVE (GPU utilization if the arch supports) |
| 56 | +- DCGM_FI_PROF_PIPE_TENSOR_ACTIVE (Tensor Core utilization) |
| 57 | +- DCGM_FI_PROF_PCIE_TX_BYTES (PCI Transmit) |
| 58 | +- DCGM_FI_PROF_PCIE_RX_BYTES (PCI Receiving) |
| 59 | +- DCGM_FI_DEV_FB_USED (GPU memory usage) |
| 60 | + |
| 61 | +## GPU Utilization |
| 62 | + |
| 63 | +The overview of [NVIDIA DCGM Documentation](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/index.html) provides a detailed description on how to read the GPU utlization in the [metrics section](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html?#metrics). |
| 64 | + |
| 65 | +Typically, the “GPU Utilization” from `nvidia-smi` or `NVML` is a rough metric that reflects how busy GPU cores are utilized. It is defined by “Percent of time over the past sample period during which one or more kernels was executing on the GPU”. In extreme cases, the metric is 100% even there’s only one thread launched to run kernel on GPU during past sample period. |
0 commit comments