Skip to content

Update notebooks of acceleration and performance #1179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jan 20, 2023
111 changes: 55 additions & 56 deletions acceleration/automatic_mixed_precision.ipynb

Large diffs are not rendered by default.

81 changes: 44 additions & 37 deletions acceleration/dataset_type_performance.ipynb

Large diffs are not rendered by default.

202 changes: 56 additions & 146 deletions acceleration/fast_training_tutorial.ipynb

Large diffs are not rendered by default.

18 changes: 9 additions & 9 deletions performance_profiling/pathology/profiling_train_base_nvtx.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,30 @@

# NVIDIA Tools Extension (NVTX)

The [NVIDIA® Tools Extension Library (NVTX)](https://github.com/NVIDIA/NVTX) is a powerful mechanism that allows users to manually instrument their application. With a C-based and a python-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications. Applications which integrate NVTX can use NVIDIA Nsight, Tegra System Profiler, and Visual Profiler to capture and visualize these events and ranges. In general, the NVTX can bring valuable insight into the application while incurring almost no overhead.
The [NVIDIA® Tools Extension Library (NVTX)](https://github.com/NVIDIA/NVTX) is a powerful mechanism that allows users to manually instrument their application. With a C-based and a python-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications. Applications that integrate NVTX can use NVIDIA Nsight, Tegra System Profiler, and Visual Profiler to capture and visualize these events and ranges. In general, the NVTX can bring valuable insight into the application while incurring almost no overhead.

# MONAI Training Pipeline and NVTX

[MONAI](https://github.com/Project-MONAI/MONAI) is a high level framework for deep learning in healthcare imaging.
[MONAI](https://github.com/Project-MONAI/MONAI) is a high-level framework for deep learning in healthcare imaging.

For performance profiling, we mainly focus on two fronts: data loading/transforms, and training/validation iterations.

[Transforms](https://github.com/Project-MONAI/MONAI/tree/dev/monai/transforms) is one core concept of data handling in MONAI, similar to [TorchVision Transforms](https://pytorch.org/vision/stable/transforms.html). Several of these transforms are usually chained together, using a [Compose](https://github.com/Project-MONAI/MONAI/blob/2f1c7a5d1b47c8dd21681dbe1b67213aa3278cd7/monai/transforms/compose.py#L35) class, to create a preprocessing or postprocessing pipeline that performs manipulation of the input data and make it suitable for training a deep learning model or inference. To dig into the cost from each individual transform, we enable the insertion of NVTX annotations via [MONAI NVTX Transforms](https://github.com/Project-MONAI/MONAI/blob/dev/monai/utils/nvtx.py).
[Transforms](https://github.com/Project-MONAI/MONAI/tree/dev/monai/transforms) is one core concept of data handling in MONAI, similar to [TorchVision Transforms](https://pytorch.org/vision/stable/transforms.html). Several of these transforms are usually chained together, using a [Compose](https://github.com/Project-MONAI/MONAI/blob/2f1c7a5d1b47c8dd21681dbe1b67213aa3278cd7/monai/transforms/compose.py#L35) class, to create a preprocessing or postprocessing pipeline that performs manipulation of the input data and makes it suitable for training a deep learning model or inference. To dig into the cost of each transform, we enable the insertion of NVTX annotations via [MONAI NVTX Transforms](https://github.com/Project-MONAI/MONAI/blob/dev/monai/utils/nvtx.py).

For training and validation steps, they are easier to track by setting NVTX annotations within the loop.

# Profiling Pathology Metastasis Detection Pipeline

## Data Preparation

The pipeline that we are profiling `rain_evaluate_nvtx_profiling.py` requires [Camelyon-16 Challenge](https://camelyon16.grand-challenge.org/) dataset. You can download all the images for "CAMELYON16" data set from sources listed [here](https://camelyon17.grand-challenge.org/Data/). Location information for training/validation patches (the location on the whole slide image where patches are extracted) are adopted from [NCRF/coords](https://github.com/baidu-research/NCRF/tree/master/coords). The reformatted coordinations and labels in CSV format for training (`training.csv`) can be found [here](https://drive.google.com/file/d/1httIjgji6U6rMIb0P8pE0F-hXFAuvQEf/view?usp=sharing) and for validation (`validation.csv`) can be found [here](https://drive.google.com/file/d/1tJulzl9m5LUm16IeFbOCoFnaSWoB6i5L/view?usp=sharing).
The pipeline that we are profiling `train_evaluate_nvtx.py` requires the [Camelyon-16 Challenge](https://camelyon16.grand-challenge.org/) dataset. You can download all the images for the "CAMELYON16" data set from the sources listed [here](https://camelyon17.grand-challenge.org/Data/)](https://camelyon17.grand-challenge.org/Data/). Location information for training/validation patches (the location on the whole slide image where patches are extracted) is adopted from [NCRF/coords](https://github.com/baidu-research/NCRF/tree/master/coords). The reformatted coordinations and labels in CSV format for training (`training.csv`) can be found [here](https://drive.google.com/file/d/1httIjgji6U6rMIb0P8pE0F-hXFAuvQEf/view?usp=sharing) and for validation (`validation.csv`) can be found [here](https://drive.google.com/file/d/1tJulzl9m5LUm16IeFbOCoFnaSWoB6i5L/view?usp=sharing).

> [`training_sub.csv`](https://drive.google.com/file/d/1rO8ZY-TrU9nrOsx-Udn1q5PmUYrLG3Mv/view?usp=sharing) and [`validation_sub.csv`](https://drive.google.com/file/d/130pqsrc2e9wiHIImL8w4fT_5NktEGel7/view?usp=sharing) is also provided to check the functionality of the pipeline using only two of the whole slide images: `tumor_001` (for training) and `tumor_101` (for validation). This dataset should not be used for the real training.

## Run Nsight Profiling

In `requirements.txt`, `cupy-cuda114` is set in default. If your cuda version is different, you may need to modify it into a suitable version, you can refer to [here](https://docs.cupy.dev/en/stable/install.html) for more details.
With environment prepared `requirements.txt`, we use `nsys profile` to get the information regarding the training pipeline's behavior across several steps. Since an epoch for pathology is long (covering 400,000 images), here we run profile on the trainer under basic settings for 30 seconds, with 50 seconds' delay. All results shown below are from experiments performed on a DGX-2 workstation using a single V-100 GPU over the full dataset.
In `requirements.txt`, `cupy-cuda114` is set in default. If your CUDA version is different, you may need to modify it into a suitable version, you can refer to [here](https://docs.cupy.dev/en/stable/install.html) for more details.
With environment prepared `requirements.txt`, we use `nsys profile` to get information regarding the training pipeline's behavior across several steps. Since an epoch for pathology is long (covering 400,000 images), here we run the profile on the trainer under basic settings for 30 seconds, with 50 seconds delay. All results shown below are from experiments performed on a DGX-2 workstation using a single V-100 GPU over the full dataset.

```python
nsys profile \
Expand Down Expand Up @@ -63,11 +63,11 @@ Let's now take a closer look at what operations are being performed during the l

![png](Figure/nsight_transform.png)

As shown in the zoomed view, during the above "data loading" gap, the major operation is data transforms. To be more specific, most of the time is spent on "ColorJitter" operation (orange dashed region). This augmentation technique is a necessary transform for the task of pathology metastasis detection. For this pipeline, it is performed on CPU. On the other hand, the GPU training is so fast that it need to wait a long time for the data augmentation to finish, which comparably is much slower.
As shown in the zoomed view, during the above "data loading" gap, the major operation is data transforms. To be more specific, most of the time is spent on THE "ColorJitter" operation (orange dashed region). This augmentation technique is a necessary transform for the task of pathology metastasis detection. For this pipeline, augmentation is performed on the CPU. On the other hand, the GPU training is so fast that it needs to wait a long time for the data augmentation to finish, which comparably is much slower.

Therefore, as we identify this major bottleneck, we need to find a mechanism for faster data transform in order to achieve performance improvement.
Therefore, as we identify this major bottleneck, we need to find a mechanism for faster data transform to achieve performance improvement.

One optimized solution is to utilize CuCIM library's GPU transforms for data augmentation, so that all steps are performed on GPU, and thus this bottleneck from slow CPU augmentation can be removed. The code for this part is included in the same python script.
One optimized solution is to utilize the CuCIM library's GPU transforms for data augmentation, so that all steps are performed on GPU, and thus this bottleneck from slow CPU augmentation can be removed. The code for this part is included in the same python script.

# Analyzing Performance Improvement

Expand Down
1 change: 1 addition & 0 deletions performance_profiling/pathology/train_evaluate_nvtx.py
Original file line number Diff line number Diff line change
Expand Up @@ -393,6 +393,7 @@ def main(cfg):
optimizer = SGD(model.parameters(), lr=cfg["lr"], momentum=0.9)

# AMP scaler
cfg["amp"] = cfg["amp"] and monai.utils.get_torch_version_tuple() >= (1, 6)
if cfg["amp"] is True:
scaler = GradScaler()
else:
Expand Down
16 changes: 8 additions & 8 deletions performance_profiling/radiology/profiling_train_base_nvtx.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
[NVIDIA Nsight™ Systems](https://developer.nvidia.com/nsight-systems) is a system-wide performance analysis tool designed to visualize an application’s algorithms, help to identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs.

# NVIDIA Tools Extension (NVTX)
The [NVIDIA® Tools Extension Library (NVTX)](https://github.com/NVIDIA/NVTX) is a powerful mechanism that allows users to manually instrument their application. With a C-based and a python-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications. Applications which integrate NVTX can use NVIDIA Nsight, Tegra System Profiler, and Visual Profiler to capture and visualize these events and ranges. In general, the NVTX can bring valuable insight into the application while incurring almost no overhead.
The [NVIDIA® Tools Extension Library (NVTX)](https://github.com/NVIDIA/NVTX) is a powerful mechanism that allows users to manually instrument their application. With a C-based and a python-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications. Applications that integrate NVTX can use NVIDIA Nsight, Tegra System Profiler, and Visual Profiler to capture and visualize these events and ranges. In general, the NVTX can bring valuable insight into the application while incurring almost no overhead.

# MONAI Training Pipeline and NVTX
[MONAI](https://github.com/Project-MONAI/MONAI) is a high level framework for deep learning in healthcare imaging.
[MONAI](https://github.com/Project-MONAI/MONAI) is a high-level framework for deep learning in healthcare imaging.

For performance profiling, we mainly focus on two fronts: data loading/transforms, and training/validation iterations.

[Transforms](https://github.com/Project-MONAI/MONAI/tree/dev/monai/transforms) is one core concept of data handling in MONAI, similar to [TorchVision Transforms](https://pytorch.org/vision/stable/transforms.html). Several of these transforms are usually chained together, using a [Compose](https://github.com/Project-MONAI/MONAI/blob/2f1c7a5d1b47c8dd21681dbe1b67213aa3278cd7/monai/transforms/compose.py#L35) class, to create a preprocessing or postprocessing pipeline that performs manipulation of the input data and make it suitable for training a deep learning model or inference. To dig into the cost from each individual transform, we enable the insertion of NVTX annotations via [MONAI NVTX Transforms](https://github.com/Project-MONAI/MONAI/blob/dev/monai/utils/nvtx.py).
[Transforms](https://github.com/Project-MONAI/MONAI/tree/dev/monai/transforms) is one core concept of data handling in MONAI, similar to [TorchVision Transforms](https://pytorch.org/vision/stable/transforms.html). Several of these transforms are usually chained together, using a [Compose](https://github.com/Project-MONAI/MONAI/blob/2f1c7a5d1b47c8dd21681dbe1b67213aa3278cd7/monai/transforms/compose.py#L35) class, to create a preprocessing or postprocessing pipeline that performs manipulation of the input data and makes it suitable for training a deep learning model or inference. To dig into the cost of each transform, we enable the insertion of NVTX annotations via [MONAI NVTX Transforms](https://github.com/Project-MONAI/MONAI/blob/dev/monai/utils/nvtx.py).

For training and validation steps, they are easier to track by setting NVTX annotations within the loop.

Expand All @@ -34,19 +34,19 @@ After profiling, the computing details can be visualized via Nsight System GUI.
## Observations
As shown in the above figure, we focus on two sections: CUDA (first row), and NVTX (last two rows). Nsight provides information regarding GPU utilization (CUDA), and specific NVTX tags we added to track certain behaviors.
In this example, we added NVTX tags to track each epoch, as well as operations within each step (data transforms, forward, backward, etc.). As shown within the second last row, each solid red block represents a single epoch.
As we perform validation every two epochs, the even epochs will be longer than odd ones since it includes both training (green dashed region) and validation (blue dashed region).
As we perform validation every two epochs, the even epochs will be longer than the odd ones since it includes both training (green dashed region) and validation (blue dashed region).
Also in this pipeline, we used CacheDataset, and the initial 40~50 seconds (red dashed region) are for loading all training images into CPU RAM.

As can be observed from the figure, there are data loading/IO gaps between epochs (pointed by orange arrows).
As can be observed from the figure, there are data loading/IO gaps between epochs (pointed out by orange arrows).

Let's zoom-in and look closer at the beginning of the second epoch.
Let's zoom in and look closer at the beginning of the second epoch.

![png](Figure/nsight_base_zoom.png)

As shown in the zoomed view:
- Between epochs, there are considerable amount of time cost for data loading (pointed by the blue arrow);
- Between epochs, there is a considerable amount of time costs for data loading (pointed out by the blue arrow);
- Between steps of training, the time of data loading is much smaller (pointed by green arrows);
- The GPU utilization (CUDA HW) is decent during training step (pointed by red arrows);
- The GPU utilization (CUDA HW) is decent during the training step (pointed by red arrows);

Upon further analysis of convergence, it appears that the convergence is relatively slow (blue curve in the tensorboard figure below). Therefore, there are two directions of performance improvement:

Expand Down