Skip to content

628 Update performance profiling #632

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 30, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ The pipeline that we are profiling `rain_evaluate_nvtx_profiling.py` requires [C
Instead of the whole dataset, just for the experiment of this performance analysis, users can also download a single whole slide image `tumor_091.tif` from [here](https://drive.google.com/uc?id=1OxAeCMVqH9FGpIWpAXSEJe6cLinEGQtF), as well as its coordinates and labels (`dataset_0.json`), from [here](https://drive.google.com/uc?id=1F-lR9tXoFkPkC1yueM-_TyaFk3CO7v0s).

## Run Nsight Profiling
In `requirements.txt`, `cupy-cuda114` is set in default. If your cuda version is different, you may need to modify it into a suitable version, you can refer to [here](https://docs.cupy.dev/en/stable/install.html) for more details.
With environment prepared `requirements.txt`, we use `nsys profile` to get the information regarding the training pipeline's behavior across several steps. Since an epoch for pathology is long (covering 400,000 images), here we run profile on the trainer under basic settings for 30 seconds, with 50 seconds' delay. All results shown below are from experiments performed on a DGX-2 workstation using a single V-100 GPU over the full dataset.

```python
Expand Down
3 changes: 1 addition & 2 deletions performance_profiling/pathology/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,5 @@ torchvision
cucim==21.8.2
cupy-cuda114
pytorch-ignite
nvidia-pyindex
nvidia-dlprof[pytorch]
nvtx
tensorboard
26 changes: 13 additions & 13 deletions performance_profiling/radiology/profiling_train_base_nvtx.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,19 @@ For training and validation steps, they are easier to track by setting NVTX anno

# Profiling Spleen Segmentation Pipeline
## Run Nsight Profiling
With environment prepared `requirements.txt`, we run DLprof (v1.4.0 / r21.08) on the trainer under basic settings for 6 epochs (with validation every 2 epochs). All results shown below are from experiments performed on a DGX-2 workstation using a single V-100 GPU.
With environment prepared `requirements.txt`, we use `nsys profile` on the trainer under basic settings for 6 epochs (with validation every 2 epochs). All results shown below are from experiments performed on a DGX-2 workstation using a single V-100 GPU.

```python
!dlprof --mode pytorch \
--reports=summary \
--formats json \
--output_path ./outputs_base \
python3 train_base_nvtx.py
nsys profile \
--output ./output_base \
--force-overwrite true \
--trace-fork-before-exec true \
python3 train_base_nvtx.py
```

# Identify Potential Performance Improvements
## Profile Results
After profiling, DLProf provides summary regarding the training process. Also, the computing details can be visualized via Nsight System GUI. (The version of Nsight used in the tutorial is 2021.3.1.54-ee9c30a OSX)
After profiling, the computing details can be visualized via Nsight System GUI. (The version of Nsight used in the tutorial is 2021.3.1.54-ee9c30a OSX)

![png](Figure/nsight_base.png)

Expand Down Expand Up @@ -59,14 +59,14 @@ One optimized solution can be found [here](https://github.com/Project-MONAI/tuto

# Analyzing Performance Improvement
## Profile Results
We again use DLProf to further analyze the optimized training script.
We again use `nsys profile` to further analyze the optimized training script.

```python
!dlprof --mode pytorch \
--reports=summary \
--formats json \
--output_path ./outputs_fast \
python3 train_fast_nvtx.py
nsys profile \
--output ./outputs_fast \
--force-overwrite true \
--trace-fork-before-exec true \
python3 train_fast_nvtx.py
```
And the profiling result is

Expand Down
3 changes: 1 addition & 2 deletions performance_profiling/radiology/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,5 @@ git+https://github.com/Project-MONAI/MONAI
pytorch-ignite
nibabel
tqdm
nvidia-pyindex
nvidia-dlprof[pytorch]
nvtx
tensorboard
6 changes: 2 additions & 4 deletions performance_profiling/radiology/train_base_nvtx.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@
from torch.utils.tensorboard import SummaryWriter
torch.backends.cudnn.benchmark = True

import nvidia_dlprof_pytorch_nvtx
import nvtx

from monai.apps import download_and_extract
Expand All @@ -47,7 +46,6 @@
)
from monai.utils import Range, set_determinism

nvidia_dlprof_pytorch_nvtx.init()

# set directories
random.seed(0)
Expand Down Expand Up @@ -143,7 +141,7 @@
num_workers=8
)
train_loader = DataLoader(
train_ds, num_workers=8, batch_size=4, shuffle=True
train_ds, num_workers=0, batch_size=4, shuffle=True
)
val_ds = CacheDataset(
data=val_files,
Expand All @@ -152,7 +150,7 @@
num_workers=8
)
val_loader = DataLoader(
val_ds, num_workers=8, batch_size=1
val_ds, num_workers=0, batch_size=1
)

# standard PyTorch program style: create UNet, DiceLoss and Adam optimizer
Expand Down
2 changes: 0 additions & 2 deletions performance_profiling/radiology/train_fast_nvtx.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@
from torch.utils.tensorboard import SummaryWriter
torch.backends.cudnn.benchmark = True

import nvidia_dlprof_pytorch_nvtx
import nvtx

from monai.apps import download_and_extract
Expand Down Expand Up @@ -51,7 +50,6 @@
from monai.utils import set_determinism
from monai.utils.nvtx import Range

nvidia_dlprof_pytorch_nvtx.init()

# set directories
random.seed(0)
Expand Down