Project-MONAI · wyli · Mar 30, 2022 · Mar 25, 2022 · Mar 29, 2022 · Mar 29, 2022
diff --git a/performance_profiling/pathology/profiling_train_base_nvtx.md b/performance_profiling/pathology/profiling_train_base_nvtx.md
@@ -21,6 +21,7 @@ The pipeline that we are profiling `rain_evaluate_nvtx_profiling.py` requires [C
 Instead of the whole dataset, just for the experiment of this performance analysis, users can also download a single whole slide image `tumor_091.tif` from [here](https://drive.google.com/uc?id=1OxAeCMVqH9FGpIWpAXSEJe6cLinEGQtF), as well as its coordinates and labels (`dataset_0.json`), from [here](https://drive.google.com/uc?id=1F-lR9tXoFkPkC1yueM-_TyaFk3CO7v0s).
 
 ## Run Nsight Profiling
+In `requirements.txt`, `cupy-cuda114` is set in default. If your cuda version is different, you may need to modify it into a suitable version, you can refer to [here](https://docs.cupy.dev/en/stable/install.html) for more details.
 With environment prepared `requirements.txt`, we use `nsys profile` to get the information regarding the training pipeline's behavior across several steps. Since an epoch for pathology is long (covering 400,000 images), here we run profile on the trainer under basic settings for 30 seconds, with 50 seconds' delay. All results shown below are from experiments performed on a DGX-2 workstation using a single V-100 GPU over the full dataset.
 
 ```python

diff --git a/performance_profiling/pathology/requirements.txt b/performance_profiling/pathology/requirements.txt
@@ -4,6 +4,5 @@ torchvision
 cucim==21.8.2
 cupy-cuda114
 pytorch-ignite
-nvidia-pyindex
-nvidia-dlprof[pytorch]
 nvtx
+tensorboard
diff --git a/performance_profiling/radiology/profiling_train_base_nvtx.md b/performance_profiling/radiology/profiling_train_base_nvtx.md
@@ -15,19 +15,19 @@ For training and validation steps, they are easier to track by setting NVTX anno
 
 # Profiling Spleen Segmentation Pipeline
 ## Run Nsight Profiling
-With environment prepared `requirements.txt`, we run DLprof (v1.4.0 / r21.08) on the trainer under basic settings for 6 epochs (with validation every 2 epochs). All results shown below are from experiments performed on a DGX-2 workstation using a single V-100 GPU.
+With environment prepared `requirements.txt`, we use `nsys profile` on the trainer under basic settings for 6 epochs (with validation every 2 epochs). All results shown below are from experiments performed on a DGX-2 workstation using a single V-100 GPU.
 
 ```python
-!dlprof --mode pytorch \
-        --reports=summary \
-        --formats json \
-        --output_path ./outputs_base \
-        python3 train_base_nvtx.py
+nsys profile \
+     --output ./output_base \
+     --force-overwrite true \
+     --trace-fork-before-exec true \
+     python3 train_base_nvtx.py
 ```
 
 # Identify Potential Performance Improvements
 ## Profile Results
-After profiling, DLProf provides summary regarding the training process. Also, the computing details can be visualized via Nsight System GUI. (The version of Nsight used in the tutorial is 2021.3.1.54-ee9c30a OSX)
+After profiling, the computing details can be visualized via Nsight System GUI. (The version of Nsight used in the tutorial is 2021.3.1.54-ee9c30a OSX)
 
 ![png](Figure/nsight_base.png)
 
@@ -59,14 +59,14 @@ One optimized solution can be found [here](https://github.com/Project-MONAI/tuto
 
 # Analyzing Performance Improvement
 ## Profile Results
-We again use DLProf to further analyze the optimized training script.
+We again use `nsys profile` to further analyze the optimized training script.
 
 ```python
-!dlprof --mode pytorch \
-        --reports=summary \
-        --formats json \
-        --output_path ./outputs_fast \
-        python3 train_fast_nvtx.py
+nsys profile \
+     --output ./outputs_fast \
+     --force-overwrite true \
+     --trace-fork-before-exec true \
+     python3 train_fast_nvtx.py
 ```
 And the profiling result is
 

diff --git a/performance_profiling/radiology/requirements.txt b/performance_profiling/radiology/requirements.txt
@@ -2,6 +2,5 @@ git+https://github.com/Project-MONAI/MONAI
 pytorch-ignite
 nibabel
 tqdm
-nvidia-pyindex
-nvidia-dlprof[pytorch]
 nvtx
+tensorboard
diff --git a/performance_profiling/radiology/train_base_nvtx.py b/performance_profiling/radiology/train_base_nvtx.py
@@ -22,7 +22,6 @@
 from torch.utils.tensorboard import SummaryWriter
 torch.backends.cudnn.benchmark = True
 
-import nvidia_dlprof_pytorch_nvtx
 import nvtx
 
 from monai.apps import download_and_extract
@@ -47,7 +46,6 @@
 )
 from monai.utils import Range, set_determinism
 
-nvidia_dlprof_pytorch_nvtx.init()
 
 # set directories
 random.seed(0)
@@ -143,7 +141,7 @@
     num_workers=8
 )
 train_loader = DataLoader(
-    train_ds, num_workers=8, batch_size=4, shuffle=True
+    train_ds, num_workers=0, batch_size=4, shuffle=True
 )
 val_ds = CacheDataset(
     data=val_files,
@@ -152,7 +150,7 @@
     num_workers=8
 )
 val_loader = DataLoader(
-    val_ds, num_workers=8, batch_size=1
+    val_ds, num_workers=0, batch_size=1
 )
 
 # standard PyTorch program style: create UNet, DiceLoss and Adam optimizer

diff --git a/performance_profiling/radiology/train_fast_nvtx.py b/performance_profiling/radiology/train_fast_nvtx.py
@@ -22,7 +22,6 @@
 from torch.utils.tensorboard import SummaryWriter
 torch.backends.cudnn.benchmark = True
 
-import nvidia_dlprof_pytorch_nvtx
 import nvtx
 
 from monai.apps import download_and_extract
@@ -51,7 +50,6 @@
 from monai.utils import set_determinism
 from monai.utils.nvtx import Range
 
-nvidia_dlprof_pytorch_nvtx.init()
 
 # set directories
 random.seed(0)