Long training time for detection #7619

Thibescobar · 2024-04-10T15:12:32Z

Thibescobar
Apr 10, 2024

Hello, here are details of the long training time problem I face related to the previous post here: Project-MONAI/model-zoo#577

I am using the model zoo's lung nodule ct detection bundle to train the other folds (trained fold 0 model already given by the zoo): https://monai.io/model-zoo.html

Facing this long training time problem, I wanted to ensure it does not come from the bundle usage, so I followed the Python script way, given by the lung nodule detection tutorial here: https://github.com/Project-MONAI/tutorials/tree/main/detection

Unfortunately, the time is the same... So I investigated the code and added time markers using print().

In the following piece of code, in case of AMP usage, execution is long at the line after the time markers "5.1 (amp)" and "5.4 (amp)" (~10s for the whole iteration). When not using AMP (amp = False), the stucks are at "5.1 (no amp)" and "6", even longer for the whole iteration (~120s). Can you help me please to find out why, and how to fix if possible, or give me some hints ?

for epoch in range(max_epochs):
        # ------------- Training -------------
        print("-" * 10)
        print(f"epoch {epoch + 1}/{max_epochs}")
        detector.train()
        epoch_loss = 0
        epoch_cls_loss = 0
        epoch_box_reg_loss = 0
        step = 0
        start_time = time.time()
        scheduler_warmup.step()
        # Training
        for batch_data in train_loader:
            #Start a timer
            start_it_time = time.time()
            
            print("----time marker 1")
            step += 1

            print("----time marker 2")    
            inputs = [
                batch_data_ii["image"].to(device) for batch_data_i in batch_data for batch_data_ii in batch_data_i
            ]
            
            print("----time marker 3")
            targets = [
                dict(
                    label=batch_data_ii["label"].to(device),
                    box=batch_data_ii["box"].to(device),
                )
                for batch_data_i in batch_data
                for batch_data_ii in batch_data_i
            ]

            print("----time marker 4")
            for param in detector.network.parameters():
                param.grad = None

            print("----time marker 5")
            if amp and (scaler is not None):
                with torch.cuda.amp.autocast():
                    print("----time marker 5.1 (amp)") #long here
                    outputs = detector(inputs, targets)
                    print("----time marker 5.2 (amp)")
                    loss = w_cls * outputs[detector.cls_key] + outputs[detector.box_reg_key]
                print("----time marker 5.3 (amp)")
                scaler.scale(loss).backward()
                print("----time marker 5.4 (amp)")
                scaler.step(optimizer)
                print("----time marker 5.5 (amp)") #long here
                scaler.update()
            else:
                print("----time marker 5.1 (no amp)") #long here
                outputs = detector(inputs, targets)
                print("----time marker 5.2 (no amp)")
                loss = w_cls * outputs[detector.cls_key] + outputs[detector.box_reg_key]
                print("----time marker 5.3 (no amp)")
                loss.backward()
                print("----time marker 5.4 (no amp)")
                optimizer.step()
                print("----time marker 5.5 (no amp)")

            # save to tensorboard
            print("----time marker 6") #long here
            epoch_loss += loss.detach().item()
            print("----time marker 7")
            epoch_cls_loss += outputs[detector.cls_key].detach().item()
            print("----time marker 8")
            epoch_box_reg_loss += outputs[detector.box_reg_key].detach().item()
            print("----time marker 9")
            tensorboard_writer.add_scalar("train_loss", loss.detach().item(), epoch_len * epoch + step)
            print("----time marker 10")
            end_it_time = time.time()
            print(f"{step}/{epoch_len} (epoch {epoch + 1}), train_loss: {loss.item():.4f}, time for iteration: {end_it_time-start_it_time}s")

My configuration is:

Computer: Laptop Dell Precision 7670
OS: Windows 10 Professional (22H2)
System type: x64
GPU: NVIDIA RTX A3000 12GB
CPU: 12th Gen Intel(R) Core(TM) i7-12850HX 2.10 GHz
RAM: 32GB
Python version: 3.10.14
MONAI version: 1.3.0
MONAI Weekly version: 1.4.dev2414
Pytorch version: torch 2.2.2+cu118
cuDNN version:
- 8.7 given by conda activate monailuna && python >>> import torch >>> torch.backends.cudnn.version() (I think this is this one)
- 8.6 installed outside the active conda env at C:\Program Files\NVIDIA GPU Computing Toolkit\CUDNN
CUDA version:
- 11.8 given by conda activate monailuna && python >>> import torch >>> torch.version.cuda (I think this is this one)
- 11.7 given by nvcc --version (version installed outside the active conda env)
- 12.2 given by nvidia-smi (compatible version but not the one installed?)

Thank you very much in advance.

KumoLiu · 2024-04-10T16:01:49Z

KumoLiu
Apr 10, 2024
Maintainer

Hi @Can-Zhao, would you mind sharing your insights on this matter?
If I recall correctly, for our benchmark on the A100, it took over 700 hours, right?
Thanks for your help in advance!

5 replies

Thibescobar Apr 10, 2024
Author

Thank you for your quick answer. It would indeed be great if you have some measurements/benchmarks of training times on different GPUs.

Based on your message @KumoLiu, is my RTX A3000 equivalent to the A100 you mention ? Sorry I do not know a lot in hardware...

If it is the case it will save me a lot of time searching what is going wrong with the code, data, versions, etc., as it would be a hardware limitation that I cannot overcome without changing material right?

Thank you very much!

Can-Zhao Apr 10, 2024
Collaborator

10s for one batch seems too long...Could you first check if CPU RAM can be the bottleneck?

Thibescobar Apr 10, 2024
Author

I checked the CPU and RAM usage when executing the training with AMP enabled, but did not manage to conclude...

I tried to understand what happen by relating the code lines where it's stuck, and the GPU, CPU, and RAM usage curves thanks to the several markers using print().

The marker "5.1 (amp)" corresponds to the line executing outputs = detector(inputs, targets)
The "5.4 (amp)" is when doing scaler.step(optimizer)

I do not know how to interpret this behavior, but definitely there are relations.

During the training execution, the GPU usage is like this:

The CPU usage is like this:

The CPU usage when merging all cores to one curve is like this, showing well the peaks at outputs = detector(inputs, targets) (5.1):

The RAM usage follows:

The disk usage is like this:

Do you see a bottleneck?
I suspect something related to the AMP cast, but disabling it is even worse, 10 times slower.
What about the 700 hours mentioned by @KumoLiu? What were the causes? Am I in this situation?

KumoLiu Apr 11, 2024
Maintainer

Hi @Thibescobar,

Your observation that AMP enables faster execution than its disabled counterpart is indeed valid, especially given that the "time marker 5.1" corresponds to the forward pass which is generally the most time-consuming step.
For your reference, I've come across some benchmarking data that might be of interest and could potentially account for the observed difference, as it could be attributed to the hardware utilized. You can find it at this link: https://bizon-tech.com/gpu-benchmarks/NVIDIA-RTX-A6000-vs-NVIDIA-A100-40-GB-(PCIe)/585vs592

What do you think? cc @Can-Zhao

Thibescobar Apr 16, 2024
Author

Hello,

Have you got any advice @Can-Zhao, @KumoLiu please?

Have a nice day.

Thibescobar · 2024-05-23T13:14:17Z

Thibescobar
May 23, 2024
Author

Hello @KumoLiu, @Can-Zhao,

I found out that a colleague had the same problem but using nnDetection and fixed it by inputing zarr files instead of nii.gz ones.

Does it sounds adaptable to the bundle use of MONAI detection? If yes, could you give me an hint please?

Have a nice day.

3 replies

chanuan Dec 28, 2024

Excuse me, have you found the cause of the problem? I am using a 3090 24G. It takes about 1000 seconds to run one epoch. I don't know if this speed is normal.I am also looking for ways to accelerate.

Thibescobar Dec 29, 2024
Author

Hello. Unfortunately no. I switched to nnDetection that is very efficient in training time and accuracy. At the price of loosing all the work done by MONAI for deployment... You are working on lung nodules ? Best

chanuan Feb 27, 2025

Hello. Unfortunately, no. I switched to nnDetection, which is very efficient in training time and accuracy. At the cost of... sacrificing the MONAI deployment work. Are you researching lung nodules? Best

I'm sorry for the late reply. I recently found that the data loading step takes a very long time, accounting for 90% of the total execution time. I haven't found a good solution yet. Could you please tell me the runtime and configuration you use with nnDetection? I'm considering switching to nnDetection. Thank you

Long training time for detection #7619

Uh oh!

Uh oh!

Thibescobar Apr 10, 2024

Replies: 2 comments · 8 replies

Uh oh!

KumoLiu Apr 10, 2024 Maintainer

Uh oh!

Thibescobar Apr 10, 2024 Author

Uh oh!

Can-Zhao Apr 10, 2024 Collaborator

Uh oh!

Uh oh!

Thibescobar Apr 10, 2024 Author

Uh oh!

KumoLiu Apr 11, 2024 Maintainer

Uh oh!

Thibescobar Apr 16, 2024 Author

Uh oh!

Thibescobar May 23, 2024 Author

Uh oh!

chanuan Dec 28, 2024

Uh oh!

Thibescobar Dec 29, 2024 Author

Uh oh!

chanuan Feb 27, 2025

Thibescobar
Apr 10, 2024

Replies: 2 comments 8 replies

KumoLiu
Apr 10, 2024
Maintainer

Thibescobar Apr 10, 2024
Author

Can-Zhao Apr 10, 2024
Collaborator

Thibescobar Apr 10, 2024
Author

KumoLiu Apr 11, 2024
Maintainer

Thibescobar Apr 16, 2024
Author

Thibescobar
May 23, 2024
Author

Thibescobar Dec 29, 2024
Author