✨ Add Multi-GPU Support to v1.1 #1449

lemonbuilder · 2023-10-30T08:08:40Z

What is the motivation for this task?

I'm going to train custom dataset using EfficientAd model.
How do I train or test using Multi-GPU?
Please, tell me which command is used.

Describe the solution you'd like

Currently, I'm training using only single devices.
$ python3 tools/train.py --model efficient_ad

Additional context

No response

lemonbuilder · 2023-11-05T06:30:18Z

@samet-akcay Is there any code implementation for using multiple GPUs?

samet-akcay · 2024-02-09T15:01:18Z

@lemonbuilder, this has now been added to the roadmap.

This task would close the following issues: #930 #1110 #930 #1398

nguyenanhtuan1008 · 2024-02-12T06:17:08Z

@samet-akcay , sorry, I got the error when training with multi-GPU with v1. How can I use only 1 GPU for example id 3 for training? Now I'm using this code for training:

# Import the required modules
from anomalib.data import MVTec
from anomalib.models import EfficientAd
from anomalib.engine import Engine

# Initialize the datamodule, model and engine
datamodule = MVTec()
model = EfficientAd()

engine = Engine()

# Train the model
engine.fit(datamodule=datamodule, model=model)

samet-akcay · 2024-02-12T10:20:56Z

@nguyenanhtuan1008, you could refer to this link.
https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html#choosing-gpu-devices

In this case, you could initialize the Engine class as ;

engine = Engine(accelerator="gpu", devices="3")

nguyenanhtuan1008 · 2024-02-15T02:37:09Z

@samet-akcay
Thank you so much.
I got the training work but still got error after 1 epoch so I gave up and using the single GPU right now.

Bepitic · 2024-03-13T22:22:02Z

Hello, I wish to take this issue.
Thank you @samet-akcay, and the good work.

RitikaxShakya · 2024-03-28T12:43:04Z

Hi @samet-akcay I would like to work on this issue. Can I take this issue?

samet-akcay · 2024-03-28T12:51:42Z

@RitikaxShakya, thanks for your interest. I've totally missed this one, but looks like @Bepitic already shown interest in this. If he doesn't want to work on it, it could be all yours. How does that sound?

samet-akcay · 2024-03-28T12:52:07Z

@Bepitic, are you still interested in this issue? If not @RitikaxShakya can take it?

Bepitic · 2024-03-28T13:01:42Z

Yes for sure, since no one confirmed me I also forgot about the one of multi-gpu 😅

samet-akcay · 2024-03-28T13:07:37Z

sorry about that

samet-akcay · 2024-03-28T13:07:48Z

@RitikaxShakya, all yours then

RitikaxShakya · 2024-03-28T13:41:39Z

.take

RitikaxShakya · 2024-04-06T21:52:45Z

@blaz-r @samet-akcay Hello! I need help regarding the parts of the code that deal with GPU initialization, data parallelization, and GPU-specific operations as these are the areas i think I'll need to modify to add Multi-GPU support.

blaz-r · 2024-04-10T10:56:51Z

I am not that familiar with these topics within the Anomalib. @ashwinvaidya17 could you provide some insight here?

RitikaxShakya · 2024-04-12T06:32:35Z

@ashwinvaidya17 Hello! Please help me regarding the parts of the code that deal with GPU initialization, data parallelization, and GPU-specific operations as these are the areas i think I'll need to modify to add Multi-GPU support.

ashwinvaidya17 · 2024-04-12T08:18:28Z

@RitikaxShakya currently we override the number of devices to 1 in Engine and the CLI.

To start with, we should remove these lines.

anomalib/src/anomalib/engine/engine.py

Line 305 in debdae7

self._cache.args["devices"] = 1

anomalib/src/anomalib/utils/config.py

Line 130 in debdae7

config.trainer.devices = 1

Doing this will break a bunch of stuff across the repo.

For example, all the trainer.model calls will break.

anomalib/src/anomalib/callbacks/checkpoint.py

Line 38 in debdae7

is_zero_or_few_shot = trainer.model.learning_type in [LearningType.ZERO_SHOT, LearningType.FEW_SHOT]

These should be replaced with trainer.lightning_module
You will also need to test each model to replace all .cpu() operations as we move large tensors out of CUDA memory to mitigate OOM issues. In case of Padim, the following line will break

anomalib/src/anomalib/models/image/padim/lightning_model.py

Line 86 in debdae7

self.stats = self.model.gaussian.fit(embeddings)

as the embeddings are on cpu. These should be moved to cuda before calling model.fit. Something as simple as to(self.device) should fix it. From my initial experiments this isn't sufficient to make the model work but it's a good start.
I am not sure if this is affected by distributed training but you might also need to look at thresholding and metrics computation

anomalib/src/anomalib/callbacks/thresholding.py

Line 182 in debdae7

output = output.cpu()

I might have missed something so feel free to report any difficulties you run into.

haimat · 2024-06-05T18:49:03Z

Using latest anomalib 1.1.0 from pip I create the Engine like so:

    engine = Engine(
        max_epochs=100,
        task=task_type,
        accelerator="gpu",
        devices=-1,
    )

By passing devices=-1 I thought training would utilize all my available GPUs.
PyTorch can see them, I get this output from training:

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

However, when looking at the output from nvidia-smi I can see that only one GPU is used.
I also tried to pass strategy="ddp" when creating the engine, however, then I get this error:

Traceback (most recent call last):
  File "/data/scratch/mkw-anomalib/anomalib-test.py", line 41, in <module>
    train()
  File "/data/scratch/mkw-anomalib/anomalib-test.py", line 36, in train
    engine.fit(datamodule=datamodule, model=model)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/anomalib/engine/engine.py", line 540, in fit
    self.trainer.fit(model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
    results = self._run_stage()
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1033, in _run_stage
    self.fit_loop.run()
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 269, in advance
    call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 208, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 295, in on_train_batch_end
    if self._should_skip_saving_checkpoint(trainer):
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/anomalib/callbacks/checkpoint.py", line 38, in _should_skip_saving_checkpoint
    is_zero_or_few_shot = trainer.model.learning_type in [LearningType.ZERO_SHOT, LearningType.FEW_SHOT]
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'learning_type'

Is this a bug or how can I use all my GPUs for training with anomalig 1.1.0?

samet-akcay · 2024-06-05T19:06:52Z

@haimat, we aim to enable multi-gpu support in v1.2

haimat · 2024-06-05T19:12:17Z

@samet-akcay Thanks for your quick reply.
So around end of July?

samet-akcay · 2024-06-05T19:14:43Z

yeah, that's the plan hopefully :)

haimat · 2024-08-02T08:46:42Z

Hello guys, any news on this, do you have an ETA on 1.2 and multui-GPU training?

haimat · 2024-08-08T07:41:18Z

@samet-akcay Hello Samet, can you estimate when multi-GPU training will be available?

ashwinvaidya17 · 2024-08-09T07:12:07Z

@haimat unfortunately we don't have an exact timeline for this. Currently, we are busy with some other high-priority tasks.

watertianyi · 2024-08-19T01:26:22Z

@samet-akcay
Use the following parameters to perform multi-GPU, but the result is a single GPU. How to set epoch? I am very anxious?
from anomalib.models import Patchcore
from anomalib.engine import Engine

Create the model and engine

model = Patchcore()
engine = Engine(max_epochs=30,task="classification",accelerator='gpu',devices=3)

Train a Patchcore model on the given datamodule

engine.train(datamodule=datamodule, model=model)

What is the default epoch?

samet-akcay · 2024-08-19T07:09:58Z

@goldwater668, as mentioned above, multi-GPU is not currently supported. devices parameter is over-written here to avoid any errors caused by multi-gpu issues.

anomalib/src/anomalib/engine/engine.py

Lines 327 to 328 in 2bd2842

    
           # Temporarily set devices to 1 to avoid issues with multiple processes 
        
           self._cache.args["devices"] = 1

watertianyi · 2024-08-19T07:17:39Z

@samet-akcay Can you specify the GPU ID?
engine = Engine(max_epochs=10,task="classification",accelerator='gpu',devices=[1,2])
I specify the GPU ID number according to the above code. Why do the following results still appear? Can’t I specify it?
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
When I am training Patchcore, I have insufficient CUDA memory. How should I adjust the parameters?

samet-akcay · 2024-08-19T08:01:54Z

Yes, you could specify the GPU ID.

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
This is an automated log generated by Lightning. You could ignore this.

If you are experiencing out of memory issues with Patchcore, your dataset is probably to large to fit a PatchCore memory bank. You could configure Patchcore arguments to make it more memory efficient. For example, changing the backbone to a more efficient backbone, changing the layers to extract etc.
https://anomalib.readthedocs.io/en/v1.1.0/markdown/guides/reference/models/image/patchcore.html

anomalib/src/anomalib/models/image/patchcore/lightning_model.py

Lines 25 to 48 in 2bd2842

    
           class Patchcore(MemoryBankMixin, AnomalyModule): 
        
               """PatchcoreLightning Module to train PatchCore algorithm. 
        
               Args: 
        
                   backbone (str): Backbone CNN network 
        
                       Defaults to ``wide_resnet50_2``. 
        
                   layers (list[str]): Layers to extract features from the backbone CNN 
        
                       Defaults to ``["layer2", "layer3"]``. 
        
                   pre_trained (bool, optional): Boolean to check whether to use a pre_trained backbone. 
        
                       Defaults to ``True``. 
        
                   coreset_sampling_ratio (float, optional): Coreset sampling ratio to subsample embedding. 
        
                       Defaults to ``0.1``. 
        
                   num_neighbors (int, optional): Number of nearest neighbors. 
        
                       Defaults to ``9``. 
        
               """ 
        
               def __init__( 
        
                   self, 
        
                   backbone: str = "wide_resnet50_2", 
        
                   layers: Sequence[str] = ("layer2", "layer3"), 
        
                   pre_trained: bool = True, 
        
                   coreset_sampling_ratio: float = 0.1, 
        
                   num_neighbors: int = 9, 
        
               ) -> None:

watertianyi · 2024-08-19T08:14:48Z

@samet-akcay I specified GPU cards 1 and 2 for training. However, during training, I still trained on card 0. Is there anything wrong with the GPU specified in the above settings?
engine = Engine(max_epochs=10,task="classification",accelerator='gpu',devices=[1,2])

samet-akcay · 2024-08-19T09:54:21Z

@goldwater668, you currently cannot set multiple GPUs as it will be mapped back to a single GPU.

With that being said, I noticed that Engine always configures the device to run on the default GPU even when the user explicitly chooses a specific GPU. I've created a PR to fix this
#2256

samet-akcay · 2024-08-19T10:33:22Z

I have created an official feature issue here: #2258

I'm closing this one. Those who are interested in this feature can follow the above issue.

watertianyi · 2024-08-20T02:01:14Z

@samet-akcay I used EfficientAd for 10 epochs and single category training to get the following results:

Epoch 9/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43944/43944 0:23:59 ? 0:00:00 30.44it/s train_st_step: 0.530 train_ae_step: 0.337 train_stae_step: 0.042 train_loss_step: 0.909            
                                                                                            image_AUROC: 0.906 image_F1Score: 0.709 train_st_epoch: 0.553 train_ae_epoch: 0.428                
Epoch 9/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43944/43944 0:23:59 ? 0:00:00 30.44it/s train_st_step: 0.530 train_ae_step: 0.337 train_stae_step: 0.042 train_loss_step: 0.909            
                                                                                            image_AUROC: 0.956 image_F1Score: 0.699 train_st_epoch: 0.541 train_ae_epoch: 0.425                
                                                                                            train_stae_epoch: 0.056 train_loss_epoch: 1.022                                                    
F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        Test metric        ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│        image_AUROC        │    0.6983320713043213     │
│       image_F1Score       │    0.6690346598625183     │
└───────────────────────────┴───────────────────────────┘

predictions = engine.predict(
        datamodule=datamodule,
        model=model,
        ckpt_path="latest/weights/lightning/model.ckpt",
    )

When predicting data, do I have to go through datamodule = Folder()? Is there a way to test the image directly?

samet-akcay · 2024-08-20T08:31:57Z

@goldwater668, the post above is not related to this issue. Can you create a Q&A in Discussions section

lemonbuilder added the Task label Oct 30, 2023

samet-akcay added Feature and removed Task labels Feb 9, 2024

samet-akcay changed the title ~~[Task]: Multi-GPU~~ Add Multi-GPU Support to v1 Feb 9, 2024

samet-akcay changed the title ~~Add Multi-GPU Support to v1~~ ✨ Add Multi-GPU Support to v1 Feb 9, 2024

samet-akcay modified the milestones: v1.2.0, v1.1.0 Feb 9, 2024

samet-akcay added this to Anomalib Feb 9, 2024

github-project-automation bot moved this to 📝 To Do in Anomalib Feb 9, 2024

samet-akcay moved this from 📝 To Do to 🗂️ Backlog in Anomalib Feb 9, 2024

samet-akcay changed the title ~~✨ Add Multi-GPU Support to v1~~ ✨ Add Multi-GPU Support to v1.1 Mar 5, 2024

This was referenced Mar 5, 2024

[Bug]: A Multi-GPU Parallel Training error with API #1821

Closed

An error occurred when using multiple gpus #1809

Closed

samet-akcay mentioned this issue Mar 22, 2024

Training With Mult-GPUS stopped #1110

Closed

1 task

samet-akcay added the Good First Issue Issues that can be picked up by someone unfamiliar with the repo and would like to contribute. label Mar 25, 2024

samet-akcay assigned RitikaxShakya Mar 28, 2024

samet-akcay mentioned this issue Apr 28, 2024

[Bug]: torch.cuda.OutOfMemoryError: CUDA out of memory - PatchCore #2016

Closed

1 task

samet-akcay modified the milestones: v1.1.0, v1.2.0 May 14, 2024

samet-akcay unassigned RitikaxShakya Jun 5, 2024

samet-akcay moved this from 🗂️ Backlog to 📝 To Do in Anomalib Jun 5, 2024

samet-akcay closed this as completed Aug 19, 2024

github-project-automation bot moved this from 📝 To Do to ✅ Done in Anomalib Aug 19, 2024

samet-akcay moved this from ✅ Done to 🗂️ Backlog in Anomalib Aug 19, 2024

samet-akcay removed this from Anomalib Aug 19, 2024

samet-akcay removed this from the v1.2.0 milestone Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Add Multi-GPU Support to v1.1 #1449

✨ Add Multi-GPU Support to v1.1 #1449

lemonbuilder commented Oct 30, 2023 •

edited

Loading

lemonbuilder commented Nov 5, 2023

samet-akcay commented Feb 9, 2024 •

edited

Loading

nguyenanhtuan1008 commented Feb 12, 2024 •

edited

Loading

samet-akcay commented Feb 12, 2024

nguyenanhtuan1008 commented Feb 15, 2024 •

edited

Loading

Bepitic commented Mar 13, 2024

RitikaxShakya commented Mar 28, 2024

samet-akcay commented Mar 28, 2024

samet-akcay commented Mar 28, 2024

Bepitic commented Mar 28, 2024

samet-akcay commented Mar 28, 2024

samet-akcay commented Mar 28, 2024

RitikaxShakya commented Mar 28, 2024

RitikaxShakya commented Apr 6, 2024 •

edited

Loading

blaz-r commented Apr 10, 2024

RitikaxShakya commented Apr 12, 2024

ashwinvaidya17 commented Apr 12, 2024

haimat commented Jun 5, 2024

samet-akcay commented Jun 5, 2024

haimat commented Jun 5, 2024

samet-akcay commented Jun 5, 2024

haimat commented Aug 2, 2024

haimat commented Aug 8, 2024

ashwinvaidya17 commented Aug 9, 2024

watertianyi commented Aug 19, 2024

samet-akcay commented Aug 19, 2024

watertianyi commented Aug 19, 2024

samet-akcay commented Aug 19, 2024 •

edited

Loading

watertianyi commented Aug 19, 2024

samet-akcay commented Aug 19, 2024

samet-akcay commented Aug 19, 2024

watertianyi commented Aug 20, 2024 •

edited by samet-akcay

Loading

samet-akcay commented Aug 20, 2024

✨ Add Multi-GPU Support to v1.1 #1449

✨ Add Multi-GPU Support to v1.1 #1449

Comments

lemonbuilder commented Oct 30, 2023 • edited Loading

What is the motivation for this task?

Describe the solution you'd like

Additional context

lemonbuilder commented Nov 5, 2023

samet-akcay commented Feb 9, 2024 • edited Loading

nguyenanhtuan1008 commented Feb 12, 2024 • edited Loading

samet-akcay commented Feb 12, 2024

nguyenanhtuan1008 commented Feb 15, 2024 • edited Loading

Bepitic commented Mar 13, 2024

RitikaxShakya commented Mar 28, 2024

samet-akcay commented Mar 28, 2024

samet-akcay commented Mar 28, 2024

Bepitic commented Mar 28, 2024

samet-akcay commented Mar 28, 2024

samet-akcay commented Mar 28, 2024

RitikaxShakya commented Mar 28, 2024

RitikaxShakya commented Apr 6, 2024 • edited Loading

blaz-r commented Apr 10, 2024

RitikaxShakya commented Apr 12, 2024

ashwinvaidya17 commented Apr 12, 2024

haimat commented Jun 5, 2024

samet-akcay commented Jun 5, 2024

haimat commented Jun 5, 2024

samet-akcay commented Jun 5, 2024

haimat commented Aug 2, 2024

haimat commented Aug 8, 2024

ashwinvaidya17 commented Aug 9, 2024

watertianyi commented Aug 19, 2024

Create the model and engine

Train a Patchcore model on the given datamodule

samet-akcay commented Aug 19, 2024

watertianyi commented Aug 19, 2024

samet-akcay commented Aug 19, 2024 • edited Loading

watertianyi commented Aug 19, 2024

samet-akcay commented Aug 19, 2024

samet-akcay commented Aug 19, 2024

watertianyi commented Aug 20, 2024 • edited by samet-akcay Loading

samet-akcay commented Aug 20, 2024

lemonbuilder commented Oct 30, 2023 •

edited

Loading

samet-akcay commented Feb 9, 2024 •

edited

Loading

nguyenanhtuan1008 commented Feb 12, 2024 •

edited

Loading

nguyenanhtuan1008 commented Feb 15, 2024 •

edited

Loading

RitikaxShakya commented Apr 6, 2024 •

edited

Loading

samet-akcay commented Aug 19, 2024 •

edited

Loading

watertianyi commented Aug 20, 2024 •

edited by samet-akcay

Loading