Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add Multi-GPU Support to v1.1 #1449

Closed
lemonbuilder opened this issue Oct 30, 2023 · 33 comments
Closed

✨ Add Multi-GPU Support to v1.1 #1449

lemonbuilder opened this issue Oct 30, 2023 · 33 comments
Labels
Feature Good First Issue Issues that can be picked up by someone unfamiliar with the repo and would like to contribute.

Comments

@lemonbuilder
Copy link

lemonbuilder commented Oct 30, 2023

What is the motivation for this task?

I'm going to train custom dataset using EfficientAd model.
How do I train or test using Multi-GPU?
Please, tell me which command is used.

Describe the solution you'd like

Currently, I'm training using only single devices.
$ python3 tools/train.py --model efficient_ad

Additional context

No response

@lemonbuilder
Copy link
Author

@samet-akcay Is there any code implementation for using multiple GPUs?

@samet-akcay samet-akcay added Feature and removed Task labels Feb 9, 2024
@samet-akcay samet-akcay changed the title [Task]: Multi-GPU Add Multi-GPU Support to v1 Feb 9, 2024
@samet-akcay samet-akcay changed the title Add Multi-GPU Support to v1 ✨ Add Multi-GPU Support to v1 Feb 9, 2024
@samet-akcay samet-akcay modified the milestones: v1.2.0, v1.1.0 Feb 9, 2024
@github-project-automation github-project-automation bot moved this to 📝 To Do in Anomalib Feb 9, 2024
@samet-akcay
Copy link
Contributor

samet-akcay commented Feb 9, 2024

@lemonbuilder, this has now been added to the roadmap.

This task would close the following issues: #930 #1110 #930 #1398

@samet-akcay samet-akcay moved this from 📝 To Do to 🗂️ Backlog in Anomalib Feb 9, 2024
@nguyenanhtuan1008
Copy link

nguyenanhtuan1008 commented Feb 12, 2024

@samet-akcay , sorry, I got the error when training with multi-GPU with v1. How can I use only 1 GPU for example id 3 for training? Now I'm using this code for training:

# Import the required modules
from anomalib.data import MVTec
from anomalib.models import EfficientAd
from anomalib.engine import Engine

# Initialize the datamodule, model and engine
datamodule = MVTec()
model = EfficientAd()

engine = Engine()

# Train the model
engine.fit(datamodule=datamodule, model=model)

@samet-akcay
Copy link
Contributor

@nguyenanhtuan1008, you could refer to this link.
https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html#choosing-gpu-devices

In this case, you could initialize the Engine class as ;

engine = Engine(accelerator="gpu", devices="3")

@nguyenanhtuan1008
Copy link

nguyenanhtuan1008 commented Feb 15, 2024

@samet-akcay
Thank you so much.
I got the training work but still got error after 1 epoch so I gave up and using the single GPU right now.

@samet-akcay samet-akcay changed the title ✨ Add Multi-GPU Support to v1 ✨ Add Multi-GPU Support to v1.1 Mar 5, 2024
@Bepitic
Copy link
Contributor

Bepitic commented Mar 13, 2024

Hello, I wish to take this issue.
Thank you @samet-akcay, and the good work.

@samet-akcay samet-akcay added the Good First Issue Issues that can be picked up by someone unfamiliar with the repo and would like to contribute. label Mar 25, 2024
@RitikaxShakya
Copy link

Hi @samet-akcay I would like to work on this issue. Can I take this issue?

@samet-akcay
Copy link
Contributor

@RitikaxShakya, thanks for your interest. I've totally missed this one, but looks like @Bepitic already shown interest in this. If he doesn't want to work on it, it could be all yours. How does that sound?

@samet-akcay
Copy link
Contributor

@Bepitic, are you still interested in this issue? If not @RitikaxShakya can take it?

@Bepitic
Copy link
Contributor

Bepitic commented Mar 28, 2024

Yes for sure, since no one confirmed me I also forgot about the one of multi-gpu 😅

@samet-akcay
Copy link
Contributor

sorry about that

@samet-akcay
Copy link
Contributor

@RitikaxShakya, all yours then

@RitikaxShakya
Copy link

.take

@RitikaxShakya
Copy link

RitikaxShakya commented Apr 6, 2024

@blaz-r @samet-akcay Hello! I need help regarding the parts of the code that deal with GPU initialization, data parallelization, and GPU-specific operations as these are the areas i think I'll need to modify to add Multi-GPU support.

@blaz-r
Copy link
Contributor

blaz-r commented Apr 10, 2024

I am not that familiar with these topics within the Anomalib. @ashwinvaidya17 could you provide some insight here?

@RitikaxShakya
Copy link

@ashwinvaidya17 Hello! Please help me regarding the parts of the code that deal with GPU initialization, data parallelization, and GPU-specific operations as these are the areas i think I'll need to modify to add Multi-GPU support.

@ashwinvaidya17
Copy link
Collaborator

@RitikaxShakya currently we override the number of devices to 1 in Engine and the CLI.

To start with, we should remove these lines.

self._cache.args["devices"] = 1

config.trainer.devices = 1

Doing this will break a bunch of stuff across the repo.

  1. For example, all the trainer.model calls will break.
    is_zero_or_few_shot = trainer.model.learning_type in [LearningType.ZERO_SHOT, LearningType.FEW_SHOT]

    These should be replaced with trainer.lightning_module
  2. You will also need to test each model to replace all .cpu() operations as we move large tensors out of CUDA memory to mitigate OOM issues. In case of Padim, the following line will break
    self.stats = self.model.gaussian.fit(embeddings)
    as the embeddings are on cpu. These should be moved to cuda before calling model.fit. Something as simple as to(self.device) should fix it. From my initial experiments this isn't sufficient to make the model work but it's a good start.
  3. I am not sure if this is affected by distributed training but you might also need to look at thresholding and metrics computation
    output = output.cpu()

I might have missed something so feel free to report any difficulties you run into.

@haimat
Copy link

haimat commented Jun 5, 2024

Using latest anomalib 1.1.0 from pip I create the Engine like so:

    engine = Engine(
        max_epochs=100,
        task=task_type,
        accelerator="gpu",
        devices=-1,
    )

By passing devices=-1 I thought training would utilize all my available GPUs.
PyTorch can see them, I get this output from training:

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

However, when looking at the output from nvidia-smi I can see that only one GPU is used.
I also tried to pass strategy="ddp" when creating the engine, however, then I get this error:

Traceback (most recent call last):
  File "/data/scratch/mkw-anomalib/anomalib-test.py", line 41, in <module>
    train()
  File "/data/scratch/mkw-anomalib/anomalib-test.py", line 36, in train
    engine.fit(datamodule=datamodule, model=model)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/anomalib/engine/engine.py", line 540, in fit
    self.trainer.fit(model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
    results = self._run_stage()
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1033, in _run_stage
    self.fit_loop.run()
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 269, in advance
    call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 208, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 295, in on_train_batch_end
    if self._should_skip_saving_checkpoint(trainer):
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/anomalib/callbacks/checkpoint.py", line 38, in _should_skip_saving_checkpoint
    is_zero_or_few_shot = trainer.model.learning_type in [LearningType.ZERO_SHOT, LearningType.FEW_SHOT]
  File "/home/sinntelligence/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'learning_type'

Is this a bug or how can I use all my GPUs for training with anomalig 1.1.0?

@samet-akcay samet-akcay moved this from 🗂️ Backlog to 📝 To Do in Anomalib Jun 5, 2024
@samet-akcay
Copy link
Contributor

@haimat, we aim to enable multi-gpu support in v1.2

@haimat
Copy link

haimat commented Jun 5, 2024

@samet-akcay Thanks for your quick reply.
So around end of July?

@samet-akcay
Copy link
Contributor

yeah, that's the plan hopefully :)

@haimat
Copy link

haimat commented Aug 2, 2024

Hello guys, any news on this, do you have an ETA on 1.2 and multui-GPU training?

@haimat
Copy link

haimat commented Aug 8, 2024

@samet-akcay Hello Samet, can you estimate when multi-GPU training will be available?

@ashwinvaidya17
Copy link
Collaborator

@haimat unfortunately we don't have an exact timeline for this. Currently, we are busy with some other high-priority tasks.

@watertianyi
Copy link

@samet-akcay
Use the following parameters to perform multi-GPU, but the result is a single GPU. How to set epoch? I am very anxious?
from anomalib.models import Patchcore
from anomalib.engine import Engine

Create the model and engine

model = Patchcore()
engine = Engine(max_epochs=30,task="classification",accelerator='gpu',devices=3)

Train a Patchcore model on the given datamodule

engine.train(datamodule=datamodule, model=model)

What is the default epoch?

@samet-akcay
Copy link
Contributor

@goldwater668, as mentioned above, multi-GPU is not currently supported. devices parameter is over-written here to avoid any errors caused by multi-gpu issues.

# Temporarily set devices to 1 to avoid issues with multiple processes
self._cache.args["devices"] = 1

@watertianyi
Copy link

@samet-akcay Can you specify the GPU ID?
engine = Engine(max_epochs=10,task="classification",accelerator='gpu',devices=[1,2])
I specify the GPU ID number according to the above code. Why do the following results still appear? Can’t I specify it?
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
When I am training Patchcore, I have insufficient CUDA memory. How should I adjust the parameters?

@samet-akcay
Copy link
Contributor

samet-akcay commented Aug 19, 2024

Yes, you could specify the GPU ID.

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
This is an automated log generated by Lightning. You could ignore this.

If you are experiencing out of memory issues with Patchcore, your dataset is probably to large to fit a PatchCore memory bank. You could configure Patchcore arguments to make it more memory efficient. For example, changing the backbone to a more efficient backbone, changing the layers to extract etc.
https://anomalib.readthedocs.io/en/v1.1.0/markdown/guides/reference/models/image/patchcore.html

class Patchcore(MemoryBankMixin, AnomalyModule):
"""PatchcoreLightning Module to train PatchCore algorithm.
Args:
backbone (str): Backbone CNN network
Defaults to ``wide_resnet50_2``.
layers (list[str]): Layers to extract features from the backbone CNN
Defaults to ``["layer2", "layer3"]``.
pre_trained (bool, optional): Boolean to check whether to use a pre_trained backbone.
Defaults to ``True``.
coreset_sampling_ratio (float, optional): Coreset sampling ratio to subsample embedding.
Defaults to ``0.1``.
num_neighbors (int, optional): Number of nearest neighbors.
Defaults to ``9``.
"""
def __init__(
self,
backbone: str = "wide_resnet50_2",
layers: Sequence[str] = ("layer2", "layer3"),
pre_trained: bool = True,
coreset_sampling_ratio: float = 0.1,
num_neighbors: int = 9,
) -> None:

@watertianyi
Copy link

@samet-akcay I specified GPU cards 1 and 2 for training. However, during training, I still trained on card 0. Is there anything wrong with the GPU specified in the above settings?
engine = Engine(max_epochs=10,task="classification",accelerator='gpu',devices=[1,2])

@samet-akcay
Copy link
Contributor

@goldwater668, you currently cannot set multiple GPUs as it will be mapped back to a single GPU.

With that being said, I noticed that Engine always configures the device to run on the default GPU even when the user explicitly chooses a specific GPU. I've created a PR to fix this
#2256

@samet-akcay
Copy link
Contributor

I have created an official feature issue here: #2258

I'm closing this one. Those who are interested in this feature can follow the above issue.

@github-project-automation github-project-automation bot moved this from 📝 To Do to ✅ Done in Anomalib Aug 19, 2024
@samet-akcay samet-akcay moved this from ✅ Done to 🗂️ Backlog in Anomalib Aug 19, 2024
@samet-akcay samet-akcay removed this from Anomalib Aug 19, 2024
@samet-akcay samet-akcay removed this from the v1.2.0 milestone Aug 19, 2024
@watertianyi
Copy link

watertianyi commented Aug 20, 2024

@samet-akcay I used EfficientAd for 10 epochs and single category training to get the following results:

Epoch 9/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43944/43944 0:23:59 ? 0:00:00 30.44it/s train_st_step: 0.530 train_ae_step: 0.337 train_stae_step: 0.042 train_loss_step: 0.909            
                                                                                            image_AUROC: 0.906 image_F1Score: 0.709 train_st_epoch: 0.553 train_ae_epoch: 0.428                
Epoch 9/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43944/43944 0:23:59 ? 0:00:00 30.44it/s train_st_step: 0.530 train_ae_step: 0.337 train_stae_step: 0.042 train_loss_step: 0.909            
                                                                                            image_AUROC: 0.956 image_F1Score: 0.699 train_st_epoch: 0.541 train_ae_epoch: 0.425                
                                                                                            train_stae_epoch: 0.056 train_loss_epoch: 1.022                                                    
F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        Test metric        ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│        image_AUROC        │    0.6983320713043213     │
│       image_F1Score       │    0.6690346598625183     │
└───────────────────────────┴───────────────────────────┘
predictions = engine.predict(
        datamodule=datamodule,
        model=model,
        ckpt_path="latest/weights/lightning/model.ckpt",
    )

When predicting data, do I have to go through datamodule = Folder()? Is there a way to test the image directly?

@samet-akcay
Copy link
Contributor

@goldwater668, the post above is not related to this issue. Can you create a Q&A in Discussions section

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Good First Issue Issues that can be picked up by someone unfamiliar with the repo and would like to contribute.
Projects
None yet
Development

No branches or pull requests

9 participants