Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to carry out multi-GPU with lightning-pose? #138

Open
Wulin-Tan opened this issue Apr 7, 2024 · 7 comments · Fixed by #206
Open

how to carry out multi-GPU with lightning-pose? #138

Wulin-Tan opened this issue Apr 7, 2024 · 7 comments · Fixed by #206
Assignees
Labels
enhancement New feature or request

Comments

@Wulin-Tan
Copy link

Hi, lighting pose team
I am very interested in this super cool tool.
How to do multi-GPU with lighting pose?
I can't find multi-GPU information in the tutorial.

@themattinthehatt
Copy link
Collaborator

We do not currently support multi-GPU training but it should not be difficult to implement with our current setup (at least for the supervised models). We'll be happy to look into this but it might take us a week or so, we'll keep you updated.

@themattinthehatt themattinthehatt self-assigned this Apr 19, 2024
@themattinthehatt themattinthehatt added the enhancement New feature or request label Apr 19, 2024
@YitingChang
Copy link

I'm also interested in multi-GPU training and would like to follow this. Thank you!

@ksikka
Copy link
Collaborator

ksikka commented Oct 7, 2024

Hi @Wulin-Tan and @YitingChang, I am starting to investigate support for multi-GPU on lightning pose. Could you provide some more context about your goals with multi-gpu - i.e. are you requesting this to accelerate training or inference, or are you running into memory limitations with one GPU? This will help us guide our development. Thanks!

@YitingChang
Copy link

Hi @ksikka ,

Thank you for following up! I'm encountering memory limitations with a single GPU, so I'd like to explore running Lightning Pose with multiple GPUs.

To give you more context on my current setup: as I add more features—such as unsupervised losses and temporal context networks—the memory demands increase. I've had to significantly resize images and reduce batch sizes to fit within the memory constraints. Currently, I’m running Lightning Pose on a cluster with one GPU that has 40 GB of memory. Allocating more GPUs shouldn't be an issue if Lightning Pose can support multi-GPU functionality.

@Wulin-Tan
Copy link
Author

@ksikka yes, I need to handle more GPU memory as well as faster training if possible.

@ksikka
Copy link
Collaborator

ksikka commented Oct 17, 2024

Hi @YitingChang and @Wulin-Tan support for multi-GPU was just added for supervised training. This works for both heatmap and heatmap_mchrnn as long as losses_to_use: [].

To use it, set num_gpus in the config to the number of GPUs on your machine. train_batch_size and val_batch_size must be divisible by num_gpus.

So the request is fulfilled for supervised. Support for multi-gpu when using unsupervised losses is a work-in-progress.

@ksikka
Copy link
Collaborator

ksikka commented Oct 17, 2024

Without multi-gpu on unsupervised, you might have success with gradient accumulation (see accumulate_grad_batches in the configuration file docs. Say you want an effective batch size of 8 but can only fit batch size 2 in memory. then you could set train_batch_size: 2 and accumulate_grad_batches: 4. This is an experimental feature so if you do try it, report back with your feedback. Thanks!

Update: also reduce sequence_length by the same factor as you reduce train_batch_size, since sequence_length is the batch size for unlabeled frames when using unsupervised losses. However when using the context model (heatmap_mchrnn), do not reduce below 5, as the number of predictions is n-4 for the context model. I'll add this to the docs in the next PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants