Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training issue #983

Open
jakubMitura14 opened this issue Oct 17, 2024 · 2 comments
Open

Distributed training issue #983

jakubMitura14 opened this issue Oct 17, 2024 · 2 comments

Comments

@jakubMitura14
Copy link

jakubMitura14 commented Oct 17, 2024

Hello I have 2 GPU as shown by nvidia smi
Image

Then I try

DistributedUtils.initialize(NCCLBackend) 
distributed_backend = DistributedUtils.get_distributed_backend(NCCLBackend)
DistributedUtils.local_rank(distributed_backend) #0
DistributedUtils.total_workers(distributed_backend) #1

and local_rank evaluates to 0 ; total_workers evaluate to 1. Seems to be incorrect, if I understand idea well.

I use
Lux v1.1.0
CUDA v5.5.2
Julia 1.10

@avik-pal
Copy link
Member

@jakubMitura14
Copy link
Author

For now it still do not work but I need to dig up deeper into mpi first. Thanks for guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants