Distributed training issue #983

jakubMitura14 · 2024-10-17T12:26:23Z

Hello I have 2 GPU as shown by nvidia smi

Then I try

DistributedUtils.initialize(NCCLBackend) 
distributed_backend = DistributedUtils.get_distributed_backend(NCCLBackend)
DistributedUtils.local_rank(distributed_backend) #0
DistributedUtils.total_workers(distributed_backend) #1

and local_rank evaluates to 0 ; total_workers evaluate to 1. Seems to be incorrect, if I understand idea well.

I use
Lux v1.1.0
CUDA v5.5.2
Julia 1.10

The text was updated successfully, but these errors were encountered:

avik-pal · 2024-10-17T19:05:08Z

You need to start julia with mpiexec https://github.com/LuxDL/Lux.jl/tree/main/examples/ImageNet#distributed-data-parallel-training

jakubMitura14 · 2024-10-18T08:41:16Z

For now it still do not work but I need to dig up deeper into mpi first. Thanks for guidance!

avik-pal added the needs more information label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training issue #983

Distributed training issue #983

jakubMitura14 commented Oct 17, 2024 •

edited

Loading

avik-pal commented Oct 17, 2024

jakubMitura14 commented Oct 18, 2024

Distributed training issue #983

Distributed training issue #983

Comments

jakubMitura14 commented Oct 17, 2024 • edited Loading

avik-pal commented Oct 17, 2024

jakubMitura14 commented Oct 18, 2024

jakubMitura14 commented Oct 17, 2024 •

edited

Loading