Does torchrun + FSDP create multiple copies of the same dataset and model? #1289

tsengalb99 · 2024-09-25T03:59:24Z

In the example T5 training code, the main function creates a copy of the model and dataset regardless of the worker rank before passing it to FSDP. Does this mean that there are n copies of the model and dataset when running the script with torchrun and n processes?

tsengalb99 · 2024-09-25T04:25:53Z

My code is set up in a similar way as the T5 example code and the memory consumption per gpu is the same regardless of the number of torchrun processes I use, so it does seem like I am creating n copies of the model. How can I avoid this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does torchrun + FSDP create multiple copies of the same dataset and model? #1289

Does torchrun + FSDP create multiple copies of the same dataset and model? #1289

tsengalb99 commented Sep 25, 2024

tsengalb99 commented Sep 25, 2024

Does torchrun + FSDP create multiple copies of the same dataset and model? #1289

Does torchrun + FSDP create multiple copies of the same dataset and model? #1289

Comments

tsengalb99 commented Sep 25, 2024

tsengalb99 commented Sep 25, 2024