Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interface with Hugging Face Accelerate for distributed training #11

Open
rosikand opened this issue Dec 29, 2022 · 2 comments
Open

Interface with Hugging Face Accelerate for distributed training #11

rosikand opened this issue Dec 29, 2022 · 2 comments

Comments

@rosikand
Copy link
Owner

Create a new distributed_train function in torchplate.experiment.Experiment which interfaces with Hugging Face Accelerate for zero-overhead distributed training of PyTorch models. Avoid .to(device) placements as the accelerate library will handle this for you. Can call this function even with one GPU.

@rosikand
Copy link
Owner Author

Optional parameters:

  • split_batches=True: whether to use true batch size or distributed batch size

@rosikand
Copy link
Owner Author

Note: I think you will have to make heavy edits to get it to interface with the metrics properly (see this function). Also, true for model serialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant