-
-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi gpu training error Directory not empty #948
Comments
should be fixed with this huggingface/transformers#27925 |
@manishiitg I still get the same issue of saving checkpoint using the latest version of transformers I used three workers each one has two GPUs, I tried fine-tuning to be saved on a shared storage and a non-shared storage, and for both cases I still got the same error! FileNotFoundError: [Errno 2] No such file or directory: 'model/tmp-checkpoint-49' -> 'model/checkpoint-49'
although the |
it works fine for me.. should raise this issue in transfomers repo if it still exists |
@manishiitg did you use multi-node with multi gpus? or a single machine with multi gpus? |
Single machine with multiple gpus |
Thanks @manishiitg |
Please check that this issue hasn't been reported before.
Expected Behavior
should work
Current behaviour
getting this errors
Steps to reproduce
training multi gpu mistral model
Config yaml
No response
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: