-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transformer.Trainer fails in creating optimizer for optim adamw_torch_fused when launched with deepspeed. #31867
Comments
Hi @amyeroberts! I'll take a look on this issue |
Hello, I faced with some issues installing deepseed and I am not sure how fast and efficient I will be able to resolve it. I think it would be better if someone else takes this task. |
I guess I'll take a crack at this one. |
hi @princethewinner this seems to be caused by a versioning issue between pytorch and deepspeed. This can be resolved by rolling pytorch forward from 2.2.1 -> 2.4 (make sure you uninstall and reinstall deepspeed too because it is conditioned on your torch install). Alternatively an older verion of deepspeed might work too, but I didn't experiment with that. @ArthurZucker I think this issue can be closed, I don't think there is anything transformers related here. |
Please let us know @princethewinner if the issue is fixed by upgrading torch ! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers
version: 4.42.3Who can help?
@muellerzr
The issue arises when the script is launched with
deepspeed
. It seems that the model is not loaded in GPU whencreate_optimizer
is called and thus fails in creating an optimizer.Launch command
Output:
However, setting
deepspeed_dict=None
and using the same launch command does not cause any error, and training continues as usual. So, I am guessing it could be caused by conflictingdeepspeed
settings or incorrect parsing of deepspeed settings.Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Training should be completed.
The text was updated successfully, but these errors were encountered: