-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async model checkpoint upload fails when the file is renamed #123
Comments
https://allegroai-trains.slack.com/archives/CTK20V944/p1586771229061800
|
Hi @iakremnev , |
Hi @iakremnev, Please see if the latest RC fixes the problem: |
Yes, the issue is solved! Thanks for the quick fix ❤️ |
@bmartinn Is this really fixed? I'm facing the same issue. I use logger = TrainsLogger(auto_connect_frameworks=True) |
@S-aiueo32 what exactly are you experiencing? |
Bug
While using
TrainsLogger
in Pytorch-Lightning, the latter saves model checkpoint via_atomic_save
function, which renames the file right after callingtorch.save
. Consequently, train's hook ontorch.save
tries to asynchronously upload.ckpt.part
file and fails.The original issue on Pytorch-Lightning Github: Lightning-AI/pytorch-lightning#1466
The text was updated successfully, but these errors were encountered: