-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add atomic save to checkpoint routine #20011
Conversation
… should make the save more robust to interrupts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Just a few suggestions
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #20011 +/- ##
=========================================
- Coverage 84% 59% -25%
=========================================
Files 427 422 -5
Lines 35504 35407 -97
=========================================
- Hits 29781 20773 -9008
- Misses 5723 14634 +8911 |
Thanks @corwinjoy |
Thanks for the help @awaelchli !
…On Tue, Jun 25, 2024, 1:53 AM awaelchli ***@***.***> wrote:
Thanks @corwinjoy <https://github.com/corwinjoy>
I also made a follow-up issue for the distributed checkpoint saving logic.
—
Reply to this email directly, view it on GitHub
<#20011 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBF42XZMLPG43N4Z5Z4RR3ZJEVZFAVCNFSM6AAAAABJ2XXZXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBYGM2DINZVGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi, this PR changes the permissions of the saved checkpoints on macOS and Linux (not sure about Windows). I suspect that was unintentional, and if so, should a new ticket be filed? On macOS: This snippet shows previous vs new behavior.
Thanks |
Yes, this change to the default file permissions is unintended. This seems
like a problem with fsspec, though, I don't see why they would use
nonstandard file permissions there. I think a first step might be to open a
ticket there and understand the rationale behind it. Unless you already
have a reference on why they are doing this?
…On Mon, Feb 24, 2025, 3:46 PM brandon-smith-xperi ***@***.***> wrote:
Hi, this PR changes the permissions of the saved checkpoints on macOS and
Linux (not sure about Windows). I suspect that was unintentional, and if
so, should a new ticket be filed?
On macOS:
Previous permissions: 644 (this is the expected permissions for a new file)
New permissions: 600 (this is not the expected permissions for a new file)
This snippet shows previous vs new behavior.
import io
import fsspec
import torch
from fsspec.core import url_to_fs
x = torch.zeros(1)
bytesbuffer = io.BytesIO()
torch.save(x, bytesbuffer)
# original correct permissions
with fsspec.open('correct_permissions.pt', "wb") as f:
f.write(bytesbuffer.getvalue())
# new incorrect permissions
fs, urlpath = fsspec.core.url_to_fs('incorrect_permissions.pt')
with fs.transaction, fs.open(urlpath, "wb") as f:
f.write(bytesbuffer.getvalue())
Thanks
—
Reply to this email directly, view it on GitHub
<#20011 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBF42W4745LN7UG6WW6VGD2ROVLVAVCNFSM6AAAAABXZIQU2KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZZHE2TMNJVGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: brandon-smith-xperi]*brandon-smith-xperi* left a comment
(Lightning-AI/pytorch-lightning#20011)
<#20011 (comment)>
Hi, this PR changes the permissions of the saved checkpoints on macOS and
Linux (not sure about Windows). I suspect that was unintentional, and if
so, should a new ticket be filed?
On macOS:
Previous permissions: 644 (this is the expected permissions for a new file)
New permissions: 600 (this is not the expected permissions for a new file)
This snippet shows previous vs new behavior.
import io
import fsspec
import torch
from fsspec.core import url_to_fs
x = torch.zeros(1)
bytesbuffer = io.BytesIO()
torch.save(x, bytesbuffer)
# original correct permissions
with fsspec.open('correct_permissions.pt', "wb") as f:
f.write(bytesbuffer.getvalue())
# new incorrect permissions
fs, urlpath = fsspec.core.url_to_fs('incorrect_permissions.pt')
with fs.transaction, fs.open(urlpath, "wb") as f:
f.write(bytesbuffer.getvalue())
Thanks
—
Reply to this email directly, view it on GitHub
<#20011 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBF42W4745LN7UG6WW6VGD2ROVLVAVCNFSM6AAAAABXZIQU2KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZZHE2TMNJVGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks @corwinjoy, I'll file a ticket with fsspec |
What does this PR do?
Upgrade the _atomic_checkpoint routine to use a transaction. This should make saves atomic so that if an interrupt happens during a file save, any existing file will not be corrupted.
The following links the related issue to the PR: #19970
Fixes #19970
I have skipped updating the changelog since it looks like this gets updated as part of the release.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--20011.org.readthedocs.build/en/20011/