-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Edge case causes incorrect filesystem to be selected for finding cloud checkpoints #17912
Comments
A similar bug occurs in |
@awaelchli kindly bringing this issue to your attention. Lmk if I can help fix. |
@schmidt-ai Thanks for the detailed explanation, this is very helpful and I was able to understand where the problem. However, I'm not sure how we can prevent the paths to be stripped. Need to figure out what the best practice is here. Any help would of course be appreciated. |
@awaelchli two ideas:
Pros and cons to each. #2 would probably be easier though. Wdyt? |
|
|
Bug description
When both of the following happen together:
s3://
orgcs://
protocol) save dirModelCheckpoint
is used without passing adirpath
The desired behaviors are:
ModelCheckpoint.__resolve_ckpt_dir
) to$logger.save_dir/$logger.name/$logger.version/checkpoints
and theModelCheckpoint
callback saves them there.ModelCheckpoint._find_last_checkpoints
will find$logger.save_dir/$logger.name/$logger.version/checkpoints/last.ckpt
. If will first check if that path exists on the filesystem instantiated inModelCheckpoint.__init_ckpt_dir
.Desired behavior 1 works, 2 does not. There are two bugs:
ModelCheckpoint.__init_ckpt_dir
will select the wrong filesystem whendirpath
isNone
, causingModelCheckpoint._find_last_checkpoints
to not find the cloud filepaths.ModelCheckpoint._fs
were used,_find_last_checkpoints
returns a set of paths with their protocols stripped (due to the call to _fs.ls). This causes_CheckpointConnector_parse_ckpt_path
to then also select the wrong filesystem, resulting in no checkpoints found.What version are you seeing the problem on?
v2.0; but likely also present on others
How to reproduce the bug
s3://.../logger_name/logger_version/checkpoints/last.ckpt
ckpt_path="last"
Error messages and logs
Environment
Current environment
More info
Here is my current workaround for S3 checkpoints:
The Universal Pathlib project fixes the behavior of cloud paths so that the procotols aren't stripped off. Could be worth looking into, to prevent these sorts of edge cases from occurring.
cc @awaelchli
The text was updated successfully, but these errors were encountered: