Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge case causes incorrect filesystem to be selected for finding cloud checkpoints #17912

Open
schmidt-ai opened this issue Jun 24, 2023 · 6 comments · May be fixed by #18252
Open

Edge case causes incorrect filesystem to be selected for finding cloud checkpoints #17912

schmidt-ai opened this issue Jun 24, 2023 · 6 comments · May be fixed by #18252
Labels
3rd party Related to a 3rd-party bug Something isn't working checkpointing Related to checkpointing help wanted Open to be worked on ver: 2.0.x
Milestone

Comments

@schmidt-ai
Copy link
Contributor

schmidt-ai commented Jun 24, 2023

Bug description

When both of the following happen together:

  1. a logger is used with a cloud (e.g s3:// or gcs:// protocol) save dir
  2. a ModelCheckpoint is used without passing a dirpath

The desired behaviors are:

  1. the checkpoint directory is resolved (via ModelCheckpoint.__resolve_ckpt_dir) to $logger.save_dir/$logger.name/$logger.version/checkpoints and the ModelCheckpoint callback saves them there.
  2. ModelCheckpoint._find_last_checkpoints will find $logger.save_dir/$logger.name/$logger.version/checkpoints/last.ckpt. If will first check if that path exists on the filesystem instantiated in ModelCheckpoint.__init_ckpt_dir.

Desired behavior 1 works, 2 does not. There are two bugs:

  1. ModelCheckpoint.__init_ckpt_dir will select the wrong filesystem when dirpath is None, causing ModelCheckpoint._find_last_checkpoints to not find the cloud filepaths.
  2. Even if the correct ModelCheckpoint._fs were used, _find_last_checkpoints returns a set of paths with their protocols stripped (due to the call to _fs.ls). This causes _CheckpointConnector_parse_ckpt_path to then also select the wrong filesystem, resulting in no checkpoints found.

What version are you seeing the problem on?

v2.0; but likely also present on others

How to reproduce the bug

  1. Use a logger with a cloud save dir
  2. Create some cloud checkpoint, e.g. s3://.../logger_name/logger_version/checkpoints/last.ckpt
  3. From a new job, try to resume training using ckpt_path="last"
  4. A warning will be emitted about how lightning couldn't find the checkpoint

Error messages and logs

UserWarning: .fit(ckpt_path="last") is set, but there is no last checkpoint available. No checkpoint will be loaded.

Environment

Current environment
#- Lightning Component: ModelCheckpoint
#- PyTorch Lightning Version: 2.0.3
#- Lightning App Version: N/A
#- PyTorch Version: 2.0.1
#- Python version: 3.10.11
#- OS: Linux
#- CUDA/cuDNN version: 11.7
#- GPU models and configuration: 1x T4
#- How you installed Lightning: `pip`
#- Running environment of LightningApp (e.g. local, cloud): AWS Sagemaker

More info

Here is my current workaround for S3 checkpoints:

from s3fs import S3FileSystem


class S3ModelCheckpoint(ModelCheckpoint):
    def __init__(self, *args: str | None, **kwargs: str | None) -> None:
        super().__init__(*args, **kwargs)
        self._fs = S3FileSystem()

    def _find_last_checkpoints(self, trainer: "L.Trainer") -> set[str]:
        ckpts = {"s3://" + ckpt for ckpt in super()._find_last_checkpoints(trainer)}
        return ckpts

The Universal Pathlib project fixes the behavior of cloud paths so that the procotols aren't stripped off. Could be worth looking into, to prevent these sorts of edge cases from occurring.

cc @awaelchli

@schmidt-ai schmidt-ai added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jun 24, 2023
@schmidt-ai
Copy link
Contributor Author

@schmidt-ai schmidt-ai changed the title Edge case causes incorrect filesystem to be selected for finding checkpoints Edge case causes incorrect filesystem to be selected for finding cloud checkpoints Jun 24, 2023
@schmidt-ai
Copy link
Contributor Author

@awaelchli kindly bringing this issue to your attention. Lmk if I can help fix.

@awaelchli awaelchli added checkpointing Related to checkpointing 3rd party Related to a 3rd-party and removed needs triage Waiting to be triaged by maintainers labels Aug 1, 2023
@awaelchli awaelchli added this to the 2.0.x milestone Aug 1, 2023
@awaelchli
Copy link
Contributor

@schmidt-ai Thanks for the detailed explanation, this is very helpful and I was able to understand where the problem. However, I'm not sure how we can prevent the paths to be stripped. Need to figure out what the best practice is here. Any help would of course be appreciated.

@awaelchli awaelchli added the help wanted Open to be worked on label Aug 1, 2023
@schmidt-ai
Copy link
Contributor Author

@awaelchli two ideas:

  1. adopt upath
  2. manually store the URL protocol and manually prepend it where needed

Pros and cons to each. #2 would probably be easier though. Wdyt?

@awaelchli
Copy link
Contributor

awaelchli commented Aug 1, 2023

  1. Thanks, I didn't know about upath. Adding this as dependency would require a discussion, probably better as a separate issue proposal. We heavily use pathlib.Path and we would have to evaluate whether the effort to make such a change is worth it and provides enough value for the user or developer. The short-term solution should be to fix your problem without upath, if possible.
  2. Would that look similar as in your code snippet above? We would store it in the model checkpoint object directly and then access it there? This would be fine with me.

@schmidt-ai
Copy link
Contributor Author

  1. yep, I agree with that.
  2. More or less. I would generalize it to all fsspec protocols, not just s3. I think it would be fine to store in the model checkpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party bug Something isn't working checkpointing Related to checkpointing help wanted Open to be worked on ver: 2.0.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants