Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AWS S3 i/o #2175

Closed
wants to merge 6 commits into from
Closed

Add AWS S3 i/o #2175

wants to merge 6 commits into from

Conversation

Laksh1997
Copy link

@Laksh1997 Laksh1997 commented Jun 13, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Allows user to save and load checkpoints to S3 paths (e.g "s3://my-bucket/my-folder/my-checkpoint.ckpt")

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team June 13, 2020 15:21


def load_s3_checkpoint(checkpoint_path, map_location, **pickle_load_args):
from torch.serialization import _legacy_load
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is _legacy_load required here? I think we should aim for torch.load here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use boto3 to download an S3 file, then use torch.load. Thoughts?



def save_s3_checkpoint(checkpoint, checkpoint_path):
from torch.serialization import _legacy_save
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above. IMO we should aim for torch.save

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.save can only save locally. We could use boto3 to upload locally saved files. Thoughts?

@mergify mergify bot requested a review from a team June 13, 2020 15:55
@Borda Borda added the feature Is an improvement or enhancement label Jun 13, 2020
@williamFalcon
Copy link
Contributor

@Borda this has failing tests

@Laksh1997
Copy link
Author

Thoughts on _legacy_save and direct write to S3 vs normal torch.save to local, then boto3 upload to S3?

@Borda Borda changed the title (WIP) Add AWS S3 i/o [blocked by #2176] Add AWS S3 i/o Jun 13, 2020
@Borda
Copy link
Member

Borda commented Jun 13, 2020

@Borda this has failing tests

I am fixing master here #2176

Thoughts on _legacy_save and direct write to S3 vs normal torch.save to local, then boto3 upload to S3?

depends how long the legacy func will be available in PyTorch
@Laksh1997 mind add a test for these new functions?

@Laksh1997
Copy link
Author

I'm thinking about how to add tests.

One is to just save and load a file actually to an open-access S3 bucket.

However, another way is to use the moto framework. This would require using boto3 for upload however.

Thinking what the best solution is, any help appreciated.

@awaelchli
Copy link
Contributor

Thoughts on _legacy_save and direct write to S3 vs normal torch.save to local, then boto3 upload to S3?

anything that has "legacy" in it's name is a big alarm sign for me hehe :)
aren't we supposed to use TorchServe to upload stuff to the amazon servers?

@Laksh1997
Copy link
Author

@awaelchli I agree that _legacy_save raises alarms.

Shall we do this then?

  • User passes in checkpointing filepath, eg s3://my-bucket/my-key/checkpoint.ckpt

  • Initially, we save it locally normally with torch.save with filepath my-key/checkpoint.ckpt

  • We then upload the file to the s3 path as specified using boto3

The only complication with the above is the following: If we have model checkpointing configured to only save the best 5 models, then we'll end up saving many more to S3 because it will upload every time a new model entry occurs. So to get around this, we would also have to accordingly delete files on S3.

Thoughts?

@Laksh1997
Copy link
Author

Thoughts on deleting the cloud checkpoint if it's not in the top K models anymore?

We would need to edit _del_model in model_checkpoint.py on line 158

@Laksh1997
Copy link
Author

Alright moving everything over to boto3. Will make tests easier that way with moto.

@f4hy
Copy link
Contributor

f4hy commented Jun 13, 2020

This seems related to #2164 which would support s3/gcs/hdfs.

@Laksh1997
Copy link
Author

@f4hy Yeah it seems it. What shall we do?

@f4hy
Copy link
Contributor

f4hy commented Jun 13, 2020

@f4hy Yeah it seems it. What shall we do?

So I need support for writing all logs and such to a remote path, not just the checkpoints. And I need support for paths other than just s3.

So that's why I set up #2164 to use gfile. This problem has been solved elsewhere, so we should just use some other lib. Gfile is great, but it requires pip install tensorflow to get full support. Tensorboard has a pruned down implementation that uses the full tf gfile if present otherwise gives just local and s3 support so hopefully that is acceptable and requires no extra dependency.

I think its a mistake to do this in a way that just implements s3. Then we would need to implement the same thing again for gcs or hdfs or others. If the gfile solution is not acceptable I hope we can find some other lib that would do this for us.

@Laksh1997
Copy link
Author

Hmmm, I agree @f4hy. Let's think of a unified way to do this long term.

I feel however that because S3 is AWS, the cleanest solution to get S3 support would be to use boto3. Also, one gets a nice testing framework as well (moto)

@Borda @williamFalcon I've just redone the PR to now work using boto3. Will be adding moto tests soon. Any comments please let me know.

@Borda Borda changed the title [blocked by #2176] Add AWS S3 i/o Add AWS S3 i/o Jun 14, 2020
@yukw777
Copy link
Contributor

yukw777 commented Jun 15, 2020

I mentioned this in #1532, but https://github.com/RaRe-Technologies/smart_open is really nice. It provides a file-like interface for all kinds of transports. I highly recommend using it for this PR!

@mergify
Copy link
Contributor

mergify bot commented Jun 16, 2020

This pull request is now in conflict... :(

@f4hy
Copy link
Contributor

f4hy commented Jun 16, 2020

I mentioned this in #1532, but https://github.com/RaRe-Technologies/smart_open is really nice. It provides a file-like interface for all kinds of transports. I highly recommend using it for this PR!

How is this better than the same gfile interface provided by tensorboard. Again #2164 but we dont need to add a new dependency.

@yukw777
Copy link
Contributor

yukw777 commented Jun 16, 2020

@f4hy smart_open supports a lot more protocols than the gfile interface with a clean, unified interface with nifty extra features (not that we necessarily need these new features), so we don't need to keep writing new code to support other protocols than s3, gcs, hdfs. I personally think the benefit of introducing this new dependency is worth the effort (i'm currently using it for work, and it's been working quite nicely for us.), but ultimately, if the core team decides to go with another route, I'm fine with that too.

@@ -95,7 +96,7 @@ class ModelCheckpoint(Callback):

def __init__(self, filepath: Optional[str] = None, monitor: str = 'val_loss', verbose: bool = False,
save_last: bool = False, save_top_k: int = 1, save_weights_only: bool = False,
mode: str = 'auto', period: int = 1, prefix: str = ''):
mode: str = 'auto', period: int = 1, prefix: str = '', remove_non_top_k_s3_files: bool = True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I would keep it synced so I would drop remove_non_top_k_s3_files

@mergify mergify bot requested a review from a team June 26, 2020 22:09
@Borda Borda added this to the 0.8.x milestone Jun 26, 2020
@mergify
Copy link
Contributor

mergify bot commented Jun 27, 2020

This pull request is now in conflict... :(

@Borda Borda changed the title Add AWS S3 i/o [blocked by #2164] Add AWS S3 i/o Jun 27, 2020
@Borda Borda modified the milestones: 0.8.x, 0.9.0 Aug 6, 2020
@awaelchli awaelchli modified the milestones: 0.9.0, 0.9.x Aug 8, 2020
@Borda Borda changed the title [blocked by #2164] Add AWS S3 i/o Add AWS S3 i/o Aug 9, 2020
@Borda
Copy link
Member

Borda commented Aug 9, 2020

@Laksh1997 can we finish this one? :]

@f4hy
Copy link
Contributor

f4hy commented Aug 9, 2020

I believe #2164 resolved this as well. After the first in #2894 I am able to train and have all outputs go to AWS s3.

@Borda
Copy link
Member

Borda commented Aug 9, 2020

cool, so let's close this one for now and reopen if needed :]

@Borda Borda closed this Aug 9, 2020
@Borda Borda modified the milestones: 0.9.x, 0.9.0 Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants