Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud implementation of shutil.copy and shutil.copytree #125

Merged
merged 12 commits into from
Apr 21, 2021

Conversation

genziano
Copy link
Contributor

@genziano genziano commented Feb 26, 2021

Closes #108 and closes #58.

Work in progress.

Three methods of CloudPath implemented:

  • upload_from, to upload a local file to a cloud file path;
  • copy, to copy a cloud file to a cloud file or directory path;
  • copytree, to copy a cloud directory path to another cloud directory path recursively.

Still to implement copy differently, which avoids download / upload when the client is the same.

(I will take care of the tests after we're set with the changes.)

@jayqi
Copy link
Member

jayqi commented Feb 27, 2021

A few thoughts:

  • I'm not convinced about the shutil-style functions. The multiple cases could either be considered as more convenient or more confusing. I'm leaning that it's more confusing.
    • For cloud-to-local, we already CloudPath.download_to which feels more explicit and clear to me.
    • For local-to-cloud, I think we should have an upload_from method which is the subject of API for creating a cloud file by uploading it #58
    • Local-to-local that just dispatches to shutil feels unnecessary
  • Then for the cloud-to-cloud, there are actually two cases, which Peter had mentioned in this comment:
    • We download the file and reupload it, like what you have.
    • If we're within the same cloud provider, we could copy directly without needing to download. I think the way you would check for that case is if the two CloudPath instances share the same client in self.client.

@genziano
Copy link
Contributor Author

A few thoughts:

  • I'm not convinced about the shutil-style functions. The multiple cases could either be considered as more convenient or more confusing. I'm leaning that it's more confusing.

    • For cloud-to-local, we already CloudPath.download_to which feels more explicit and clear to me.
    • For local-to-cloud, I think we should have an upload_from method which is the subject of API for creating a cloud file by uploading it #58
    • Local-to-local that just dispatches to shutil feels unnecessary
  • Then for the cloud-to-cloud, there are actually two cases, which Peter had mentioned in this comment:

    • We download the file and reupload it, like what you have.
    • If we're within the same cloud provider, we could copy directly without needing to download. I think the way you would check for that case is if the two CloudPath instances share the same client in self.client.

I will move the implementation as methods of CloudPath, also because I couldn't do what I thought because of circular imports.

For the cloud-to-cloud, I was planning indeed to implement differently the case with the same client. And about this, do you have a preferred approach of checking whether the client of src and dst is the same? Checking equality on object level might be too strict.

@jayqi
Copy link
Member

jayqi commented Feb 27, 2021

For the cloud-to-cloud, I was planning indeed to implement differently the case with the same client. And about this, do you have a preferred approach of checking whether the client of src and dst is the same? Checking equality on object level might be too strict.

I think actually that checking for the same instance is the right way to go, i.e., self.client is dst.client. cloudpathlib is implemented in a way that cloud path instances should share references to the same client instance unless explicitly created under a different one.

"""
if not src.is_file():
raise ValueError('src must be a file.')
if isinstance(src, CloudPath) & isinstance(dst, CloudPath):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One little note that all these should be and not &

@codecov
Copy link

codecov bot commented Mar 4, 2021

Codecov Report

Merging #125 (541a1b1) into master (a19902e) will decrease coverage by 1.9%.
The diff coverage is 11.5%.

@@           Coverage Diff            @@
##           master    #125     +/-   ##
========================================
- Coverage    91.8%   89.8%   -2.0%     
========================================
  Files          19      19             
  Lines        1050    1076     +26     
========================================
+ Hits          964     967      +3     
- Misses         86     109     +23     
Impacted Files Coverage Δ
cloudpathlib/cloudpath.py 84.8% <11.5%> (-5.4%) ⬇️

)
temp_file = self._local
temp_file.parent.mkdir(parents=True)
self.download_to(temp_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of download to a temp_file, you can do self.fspath which will use the cached version if it is downloaded and download it to the cache if not. So the lines below can just use that. For example destination.upload_from(self.fspath)

raise ValueError(
f"Origin path {self} must be a directory. To copy a single file use the method copy."
)
if not destination.is_dir():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check might require that destination already exists, but I don't think that should be a requirement. Maybe use destination.exists() and destination.is_file() instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. is_dir will return False if destination does not exist. And, shutil.copytree will create the destination directory, so we should mirror that behavior.

@pjbull
Copy link
Member

pjbull commented Mar 13, 2021

Hi @genziano, seems like this is heading in the right direction—excited to get it merged!

Anything that we can help with in terms of implementation questions or writing tests?

@genziano
Copy link
Contributor Author

Hi @genziano, seems like this is heading in the right direction—excited to get it merged!

Anything that we can help with in terms of implementation questions or writing tests?

Hi @pjbull! I have unfortunately been busy these last weeks, but next week I will work on finishing it up, and hopefully picking up some new issue.

Unrelated to this PR, I have a question: do you guys have some examples of usage with Azure blob storage? I'm not sure about how to create paths pointing to azure blob accounts. I'd also be happy to include some examples if needed!

@pjbull
Copy link
Member

pjbull commented Mar 16, 2021

do you guys have some examples of usage with Azure blob storage

Is something like this what you mean or also including the authentication/credential flow (which is in the docs)

from cloudpathlib import CloudPath, AzureBlobPath

# generic version starting with az:// is dispatched to AzureBlobPath
az_path = CloudPath("az://container/directory/file.txt")

# explicit version
az_path = AzureBlobPath("az://container/directory/file.txt")

@genziano
Copy link
Contributor Author

do you guys have some examples of usage with Azure blob storage

Is something like this what you mean or also including the authentication/credential flow (which is in the docs)

from cloudpathlib import CloudPath, AzureBlobPath

# generic version starting with az:// is dispatched to AzureBlobPath
az_path = CloudPath("az://container/directory/file.txt")

# explicit version
az_path = AzureBlobPath("az://container/directory/file.txt")

Yes thanks, exactly that. I didn't figure out that the account name was just part of the connection string.

@BenAsaf
Copy link

BenAsaf commented Apr 12, 2021

Hello!
Any news on this? I am looking forward to this PR :)
Can I assist in anyway?

@pjbull pjbull changed the base branch from master to 108-copy-upload April 21, 2021 05:43
@pjbull pjbull marked this pull request as ready for review April 21, 2021 05:44
@pjbull pjbull merged commit 26337a9 into drivendataorg:108-copy-upload Apr 21, 2021
pjbull added a commit that referenced this pull request May 29, 2021
* Cloud implementation of shutil.copy and shutil.copytree (#125)

* Draft of cloudpathlib.copy, that replicates shutil.copy

* Import only PathLike from os

* Reuse _local_to_cloud_copy for cached files

* try to use the intended cached _local version

* Update cloudshutil.py

* Implemented the CloudPath methods: upload_from, copy, copytree

* Add docstrings

* Fix indentation

* fix unnecessary imports

* Fix typing issues

* Allow CloudPath.copy to a directory as well

* Add remove_src option to enable copy

* update copy, copytree, upload_from

* Fixes for bugs from writing tests

* Tests for copy, copytree, upload

* Mypy checks

* Add sleeps to ensure write times in tests

* Update functions table

* extra sleep time for live tests

* Address code review comments

* Add chagelog updates for release

Co-authored-by: Alessandro Gentile <alemaudit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add copy and copytree methods API for creating a cloud file by uploading it
4 participants