Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38705: [C++][FS][Azure] Implement CopyFile() #39058

Merged
merged 3 commits into from
Dec 7, 2023

Conversation

kou
Copy link
Member

@kou kou commented Dec 4, 2023

Rationale for this change

CopyFile() copies the given source to the given destination. Both of source and destination must be blob name like other filesystem implementations.

What changes are included in this PR?

Use CopyFromUri() API that should use server-side copy.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

Copy link

github-actions bot commented Dec 4, 2023

⚠️ GitHub issue #38705 has been automatically assigned in GitHub to PR creator.

@kou
Copy link
Member Author

kou commented Dec 4, 2023

@Tom-Newton @felipecrv What do you think about this behavior? I think that this is convenient but this is different from other filesystem implementations...

If the given destination has only container (container) or has trailing slash (container/directory/), base name of the given source is automatically appended. For example:

  • Source: container1/directory1/path.txt
  • Destination: container2
  • Real destination: container2/path.txt (path.txt is only appended to container2)

This is the same behavior as cp command. But other filesystem implementations require path not directory as destination.

@Tom-Newton
Copy link
Contributor

This behaviour sounds fine to me.

I think there is a problem with the CopyFromUri() API though. I believe it requires using SAS token auth on the source, because the source URI needs to be self authenticated.

@Tom-Newton
Copy link
Contributor

I think there is a problem with the CopyFromUri() API though.

Nevermind, it looks like you've got a working implementation using CopyFromUri(). Maybe it does auth automatically if it detects that the source is in the same storage account. Or possibly the Azure SDK automatically switches between Copy Blob From URL and Copy Blob

@pitrou
Copy link
Member

pitrou commented Dec 4, 2023

I think that this is convenient but this is different from other filesystem implementations...

We should not diverge from the filesystem spec (as described in the FileSystem class docstrings ).
The end goal is for all implementations to expose a similar behavior.

Also, at some point the Azure fs implementation will have to implement and pass the generic filesystem tests, which do have a test for this situation.

@felipecrv
Copy link
Contributor

I think that this is convenient but this is different from other filesystem implementations...

We should not diverge from the filesystem spec (as described in the FileSystem class docstrings ). The end goal is for all implementations to expose a similar behavior.

Also, at some point the Azure fs implementation will have to implement and pass the generic filesystem tests, which do have a test for this situation.

I agree. I think we should fix #38772 before adding more operations. I implemented the GetFileInfo(select) with this as an assumption. CopyFile shouldn't implicitly create directories, so we need a way to check an empty directory exists.

@kou
Copy link
Member Author

kou commented Dec 5, 2023

Oh, sorry. I didn't read the docstring. I'll remove the behavior.

Also, at some point the Azure fs implementation will have to implement and pass the generic filesystem tests, which do have a test for this situation.

I've opened a new issue for it: #39069

@kou
Copy link
Member Author

kou commented Dec 5, 2023

I think there is a problem with the CopyFromUri() API though.

Nevermind, it looks like you've got a working implementation using CopyFromUri(). Maybe it does auth automatically if it detects that the source is in the same storage account. Or possibly the Azure SDK automatically switches between Copy Blob From URL and Copy Blob

I think that "Copy Blob From URL" and "Copy Blob" are the same API internally.
Both of them use the same URI and PUT:

And BlobClient provides only CopyFromUri(). (It doesn't provide Copy().)
So I used CopyFromUri().

@kou kou force-pushed the cpp-azurefs-copy-file branch from df57308 to f1fd538 Compare December 5, 2023 02:29
@kou
Copy link
Member Author

kou commented Dec 5, 2023

Updated:

  • Accept only blob path for destination like other implementations.

@kou
Copy link
Member Author

kou commented Dec 6, 2023

I'll merge this in a few days if nobody objects it.

Copy link
Contributor

@Tom-Newton Tom-Newton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Dec 6, 2023

Personally I'm not particularly worried but to be rigorous @felipecrv's comment should probably be addressed.

CopyFile shouldn't implicitly create directories, so we need a way to check an empty directory exists.

@kou
Copy link
Member Author

kou commented Dec 7, 2023

OK. I've implemented it.

@kou kou merged commit 70ccf33 into apache:main Dec 7, 2023
34 checks passed
@kou kou deleted the cpp-azurefs-copy-file branch December 7, 2023 05:26
@kou kou removed the awaiting committer review Awaiting committer review label Dec 7, 2023
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 70ccf33.

There were 10 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
### Rationale for this change

`CopyFile()` copies the given source to the given destination. Both of source and destination must be blob name like other filesystem implementations.

### What changes are included in this PR?

Use `CopyFromUri()` API that should use server-side copy.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#38705

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][FS][Azure] Implement CopyFile()
4 participants