Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement multipart copy and copying a particular version #308

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ararslan
Copy link
Member

@ararslan ararslan commented Oct 4, 2024

I've had these changes locally for months (possibly a year or more?) but hadn't committed or pushed them. I don't know if/when I'll have the bandwidth to ensure this gets over the finish line, so if someone is interested in picking this up then please feel free to do so.

Summary of changes:

  • s3_copy now supports a version keyword argument that facilitates copying a specified version of an object.
  • A new function s3_multipart_copy to mirror s3_multipart_upload has been added, which calls UploadPartCopy in the API.
  • An explicit cp(::S3Path, ::S3Path) method has been implemented, which avoids the fallback cp(::AbstractPath, ::AbstractPath) method that reads the source file into memory before writing to the destination.
    • To avoid breaking the convenient but possibly unintended prior behavior of using different credentials for the source and destination paths, the fallback method is called when the source and destination credentials differ.
  • cp(::S3Path, ::S3Path) allows the user to opt into a multipart copy, in which case multipart is used when the source is larger than the specified part size (50 MiB by default). A multipart copy is unconditionally used when the source is at least 5 GiB. This behavior mimics that of the AWS CLI. Note that this now requires an additional API call to HeadObject in order to retrieve the source size.

Summary of changes:
- `s3_copy` now supports a `version` keyword argument that facilitates
  copying a specified version of an object.
- A new function `s3_multipart_copy` to mirror `s3_multipart_upload` has
  been added, which calls `UploadPartCopy` in the API.
- An explicit `cp(::S3Path, ::S3Path)` method has been implemented,
  which avoids the fallback `cp(::AbstractPath, ::AbstractPath)` method
  that reads the source file into memory before writing to the
  destination.
- `cp(::S3Path, ::S3Path)` allows the user to opt into a multipart copy,
  in which case multipart is used when the source is larger than the
  specified part size (50 MiB by default). A multipart copy is
  unconditionally used when the source is at least 5 GiB. This behavior
  mimics that of the AWS CLI. Note that this now requires an additional
  API call to `HeadObject` in order to retrieve the source size.
Comment on lines 507 to 512
to_bucket,
to_path,
"$bucket/$path",
source,
Dict("headers" => headers);
aws_config=aws,
kwargs...,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
to_bucket,
to_path,
"$bucket/$path",
source,
Dict("headers" => headers);
aws_config=aws,
kwargs...,
to_bucket, to_path, source, Dict("headers" => headers); aws_config=aws, kwargs...

@ararslan
Copy link
Member Author

ararslan commented Oct 4, 2024

bors try

args=Dict{String,Any}(),
kwargs...,
)
args["x-amz-copy-source-range"] = string(first(byte_range), '-', last(byte_range))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
args["x-amz-copy-source-range"] = string(first(byte_range), '-', last(byte_range))
args["x-amz-copy-source-range"] = string("bytes=", first(byte_range), '-', last(byte_range))

https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPartCopy.html#API_UploadPartCopy_RequestSyntax

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
args["x-amz-copy-source-range"] = string(first(byte_range), '-', last(byte_range))
headers = Dict(
"x-amz-copy-source-range" => string(
"bytes=", first(byte_range), '-', last(byte_range)
)
)
mergewith!(_merge, args, Dict("headers" => headers))

Otherwise it gets added as a query parameter rather than a header

@ararslan
Copy link
Member Author

Relevant: JuliaCloud/AWS.jl#695


upload = s3_begin_multipart_upload(aws, bucket, path)
tags = map(enumerate(0:part_size:file_size)) do (part, byte_offset)
byte_range = byte_offset:min(byte_offset + part_size - 1, file_size)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
byte_range = byte_offset:min(byte_offset + part_size - 1, file_size)
byte_range = byte_offset:(min(byte_offset + part_size, file_size) - 1)

Since it's 0-based

[multipart copy](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html).

# Optional Arguments
- `part_size_mb`: maximum size per uploaded part, in mebibytes (MiB).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth exposing an option that allows matching the part size between the source and destination. IIUC, that should make the range-based accesses faster while copying. If a file is big enough for a multipart copy, it was probably uploaded with a multipart upload, in which case the parts and their sizes can be obtained with S3.get_object_attributes. Lacking that permission, one can also get the part size with S3.head_object by passing Dict("partNumber" => 1) as a query parameter, and the number of parts will be in the entity tag of the source object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant