-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement multipart copy and copying a particular version #308
base: master
Are you sure you want to change the base?
Conversation
Summary of changes: - `s3_copy` now supports a `version` keyword argument that facilitates copying a specified version of an object. - A new function `s3_multipart_copy` to mirror `s3_multipart_upload` has been added, which calls `UploadPartCopy` in the API. - An explicit `cp(::S3Path, ::S3Path)` method has been implemented, which avoids the fallback `cp(::AbstractPath, ::AbstractPath)` method that reads the source file into memory before writing to the destination. - `cp(::S3Path, ::S3Path)` allows the user to opt into a multipart copy, in which case multipart is used when the source is larger than the specified part size (50 MiB by default). A multipart copy is unconditionally used when the source is at least 5 GiB. This behavior mimics that of the AWS CLI. Note that this now requires an additional API call to `HeadObject` in order to retrieve the source size.
to_bucket, | ||
to_path, | ||
"$bucket/$path", | ||
source, | ||
Dict("headers" => headers); | ||
aws_config=aws, | ||
kwargs..., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
to_bucket, | |
to_path, | |
"$bucket/$path", | |
source, | |
Dict("headers" => headers); | |
aws_config=aws, | |
kwargs..., | |
to_bucket, to_path, source, Dict("headers" => headers); aws_config=aws, kwargs... |
bors try |
args=Dict{String,Any}(), | ||
kwargs..., | ||
) | ||
args["x-amz-copy-source-range"] = string(first(byte_range), '-', last(byte_range)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args["x-amz-copy-source-range"] = string(first(byte_range), '-', last(byte_range)) | |
args["x-amz-copy-source-range"] = string("bytes=", first(byte_range), '-', last(byte_range)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args["x-amz-copy-source-range"] = string(first(byte_range), '-', last(byte_range)) | |
headers = Dict( | |
"x-amz-copy-source-range" => string( | |
"bytes=", first(byte_range), '-', last(byte_range) | |
) | |
) | |
mergewith!(_merge, args, Dict("headers" => headers)) |
Otherwise it gets added as a query parameter rather than a header
Relevant: JuliaCloud/AWS.jl#695 |
|
||
upload = s3_begin_multipart_upload(aws, bucket, path) | ||
tags = map(enumerate(0:part_size:file_size)) do (part, byte_offset) | ||
byte_range = byte_offset:min(byte_offset + part_size - 1, file_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
byte_range = byte_offset:min(byte_offset + part_size - 1, file_size) | |
byte_range = byte_offset:(min(byte_offset + part_size, file_size) - 1) |
Since it's 0-based
[multipart copy](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html). | ||
|
||
# Optional Arguments | ||
- `part_size_mb`: maximum size per uploaded part, in mebibytes (MiB). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's worth exposing an option that allows matching the part size between the source and destination. IIUC, that should make the range-based accesses faster while copying. If a file is big enough for a multipart copy, it was probably uploaded with a multipart upload, in which case the parts and their sizes can be obtained with S3.get_object_attributes
. Lacking that permission, one can also get the part size with S3.head_object
by passing Dict("partNumber" => 1)
as a query parameter, and the number of parts will be in the entity tag of the source object.
I've had these changes locally for months (possibly a year or more?) but hadn't committed or pushed them. I don't know if/when I'll have the bandwidth to ensure this gets over the finish line, so if someone is interested in picking this up then please feel free to do so.
Summary of changes:
s3_copy
now supports aversion
keyword argument that facilitates copying a specified version of an object.s3_multipart_copy
to mirrors3_multipart_upload
has been added, which callsUploadPartCopy
in the API.cp(::S3Path, ::S3Path)
method has been implemented, which avoids the fallbackcp(::AbstractPath, ::AbstractPath)
method that reads the source file into memory before writing to the destination.cp(::S3Path, ::S3Path)
allows the user to opt into a multipart copy, in which case multipart is used when the source is larger than the specified part size (50 MiB by default). A multipart copy is unconditionally used when the source is at least 5 GiB. This behavior mimics that of the AWS CLI. Note that this now requires an additional API call toHeadObject
in order to retrieve the source size.