-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a MultipartUploadStream IO object #46
Changes from all commits
e6173ab
57dfc36
4eabf65
57c2d16
b711349
2273210
1f16ffc
174464c
1d7ef83
7d7ef7b
5f903f1
2ed517d
a9232f4
cdd1d23
86f8ece
456c32c
b5a7ad5
97c9ca6
737c408
b2cf37d
0458ef5
6072e72
e6ba825
0214836
6439a3e
8910fb1
ac71f14
f72716d
0790f79
c18952b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -372,3 +372,209 @@ function Base.read(io::PrefetchedDownloadStream, ::Type{UInt8}) | |
end | ||
return b | ||
end | ||
|
||
function _upload_task(io; kw...) | ||
try | ||
(part_n, upload_buffer) = take!(io.upload_queue) | ||
# upload the part | ||
parteTag, wb = uploadPart(io.store, io.url, upload_buffer, part_n, io.uploadState; io.credentials, kw...) | ||
Base.release(io.sem) | ||
# add part eTag to our collection of eTags | ||
Base.@lock io.cond_wait begin | ||
if length(io.eTags) < part_n | ||
resize!(io.eTags, part_n) | ||
end | ||
io.eTags[part_n] = parteTag | ||
io.ntasks -= 1 | ||
notify(io.cond_wait) | ||
end | ||
catch e | ||
isopen(io.upload_queue) && close(io.upload_queue, e) | ||
Base.@lock io.cond_wait begin | ||
io.exc = e | ||
notify(io.cond_wait, e, all=true, error=true) | ||
end | ||
end | ||
return nothing | ||
end | ||
|
||
""" | ||
This is an *experimental* API. | ||
|
||
MultipartUploadStream <: IO | ||
MultipartUploadStream(args...; kwargs...) -> MultipartUploadStream | ||
|
||
An in-memory IO stream that uploads chunks to a URL in blob storage. | ||
|
||
For every data chunk we call write(io, data;) to write it to a channel. We spawn one task | ||
per chunk to read data from this channel and uploads it as a distinct part to blob storage | ||
to the same remote object. | ||
We expect the chunks to be written in order. | ||
For cases where there is no need to upload data in parts or the data size is too small, `put` | ||
can be used instead. | ||
|
||
# Arguments | ||
* `store::AbstractStore`: The S3 Bucket / Azure Container object | ||
* `key::String`: S3 key / Azure blob resource name | ||
|
||
# Keywords | ||
* `credentials::Union{CloudCredentials, Nothing}=nothing`: Credentials object used in HTTP | ||
requests | ||
* `concurrent_writes_to_channel::Int=(4 * Threads.nthreads())`: represents the max number of | ||
chunks in flight. Defaults to 4 times the number of threads. We use this value to | ||
initialize a semaphore to perform throttling in case the writing to the channel is much | ||
faster to uploading to blob storage, i.e. `write` will block as a result of this limit | ||
being reached. | ||
* `kwargs...`: HTTP keyword arguments are forwarded to underlying HTTP requests, | ||
|
||
## Examples | ||
``` | ||
# Get an IO stream for a remote CSV file `test.csv` living in your S3 bucket | ||
io = MultipartUploadStream(bucket, "test.csv"; credentials) | ||
|
||
# Write a chunk of data (Vector{UInt8}) to the stream | ||
write(io, data;) | ||
|
||
# Wait for all chunks to be uploaded | ||
wait(io) | ||
|
||
# Close the stream | ||
close(io; credentials) | ||
|
||
# Alternative syntax that encapsulates all these steps | ||
MultipartUploadStream(bucket, "test.csv"; credentials) do io | ||
write(io, data;) | ||
end | ||
``` | ||
|
||
## Note on upload size | ||
``` | ||
Some cloud storage providers might have a lower limit on the size of the uploaded object. | ||
For example it seems that S3 requires at minimum an upload of 5MB: | ||
https://github.com/minio/minio/issues/11076. | ||
We haven't found a similar setting for Azure. | ||
For such cases where the size of the data is too small, one can use `put` | ||
``` | ||
""" | ||
mutable struct MultipartUploadStream{T <: AbstractStore} <: IO | ||
store::T | ||
url::String | ||
credentials::Union{Nothing, AWS.Credentials, Azure.Credentials} | ||
uploadState::Union{String, Nothing} | ||
eTags::Vector{String} | ||
upload_queue::Channel{Tuple{Int, Vector{UInt8}}} | ||
cond_wait::Threads.Condition | ||
cur_part_id::Int | ||
ntasks::Int | ||
exc::Union{Exception, Nothing} | ||
sem::Base.Semaphore | ||
|
||
function MultipartUploadStream( | ||
Drvi marked this conversation as resolved.
Show resolved
Hide resolved
|
||
store::AWS.Bucket, | ||
key::String; | ||
credentials::Union{Nothing, AWS.Credentials}=nothing, | ||
concurrent_writes_to_channel::Int=(4 * Threads.nthreads()), | ||
kw... | ||
) | ||
url = makeURL(store, key) | ||
uploadState = API.startMultipartUpload(store, key; credentials, kw...) | ||
io = new{AWS.Bucket}( | ||
store, | ||
url, | ||
credentials, | ||
uploadState, | ||
Vector{String}(), | ||
Channel{Tuple{Int, Vector{UInt8}}}(Inf), | ||
Threads.Condition(), | ||
0, | ||
0, | ||
nothing, | ||
Base.Semaphore(concurrent_writes_to_channel) | ||
) | ||
return io | ||
end | ||
|
||
function MultipartUploadStream( | ||
store::Azure.Container, | ||
key::String; | ||
credentials::Union{Nothing, Azure.Credentials}=nothing, | ||
concurrent_writes_to_channel::Int=(4 * Threads.nthreads()), | ||
kw... | ||
) | ||
url = makeURL(store, key) | ||
uploadState = API.startMultipartUpload(store, key; credentials, kw...) | ||
io = new{Azure.Container}( | ||
store, | ||
url, | ||
credentials, | ||
uploadState, | ||
Vector{String}(), | ||
Channel{Tuple{Int, Vector{UInt8}}}(Inf), | ||
Threads.Condition(), | ||
0, | ||
0, | ||
nothing, | ||
Base.Semaphore(concurrent_writes_to_channel) | ||
) | ||
return io | ||
end | ||
|
||
# Alternative syntax that applies the function `f` to the result of | ||
# `MultipartUploadStream(args...; kwargs...)`, waits for all parts to be uploaded and | ||
# and closes the stream. | ||
function MultipartUploadStream(f::Function, args...; kw...) | ||
io = MultipartUploadStream(args...; kw...) | ||
try | ||
f(io) | ||
wait(io) | ||
close(io; kw...) | ||
catch e | ||
# todo, we need a function here to signal abort to S3/Blobs. We don't have that | ||
# yet in CloudStore.jl | ||
rethrow() | ||
end | ||
end | ||
end | ||
|
||
# Writes a data chunk to the channel and spawn | ||
function Base.write(io::MultipartUploadStream, bytes::Vector{UInt8}; kw...) | ||
local part_n | ||
Base.@lock io.cond_wait begin | ||
io.ntasks += 1 | ||
io.cur_part_id += 1 | ||
part_n = io.cur_part_id | ||
notify(io.cond_wait) | ||
end | ||
Base.acquire(io.sem) | ||
Drvi marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# We expect the data chunks to be written in order in the channel. | ||
put!(io.upload_queue, (part_n, bytes)) | ||
Threads.@spawn _upload_task($io; $(kw)...) | ||
return nothing | ||
end | ||
|
||
# Waits for all parts to be uploaded | ||
function Base.wait(io::MultipartUploadStream) | ||
try | ||
Base.@lock io.cond_wait begin | ||
while true | ||
!isnothing(io.exc) && throw(io.exc) | ||
io.ntasks == 0 && break | ||
wait(io.cond_wait) | ||
end | ||
end | ||
catch e | ||
rethrow() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we are missing a way to signal to S3/Blobs that we're aborting the multi-part upload a la AbortMultipartUpload... can you open an issue about it? Also, I think we should consider the following syntax to the user MultipartUploadStream(...) do io
...
end something like function MultipartUploadStream(f::Function, args...; kwargs...)
io = MultipartUploadStream(args...; kwargs...)
try
f(io)
wait(io)
close(io)
catch e
abort(io, e) # todo, we don't have this yet
rethrow()
end
end There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is already an issue for adding more low-level functionality regarding multipart uploads including list parts and and list parts that are in progress, which are related to aborting: #3. Should I add a comment there about abort? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the alternative syntax, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, sorry, I didn't realize there was an issue already
Yes, similar to what one would do with a local file open("file") do io
write(io, "my stuff")
end |
||
end | ||
end | ||
|
||
# When there are no more data chunks to upload, this function closes the channel and sends | ||
# a POST request with a single id for the entire upload. | ||
function Base.close(io::MultipartUploadStream; kw...) | ||
try | ||
close(io.upload_queue) | ||
return API.completeMultipartUpload(io.store, io.url, io.eTags, io.uploadState; kw...) | ||
catch e | ||
io.exc = e | ||
rethrow() | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Drvi Could you help me with adding a test case that triggers an error during uploading and covers these lines? I tried with uploading a part that was smaller than the minimum S3 size and got the following error, but it didn't exercise this catch block. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you had a good idea, try writing a small file and then testing, that the channel is closed and that the exp field is populated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that, but when the errors happens I'm getting
Exiting on signal: TERMINATED
and the test terminates before I get to check the channel and the exception.