-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adopting byte range PATCH in resumable upload #2501
Comments
For reference, here is @awwright's presentation from IETF 116: https://datatracker.ietf.org/meeting/116/materials/slides-116-httpapi-byte-range-patch |
I am open to the idea of using a byte range instead of the Upload-Offset header field for signaling the current upload progress. However, I am not sure if we would want to support the full byte range syntax. Our current draft specifies append-only uploads, but byte range allows for non-sequential writes and even overwriting previous data. Some upload servers might be interested in implementing this full functionality, but many applications do not need more than an append-only upload. Implementing the full byte range syntax would be just more effort. So, if we say a server must support append operations and can optionally support non-sequential writes and overwrites, I can also live with the byte range syntax. That might open the door for more sophisticated upload schemes in the future. |
+1. Assuming that we'd switch to using byte-range PATCH for appending the upload, are we also going to switch to using |
I am not sure if Content-Length would be appropriate then. If clients can use byte range PATCH for uploading, they might upload non-contiguous chunks (e.g. two range: 0-100 bytes and 200-300 bytes). These ranges should then also be correctly present in the response for HEAD requests. But Content-Length cannot capture these ranges, so we would need a response header similar to the Range header field, which can capture multiple ranges: https://httpwg.org/specs/rfc9110.html#field.range Of course, if the server only allows appending, the returned ranges with always be contiguous and start at 0. |
Yes, when using a byte range PATCH, this is acceptable; a server may choose to accept only writes that append; or only the writes that overwrite. You may also require that the data be contiguous, and reject uploads that contain multiple parts.
Clients should not attempt this, and servers that receive this ought to return an error, so in the event of a network problem, clients may re-synchronize their state with the server. For example, suppose the client uploads bytes 0-99, and suppose the client never receives acknowledgement. The client could retry 0-99, or go ahead and try 100-199. (Or, the client could re-synchronize and find out the last received data ended at byte 55.) Regardless of how the client handles it, the server should error if the upload would repeat data that has already been uploaded but not acknowledged, or if this would skip over data that was sent, but lost on the way to the server.
Byte range PATCH discusses "sparse documents", however I don't think this makes sense in resumable uploads. The server should simply reject requests that would require support for sparse or non-continuous data. |
I think We can take this opportunity to reduce the number of new headers introduced in this draft. Servers are only required to support appending. Any other types of operations would be optional pending agreement between clients and servers or negotiation mechanism not defined in this draft. |
The
This indicates the upload is providing the last byte of a file that's 300 bytes long, and so, this must be the final upload. (Since the final byte of a 300 byte large file will be byte number 299.) |
That requires the length to be known when we start sending the request. If we streaming compress a file for example, we do not know if the body will end in 10 bytes or in 100MB. |
@guoye-zhang If the final length is unknown then you would use the indeterminate length form, for example:
This would indicate an upload of the first 100 bytes, and some unknown amount of bytes is to follow. |
I disagree on that point. I think we should either go full byte range mode or just stick to the current offset. Meaning that either we use byte ranges for the PATCH and HEAD requests, even if our draft only defines append-only modification. Or, we stick to the method of having an offset that is acknowledged by the client in the PATCH request. However, mixing between these approaches in the PATCH and HEAD requests would be confusing and inconsistent. In both cases, I do not mind if we reuse existing headers or have to introduce new one. Reusing fields is always great, of course. |
@awwright How do we say the length is unknown but it's the upload is complete? |
In what situations would the upload be complete but the client or server do not know its length? Both should be able to keep track of the upload's length until it is complete. |
The example is that we are streaming compressing a file. We don't know how big the compressed file will be, but we need to tell the server that the end of the request body is the end of the upload. |
Yeah, right. At the beginning of the request we do not know its length. That is a good point. We could circumvent this by leaving the length undeclared in the first request and then using a second request with an empty body to effectively declare the length. That is similar to how we handle it in tus right now. |
@guoye-zhang If the upload is complete, then you'll know the size of the upload, and you can specify it at that time. For example, here is three requests uploading a 300 byte file:
|
@awwright I think @guoye-zhang is talking about something else: Assume we want to upload a stream in a single request but we do not know the full size of the stream at the beginning of the request (e.g. because the stream is the result of a compression). What range would we use in this case where we do not know the request body size upfront? We would need something like
but even that does not convey the information that the upload will be complete if the entire body has been received. Currently, we convey that information using the |
This is the syntax for requesting a range in the response, would that be used in resumable uploads? Or are you looking for a syntax like In any event, if you want to begin uploading a "last piece" of indeterminate length, this poses a logical contradiction: If you make the "final" upload, but then that last piece is interrupted, how do you resume it? You shouldn't be able to mark an upload as finished, until the last byte is actually received by the server. The most straightforward way to do this is to tell the server how long the final length will be. |
One of the main goals of the current draft is that we can transparently upgrade a regular upload into a resumable upload. The We can resumable an interrupted upload by sending a HEAD request asking for an offset. I don't think there is a different whether it's the last piece or not. In automatic upgrade mode, we are always sending the last piece but we don't always know the final length until the upload body ends. |
An upload with a header like What I was wondering is what happens if a continuation is interrupted... if I resume the upload, how does the server know that the end of my stream is the conclusion of my upload, and not another network interruption? In HTTP/1.1 you can typically detect the end of the message, and distinguish it from an interruption. But in some cases (I think limited to HTTP/1.0, hopefully rare nowadays), a network error is indistinguishable from the true end of the message. I don't see any language warning about this possibility. But it sounds like this is a separate concern that should be looked at separately. |
The discussion was about the necessity of I'm not aware of HTTP/1.0 interruption issues. HTTP/1.0 supports chunked encoding and can properly end the request with an empty chunk. Even if such an issue exists, I don't think it's within the scope of this protocol to address since we don't want to reinvent a way to identify the end of a message because we don't trust HTTP to do so. We can always say that this protocol can't be used with HTTP/1.0. |
FWIW use of chunked encoding with HTTP/1.0 is forbidden; see https://httpwg.org/specs/rfc9112.html#field.transfer-encoding and the related discussion in httpwg/http-core#879. That said, I do not think the problem of handling of chunked encoding in HTTP/1.0 has an effect on our outcome here, because Resumable Upload relies on an informational response. I do not think we'd want to send informational responses to arbitrary HTTP/1.x clients, because doing so is known to cause interoperability issues. Back to the topic, I tend to believe that the necessity of having I see others pointing out that at the moment a client tries to upload the final chunk by generating the I think that is a valid argument and therefore my +1 to retaining |
That is not entirely true. If the client is not able to receive informational responses, it can also perform an empty Upload Creation Procedure without any body. The (normal) response will then also contain the Location header with the upload URL, just as the informational response would. Then, the client can start sending PATCH requests to this endpoint without relying on informational responses.
I agree on this point. |
I just published draft-wright-http-patch-byterange-02 which ought to address some of the considerations here, particularly, specifying how to make writes of indeterminate length. I believe this has everything required to replace the Upload-Offset header. For example, the following patch represents data of an unknown length, but starting at a 200 byte offset:
I would suggest saying that servers MUST support this indeterminate length form, and MAY support other forms. |
@awwright I was currently looking into your latest draft and was wondering if it is already adopted by any working group or you intend to bring it any WG? |
@Acconut I believe a Call for Adoption with the HTTP APIs WG is on the IETF 117 schedule. |
I just read through -03, in which a
|
Guoye mentioned we could use
|
I would think that if you're making a HEAD request, you would see the number of bytes received in a
This ought to be possible if you use the unsatisfied-range form, to set the size of the target resource without writing any data. I will add this to the draft once I-D submissions open up (or see the changes here), since I realized there was no way to set the total length of the target document, which is a common filesystem operation that should be supported. A typical use of this would be to truncate the document (e.g. In resumable uploads, you would use this to send a "trailer" that indicates the size of the upload once it becomes known, indicating that the upload is complete. For example, PATCH /transfers/1 HTTP/1.1
Content-Type: multipart/byteranges; boundary=PART
--PART
Content-Range: bytes 200-/*
01234
--PART
Content-Range: bytes */205
--PART-- (Note in HTTP/1.1 this would be sent with In the first part, the client is sending the remainder of the upload starting at the 200th byte; it continues to byte 204, which is the last byte of the entire request. To signal that it is the last byte, the client sends one final part, saying the complete length of the document is 205 bytes—i.e. the 204th byte is the last, and is therefore complete (as opposed to a larger number, which would indicate the upload has not finished.)
Content-Range is used inside one of the media types, like message/byterange. However, for resumable uploads, most clients will probably want to use the application/byteranges media type, which is binary, and somewhat easier to parse. Each part in that patch will contain a Content-Range field. |
We want to avoid the overhead of multipart framing, especially since the request might already be multipart itself, having nested multipart body will be a serious test of some parser implementations. Upload creation is also a concern. We need it to be fully compatible with a regular upload when Ultimately, we are just adopting ranged PATCH to perform mutations on the resource, but whether or not this is the last mutation doesn't need to be inferred by what mutation we are performing. If the client and the server agree, you can even use this protocol to perform live editing of a document and save it in the end by setting |
There is no reason this should prose a problem. There is no overhead with the binary And nested multipart responses do not pose a problem to the multipart format; the contents of a part body are completely opaque, and can be anything, including another multipart document.
If this is so, then I'm struggling to figure out what On the other hand, if the server can detect that the upload was interrupted or cleanly finished, even in indeterminate length uploads, why is this flag necessary at all? What does the error handling look like if a continued upload is itself interrupted? |
I was talking about the
That is true in theory, and request smuggling is also impossible in theory. But parsers are not perfect in reality so we would want to defend against it by not putting ourselves in such a situation.
We either have Content-Length header or chunked encoding / framing to indicate the end of the request body. These are existing mechanisms in HTTP today used by regular uploads / downloads.
This flag is for an advanced use case where you need to chunk uploads to a certain limit, e.g. to get around the CDN's 100MB request body size limit. As is implemented today on Apple platforms, the flag always set it to complete and we don't support chunked uploads which is allowed by the draft.
There is no difference (in the draft today) between the interruption of the creation procedure and appending procedure. You query the offset and continue appending. |
Yes, but I only wrote the example using
This argument could be made of any parser, including resumable uploads itself; I'm not sure why Also note, the PATCH method must use some media type, otherwise the server has no way to know what the body of the PATCH payload means. I have no way to implement a PATCH method in an HTTP server without also defining a media type and its behavior. The only alternative I know of is create a new method (perhaps call it APPEND).
So then, what about clients who have no way of knowing this ahead of time? If you start uploading a request, and then halfway through you realize it's going to be 200M and you need to split it up—you can't do this if you've already specified in the headers "upload complete". I think it makes the whole request more complicated if you have to declare this in advance. You have to process the end of the request differently depending on the value for Upload-Incomplete (or Upload-Complete?). There's at least two options to avoid this: use an HTTP trailer, or Or put it another way, the Upload-Incomplete header is a work-around for the fact that the end of the resumed upload is tied to the end of the original upload, and this inflexibility results in added complexity elsewhere. Instead, the end of the original request body should be signaled separately from the end of resumed upload request. |
OK, I'll look into it.
We are proposing the minimum set of features to support a resumable upload. You have to define an offset and whether it is the end of the upload, the length of the upload is already provided by HTTP. Arguable even signaling the upload completion is unnecessary, but chucked upload is a commonly requested and used feature, and people will just reinvent it poorly if we don't define it. My objection is having additional chunking inside a single append operation.
A new method is an additional implementation hurdle since there are HTTP libraries that don't support that.
Well if you know about the specific limitation and you might go over, you should always set upload to incomplete. Otherwise, it's OK to do optimistically do complete uploads, and attempt a resumption when CDN terminates the oversized upload.
Upload complete makes the upload creation procedure compatible with a non-resumable upload, and allows feature detection and transparent upgrade to resumable uploads. Upload incomplete is an advanced feature for people who cannot depend on 1xx responses, or need chunked uploads. The design tradeoff has been extensively debated and iterated to meet the requirements of various adopters. We are willing to switch to a mechanism which has direct translations of our current operation mode, with an equivalent or strictly more powerful feature set. "Append this request body to an existing upload, treat this as the final operation or not" isn't a very complicated semantics. |
We talked about this again after the httpapi session at IETF 118 and there are two relevant points. draft-wright-http-patch-byterange might move away from changing the Content-Range header to instead using a new header, for example Content-Offset (or Upload-Offset or some other name). This could be the same header that is used in resumable uploads, allowing both drafts to use a common header field without running into issues with Content-Range. In addition, an explicit design goal of the message/byterange media type from draft-wright-http-patch-byterange is to include the range (or offset) of the partial content in the request body itself. That's why the request body consists of a set of headers before the actual partial content (notice the additional empty line and the Content-Length: 272):
The purpose is that the message is an encapsulated patch document that can be stored on disk or sent to clients in responses to GET requests. This document not only contains the partial chunk but alse the range at which is should be applied. That's why it is preferable to include the range in the message body instead of a header field. While this might make sense for some applications, I don't think this is a good approach for resumable uploads. Here we should rather include the offset in an actual header field instead of having to parse it from the request body. @awwright Is this a good summary of our discussion? Let me know if I missed something. |
@Acconut Yes, this is an excellent summary; using Content-Offset as a Partial PUT would look like: PUT /uploads/foo HTTP/1.1
Content-Offset: bytes 5000
Transfer-Encoding: Chunked
[bytes...] Additionally, I was thinking about your proposal on the list, what about using "application/octet-stream" as the Content-Type in the manner you were proposing, where the upload body is only the data to write, and the metadata is provided in HTTP headers? I think this would be reasonable as there's not very many things that "application/octet-stream" in a PATCH upload could possibly do. Naturally, you can assume the upload body is an opaque binary blob, and so the only place the metadata could possibly be would be out-of-band, i.e. the HTTP headers: PATCH /uploads/foo HTTP/1.1
Content-Type: application/octet-stream
Content-Offset: bytes 5000
Transfer-Encoding: Chunked
[bytes...] The rules for using "application/octet-stream" like this might be "the body MUST be an HTTP partial response, a message without recognized partial headers (Content-Range, Content-Offset) MUST be an error." |
Well. Using PUT with a new header field risks that existing servers will ignore it. |
@reschke Indeed that's the whole motivation for Byte Range PATCH. However for resumable uploads, where the server establishes support for resumption with a 1xx code, that's unlikely to be a problem. |
@awwright I see little benefit of using PUT over PATCH here. Yes, if the server returns an upload resource URL to the client, it should make sure that that endpoint does support the method and the needed headers properly. However, with PATCH we have a more explicit approach to this, where the media type ensures that the server knows how to apply the patch. With this, we can avoid any possible compatibility issues with PUT in existing infrastructure. My current preference would be to add an For example:
|
Having thought about this a bit since the session, I find myself coming back to the definitions of PUT vs. PATCH. PUT creates a new resource out of the thing being uploaded in the body; PATCH transforms an existing resource using some description defined by the media type. If resumable upload were specified as a special-case of coming back to the same upload endpoint with special headers to indicate it's a continuation of the same PUT, then I think a PUT with appropriate decoration would be appropriate. But it's not -- it creates an "upload resource" to represent the pending upload, and the client is able to act on that temporary upload resource in subsequent calls. Semantically, I think PATCH is appropriate -- the client is transforming the upload resource into what should have been the complete body of the interrupted request, and at the end the server processes the completed resource as if it were the original request. |
Exactly, that is how the current draft is laid out. This is also how most people implementing the protocol on the server-side will approach their implementation. A temporary resource is created, which gets modified over time until the upload is considered complete. Your argument is a great formulation about why PATCH is the correct choice here. Thanks for sharing it. I agree that PUT is not the best fit here. Now we just need to think about the media type to properly describe the semantics of the PATCH request. |
Closing in favor of #2610. |
Let's look into adopting byte range PATCH proposed in httpapi WG by @awwright
The text was updated successfully, but these errors were encountered: