Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebDAV extention MPUT for multiple file upload [WIP] #12001

Closed
DeepDiver1975 opened this issue Nov 6, 2014 · 32 comments
Closed

WebDAV extention MPUT for multiple file upload [WIP] #12001

DeepDiver1975 opened this issue Nov 6, 2014 · 32 comments

Comments

@DeepDiver1975
Copy link
Member

In order to increase sync performance of multiple files we will introduce a new http verb MPUT.

Requirements:

  • target folder has to exist
  • upload files are restricted to this folder - no sub folder upload

Request:

MPUT  /test/sub-folder-1/sub-folder-2/ HTTP/1.1
   Host: www.foo.bar
   Content-type: text/xml; charset="utf-8"
   Content-Length: xxxx

   <?xml version="1.0" encoding="utf-8" ?>
   <D:mput xmlns:D="http://www.owncloud.org/">
     <D:file>
          <D:name>test1.txt</D:name>
          <D:ctime>2009-06-30T18:00:00+02:00</D:ctime>
          <D:mtime>2009-06-30T18:30:00+02:00</D:mtime>
          <D:etag>aabbccddeeff</D:etag>
          <D:size>1024</D:size>
          <D:data encoding="base64">ABCDEFGH</D:data>
     </D:file>
     <D:file>
          <D:name>test2.jpeg</D:name>
          <D:ctime>2009-06-30T18:00:00+02:00</D:ctime>
          <D:mtime>2009-06-30T18:30:00+02:00</D:mtime>
          <D:etag>aabbccddeeff</D:etag>
          <D:size>4024000</D:size>
          <D:data encoding="base64">ABCDEFGH</D:data>
     </D:file>
   </D:mput>
@DeepDiver1975
Copy link
Member Author

@dragotin @jnweiger

@LukasReschke
Copy link
Member

Base64 encoding the files will make the files about 33% larger. This is a massive performance hit.

Can't we use multipart for that? Also what is the etag that the client sends to the server?

@LukasReschke
Copy link
Member

Also will this work with *.part uploads here? - Otherwise the decoding of the base64 might reach the PHP memory limit very quick to my understanding.

@PVince81
Copy link
Contributor

PVince81 commented Nov 6, 2014

Too bad we can't just send pure binary instead of XML.

@LukasReschke you are talking about chunks, right ?

@LukasReschke
Copy link
Member

@LukasReschke you are talking about chunks, right ?

Yes. - Sorry for the incorrect term ;-)

@PVince81
Copy link
Contributor

PVince81 commented Nov 6, 2014

I'd say chunks and MPUT here are quite the opposite. The purpose of chunks is to split a file in pieces and reduce the size of requests.

With an MPUT request I guess we'd rather have much bigger requests, or at least we'd bundle multiple small files as long as the total payload is less than 10 MB.

@DeepDiver1975
Copy link
Member Author

the MPUT is designed for multiple small files - chunking is out of scope

@DeepDiver1975
Copy link
Member Author

Base64 encoding the files will make the files about 33% larger. This is a massive performance hit.

I'm free to use whatever encoding - that's why we can specify the encoding as attribute

@DeepDiver1975
Copy link
Member Author

Also what is the etag that the client sends to the server?

the etag is to be used for detection of change - according to If-Match header

@DeepDiver1975
Copy link
Member Author

Too bad we can't just send pure binary instead of XML.

if we plan to use this as WebDAV extension we shall stick to XML - we could also start to implement our own interface ..... 🙈

@karlitschek
Copy link
Contributor

Definitely an very interesting idea. It would be good to evaluate the possible performance improvements a bit more.

@guruz
Copy link
Contributor

guruz commented Nov 7, 2014

I'd prefer HTTP Multipart as there you can use binary.
This is also supported by our HTTP library: http://qt-project.org/doc/qt-5/qnetworkaccessmanager.html#put-2

In general I think this is the wrong way to go. You'll open a big can of worms. What about HTTP timeouts? What if some files fail and some succeed?

The problem of stuffing more into one pipe has already been solved and is called SPDY (which is also supported by our HTTP library). Then on the server side the code can stay the same and each file can handled individually.

@DeepDiver1975
Copy link
Member Author

In general I think this is the wrong way to go. You'll open a big can of worms. What about HTTP timeouts? What if some files fail and some succeed?

WebDAV defines the multi-status-response which can return individual status for each file. But yes - I understand your concerns.

The problem of stuffing more into one pipe has already been solved and is called SPDY (which is also supported by our HTTP library). Then on the server side the code can stay the same and each file can handled individually.

this would be worth testing - just like comparing this to mod_php and php-fpm

@evert
Copy link

evert commented Dec 2, 2014

Hi guys,

Is it possible that your client is simply not optimized enough. The idea that a lot of small files can take a long time to upload makes a lot of sense to me, due to latency.

But this should be largely solvable by using HTTP Pipelining, which afaik is pretty widely supported. Using HTTP Pipelining the client basically keeps on sending requests before waiting for a response from the server.

A server would just send back all the response bodies in order, and clients that support this, can figure this all out.

Trying to combine this all in one HTTP request seems like a hack. It would be a valid hack if pipelining is for some reason impossible.

@guruz
Copy link
Contributor

guruz commented Dec 2, 2014

@evert Pipelining makes sense if it is GET requests from a reliable webserver. In our case though:
The ownCloud server has the ability to behaviour arbitrarily in terms of reponse time because of

  • slow remote server storage backends
  • the amount of SQL queries it does (because of architecture oddities or plain bugs)
  • PHP session locking issues

So if the current request in the pipeline is taking long, the other requests are not processed and because of our HTTP stack we don't have good insight on what is going on. That's why the client currently limits to 3 parallel requests and doesn't pipeline.

I started investigating SPDY which I see as superior to Pipelining:

  • You push in all the requests and they can be replied in any order (Pipelining is FIFO)
  • We can do all the SSL/TCP negotation only once instead of doing it for all 3 sync connections (+ some other connections for things mirall does)

@evert
Copy link

evert commented Dec 2, 2014

My understanding was that pipelining is something that the webserver handles (e.g.: nginx) and would just fire up multiple PHP "threads" without waiting for requests to be fully handled.

I suppose that's a wrong assumption then?
Man.. I would love to do some more testing to figure this bit out then.

Lacking pipelining though, I would suggest using POST with a unique Content-Type over inventing a new HTTP verb. And remember that you're not stuck to using XML. multipart may be a great format for this as @LukasReschke mentioned.

@guruz
Copy link
Contributor

guruz commented Dec 2, 2014

@evert Yes, from the PHP's perspective it is in parallel. However the webserver can only send back the reply as FIFO (it's a single pipe line after all, not a line with multiple channels).

Another problem (why I mentioned only GET/idempotent is OK) is that one request might error out and then break the TCP connection and then you don't really know what happened to the other requests in the pipeline.

I had implemented pipelining in QNetworkAccessManager some years ago :-)
But I think SPDY is more interesting, only problem right now I see is that QNetworkAccessManager only speaks SPDY 3 while everyone on the web does 3.1

@evert
Copy link

evert commented Dec 2, 2014

Yes, from the PHP's perspective it is in parallel. However the webserver can only send back the reply as FIFO (it's a single pipe line after all, not a line with multiple channels).

Isn't that the same for putting the requests in a single HTTP body? You're simply doing that on a different layer.

Another problem (why I mentioned only GET/idempotent is OK) is that one request might error out and then break the TCP connection and then you don't really know what happened to the other requests in the pipeline.

My point is the same here.

Idempotence is actually a great point here, because you can actually just repeat the PUT requests you didn't get an answer for. The ones that succeeded earlier would return a 412, and the others would just come through. This is not true for a custom method or POST.

If SPDY makes this easier, this would be great as well though. It addresses the issue in a similar way: by fixing it on a protocol level, as opposed to the application level.

@dragotin
Copy link
Contributor

dragotin commented Dec 4, 2014

How would the server side needed to be changed if we use SPDY? As I understand, it comes as a web server plugin, which handles more requests through shared TCP connections. That would mean that the server does not need general code changes to support that, but could "only" need to optimize handling of single requests in parallel.

Another issue might be that not all cheap hosting offerings might offer the spdy module for the web server, but these would continue to run, with not so cool performance.

Sounds interesting!

@DeepDiver1975
Copy link
Member Author

My plan was to actually test different setups and measure the performance - @MorrisJobke you still owe me some dockers 😉

@MorrisJobke
Copy link
Contributor

@DeepDiver1975 I know :(

@oparoz
Copy link
Contributor

oparoz commented Dec 4, 2014

I don't know how quickly you want to move to SPDY, but mod_spdy is not yet available on Apache 2.4, requires PHP-FPM in order to run at full speed and a special version of mod_ssl which will be obsolete very soon.

@guruz
Copy link
Contributor

guruz commented Dec 5, 2014

@oparoz No worries, SPDY is just an alternative to HTTPS, it does not change the sync algorithmn. All optional.. and still brainstorming here right now :)

@PVince81
Copy link
Contributor

PVince81 commented Feb 4, 2015

Should we continue this discussion for 8.1 ?

@PVince81
Copy link
Contributor

PVince81 commented Feb 4, 2015

(or during 8.1 but implement in a later release)

@DeepDiver1975 DeepDiver1975 added this to the 8.1-next milestone Feb 5, 2015
@DeepDiver1975
Copy link
Member Author

Discussion during 8.1 for sure

@jturcotte
Copy link
Member

[Moved from #15264]
Here is a patch for the performance-tests-c++ that uses HTTP multipart to send multiple PUTs in a single HTTP request:
https://gist.github.com/jturcotte/d787bf8c0c6f3f92405d

You can apply it on top of https://github.com/owncloud/administration

If you notice performance improvements worth the extra code complexity, we can then make sure that the client takes advantage of it.

@jturcotte
Copy link
Member

Unfortunately the Qt network stack currently doesn't support sending multipart requests except when using PUT and POST, so it would be easier client side to go that way instead of MPUT.

Could you guys detect this from a different URI or from the multipart/mixed Content-Type?

@guruz
Copy link
Contributor

guruz commented Mar 27, 2015

@jturcotte Let's not go crazy here, POST/PUT is just fine. Everything else brings other issues with load balancers or proxies :-)

@evert
Copy link

evert commented Mar 30, 2015

I agree that it will be more robust to define a new contenttype for a POST request, than to introduce a new HTTP method.

@PVince81
Copy link
Contributor

@DeepDiver1975 defer to 8.2 ?

@DeepDiver1975 DeepDiver1975 modified the milestones: 8.2-next, 8.1-current Apr 23, 2015
@DeepDiver1975
Copy link
Member Author

We think we don't need bulk operations - closing this for now.

@MorrisJobke MorrisJobke removed this from the 8.2-next milestone Jun 22, 2015
@lock lock bot locked as resolved and limited conversation to collaborators Aug 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants