Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked upload performance improvements #47682

Open
provokateurin opened this issue Sep 2, 2024 · 26 comments · May be fixed by #49753
Open

Chunked upload performance improvements #47682

provokateurin opened this issue Sep 2, 2024 · 26 comments · May be fixed by #49753
Assignees
Labels
Milestone

Comments

@provokateurin
Copy link
Member

Motivation

Chunked upload is useful in many cases, but also slower than uploading the entire file directly.
Overall it is slower because multiple network requests have to be made.
The speed of requests varies during their lifetime, as it is slower at the beginning and then flattens out at the maximum over time.
The smaller the requests are and the more we make, the worse performance penalty.
Thus to increase chunked upload speed the size of chunks should be increased.

A single upload using cURL is the upper limit of the possible upload speed for any configuration and upload method.
A chunked upload with the chunk size equal to or greater than the file size represents the upper limit for chunked uploads as it only uploads a single chunk.
While reaching the former would be nice, only the latter is achievable (without general performance improvements in WebDAV regardless of the maximum chunk size) and thus represents the theoretical goal.

Testing methodology

Input

dd if=/dev/random of=1G.bin bs=1G count=1

Scenarios

All tests are running on a local instance using the PHP standalone web server with 10 workers and no extra apps enabled.
The machine has a Ryzen 5 5800X (8 threads, 16 cores), 48GB RAM and a NVMe M.2 Samsung 980 SSD 1TB.
Hardware should not be a bottleneck on this setup and external networking can not have an effect either.

1. cURL single upload

Take the Real timing value.

time curl -X PUT "http://localhost:8080/remote.php/webdav/1G.bin" -u admin:admin --upload-file 1G.bin

Runs:
5.412s
5.223s
5.100s
Average: 5.245s

Note: I once saw an outlier that only took about 4.7s, but this never happened again.

Chunked upload via browser

Open Firefox Devtools and filter network requests by dav/uploads.
Upload 1G.bin via web interface.
Take Started of first (MKCOL) and the Downloaded of the last (MOVE) request and subtract them (See the Timings tab of each request).
This includes some constant overhead for the MKCOL and MOVE requests which is not relevant for comparing chunked upload timing results as they all have the same overhead, but when comparing to the cURL scenario it accurately measures the overall time for the upload process.

According to https://firefox-source-docs.mozilla.org/devtools-user/network_monitor/throttling/index.html "Wi-Fi" throttling means a maximum speed of 15 Mbps.
Sadly this is the "fastest" speed one can select for throttling and there is no way to set a custom speed.
It should represent a worst-case, while most uploads are probably done with 2-3x that speed in the real world if the Nextcloud instance is not on the same network.

Adjusting the default maximum chunk size can be done in

$maxChunkSize = (int)Server::get(IConfig::class)->getAppValue('files', 'max_chunk_size', (string)(10 * 1024 * 1024));

2. Chunk size 10MiB (current default), unlimited bandwidth

Chunks: 103
Runs:
47.16s
47.65s
47.33s
Average: 47.38s

3. Chunk size 100MiB, unlimited bandwidth

Chunks: 11
Runs:
8.53s
8.64s
8.63s
Average: 8.6s

4. Chunk size 1024MiB, unlimited bandwidth

Chunks: 1
Runs:
6.37s
6.34s
6.34s
Average: 6.35s

5. Chunk size 10MiB (current default), throttled "Wi-Fi"

Chunks: 103
Runs:
551.40s
551.40s
551.40s
Average: 551.40s

6. Chunk size 100MiB, throttled "Wi-Fi"

Chunks: 11
Runs:
552.60s
549.60s
551.40s
Average: 551.2s

7. Chunk size 1024MiB, throttled "Wi-Fi"

Chunks: 1
Runs:
568.20s
555.60s
553.11s
Average: 558.97s

Conclusions

  1. Upload speed in Nextcloud is very consistent regardless of upload method. Great!

  2. Chunked upload in general takes about 21% longer in scenarios with unlimited bandwidth (scenario 1 and 4). Whether this overhead can be eliminated easily is not clear, but at least there is no hard limitation since both uploads are done through WebDAV and thus use the same underlying stack (also see other interesting findings section below).

  3. In the current default configuration with unlimited bandwidth chunked upload is takes 646% longer than the maximum speed (scenario 2 and 4). By increasing the chunk size by 10x keeping the bandwidth ulimited it only takes 35% longer than the maximum speed (scenario 3 and 4). This is a 5.5x increase in total throughput (scenario 2 and 3).

  4. In bandwidth limited scenarios increasing the chunk size has almost no positive (and no negative effect; scenario 5 and 6). This is expected as the slow speed at the beginning of each chunk is a lot smaller on relation to the overall speed or even exactly the same.

  5. Increasing the chunk size helps uploads on fast connections while it has no downsides on slow connections speed wise. Slow networks can be correlated with unstable networks, so having fewer and larger chunks could result in a higher rate of aborted chunk uploads. This downside should be taken into consideration when choosing a new maximum chunk size.

  6. A new maximum chunk size still needs to be figured out by collecting more data for different chunk sizes. It needs to hit a sweet spot of maximum speed with minimum size to account for the before mentioned drawback on unstable networks (basically the point of diminishing returns). This investigation was only to prove that we can increase the chunked upload speed.

Other interesting findings

While uploading with a single chunk and unlimited bandwidth Firefox displayed that the request needed 637ms to send but had to wait 2.10s after that (reproducible). This might show that we have a bottleneck in processing the uploads on the backend side. Maybe it would be possible to stream the request data directly into the file while should cut down the waiting time a lot. It should be possible to profile these requests and figure out where the time is spent.

For single chunks the MOVE request still takes quite some time. I assume this happens because it concatenates the chunks while there is only one (which is slow because it has to read and write all the data). This case could be detected and the file moved without reading and writing it (which is not possible for all storages AFAICT, i.e. it needs to be on the same physical disk to take advantage of it). This only affects uploads where the file size is less than the maximum chunk size. Due to the current low limit it is not really noticable, but with a higher maximum chunk size this would affect a lot more and bigger uploads and could lead to quite a performance improvement for those.

Upload ETA calculations are all over the place due to the varying speed of uploading chunks over their lifetime. It could be improved by taking the overall time a chunk needs to upload and multiplying it by number of remaining chunks. This should be a lot more reliable as every chunk should have a similar upload characteristics. To smooth out the the value it could take into account the last 3-5 chunks.

@provokateurin provokateurin added 1. to develop Accepted and waiting to be taken care of feature: dav performance 🚀 labels Sep 2, 2024
@provokateurin provokateurin self-assigned this Sep 2, 2024
@sorbaugh sorbaugh added this to the Nextcloud 31 milestone Sep 2, 2024
@joshtrichards
Copy link
Member

Good stuff.

A couple thoughts:

  • The Desktop client has already traveled the bumps from 10MB -> 1000MB -> 5000MB so there are some data points there. The main problem with the desktop client's approach at the moment is that we don't have back-off logic in place when the higher chunk size doesn't work. This creates problems when it runs up against web/proxy limits (default or otherwise)1. But that is fixable (I think).

  • To avoid unnecessary support issues, invalid bug reports, and a poor user experience: Increasing the default chunk size should probably account for widespread limitations2 (or) have a good dynamic/back-off logic to reconfigure if the end-to-end path doesn't seem to be working with the larger chunk size (or) be really well documented. Preferably either/both of the former two. :-)

  • We have variance in behavior among our various clients. Ideally we unify defaults and behavior. Unless there's a compelling reason not too of course.

  • A shared ability to set the chunk size either globally or on a per client type basis might be nice. Currently the server-side parameter is only applicable to the Web client. And the chunk size isn't even configurable in our Android client (for example).

Footnotes

  1. e.g. See "Connection closed" message when syncing files larger then +- 100Mb desktop#4278

  2. Reasons this happens: Apache LimitRequestBody defaults to 1GB these days (used to be unlimited by default), CloudFlare freebie level (100MB) which leads to lots of false bug reports/poor user experiences, and various other web/proxy body/transaction size limits.

@provokateurin
Copy link
Member Author

provokateurin commented Sep 2, 2024

@joshtrichards thanks for your input! Highly appreciated having pointers to related topics :)

@provokateurin
Copy link
Member Author

Hi, @nextcloud/desktop would it be possible to sit down with some of you at the Conference in two weeks and discuss this? I think we can figure out something together, solve the problems and unify the behavior across clients and the web.

@provokateurin
Copy link
Member Author

Also @joshtrichards (and anyone else) feel free to join in case this takes place. You seem to know quite a lot of paint points of this problem which would be really helpful to consider.

@camilasan
Copy link
Member

@provokateurin Matthieu and me will be at the conference, could you put something on the calendar for this?
will you join us @joshtrichards? :)

@provokateurin
Copy link
Member Author

I'll try to find a time slot (the calendar is really full huh) and get it added to the official agenda so everyone interested can join. To avoid collisions with any other event it probably has to be late afternoon.

@provokateurin provokateurin moved this to 📄 To do (~10 entries) in 📁 Files team Sep 3, 2024
@provokateurin
Copy link
Member Author

provokateurin commented Sep 3, 2024

Ok, the meeting will take place Wednesday 18.9. 16:00 at Motion Lab. The room is still to be determined, but will be available in the public agenda and calendar on time.

@joshtrichards
Copy link
Member

I wish I could join, but I'll not in be town for the conference.

@provokateurin
Copy link
Member Author

I just realized I might have made a mistake. Instead of the 10 workers I think only 4 were available. Given that the chunked upload tries to upload 5 concurrent chunks at least one of the might have been stalled the whole time.
I will make new measurements to confirm if this is the case and affected the results. Overall it shouldn't changed the picture though, as only up-to 20% of the performance was lost due to this error which is a lot less than the noticed overall performance loss utilizing multiple concurrent chunks.

@provokateurin provokateurin added 2. developing Work in progress and removed 1. to develop Accepted and waiting to be taken care of labels Sep 9, 2024
@cfiehe
Copy link

cfiehe commented Sep 11, 2024

During my analysis of #47856, I have found some interesting points regarding file upload optimization in Nextcloud:

One problem in case of local storage is that OCA\DAV\Connector\Sabre\Directory does not handle moveInto efficiently. It always falls back to Sabre’s default copy-delete handling instead of making use of a more efficient renaming way when source and target are located on the same storage.

Another problem is in View.php, where Nextcloud tells the storage to use rename instead of moveFromStorage in the only case, when source and target storage are represented by the same object $storage1 === $storage2. This criterion is too strict in my opinion and will never be true in case of groupfolders, where an individual storage representation gets used. The criterion does also not play well with all those storage wrappers and prevents a more efficient way to handle file renamings on the same storage.

@provokateurin
Copy link
Member Author

@cfiehe thanks for your insights! We will consider them in the discussion, although this issue itself is more about the chunked upload performance itself. I only noticed similar issues and noted them down here so they can be further investigated in the future.

@provokateurin
Copy link
Member Author

provokateurin commented Sep 19, 2024

Hi everyone,
we had a really good discussion about chunked upload performance.
I will try to summarize the outcome and ideas we had:

  1. The upload speed for the existing chunked upload methods can be improved by choosing an appropriate chunk size and we settled on 100MiB based on the benchmarking and the input @joshtrichards gave. This setting should be configurable by the admin in config.php and exposed to the clients and web via capabilities. The same goes for the number of parallel chunks.
  2. To fully saturate the upload bandwidth and reduce the number of connections a continious upload using HTTP Content-Range (https://racineennis.ca/2017/03/21/Resumable-file-transfers-with-content-range-header) can be implemented. This is not backwards compatible with the existing chunked upload method and needs to be implemented differently. The main benefit is that we don't manually control the chunk size by guessing a sensible value and that we instead let the request upload as long as possible. At some point a timeout will occur, but then we can just continue where we stopped. This means there is only 1 connection per file upload needed and we use the maximum time possible to upload which minimizes the overhead of each request as much as possible. This also has benefits on unstable networks as the chunk size is dynamic by definition, so even if your connection drops after every 1KiB, you will be able to upload large files.
  3. Assembly of single chunks can be optimized by just moving the chunk to be the file instead of reading the chunk to memory and writing it as the file. See comment below
  4. The assembly of the chunks is very time consuming in the upload process. As I already discovered earlier, it even happens when you only upload a single chunk which can be optimized (just move the chunk/file instead of reading and writing it). Due to relying on an HTTP request that will also run into a timeout eventually there is actually a hard limit on the maximum file size one can upload to a particular server, as any file that can not be assembled within the timeout time can never be uploaded to the server. We had the idea to do the assembly "lazy" in the background, such that the final MOVE request would only indicate that the upload is done. There are two variants to implement this:
    1. Do the assembly in the background using a background job and display to the user that the file still needs to be processed. This needs adaption in the UI and doesn't provide a great UX.
    1. Same as i, but we display the file like any regular file to the user (including read&write) and keep a reference to the unassembled chunks as long as they have not been assembled by the background job yet. When reading the data of the file we just chain the streams of the chunks, such that they represent the entire file. Performance wise this could have some downsides as the speed would go up and down at the chunk borders, but it provides a much better UX as the file is directly usable after the upload. This way to implement it also has the benefit that it doesn't need changes in the UI and business logic on the web and clients.

Items 1, 2, 3 and 4 can all be implemented indepently of another, while 1 and 3 and very easy, 2 needs quite some work on the backend and client sides and 4 needs more research/talking to experts to figure out if option 4.ii is possible.
The combination of all these ideas should lead to the best possible performance and UX which is of course a super nice goal to achieve.

Thank you again to all the people who attended, the outcome is way better than what I had hoped for.

CC @sorbaugh @AndyScherzinger @tobiasKaminsky

@provokateurin
Copy link
Member Author

Assembly of single chunks can be optimized by just moving the chunk to be the file instead of reading the chunk to memory and writing it as the file.

I implemented this, but it actually makes no difference, because the chunk is read into memory anyway. It even seemed to slow down the process a few ms, maybe because it no longer used streams but the entire data directly.

@juliusknorr
Copy link
Member

juliusknorr commented Oct 17, 2024

Assembly of single chunks can be optimized by just moving the chunk to be the file instead of reading the chunk to memory and writing it as the file.

I implemented this, but it actually makes no difference, because the chunk is read into memory anyway. It even seemed to slow down the process a few ms, maybe because it no longer used streams but the entire data directly.

One additional limitation there is on the PHP side (ref #43636 (comment)). I could imagine some performance gain if we'd use a POST for upload of chunks where PHP exposes the temporary file directly, but doesn't seem to be possible with the current PUT approach for chunks. In the current state there is always a file written by PHP itself which we then read into memory through fopen('input://) as a stream and write again to the actual destination.

@provokateurin
Copy link
Member Author

@juliushaertl thanks for the insight. I will keep it in mind for point 2, then it can be even faster 👍

@JMarcosHP

This comment has been minimized.

@provokateurin

This comment has been minimized.

@provokateurin
Copy link
Member Author

provokateurin commented Nov 22, 2024

For point 2:
Instead of using Content-Range it makes more sense to implement https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-resumable-upload, as it is standardized and they way it works looks very solid to me. This also makes it much easier for others to use the API, because there are already clients out there than implement the protocol.
I also found https://github.com/ankitpokhrel/tus-php which could either be used directly if possible (I doubt that can be done easily) or at least used as a reference for how it can be implemented in PHP. I learned that this is actually not an implementation of the RFC... it's a bit confusing because the RFC is inspired by tus (v2) but not the same.

Thanks @Leptopoda for pointing me to this RFC :)

@provokateurin
Copy link
Member Author

provokateurin commented Nov 22, 2024

Actually now that I think about it, we can probably get rid of the assembly step completely, which makes point 4 irrelevant (except for older chunked upload implementations). With a single upload connection we can just append the already uploaded data. Having the chunks separate and then merging them is only necessary due to parallel connections. I need to think a bit more about it, but I hope it is possible to save a lot of work by just taking the simple route which is also way more performant and less resource intensive.

@provokateurin
Copy link
Member Author

Implementation of the draft RFC is here: #49753
Please see the comment with the explanation why it won't be merged yet (and probably not for 31 at all).

@brunofin
Copy link

Hi I would like to know if this would possibly also address the following issue I've been having: https://help.nextcloud.com/t/long-running-request-to-move-file-after-large-file-upload-times-out-causing-del-to-be-called/212394

To summarize the post from the community forum: A long-running request to MOVE .file at the end of a large chunked file upload (10GB) times out behind CloudFlare argo tunnel for taking longer than the imposed 1min30s for proxied connections, causing the frontend to see this as an error (even though it was not) and react to it by making a request to DELETE <file id> cancelling the assembly and deleting everything.

Would the changes proposed in the PR have a positive impact in any way in this situation? Thanks.

@provokateurin
Copy link
Member Author

Yes the PR above will fix that issue. The problem you are describing is unfortunately not fixable with the current architecture of chunked upload. The only way to work around it is to increase timeouts if possible (but that can have other downsides as well).

@brunofin
Copy link

Thanks for your reply. I am glad to hear that would solve the issue. That's been a blocker for me to move myself and me entire family to NC, but knowing there's a solution to be released eventually is nice.

In order to speed up things, I can offer my help. I'm not experienced with PHP at all so I couldn't help there but I'm very experienced with a range of frontend tech and some open source contributions as all, if needed at all I could try to provide some contribution.

@provokateurin
Copy link
Member Author

Thanks for the offer to help, but currently the problem is that the RFC is still in draft and we want to wait for it to be finalized. Hopefully that is going to happen within the next few months and the feature will land in 31 or 32 later next year.

@modernNeo

This comment has been minimized.

@provokateurin
Copy link
Member Author

@brunofin is experiencing a different problem which can not be fixed by changing chunk sizes. In this case the assembly of the chunks can never finish because the timeouts always end the request early.

Let's stay on topic please and use other issues or the forum to discuss existing problems that are not directly related to this issue.

@sorbaugh sorbaugh moved this from 🏗️ In progress to ☑️ Done in 📁 Files team Jan 15, 2025
@sorbaugh sorbaugh moved this from ☑️ Done to 🏗️ In progress in 📁 Files team Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🏗️ In progress
Development

Successfully merging a pull request may close this issue.

9 participants