Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronously upload pack files #3489

Merged
merged 7 commits into from
Jul 3, 2022

Conversation

MichaelEischer
Copy link
Member

What does this PR change? What problem does it solve?

The backup pipeline can stall if lots of small files are backed up while using a high-latency backend, see #2696. This is caused by the fileSaver and blobSaver having to wait until a pack file is uploaded, if a blob filled up the pack file. By blocking the fileSaver this prevents processing further files and thus effectively limits the upload concurrency to 2 for small files.

This PR lets the repository upload finished packs in the background (in SaveAndEncrypt), without blocking the caller. The caller is only blocked if all uploader goroutines are busy. That way the fileSaver is no longer blocked when backing up small files. The archiver concurrency has been adapted to remove excess goroutines.

The concurrency of the pack uploader is based on the number of connections configured for each backend. For local a default of 2 and for sftp a default of 5 connections is used. Configurable connections for these two backends are implemented in #3475.

To keep the number of temporary pack files low, despite the increased upload concurrency, the packer manager now only maintains a single not-completed pack file. The pack file hash calculation is deferred to the uploader goroutines as otherwise the single pack file would become a bottleneck. Simply dumping data chunks into the temporary pack file is rather fast: for me BenchmarkPackerManager which measures the pack assembly step reports a throughput of 3GB/s.

Was the change previously discussed in an issue or on the forum?

Fixes #2696

Should be combined with #3475.

Checklist

  • I have read the contribution guidelines.
  • I have enabled maintainer edits.
  • I have added tests for all code changes. Only the pack uploader has no own tests. Everything else is already covered by existing tests
  • I have added documentation for relevant changes (in the manual). I think we should document the -o <backend>.connections=42 parameter somewhere
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I have run gofmt on the code in all commits.
  • All commit messages are formatted in the same style as the other commits in the repo.
  • I'm done! This pull request is ready for review.

@MichaelEischer MichaelEischer force-pushed the async-pack-uploads branch 2 times, most recently from ea2055c to 3fc0e8a Compare August 22, 2021 17:22
@metalsp0rk
Copy link
Contributor

metalsp0rk commented Dec 3, 2021

After testing this, I'm seeing much better and more consistent usage of my network connection. Hope to see this implemented soon.

SFTP backend on my local network goes from 550mbps to 875mbps average when combined with #3475 (non-scientific anecdote)

@Slind14
Copy link

Slind14 commented Jun 28, 2022

We ran a benchmark with this in comparison to 0.13.1:

10G Uplink 0.13.1 MichaelEischer:async-pack-uploads diff
ext. S3 550M 550M -
Minio 800M 1.4G 👍
Rest Server 2.2G 1.6G 👎

@MichaelEischer
Copy link
Member Author

I've rebased the PR to the current master. The upload statistics introduce by #3733 required quite a few changes, as the asynchronous upload no longer allows retrieving the pack header overhead from the call to Finalize. As a replacement there's now a new method Packer.HeaderOverhead which estimates the pack header overhead before finalization. The removal of the tomb package from the archiver by #3785 also required a few changes.

Previously, SaveAndEncrypt would assemble blobs into packs and either
return immediately if the pack is not yet full or upload the pack file
otherwise. The upload will block the current goroutine until it
finishes.

Now, the upload is done using separate goroutines. This requires changes
to the error handling. As uploads are no longer tied to a SaveAndEncrypt
call, failed uploads are signaled using an errgroup.

To count the uploaded amount of data, the pack header overhead is no
longer returned by `packer.Finalize` but rather by
`packer.HeaderOverhead`. This helper method is necessary to continue
returning the pack header overhead directly to the responsible call to
`repository.SaveBlob`. Without the method this would not be possible,
as packs are finalized asynchronously.
Now with the asynchronous uploaders there's no more benefit from using
more blob savers than we have CPUs. Thus use just one blob saver for
each CPU we are allowed to use.
Large amount of tree savers have no obvious benefit, however they can
increase the amount of (potentially large) trees kept in memory.
Use only a single not completed pack file to keep the number of open and
active pack files low. The main change here is to defer hashing the pack
file to the upload step. This prevents the pack assembly step to become
a bottleneck as the only task is now to write data to the temporary pack
file.

The tests are cleaned up to no longer reimplement packer manager
functions.
@MichaelEischer
Copy link
Member Author

@Slind14 Thanks for testing. Does adding -o s3.connections=10 / -o rest.connections=10 improve the upload performance?

I'm going to merge this PR despite the rest-server performance regression, but we should still investigate it separately. The problem for me right now is that I have no way to reproduce the performance issues. A local backup to a nvme disk achieves 500MB/s throughput for me with several 1GB files and apparently does not suffer from the performance problem.

@MichaelEischer MichaelEischer merged commit cd50feb into restic:master Jul 3, 2022
@MichaelEischer MichaelEischer deleted the async-pack-uploads branch July 3, 2022 09:56
@Slind14
Copy link

Slind14 commented Jul 3, 2022

Going into the hundreds of connections we can reach 3G.

3G seems to be the max restic can achieve. We get the same with a server right next to it in LAN. No matter if 10 or 100 connections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

restic is very slow to backup small files
3 participants