Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quadratic complexity when reading large entries #21

Open
nigeltao opened this issue Aug 20, 2021 · 3 comments
Open

Quadratic complexity when reading large entries #21

nigeltao opened this issue Aug 20, 2021 · 3 comments

Comments

@nigeltao
Copy link

nigeltao commented Aug 20, 2021

Make a .tar archive containing a single large file, archivemount it and copy out the single element in the archive:

$ truncate --size=256M zeroes
$ tar cf zeroes-256mib.tar zeroes
$ archivemount zeroes-256mib.tar ~/mnt
$ dd if=~/mnt/zeroes of=/dev/null status=progress
267633152 bytes (268 MB, 255 MiB) copied, 77 s, 3.5 MB/s
524288+0 records in
524288+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 77.5176 s, 3.5 MB/s

Notice that the "MB/s" dd throughput slows down (in the interactive output, not copy/pasted above) as time goes on. It looks like there's some sort of super-linear algorithm involved, and the whole thing takes more than a minute. In comparison, a straight tar extraction takes less than a second.

$ time tar xf zeroes-256mib.tar
real    0m0.216s

Sprinkling some logging in the _ar_read function in archivemount.c shows that the dd leads to multiple _ar_read calls. In the steady state, it reads size = 128 KiB each time, with the offset argument incrementing on each call.

Inside that _ar_read function, if we take the if (node->modified) false branch, then for each call:

  • we call archive_read_new, even though it's the same archive every time.
  • we iterate from the start of that archive until we find the archive_entry for the file. Again, this is once per call, even if it's conceptually the same archive_entry object used in previous calls.
  • we copy out offset bytes from the start of the entry's contents to a 'trash' buffer, before finally copying out the payload that _ar_read is actually interested in.

The total number of bytes produced by archive_read_data calls is therefore quadratic in the decompressed size of the archive entry. This is slow enough for .tar files but probably worse for .tar.gz files. The whole thing is reminiscent of the Shlemiel the Painter story.

There may be some complications if we're re-writing the archive, but when mounting read-only, we should be able to get much better performance.

@nigeltao
Copy link
Author

https://github.com/google/fuse-archive
is an alternative implementation with linear (not quadratic) complexity.

@mxmlnkn
Copy link

mxmlnkn commented Feb 13, 2022

I did some benchmarks to visualize this problem. The setup is simple: create an archive containing a single file and measure how long it takes to read that file. Repeat for different file sizes, tools, and compression backends.

bandwidth-comparison

@nabijaczleweli
Copy link

nabijaczleweli commented Jun 16, 2024

Fixed in archivemount-ng in https://git.sr.ht/~nabijaczleweli/archivemount-ng/commit/6353505d958a3db6d91dba90921bd660ab530c77#archivemount.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants