Quadratic complexity when reading large entries #21

nigeltao · 2021-08-20T04:53:27Z

Make a .tar archive containing a single large file, archivemount it and copy out the single element in the archive:

$ truncate --size=256M zeroes
$ tar cf zeroes-256mib.tar zeroes
$ archivemount zeroes-256mib.tar ~/mnt
$ dd if=~/mnt/zeroes of=/dev/null status=progress
267633152 bytes (268 MB, 255 MiB) copied, 77 s, 3.5 MB/s
524288+0 records in
524288+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 77.5176 s, 3.5 MB/s

Notice that the "MB/s" dd throughput slows down (in the interactive output, not copy/pasted above) as time goes on. It looks like there's some sort of super-linear algorithm involved, and the whole thing takes more than a minute. In comparison, a straight tar extraction takes less than a second.

$ time tar xf zeroes-256mib.tar
real    0m0.216s

Sprinkling some logging in the _ar_read function in archivemount.c shows that the dd leads to multiple _ar_read calls. In the steady state, it reads size = 128 KiB each time, with the offset argument incrementing on each call.

Inside that _ar_read function, if we take the if (node->modified) false branch, then for each call:

we call archive_read_new, even though it's the same archive every time.
we iterate from the start of that archive until we find the archive_entry for the file. Again, this is once per call, even if it's conceptually the same archive_entry object used in previous calls.
we copy out offset bytes from the start of the entry's contents to a 'trash' buffer, before finally copying out the payload that _ar_read is actually interested in.

The total number of bytes produced by archive_read_data calls is therefore quadratic in the decompressed size of the archive entry. This is slow enough for .tar files but probably worse for .tar.gz files. The whole thing is reminiscent of the Shlemiel the Painter story.

There may be some complications if we're re-writing the archive, but when mounting read-only, we should be able to get much better performance.

The text was updated successfully, but these errors were encountered:

nigeltao · 2021-09-23T03:59:35Z

https://github.com/google/fuse-archive
is an alternative implementation with linear (not quadratic) complexity.

mxmlnkn · 2022-02-13T21:38:10Z

I did some benchmarks to visualize this problem. The setup is simple: create an archive containing a single file and measure how long it takes to read that file. Repeat for different file sizes, tools, and compression backends.

nabijaczleweli · 2024-06-16T02:32:32Z

Fixed in archivemount-ng in https://git.sr.ht/~nabijaczleweli/archivemount-ng/commit/6353505d958a3db6d91dba90921bd660ab530c77#archivemount.cpp

nigeltao mentioned this issue Jan 21, 2022

Discussion of performance google/fuse-archive#3

Closed

mxmlnkn mentioned this issue Feb 12, 2022

Extremely slow file access in zip files #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quadratic complexity when reading large entries #21

Quadratic complexity when reading large entries #21

nigeltao commented Aug 20, 2021 •

edited

Loading

nigeltao commented Sep 23, 2021

mxmlnkn commented Feb 13, 2022 •

edited

Loading

nabijaczleweli commented Jun 16, 2024 •

edited

Loading

Quadratic complexity when reading large entries #21

Quadratic complexity when reading large entries #21

Comments

nigeltao commented Aug 20, 2021 • edited Loading

nigeltao commented Sep 23, 2021

mxmlnkn commented Feb 13, 2022 • edited Loading

nabijaczleweli commented Jun 16, 2024 • edited Loading

nigeltao commented Aug 20, 2021 •

edited

Loading

mxmlnkn commented Feb 13, 2022 •

edited

Loading

nabijaczleweli commented Jun 16, 2024 •

edited

Loading