feat: optimize LZMA processing with enhanced multiReader implementation for io.ByteReader #95

orisano · 2023-06-17T17:20:32Z

This pull request aims to optimize the LZMA2 processing in the github.com/bodgit/sevenzip project by enhancing the usage of github.com/ulikunitz/xz library for io.ByteReader.

The current implementation doesn't fully leverage the optimizations provided by github.com/ulikunitz/xz due to its usage of io.MultiReader. As a result, the proposed change modifies the code to return a custom implementation of multiReader based on io.ReadCloser when io.ByteReader is implemented.

The optimized multiReader implementation exhibits significant performance improvements, especially when handling large 7z archives. While the impact might be limited for smaller inputs, it can provide an efficiency boost of nearly 10% for larger files.

before:

goos: linux
goarch: amd64
pkg: github.com/bodgit/sevenzip
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkLargeLZMA2-8   	       1	323613023997 ns/op
BenchmarkLZMA2-8        	     816	   1648719 ns/o

after:

goos: linux
goarch: amd64
pkg: github.com/bodgit/sevenzip
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkLargeLZMA2-8   	       1	307210122907 ns/op
BenchmarkLZMA2-8        	     926	   1512918 ns/op

I kindly request your review and feedback on this pull request. Thank you for considering this contribution.

bodgit · 2023-06-19T10:35:26Z

I'm slightly confused as your PR and benchmark performance references LZMA2, but the reader you've changed is only used for LZMA(1) streams; there's a separate reader used for LZMA2.

orisano · 2023-06-19T10:57:40Z

@bodgit thanks for reply.
Sorry for your confusion.
The folder 0 of testdata/lzma2.7z is compressed with lzma and the lzma reader is called instead of lzma2.
The 7z I am working with now has folder 0 as lzma as testdata/lzma2.7z, and its large size affects performance.

bodgit · 2023-06-19T14:54:43Z

The folder 0 of testdata/lzma2.7z is compressed with lzma and the lzma reader is called instead of lzma2.

In some cases there are actually two folder 0 in an archive. If the archive headers are compressed and/or encrypted then there is a separate compressed stream which contains a single folder that holds the archive headers/metadata, as well as the usual folder 0 (... & 1, 2, etc.) that holds the compressed file contents.

In the case of the testdata/lzma2.7z, I think this why you've noticed it use the LZMA reader. I think the headers are compressed, (7z a -mhc=on ...), which seems to always use LZMA, I'm not sure if there's any way to tell 7z to use something else. The file contents are compressed with LZMA2 though:

$ 7z l -slt testdata/lzma2.7z | grep ^Method
Method = LZMA2:48k
...
Method = LZMA2:48k

The 7z I am working with now has folder 0 as lzma as testdata/lzma2.7z, and its large size affects performance.

So I suppose my question is, is this folder 0 in your archive the compressed headers, or is it the first folder containing some/all of the compressed files?

I don't necessarily have a problem with the PR as such, just trying to understand where the slowdown is. Is the LZMA2 reader going to suffer the same problem, given it uses the same github.com/ulikunitz/xz package?

orisano · 2023-06-19T15:10:10Z

So I suppose my question is, is this folder 0 in your archive the compressed headers, or is it the first folder containing some/all of the compressed files?

Compressed headers.

I don't necessarily have a problem with the PR as such, just trying to understand where the slowdown is. Is the LZMA2 reader going to suffer the same problem, given it uses the same github.com/ulikunitz/xz package?

LZMA2 Reader passes io.ReadCloser directly to the xz package, so this is not a problem.
The xz package wraps io.ReadCloser with io.LimitReader, which degrades performance.
If ulikunitz/xz#56 is merged, the optimization will take effect in the body of lzma2.
This is an xz issue, so it does not matter.

orisano · 2023-06-20T12:55:56Z

benchstat:

goos: linux
goarch: amd64
pkg: github.com/bodgit/sevenzip
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
             │  base.txt   │             xz#55.txt              │    xz#55_xz#56_sevenzip#95.txt     │
             │   sec/op    │   sec/op    vs base                │   sec/op    vs base                │
LargeLZMA2-8   10.775 ± 0%   8.704 ± 0%  -19.22% (p=0.000 n=10)   8.384 ± 0%  -22.19% (p=0.000 n=10)

bodgit · 2023-07-03T11:24:39Z

I've merged this PR and released it as v1.4.3. If/when the xz package is updated and contains your fix, I'll create a further release.

Thanks for the contribution!

feat: add io.ByteReader optimization

9291662

orisano changed the title ~~feat: optimize LZMA processing with enhanced multiReader implementation for io.ByteReader~~ feat: optimize LZMA2 processing with enhanced multiReader implementation for io.ByteReader Jun 17, 2023

bodgit self-assigned this Jun 19, 2023

bodgit added the enhancement New feature or request label Jun 19, 2023

orisano changed the title ~~feat: optimize LZMA2 processing with enhanced multiReader implementation for io.ByteReader~~ feat: optimize LZMA processing with enhanced multiReader implementation for io.ByteReader Jun 19, 2023

bodgit merged commit 9291662 into bodgit:master Jul 3, 2023

orisano deleted the feat/byte-reader-optimization branch July 3, 2023 11:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: optimize LZMA processing with enhanced multiReader implementation for io.ByteReader #95

feat: optimize LZMA processing with enhanced multiReader implementation for io.ByteReader #95

orisano commented Jun 17, 2023 •

edited

Loading

bodgit commented Jun 19, 2023

orisano commented Jun 19, 2023

bodgit commented Jun 19, 2023

orisano commented Jun 19, 2023 •

edited

Loading

orisano commented Jun 20, 2023

bodgit commented Jul 3, 2023

feat: optimize LZMA processing with enhanced multiReader implementation for io.ByteReader #95

feat: optimize LZMA processing with enhanced multiReader implementation for io.ByteReader #95

Conversation

orisano commented Jun 17, 2023 • edited Loading

bodgit commented Jun 19, 2023

orisano commented Jun 19, 2023

bodgit commented Jun 19, 2023

orisano commented Jun 19, 2023 • edited Loading

orisano commented Jun 20, 2023

bodgit commented Jul 3, 2023

orisano commented Jun 17, 2023 •

edited

Loading

orisano commented Jun 19, 2023 •

edited

Loading