v1.5.0 speed regressions #2662

ghost · 2021-05-16T15:34:04Z

A msvc2019 speed regression commit: 634bfd3

EDIT:
This regression only occurs when using /Ob3 compiler option, AND on i5-4570.
If use '/Ob2' option, OR on Ryzen-3600X, 634bfd3 has no speed regression.

/Ob3 specifies more aggressive inlining than /Ob2
https://docs.microsoft.com/en-us/cpp/build/reference/ob-inline-function-expansion

before:

                  c_speed   d_speed
level 1: 63.99%, 200.3395, 370.5008
level 2: 54.90%, 150.2246, 259.3094
level 3: 52.55%, 104.7193, 231.1173
level 4: 49.95%, 94.5034, 198.3928

after:

                  c_speed   d_speed
level 1: 63.99%, 184.0881, 361.4072
level 2: 54.90%, 144.1343, 259.7859
level 3: 52.55%, 101.5963, 231.4723
level 4: 49.95%, 91.4660, 198.3102

The unit of c_speed/d_speed is MB/s.

The text was updated successfully, but these errors were encountered:

ghost · 2021-05-16T15:57:49Z

There is another speed regression commit.

EDIT:
It seems this regression only occurs in finalize_dictionary function, it's not a performance critical function.

Run pyzstd module's unit-tests on Windows 10, msvc2019, i5 4570:
before 980f3bb : 2.8x seconds.
after 980f3bb : 3.3x seconds.

On WSL2, gcc-9.3.0, i5 4570:
before 980f3bb : 5.1x seconds.
after 980f3bb : 5.6x seconds.

On Windows 10, msvc2019, amd 3600x:
before 980f3bb : 1.9x seconds.
after 980f3bb : 2.1x seconds.

How to run the unit-tests:

Install Python with "Python test suite" checkbox checked.
Install msvc2019 community edition.
Download pyzstd source code: https://github.com/animalize/pyzstd/archive/refs/heads/dev.zip
Run this .bat file:

e:
cd e:\dev\pyzstd

echo Y | rd /q /s build
echo Y | rd /q /s dist
echo Y | py -m pip uninstall pyzstd

py setup.py install
py E:\dev\pyzstd\tests\test_zstd.py
pause

FrancescAlted · 2021-05-17T12:31:56Z

I have recently upgraded C-Blosc2 to Zstd 1.5.0 (from 1.4.9), and I am detecting performance regressions too, specially on the compression side of the things.

The differences are very apparent on this Intel box: Clear Linux, GCC 11, i9-10940X @ 3.30GHz:

Before (zstd 1.4.9):

> ./bench/b2bench zstd shuffle single 8                                                                                                 (base)
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,zlib,zstd
Supported compression libraries:
  BloscLZ: 2.3.0
  LZ4: 1.9.3
  Zlib: 1.2.11.zlib-ng
  Zstd: 1.4.9
Using compressor: zstd
Using shuffle type: shuffle
Running suite: single
--> 8, 4194304, 4, 19, zstd, shuffle
********************** Run info ******************************
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 4194304 bytes	Type size: 4 bytes
Working set: 256.0 MB		Number of threads: 8
********************** Running benchmarks *********************
memcpy(write):		  871.3 us, 4590.7 MB/s
memcpy(read):		  481.3 us, 8311.5 MB/s
Compression level: 0
comp(write):	  122.1 us, 32749.7 MB/s	  Final bytes: 4194336  Ratio: 1.00
decomp(read):	   94.7 us, 42224.1 MB/s	  OK
Compression level: 1
comp(write):	  716.9 us, 5579.4 MB/s	  Final bytes: 599602  Ratio: 7.00
decomp(read):	  288.5 us, 13864.8 MB/s	  OK
Compression level: 2
comp(write):	  567.5 us, 7048.5 MB/s	  Final bytes: 345678  Ratio: 12.13
decomp(read):	  261.8 us, 15280.0 MB/s	  OK
Compression level: 3
comp(write):	  806.8 us, 4957.9 MB/s	  Final bytes: 134398  Ratio: 31.21
decomp(read):	  266.0 us, 15039.3 MB/s	  OK
Compression level: 4
comp(write):	  837.3 us, 4777.3 MB/s	  Final bytes: 62832  Ratio: 66.75
decomp(read):	  130.2 us, 30722.6 MB/s	  OK
Compression level: 5
comp(write):	  928.7 us, 4307.3 MB/s	  Final bytes: 60076  Ratio: 69.82
decomp(read):	  122.2 us, 32722.5 MB/s	  OK
Compression level: 6
comp(write):	  909.1 us, 4400.2 MB/s	  Final bytes: 59080  Ratio: 70.99
decomp(read):	  114.9 us, 34818.2 MB/s	  OK
Compression level: 7
comp(write):	 1515.8 us, 2639.0 MB/s	  Final bytes: 37592  Ratio: 111.57
decomp(read):	   89.5 us, 44674.7 MB/s	  OK
Compression level: 8
comp(write):	 1686.7 us, 2371.4 MB/s	  Final bytes: 37464  Ratio: 111.96
decomp(read):	   90.0 us, 44432.8 MB/s	  OK
Compression level: 9
comp(write):	 18625.2 us, 214.8 MB/s	  Final bytes: 15400  Ratio: 272.36
decomp(read):	  179.2 us, 22316.9 MB/s	  OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:	    5.7 s, 2953.1 MB/s

After (zstd 1.5.0):

> ./bench/b2bench zstd shuffle single 8                                                                                                     (base)
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,zlib,zstd
Supported compression libraries:
  BloscLZ: 2.3.0
  LZ4: 1.9.3
  Zlib: 1.2.11.zlib-ng
  Zstd: 1.5.0
Using compressor: zstd
Using shuffle type: shuffle
Running suite: single
--> 8, 4194304, 4, 19, zstd, shuffle
********************** Run info ******************************
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 4194304 bytes	Type size: 4 bytes
Working set: 256.0 MB		Number of threads: 8
********************** Running benchmarks *********************
memcpy(write):		  865.8 us, 4619.7 MB/s
memcpy(read):		  481.9 us, 8299.9 MB/s
Compression level: 0
comp(write):	  122.4 us, 32675.9 MB/s	  Final bytes: 4194336  Ratio: 1.00
decomp(read):	  100.0 us, 39998.4 MB/s	  OK
Compression level: 1
comp(write):	  737.9 us, 5420.7 MB/s	  Final bytes: 599602  Ratio: 7.00
decomp(read):	  281.9 us, 14187.4 MB/s	  OK
Compression level: 2
comp(write):	  586.9 us, 6815.0 MB/s	  Final bytes: 345678  Ratio: 12.13
decomp(read):	  253.1 us, 15801.9 MB/s	  OK
Compression level: 3
comp(write):	 1169.0 us, 3421.8 MB/s	  Final bytes: 134398  Ratio: 31.21
decomp(read):	  262.8 us, 15219.4 MB/s	  OK
Compression level: 4
comp(write):	 1863.3 us, 2146.8 MB/s	  Final bytes: 63200  Ratio: 66.37
decomp(read):	  127.9 us, 31278.8 MB/s	  OK
Compression level: 5
comp(write):	 1758.7 us, 2274.4 MB/s	  Final bytes: 60076  Ratio: 69.82
decomp(read):	  121.7 us, 32866.7 MB/s	  OK
Compression level: 6
comp(write):	 2255.6 us, 1773.4 MB/s	  Final bytes: 59080  Ratio: 70.99
decomp(read):	  112.3 us, 35628.1 MB/s	  OK
Compression level: 7
comp(write):	 3032.0 us, 1319.2 MB/s	  Final bytes: 37592  Ratio: 111.57
decomp(read):	   88.5 us, 45204.3 MB/s	  OK
Compression level: 8
comp(write):	 3274.6 us, 1221.5 MB/s	  Final bytes: 37464  Ratio: 111.96
decomp(read):	   89.0 us, 44924.8 MB/s	  OK
Compression level: 9
comp(write):	 21220.3 us, 188.5 MB/s	  Final bytes: 15400  Ratio: 272.36
decomp(read):	  182.1 us, 21962.1 MB/s	  OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:	    7.5 s, 2251.7 MB/s

As you see, the compression speed can be more than 2x slower with 1.5.0. Decompression does not seem to be affected.

Curiously, with Apple M1 (Apple clang 12.0.5) the differences are almost negligible:

Before (zstd 1.4.9):

> ./bench/b2bench zstd shuffle single 8                                                                                 (base)
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,zlib,zstd
Supported compression libraries:
  BloscLZ: 2.3.0
  LZ4: 1.9.3
  Zlib: 1.2.11.zlib-ng
  Zstd: 1.4.9
Using compressor: zstd
Using shuffle type: shuffle
Running suite: single
--> 8, 4194304, 4, 19, zstd, shuffle
********************** Run info ******************************
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 4194304 bytes	Type size: 4 bytes
Working set: 256.0 MB		Number of threads: 8
********************** Running benchmarks *********************
memcpy(write):		  477.5 us, 8377.2 MB/s
memcpy(read):		   85.5 us, 46786.6 MB/s
Compression level: 0
comp(write):	  128.9 us, 31040.8 MB/s	  Final bytes: 4194336  Ratio: 1.00
decomp(read):	  107.2 us, 37298.2 MB/s	  OK
Compression level: 1
comp(write):	 1027.6 us, 3892.6 MB/s	  Final bytes: 599602  Ratio: 7.00
decomp(read):	  408.5 us, 9791.2 MB/s	  OK
Compression level: 2
comp(write):	  742.8 us, 5385.3 MB/s	  Final bytes: 345678  Ratio: 12.13
decomp(read):	  403.9 us, 9903.4 MB/s	  OK
Compression level: 3
comp(write):	  965.2 us, 4144.0 MB/s	  Final bytes: 134398  Ratio: 31.21
decomp(read):	  396.9 us, 10078.5 MB/s	  OK
Compression level: 4
comp(write):	 1069.1 us, 3741.3 MB/s	  Final bytes: 62832  Ratio: 66.75
decomp(read):	  235.2 us, 17009.2 MB/s	  OK
Compression level: 5
comp(write):	 1178.7 us, 3393.6 MB/s	  Final bytes: 60076  Ratio: 69.82
decomp(read):	  223.7 us, 17877.1 MB/s	  OK
Compression level: 6
comp(write):	 1566.6 us, 2553.4 MB/s	  Final bytes: 59080  Ratio: 70.99
decomp(read):	  162.6 us, 24593.5 MB/s	  OK
Compression level: 7
comp(write):	 2319.6 us, 1724.4 MB/s	  Final bytes: 37592  Ratio: 111.57
decomp(read):	  137.3 us, 29132.4 MB/s	  OK
Compression level: 8
comp(write):	 2626.4 us, 1523.0 MB/s	  Final bytes: 37464  Ratio: 111.96
decomp(read):	  137.3 us, 29128.3 MB/s	  OK
Compression level: 9
comp(write):	 13825.9 us, 289.3 MB/s	  Final bytes: 15400  Ratio: 272.36
decomp(read):	  145.6 us, 27476.1 MB/s	  OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:	    5.5 s, 3094.6 MB/s

After (zstd 1.5.0):

> ./bench/b2bench zstd shuffle single 8                                                                                     (base)
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,zlib,zstd
Supported compression libraries:
  BloscLZ: 2.3.0
  LZ4: 1.9.3
  Zlib: 1.2.11.zlib-ng
  Zstd: 1.4.9
Using compressor: zstd
Using shuffle type: shuffle
Running suite: single
--> 8, 4194304, 4, 19, zstd, shuffle
********************** Run info ******************************
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 4194304 bytes	Type size: 4 bytes
Working set: 256.0 MB		Number of threads: 8
********************** Running benchmarks *********************
memcpy(write):		  440.0 us, 9090.7 MB/s
memcpy(read):		   97.7 us, 40942.7 MB/s
Compression level: 0
comp(write):	  132.3 us, 30226.6 MB/s	  Final bytes: 4194336  Ratio: 1.00
decomp(read):	  119.5 us, 33463.8 MB/s	  OK
Compression level: 1
comp(write):	 1123.1 us, 3561.7 MB/s	  Final bytes: 599602  Ratio: 7.00
decomp(read):	  468.5 us, 8537.5 MB/s	  OK
Compression level: 2
comp(write):	  751.2 us, 5325.0 MB/s	  Final bytes: 345678  Ratio: 12.13
decomp(read):	  407.2 us, 9822.2 MB/s	  OK
Compression level: 3
comp(write):	  968.8 us, 4128.9 MB/s	  Final bytes: 134398  Ratio: 31.21
decomp(read):	  398.1 us, 10046.7 MB/s	  OK
Compression level: 4
comp(write):	 1053.0 us, 3798.7 MB/s	  Final bytes: 62832  Ratio: 66.75
decomp(read):	  235.3 us, 17001.6 MB/s	  OK
Compression level: 5
comp(write):	 1188.4 us, 3365.8 MB/s	  Final bytes: 60076  Ratio: 69.82
decomp(read):	  223.4 us, 17905.2 MB/s	  OK
Compression level: 6
comp(write):	 1586.9 us, 2520.7 MB/s	  Final bytes: 59080  Ratio: 70.99
decomp(read):	  167.9 us, 23817.5 MB/s	  OK
Compression level: 7
comp(write):	 2317.5 us, 1726.0 MB/s	  Final bytes: 37592  Ratio: 111.57
decomp(read):	  139.2 us, 28730.5 MB/s	  OK
Compression level: 8
comp(write):	 2616.8 us, 1528.6 MB/s	  Final bytes: 37464  Ratio: 111.96
decomp(read):	  137.5 us, 29082.7 MB/s	  OK
Compression level: 9
comp(write):	 13949.7 us, 286.7 MB/s	  Final bytes: 15400  Ratio: 272.36
decomp(read):	  144.0 us, 27770.3 MB/s	  OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:	    5.5 s, 3061.5 MB/s

To reproduce benchmarks, build the library following instructions in:
https://github.com/Blosc/c-blosc2/blob/main/README.rst#compiling-the-c-blosc2-library-with-cmake

and the ./bench/b2bench will appear in the build dir.

senhuang42 · 2021-05-17T21:41:51Z

@FrancescAlted I can indeed measure a speed regression as well with your C-blosc2 project - I took a random file generated by your benchmark tool, and benched it using the zstd benchmarking utility:

1.5.0:
 5#out.txt           :    262144 ->      5175 (50.66), 400.8 MB/s ,3193.3 MB/s 
 6#out.txt           :    262144 ->      5173 (50.68), 354.4 MB/s ,3193.2 MB/s 
 7#out.txt           :    262144 ->      3940 (66.53), 361.1 MB/s ,7293.9 MB/s 
 8#out.txt           :    262144 ->      4469 (58.66), 281.0 MB/s ,5744.3 MB/s 
1.4.9:
 5#out.txt           :    262144 ->      5175 (50.66), 541.5 MB/s ,3194.8 MB/s 
 6#out.txt           :    262144 ->      5173 (50.68), 454.7 MB/s ,3191.6 MB/s 
 7#out.txt           :    262144 ->      3917 (66.92), 458.4 MB/s ,7341.2 MB/s 
 8#out.txt           :    262144 ->      4469 (58.66), 332.5 MB/s ,5764.8 MB/s

So I'm measuring around a 30% regression on gcc with an i9-9900k. The file is two long strings repeated many times, so it's not inconcievable that the new match finder doesn't deal with this particular case as well.

One way to mitigate this in your tool in particular could just be disabling the new matchfinder. Basically, you just need to #define ZSTD_STATIC_LINKING_ONLY before including zstd.h (to link in the advanced API), and you could migrate the ZSTD_compressCCtx() call to:

ZSTD_CCtx_setParameter(thread_context->zstd_cctx, ZSTD_c_compressionLevel, clevel);
ZSTD_CCtx_setParameter(thread_context->zstd_cctx, ZSTD_c_useRowMatchFinder, ZSTD_urm_disableRowMatchFinder);
code = ZSTD_compress2(thread_context->zstd_cctx,
        (void*)output, maxout, (void*)input, input_length);

and for ZSTD_compress_usingCDict(), just add the following line before it:

ZSTD_CCtx_refCDict(thread_context->zstd_cctx);

Though maybe this warrants some more investigation. A perf comparison doesn't seem to flag any particular function, but a lot of time is spent in prefetching, and in this case it might just not be as useful for some reason. It could be the case that the Apple M1 deals with heavy prefetching relatively better, and doesn't suffer from it as much.

terrelln · 2021-05-18T00:15:14Z

I think what is happening is that this file as a lot of very long matches. The row-based match finder's update function is slower, but it makes up for it in much faster searches. But, when you have very long matches (100s of bytes at least), the update function can start to dominate. And the hash chain's function is very very simple.

I did a simple experiment. I added this code:

    if (target - idx > 100) {
        idx = target - 100;
        if (useCache)
            ZSTD_row_fillHashCache(ms, base, rowLog, mls, idx, ip + 1);
    }

here. I don't think the code is quite right, but it produces a valid result so whatever.

The result is:

--no-row-match-finder:
 5#c-blosc.dat       :    262144 ->      5175 (50.66), 530.5 MB/s ,3188.3 MB/s
--row-match-finder before:
 5#c-blosc.dat       :    262144 ->      5175 (50.66), 410.4 MB/s ,3183.6 MB/s
--row-match-finder after:
 5#c-blosc.dat       :    262144 ->      5131 (51.09), 876.2 MB/s ,3210.0 MB/s

So, I think that we should investigate the right skipping strategy for (in)compressible sections.

ghost · 2021-05-20T11:29:57Z

The two regressions reported by me are invalid.

634bfd3
This regression only occurs when using /Ob3 compiler option, AND on i5-4570.
If use /Ob2 option, OR on Ryzen-3600X, 634bfd3 has no speed regression.
It may be a regression for a specific processor.

980f3bb
It seems this regression only occurs in finalize_dictionary function, which is not a performance critical function.
This regression can be reproduced on Ryzen-3600X/win10/msvc2019, i5-4570/wsl2/gcc-9.3.
If you have time, it's best to take a look, it's an optimization commit after all.

Cyan4973 · 2021-11-24T00:12:46Z

fixed in #2755

terrelln · 2021-11-24T00:12:47Z

This should be fixed by #2755

P-E-Meunier · 2023-02-24T15:19:27Z

Hi! We're still experiencing this regression very badly with the latest release in Pijul (https://pijul.org), which uses the zstd-seekable format.

We have been pinpointing our version of ZStd for quite a while, causing endless issues with the different platforms we support.

ZStd 1.5 is at least 10 times slower than ZStd 1.4.8, even on our most basic test cases.

Cyan4973 · 2023-02-24T16:54:44Z

ZStd 1.5 is at least 10 times slower than ZStd 1.4.8, even on our most basic test cases.

This is a really huge regression, and we haven't observed anything that bad in our tests so far.

We would be interested in understanding better your specific scenario,
if we can reproduce it, this will be a good starting point for a fix.

terrelln · 2023-02-24T19:59:40Z

@P-E-Meunier this is likely unrelated to this Issue. Could you please provide a way for us to reproduce the issue?

Can you also double check that you're building zstd with asserts disabled -DNDEBUG and with -O3?

P-E-Meunier · 2023-02-24T20:17:36Z

I'm actually using the zstd provided by my Linux distribution (NixOS), but the exact same build script is impressively fast with 1.4.8 and extremely slow with 1.5.0.

Reproducing the issue is not particularly easy, since ZStd really sits at the bottom of our stack. I will try to make a more minimal example than what I have.

P-E-Meunier · 2023-02-24T20:29:24Z

If you want to try it now anyway, the way to reproduce this is:

install a recent enough version of Rust
run cargo install pijul --version "~1.0.0-beta"
run a minimal repository example, like pijul init; dd if=/dev/zero of=testfile bs=1024 count=102400; pijul rec -am. testfile

On ZStd 1.4.8, the last command (pijul rec -am. testfile) takes a few seconds, maybe 10s on an old-ish laptop. On ZStd 1.5.0, it takes a few minutes.

One way to debug this could be to download the latest source code for Pijul by running pijul clone https://nest.pijul.com/pijul/pijul, and run it again after placing a new ZStd in the library path. If that isn't possible, then download the latest source code for the Rust bindings of ZStd-seekable: pijul clone https://nest.pijul.com/pmeunier/zstd-seekable, update pijul/libpijul/Cargo.toml to change the path for crate zstd-seekable (adding path="/path/to/my/clone" on the line where zstd-seekable is declared should be enough), and run cargo install in the pijul directory.

Cyan4973 · 2023-02-25T02:03:36Z

On ZStd 1.5.0, it takes a few minutes.

Have you tried v1.5.4 ?

P-E-Meunier · 2023-02-25T06:46:25Z

Yes, 1.5.4 is still affected, see https://nest.pijul.com/pijul/pijul/discussions/761

P-E-Meunier · 2023-03-01T10:40:48Z

Just a minor note in case you want to test, we're now packaging ZStd 1.4.8 along with our bindings in order to avoid the speed regression. So this regression is harder to observe, I'll still try to make up a test case.

Cyan4973 · 2023-03-07T04:39:08Z

I would like to reproduce the faulty scenario,
though installing pijul and all its dependencies is not my first choice.
Instead, I would like to reproduce an equivalent scenario, using libzstd and the seekable format,
to observe a similar effect.

A few initial questions come to mind : in the pijul scenario experiencing the slowdown

what's the compression level set for libzstd
what's the maximum block size set for the seekable format
what are the final average block size and average compression ratio
is there a read test in the benchmark ? if yes, what's the read pattern ?

Looking at the proposed reproduction scenario :
dd if=/dev/zero of=testfile bs=1024 count=102400
it seems this is a source document which is just completely filled with zeroes ?
In which case, the compression ratio will be very high, but this is hardly a "normal" use case, more like an edge case.

Cyan4973 · 2023-03-07T04:59:50Z

By the way,
I tried the process explained in this post,
and when reaching last command,
I'm not sure what's happening, but it does not seem to match expectation :

time pijul rec -am. testfile
Hash: NI2YIUC6RSD7OFBFVKPEFFU42Z6SI3TW6HCVQB722PZHXKAIVAHAC
pijul rec -am. testfile  0.46s user 0.05s system 92% cpu 0.549 total

with the local system providing libzstd v1.5.0 by default.

Btw, does pijul use the system's libzstd library, or does it vendor in its own version ?
I tested the recommended version, aka cargo install pijul --version "~1.0.0-beta".

P-E-Meunier · 2023-03-07T07:41:42Z

The compression level is 10, the frame size is 256. I don't know about the compression ratio. There is no read test.

Here is the Rust code we use to compress, you should be able to reproduce this by compiling with the zstd-seekable crate at version 1.7.0 (further versions use older zstd-seekable).

const LEVEL: usize = 10;
const FRAME_SIZE: usize = 256;
fn compress(input: &[u8], w: &mut Vec<u8>) -> Result<(), ChangeError> {
    info!("compressing with ZStd {}", zstd_seekable::version().to_str().unwrap());
    let mut cstream = zstd_seekable::SeekableCStream::new(LEVEL, FRAME_SIZE).unwrap();
    let mut output = [0; 4096];
    let mut input_pos = 0;
    while input_pos < input.len() {
        let (out_pos, inp_pos) = cstream.compress(&mut output, &input[input_pos..])?;
        w.write_all(&output[..out_pos])?;
        input_pos += inp_pos;
    }
    while let Ok(n) = cstream.end_stream(&mut output) {
        if n == 0 {
            break;
        }
        w.write_all(&output[..n])?;
    }
    Ok(())
}

The latest Pijul (beta.4) first tries to find ZStd >= 1.4.0 and < 1.5.0 on the system via pkg-config, and uses its own version (ZStd 1.4.8) if that doesn't work. On Windows, it always uses its own version.

Pijul doesn't do any of that on its own, the much smaller ZStd-seekable crate is responsible for handling all that.

P-E-Meunier · 2023-03-07T07:45:21Z

About the faulty reproduction scenario: yes, this is a large file filled with zeros, we observe the same problem with actual text code files if they are large enough, which is how this issue was found.

yoniko · 2023-03-07T17:58:36Z

A frame size of 256 is rather small.
Couple that with the fact that the seekable compressor uses ZSTD_compressStream means that we basically hit the scenario that #3426 tries to address.
I haven't benchmarked but I believe that utilizing ZSTD_compressStream2 in seekable compressor could solve the problem.

@P-E-Meunier - mind sharing why you have chosen such a small frame size? while a file of zeroes compresses very well, more complex files will not compress as well when the frame size is so small. Additionally, you pay a large overhead for compression / decompression of every frame.
Have you benchmarked other frame sizes?

P-E-Meunier · 2023-03-07T18:02:40Z

I did this a few years ago. I remember benchmarking a few sizes and not noticing a large difference there, but I will try a larger frame size, thanks for the advice.

I haven't followed the changes closely in ZStd, but does that explain why ZStd 1.5 was orders of magnitudes slower than 1.4.9 in our particular case?

Cyan4973 · 2023-03-07T19:49:28Z

does that explain why ZStd 1.5 was orders of magnitudes slower

Yes, it does.
This scenario is possibly the worst possible case for v1.5.0,
and it's very unusual in the specific list of parameters selected,
so much so that we would not have expected it to be a real-world use case.

mind sharing why you have chosen such a small frame size? while a file of zeroes compresses very well, more complex files will not compress as well when the frame size is so small

I did this a few years ago. I remember benchmarking a few sizes and not noticing a large difference there, but I will try a larger frame size, thanks for the advice.

That part is surprising.
256 bytes is extremely small.
Save some special corner cases (such as just bunch of zeroes), cutting input into independent frames of 256 bytes shall result in very bad compression ratio, almost no compression at all.
Selecting higher sizes should result in much better ratios. Presuming source data is compressible, the differences from 256 bytes to 1K to 4K to 16K should be large and obvious.
Hence, it's surprising to read that you did not notice any large difference.
So I wonder if we use FRAME_SIZE to mean the same thing.

yoniko · 2023-03-07T19:56:51Z

So I wonder if we use FRAME_SIZE to mean the same thing.

According to the posted code 256 is the frame size fed into the seekable-zstd library, so I believe it's the same thing.

P-E-Meunier · 2023-03-07T20:23:21Z

Just tested our slowest benchmark, and 4K is indeed very fast with ZStd ≥ 1.5. Thanks again!

@P-E-Meunier

As reported by @P-E-Meunier in #2662 (comment), seekable format ingestion speed can be particularly slow when selected `FRAME_SIZE` is very small, especially in combination with the recent row_hash compression mode. The specific scenario mentioned was `pijul`, using frame sizes of 256 bytes and level 10. This is improved in this PR, by providing approximate parameter adaptation to the compression process. Tested locally on a M1 laptop, ingestion of `enwik8` using `pijul` parameters went from 35sec. (before this PR) to 2.5sec (with this PR). For the specific corner case of a file full of zeroes, this is even more pronounced, going from 45sec. to 0.5sec. These benefits are unrelated to (and come on top of) other improvement efforts currently being made by @yoniko for the row_hash compression method specifically. The `seekable_compress` test program has been updated to allows setting compression level, in order to produce these performance results.

ghost changed the title ~~v1.5.0 speed regression on msvc2019~~ v1.5.0 speed regressions on msvc2019 May 16, 2021

ghost changed the title ~~v1.5.0 speed regressions on msvc2019~~ v1.5.0 speed regressions May 17, 2021

senhuang42 mentioned this issue Aug 26, 2021

Skip most long matches in lazy hash table update #2755

Merged

Cyan4973 closed this as completed Nov 24, 2021

Cyan4973 mentioned this issue Mar 10, 2023

Improved seekable format ingestion speed for small frame size #3544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.5.0 speed regressions #2662

v1.5.0 speed regressions #2662

ghost commented May 16, 2021 •

edited by ghost

Loading

ghost commented May 16, 2021 •

edited by ghost

Loading

FrancescAlted commented May 17, 2021

senhuang42 commented May 17, 2021 •

edited

Loading

terrelln commented May 18, 2021 •

edited

Loading

ghost commented May 20, 2021

Cyan4973 commented Nov 24, 2021

terrelln commented Nov 24, 2021

P-E-Meunier commented Feb 24, 2023 •

edited

Loading

Cyan4973 commented Feb 24, 2023

terrelln commented Feb 24, 2023

P-E-Meunier commented Feb 24, 2023

P-E-Meunier commented Feb 24, 2023

Cyan4973 commented Feb 25, 2023

P-E-Meunier commented Feb 25, 2023

P-E-Meunier commented Mar 1, 2023

Cyan4973 commented Mar 7, 2023 •

edited

Loading

Cyan4973 commented Mar 7, 2023 •

edited

Loading

P-E-Meunier commented Mar 7, 2023

P-E-Meunier commented Mar 7, 2023

yoniko commented Mar 7, 2023

P-E-Meunier commented Mar 7, 2023

Cyan4973 commented Mar 7, 2023

yoniko commented Mar 7, 2023

P-E-Meunier commented Mar 7, 2023 •

edited

Loading

v1.5.0 speed regressions #2662

v1.5.0 speed regressions #2662

Comments

ghost commented May 16, 2021 • edited by ghost Loading

ghost commented May 16, 2021 • edited by ghost Loading

FrancescAlted commented May 17, 2021

senhuang42 commented May 17, 2021 • edited Loading

terrelln commented May 18, 2021 • edited Loading

ghost commented May 20, 2021

Cyan4973 commented Nov 24, 2021

terrelln commented Nov 24, 2021

P-E-Meunier commented Feb 24, 2023 • edited Loading

Cyan4973 commented Feb 24, 2023

terrelln commented Feb 24, 2023

P-E-Meunier commented Feb 24, 2023

P-E-Meunier commented Feb 24, 2023

Cyan4973 commented Feb 25, 2023

P-E-Meunier commented Feb 25, 2023

P-E-Meunier commented Mar 1, 2023

Cyan4973 commented Mar 7, 2023 • edited Loading

Cyan4973 commented Mar 7, 2023 • edited Loading

P-E-Meunier commented Mar 7, 2023

P-E-Meunier commented Mar 7, 2023

yoniko commented Mar 7, 2023

P-E-Meunier commented Mar 7, 2023

Cyan4973 commented Mar 7, 2023

yoniko commented Mar 7, 2023

P-E-Meunier commented Mar 7, 2023 • edited Loading

ghost commented May 16, 2021 •

edited by ghost

Loading

ghost commented May 16, 2021 •

edited by ghost

Loading

senhuang42 commented May 17, 2021 •

edited

Loading

terrelln commented May 18, 2021 •

edited

Loading

P-E-Meunier commented Feb 24, 2023 •

edited

Loading

Cyan4973 commented Mar 7, 2023 •

edited

Loading

Cyan4973 commented Mar 7, 2023 •

edited

Loading

P-E-Meunier commented Mar 7, 2023 •

edited

Loading