-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.5.0 speed regressions #2662
Comments
There is another speed regression commit. EDIT: Run pyzstd module's unit-tests on Windows 10, msvc2019, i5 4570: On WSL2, gcc-9.3.0, i5 4570: On Windows 10, msvc2019, amd 3600x: How to run the unit-tests:
|
I have recently upgraded C-Blosc2 to Zstd 1.5.0 (from 1.4.9), and I am detecting performance regressions too, specially on the compression side of the things. The differences are very apparent on this Intel box: Clear Linux, GCC 11, i9-10940X @ 3.30GHz: Before (zstd 1.4.9):
After (zstd 1.5.0):
As you see, the compression speed can be more than 2x slower with 1.5.0. Decompression does not seem to be affected. Curiously, with Apple M1 (Apple clang 12.0.5) the differences are almost negligible: Before (zstd 1.4.9):
After (zstd 1.5.0):
To reproduce benchmarks, build the library following instructions in: and the |
@FrancescAlted I can indeed measure a speed regression as well with your
So I'm measuring around a 30% regression on gcc with an i9-9900k. The file is two long strings repeated many times, so it's not inconcievable that the new match finder doesn't deal with this particular case as well. One way to mitigate this in your tool in particular could just be disabling the new matchfinder. Basically, you just need to
and for
Though maybe this warrants some more investigation. A |
I think what is happening is that this file as a lot of very long matches. The row-based match finder's update function is slower, but it makes up for it in much faster searches. But, when you have very long matches (100s of bytes at least), the update function can start to dominate. And the hash chain's function is very very simple. I did a simple experiment. I added this code: if (target - idx > 100) {
idx = target - 100;
if (useCache)
ZSTD_row_fillHashCache(ms, base, rowLog, mls, idx, ip + 1);
} here. I don't think the code is quite right, but it produces a valid result so whatever. The result is:
So, I think that we should investigate the right skipping strategy for (in)compressible sections. |
The two regressions reported by me are invalid. 634bfd3 980f3bb |
fixed in #2755 |
This should be fixed by #2755 |
Hi! We're still experiencing this regression very badly with the latest release in Pijul (https://pijul.org), which uses the zstd-seekable format. We have been pinpointing our version of ZStd for quite a while, causing endless issues with the different platforms we support. ZStd 1.5 is at least 10 times slower than ZStd 1.4.8, even on our most basic test cases. |
This is a really huge regression, and we haven't observed anything that bad in our tests so far. We would be interested in understanding better your specific scenario, |
@P-E-Meunier this is likely unrelated to this Issue. Could you please provide a way for us to reproduce the issue? Can you also double check that you're building zstd with asserts disabled |
I'm actually using the zstd provided by my Linux distribution (NixOS), but the exact same build script is impressively fast with 1.4.8 and extremely slow with 1.5.0. Reproducing the issue is not particularly easy, since ZStd really sits at the bottom of our stack. I will try to make a more minimal example than what I have. |
If you want to try it now anyway, the way to reproduce this is:
On ZStd 1.4.8, the last command ( One way to debug this could be to download the latest source code for Pijul by running |
Have you tried |
Yes, 1.5.4 is still affected, see https://nest.pijul.com/pijul/pijul/discussions/761 |
Just a minor note in case you want to test, we're now packaging ZStd 1.4.8 along with our bindings in order to avoid the speed regression. So this regression is harder to observe, I'll still try to make up a test case. |
I would like to reproduce the faulty scenario, A few initial questions come to mind : in the pijul scenario experiencing the slowdown
Looking at the proposed reproduction scenario : |
By the way,
with the local system providing Btw, does |
The compression level is 10, the frame size is 256. I don't know about the compression ratio. There is no read test. Here is the Rust code we use to compress, you should be able to reproduce this by compiling with the zstd-seekable crate at version 1.7.0 (further versions use older zstd-seekable). const LEVEL: usize = 10;
const FRAME_SIZE: usize = 256;
fn compress(input: &[u8], w: &mut Vec<u8>) -> Result<(), ChangeError> {
info!("compressing with ZStd {}", zstd_seekable::version().to_str().unwrap());
let mut cstream = zstd_seekable::SeekableCStream::new(LEVEL, FRAME_SIZE).unwrap();
let mut output = [0; 4096];
let mut input_pos = 0;
while input_pos < input.len() {
let (out_pos, inp_pos) = cstream.compress(&mut output, &input[input_pos..])?;
w.write_all(&output[..out_pos])?;
input_pos += inp_pos;
}
while let Ok(n) = cstream.end_stream(&mut output) {
if n == 0 {
break;
}
w.write_all(&output[..n])?;
}
Ok(())
} The latest Pijul (beta.4) first tries to find ZStd >= 1.4.0 and < 1.5.0 on the system via pkg-config, and uses its own version (ZStd 1.4.8) if that doesn't work. On Windows, it always uses its own version. Pijul doesn't do any of that on its own, the much smaller ZStd-seekable crate is responsible for handling all that. |
About the faulty reproduction scenario: yes, this is a large file filled with zeros, we observe the same problem with actual text code files if they are large enough, which is how this issue was found. |
A frame size of 256 is rather small. @P-E-Meunier - mind sharing why you have chosen such a small frame size? while a file of zeroes compresses very well, more complex files will not compress as well when the frame size is so small. Additionally, you pay a large overhead for compression / decompression of every frame. |
I did this a few years ago. I remember benchmarking a few sizes and not noticing a large difference there, but I will try a larger frame size, thanks for the advice. I haven't followed the changes closely in ZStd, but does that explain why ZStd 1.5 was orders of magnitudes slower than 1.4.9 in our particular case? |
Yes, it does.
That part is surprising. |
According to the posted code 256 is the frame size fed into the seekable-zstd library, so I believe it's the same thing. |
Just tested our slowest benchmark, and 4K is indeed very fast with ZStd ≥ 1.5. Thanks again! |
As reported by @P-E-Meunier in #2662 (comment), seekable format ingestion speed can be particularly slow when selected `FRAME_SIZE` is very small, especially in combination with the recent row_hash compression mode. The specific scenario mentioned was `pijul`, using frame sizes of 256 bytes and level 10. This is improved in this PR, by providing approximate parameter adaptation to the compression process. Tested locally on a M1 laptop, ingestion of `enwik8` using `pijul` parameters went from 35sec. (before this PR) to 2.5sec (with this PR). For the specific corner case of a file full of zeroes, this is even more pronounced, going from 45sec. to 0.5sec. These benefits are unrelated to (and come on top of) other improvement efforts currently being made by @yoniko for the row_hash compression method specifically. The `seekable_compress` test program has been updated to allows setting compression level, in order to produce these performance results.
A msvc2019 speed regression commit: 634bfd3
EDIT:
This regression only occurs when using
/Ob3
compiler option, AND on i5-4570.If use '/Ob2' option, OR on Ryzen-3600X, 634bfd3 has no speed regression.
/Ob3
specifies more aggressive inlining than /Ob2https://docs.microsoft.com/en-us/cpp/build/reference/ob-inline-function-expansion
before:
after:
The unit of c_speed/d_speed is MB/s.
The text was updated successfully, but these errors were encountered: