-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression (decompress) from 1.4.0 -> 1.4.3 #1758
Comments
Did you mean an increase in decompression time (aka, slower) ? |
Can you include those compilation flags you're using and what CPU you're seeing this on? |
Yes, sorry for the mixed wording. It takes longer to decompress the same file with 1.4.3, than it does with 1.4.0. The times listed in that table are measured in seconds.
I'm building zstd with zlib (1.2.11) and lzma (5.2.4) in combination with many other parts of code via a cmake super-build. The output the I'm seeing is the normal "Release" build flags in the logs,
Concerning the system, it's an older Xeon, lacking AVX or AVX2. This is the output of $ cat /proc/cpuinfo:
I'd be happy to provide more info, just let me know what's needed. Thanks, Kyle |
A little known feature of The advantage is that the in-memory benchmark is free of I/O side-effects, which can dominate results at high speed, and uses a very precise timer. I made a comparison of
The results are so-so, aka the compression benefits are not clearly present. It's roughly the same, maybe very slightly faster (mostly for This surprised me, so I re-run the test using
Now we are talking. The gains are more visible. It's a bit short of the 7% advertised, but it's definitely there. But there is a bit more to it : compare both tables : moving from
It's actually rather a loss ! Which means, transitively, that it must have been worse for
Yes, it was worse. Conclusions :
These experiments explain why, at I would suggest to try FYI, I tried to time the CLI decompression performance on my test platform, but could not reproduce any sensible difference so far (noise measurement was higher than any potential difference between |
That is some great insight! Thank you for looking into it so thoroughly. After I had given it a little more thought, I was questioning why I wasn't compiling it at O3, instead of O2, to take advantage of vectorization. I'm glad you tested it as well. I will try that tomorrow, alongside gcc-{6,7,8} with O{2,3} and see if I can't help pin down the issue. I'm glad you reminded me of the benchmark mode. I knew it was in there, but it completely slipped my mind when I was testing this morning. I'll post a representative file too. Thanks, Kyle |
I think it's probable that for GCC and clang vectorization is almost entirely bad for zstd performance. It certainly is in any instance I looked at but I stopped short of disabling it completely. Compiler vectorization introduces high startup costs for the loop (checking for length, overlap) that have to be amortized against an assumed high trip count for the loop. In the case of zstd decoding, that average trip count is actually very low -- most likely 1. I turned off auto-vectorization for decoding in PR1668 and replaced it with a hand-vectorized version that I wrote with processors >= sandy bridge (2012) in mind, for which 16-byte operations are not meaningfully more expensive than 8-byte ones. It is possible for a E5620, which is a bit older, that assumption is not valid. |
Thanks @KBentley57 , I'm also impressed by the very large drop in decompression speed between level 16 and 19. edit : or maybe related to the 4 MB cache size of target cpu, since higher levels increase window size up to 8 MB, resulting in more cache misses. |
@Cyan4973 I'm a little perplexed, however. One one hand it doesn't really matter, since they're so close, but on my laptop there's clearly a performance difference between the two versions, at least for my type of data. On the other hand, the cumulation of small time / energy savings is significant over the course of a large montecarlo run consisting of > 1 million trials. Am I looking too hard at what is likely an unpredictable quantity? I saw that there were a few commits on the PR page , have you done any investigations into this? Not pushing, just asking if anything obvious poked its head out. Thanks, Kyle |
No, unfortunately, no easy conclusion here. |
@KBentley57 what is the compression ratio of your file? I'm going to start the following benchmark on my machine, which produces very stable benchmark results, and I want to make sure to include a representative file. #!/usr/bin/env sh
PROG="$0"
USAGE="$PROG 0zstd 1zstd FILES..."
PREFIX="taskset --cpu-list 0"
ZSTD0="$1"
shift
ZSTD1="$1"
shift
if [ "x$ZSTD0" == "x" ]; then
echo $USAGE
exit 1
fi
if [ "x$ZSTD1" == "x" ]; then
echo $USAGE
echi 1
fi
levels=$(seq 1 19)
echo "Compressing each file with each level"
for file in $@; do
for level in $levels; do
ofile="$file.zst.$level"
if [ ! -f "$ofile" ]; then
$ZSTD1 "$file" -$level -o "$file.zst.$level"
fi
done
done
echo "Disabling Turbo"
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
for file in $@; do
echo "Benchmarking on $file"
for ZSTD in $ZSTD0 $ZSTD1; do
echo $ZSTD
for level in $levels; do
$ZSTD -b -d "$file.zst.$level"
done
done
done
echo "Enabling Turbo"
echo 0 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo |
The difference in decompression speed between levels 16, 17, 18, and 19 can be explained by the minimum match length. There must be a bit more 4 byte matches than 5 byte matches, and a lot more 3 byte matches. |
@KBentley57 would it be possible to regenerate a OpenVDB file containing data that you can share that has the same performance regression? We suspect it is either the file, or the CPU specs, but we're not sure which. Having the file would really help us narrow down the issue. |
@Cyan4973 Thanks, I'll do what I can to help out, if it's needed. @terrelln I am working on getting the OK for that right now. I can't promise it'll be tomorrow, but by Friday afternoon I should have a few test cases that you can try. Generally, we see compression ratios of anywhere from 1.4 and up. That depends on a few parameters, but I'd put it in the range (1.4, 2.0). |
All, Sorry for the delay. While I'm afraid I can't provide the actual data file yet, OpenVDB has many sample voxel files that are similar enough to my use case that I think should be sufficient. Here is one that is nearly identical in filesize and roughly the same density, etc.. There are a few differences, notably that I'm using HalfFloat (An Industrial Light and Magic (ILM) component of OpenVDB), so that the leaf nodes in this structure would be twice as large. I'm not certain if that matters. The are storing two grids, their second grid is a vec3, whereas I'm storing a second grid of scalar values, again, HalfFloat. The compression ratios for this file aren't nearly as high as my data, achieving somewhere around 1.1 Here's a few of my numbers on the old work machine - The case still remains that 1.4.0 O2 beats 1.4.3 O3 under a significant number of cases, though not by much. Under 1.4.3 O2 wins a few loses a few, etc. Here's a look at the differences between the two version when compiled under O2 vs O3 (1.4.0 - 1.4.3). @terrelln I effectively ran your script, and experienced similar results. BTW, I'm not sure if was copy/paste error, but there's a typo in the second And finally, the tabular data in case it reveals anything more -
|
We have a project to look more into these On the specific issue of performance comparison between 1.4.0 and 1.4.3, the topic is becoming obsolete, because the next version in v1.4.4 is featuring substantial differences for the decompression algorithm, resulting in a dramatically better speed. As a consequence, the code already merged in I would suggest a test of the current code in |
I'll give that a look sometime soon. If that's the case, should this issue be marked as closed? I don't want it to run on forever now that it has fulfilled its purpose. Thanks, Kyle |
All,
I was excited to test the new release (1.4.3) in our companies code, expecting the gains touted in the release notes (7% average, if that is correct) compared the the version I'm using, 1.4.0. I made a sample file and compressed it with a few standard options, level 17 for compression as I've found the smallest ratios with that level for my data.
The file was compressed with zstd built from source using version 1.4.0, with GCC 8 on CentOS 7, using the same flags on both versions. The data size is about 365 MB uncompressed. Compressed, the file is around 215 MB. I put the file in /dev/shm in attempt to isolate the IO and ran a simple script to time the decompression of the file, delete the uncompressed output, and repeat. The time was reported as the real output from the bash time command. The descriptive statistics between the two experiments are summarized in the table below.
Can anyone comment on why I may be seeing a significant increase in decompression time? The order is on 10%. I am afraid I cannot share the file that was being compressed, but it seems somewhat immaterial.
The text was updated successfully, but these errors were encountered: