Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression (decompress) from 1.4.0 -> 1.4.3 #1758

Closed
KBentley57 opened this issue Aug 29, 2019 · 18 comments
Closed

Performance regression (decompress) from 1.4.0 -> 1.4.3 #1758

KBentley57 opened this issue Aug 29, 2019 · 18 comments

Comments

@KBentley57
Copy link

KBentley57 commented Aug 29, 2019

All,

I was excited to test the new release (1.4.3) in our companies code, expecting the gains touted in the release notes (7% average, if that is correct) compared the the version I'm using, 1.4.0. I made a sample file and compressed it with a few standard options, level 17 for compression as I've found the smallest ratios with that level for my data.

The file was compressed with zstd built from source using version 1.4.0, with GCC 8 on CentOS 7, using the same flags on both versions. The data size is about 365 MB uncompressed. Compressed, the file is around 215 MB. I put the file in /dev/shm in attempt to isolate the IO and ran a simple script to time the decompression of the file, delete the uncompressed output, and repeat. The time was reported as the real output from the bash time command. The descriptive statistics between the two experiments are summarized in the table below.

Can anyone comment on why I may be seeing a significant increase in decompression time? The order is on 10%. I am afraid I cannot share the file that was being compressed, but it seems somewhat immaterial.

Statistic 1.4.0 1.4.3
Mean 2.153 2.358
Standard Error 0.003 0.002
Mode 2.143 2.365
Median 2.151 2.369
First Quartile 2.136 2.364
Third Quartile 2.176 2.375
Variance 0.007 0.003
Standard Deviation 0.081 0.052
Kurtosis 13.845 29.457
Skewness 2.017 1.228
Range 0.934 0.807
Minimum 1.977 2.206
Maximum 2.911 3.013
Sum 2153.070 2358.416
Count 1000 1000
@Cyan4973
Copy link
Contributor

Can anyone comment on why I may be seeing a significant increase in decompression speed?

Did you mean an increase in decompression time (aka, slower) ?

@felixhandte
Copy link
Contributor

felixhandte commented Aug 29, 2019

Can you include those compilation flags you're using and what CPU you're seeing this on?

@KBentley57
Copy link
Author

Can anyone comment on why I may be seeing a significant increase in decompression speed?

Did you mean an increase in decompression time (aka, slower) ?

Yes, sorry for the mixed wording. It takes longer to decompress the same file with 1.4.3, than it does with 1.4.0. The times listed in that table are measured in seconds.

Can you include those compilation flags you're useing and what CPU you're seeing this on?

I'm building zstd with zlib (1.2.11) and lzma (5.2.4) in combination with many other parts of code via a cmake super-build. The output the I'm seeing is the normal "Release" build flags in the logs,

# compile C with /opt/rh/devtoolset-8/root/usr/bin/gcc
C_FLAGS =  -std=c99 -Wall -Wextra -Wundef -Wshadow -Wcast-align -Wcast-qual -Wstrict-prototypes -O2 -DNDEBUG  

C_DEFINES = -DXXH_NAMESPACE=ZSTD_ -DZSTD_GZCOMPRESS -DZSTD_GZDECOMPRESS -DZSTD_LEGACY_SUPPORT=0 -DZSTD_LZMACOMPRESS -DZSTD_LZMADECOMPRESS -DZSTD_MULTITHREAD

Concerning the system, it's an older Xeon, lacking AVX or AVX2. This is the output of $ cat /proc/cpuinfo:

vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x1f
cpu MHz		: 2394.248
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 10
cpu cores	: 4
apicid		: 21
initial apicid	: 21
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm ida arat spec_ctrl intel_stibp flush_l1d
bogomips	: 4788.49
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

I'd be happy to provide more info, just let me know what's needed.

Thanks,

Kyle

@Cyan4973
Copy link
Contributor

Cyan4973 commented Aug 29, 2019

A little known feature of zstd internal benchmark is that it can benchmark only decompression speed. For that, you'll need to load a *.zst compressed file, and use command -b -d. This is useful when measuring decompression speed on files compressed using high compression levels, as compression times can be punishing, especially on large files.

The advantage is that the in-memory benchmark is free of I/O side-effects, which can dominate results at high speed, and uses a very precise timer.

I made a comparison of v1.4.0 vs v1.4.3 decompression speeds using this technique on a desktop system using a Core i7-9700k , compiling with gcc v8.3.0 and using -O2 optimization flag (trying to reproduce @KBentley57's scenario, the default is -O3). The files are individual components of the silesia corpus compressed at level 17.

file v1.4.0 v1.4.3 diff
dickens 792 795 0.38%
mozilla 837 823 -1.67%
mr 753 758 0.66%
nci 2183 2228 2.06%
ooffice 619 614 -0.81%
osdb 1057 1050 -0.66%
reymont 982 990 0.81%
samba 1314 1308 -0.46%
sao 626 628 0.32%
webster 925 940 1.62%
xml 1848 1845 -0.16%
x-ray 465 466 0.22%
silesia.tar 950 954 0.42%

The results are so-so, aka the compression benefits are not clearly present. It's roughly the same, maybe very slightly faster (mostly for nci).

This surprised me, so I re-run the test using -O3 (the default) :

file v1.4.0 v1.4.3 diff
dickens 729 775 6.31%
mozilla 795 827 4.03%
mr 698 741 6.16%
nci 2172 2236 2.95%
ooffice 583 611 4.80%
osdb 991 1046 5.55%
reymont 918 971 5.77%
samba 1242 1300 4.67%
sao 589 631 7.13%
webster 861 923 7.20%
xml 1774 1841 3.78%
x-ray 429 456 6.29%
silesia.tar 908 948 4.41%

Now we are talking. The gains are more visible. It's a bit short of the 7% advertised, but it's definitely there.

But there is a bit more to it : compare both tables : moving from -O2 to -O3 is not a gain. Here is a comparison for v1.4.3 :

file -O2 -O3 diff
dickens 795 775 -2.52%
mozilla 823 827 0.49%
mr 758 741 -2.24%
nci 2228 2236 0.36%
ooffice 614 611 -0.49%
osdb 1050 1046 -0.38%
reymont 990 971 -1.92%
samba 1308 1300 -0.61%
sao 628 631 0.48%
webster 940 923 -1.81%
xml 1845 1841 -0.22%
x-ray 466 456 -2.15%
silesia.tar 954 948 -0.63%

It's actually rather a loss ! Which means, transitively, that it must have been worse for v1.4.0 :

file -O2 -O3 diff
dickens 792 729 -7.95%
mozilla 837 795 -5.02%
mr 753 698 -7.30%
nci 2183 2172 -0.50%
ooffice 619 583 -5.82%
osdb 1057 991 -6.24%
reymont 982 918 -6.52%
samba 1314 1242 -5.48%
sao 626 589 -5.91%
webster 925 861 -6.92%
xml 1848 1774 -4.00%
x-ray 465 429 -7.74%
silesia.tar 950 908 -4.42%

Yes, it was worse.

Conclusions :

  • -O3 is actually a bad setting (for decompression speed and gcc v8.3.0).
    • Of course, it's unclear if this is a bad setting also for compression, or for other compilers (clang), or for different versions. That's a mess. Not being able to rely on -O3 > -O2 makes the situation a lot more complex.
  • The decompression speed gains offered in v1.4.1 effectively achieves parity between -O2 and -O3. That means that the most important contribution was to prevent the compiler from doing too much harm from trying too hard to be clever.

These experiments explain why, at -O2 setting, there is no perceived benefit between v1.4.0 and v1.4.3, but it doesn't explain why @KBentley57's experiment perceives a sizable loss of performance.

I would suggest to try -b -d on your platform, and see if it reproduces the issue.
If it does, we will have to look into the library, and find a sample which reproduces the issue.
If it doesn't, then the issue could be in the CLI instead, or in I/O conditions.

FYI, I tried to time the CLI decompression performance on my test platform, but could not reproduce any sensible difference so far (noise measurement was higher than any potential difference between v1.4.0 and v1.4.3).

@KBentley57
Copy link
Author

@Cyan4973

That is some great insight! Thank you for looking into it so thoroughly.

After I had given it a little more thought, I was questioning why I wasn't compiling it at O3, instead of O2, to take advantage of vectorization. I'm glad you tested it as well. I will try that tomorrow, alongside gcc-{6,7,8} with O{2,3} and see if I can't help pin down the issue. I'm glad you reminded me of the benchmark mode. I knew it was in there, but it completely slipped my mind when I was testing this morning. I'll post a representative file too.

Thanks,

Kyle

@mgrice
Copy link
Contributor

mgrice commented Aug 30, 2019

I think it's probable that for GCC and clang vectorization is almost entirely bad for zstd performance. It certainly is in any instance I looked at but I stopped short of disabling it completely. Compiler vectorization introduces high startup costs for the loop (checking for length, overlap) that have to be amortized against an assumed high trip count for the loop. In the case of zstd decoding, that average trip count is actually very low -- most likely 1.

I turned off auto-vectorization for decoding in PR1668 and replaced it with a hand-vectorized version that I wrote with processors >= sandy bridge (2012) in mind, for which 16-byte operations are not meaningfully more expensive than 8-byte ones. It is possible for a E5620, which is a bit older, that assumption is not valid.

@Cyan4973 Cyan4973 removed the bug label Sep 10, 2019
@KBentley57
Copy link
Author

KBentley57 commented Sep 11, 2019

I had some time to run a few tests on my laptop at home, I'm afraid I haven't made the time for it yet at work. The results are interesting, and display similar results to what I observed with different data.

I used a representative sample of data that is about ~18 MB, (It's an OpenVDB grid, for the curious), and put it in /dev/shm again. I compiled 1.4.0 and 1.4.3 under the Release build, which adds O3, and the RelWithDebInfo which adds O2. My tests show that 1.4.0 compiled under O3 beats 1.4.3 under O3 by about 2-3% across the board, and 1.4.3 O2 loses to 1.4.0 O2 in levels 1-6, but beats it in levels >= 7.

Specs: Debian 10, Intel Core I7 5600U, GCC 8.3.0-6

First up is the compression tests. This one doesn't really affect me, but here are the results for the sake of completeness. It's worth noting that as suspected, O2 beats O3 in some cases.
image

The decompression test is next. Note that here I'm not timing the binary unzstd like in the original post, but using the results of the internal benchmark as suggested. Levels 1-19 were tested with the command on the plot. This is a single run, not a statistical analysis. Note that the Y-axis doesn't start at 0, don't be misled by the heights of the bars.
image

To highlight the differences, here's a plot of the percent difference between 1.4.0 and 1.4.3 when compiled under O3. The mean value is 2% in favor of 1.4.0, but for the most part, the difference is in the 3-4% range.
image

I'll do my best to carve out a few minutes at work to try this again, but I think the results here make the case that at least some reconsideration ought to be given to the hand-rolled vectorized loop in the 1.4.x patch, as this is a pretty modern cpu.

Here's the data in tabular form.

Compression
--
Level | 1.4.0 O3 | 1.4.0 O2 | 1.4.3 O3 | 1.4.3 O2
1 | 336.200 | 313.400 | 329.200 | 332.100
2 | 255.600 | 242.200 | 255.500 | 255.900
3 | 139.800 | 140.900 | 140.000 | 144.600
4 | 103.700 | 103.400 | 114.900 | 120.400
5 | 42.400 | 37.900 | 42.600 | 38.100
6 | 29.200 | 26.400 | 28.500 | 25.600
7 | 27.700 | 25.400 | 27.300 | 24.400
8 | 26.700 | 24.400 | 25.500 | 23.400
9 | 25.100 | 22.700 | 23.600 | 22.000
10 | 20.100 | 18.200 | 18.800 | 17.700
11 | 19.700 | 18.300 | 18.600 | 17.500
12 | 18.900 | 17.800 | 18.100 | 17.100
13 | 13.500 | 13.100 | 13.400 | 13.000
14 | 13.500 | 13.100 | 13.300 | 13.000
15 | 11.800 | 11.400 | 11.400 | 11.200
16 | 10.110 | 9.900 | 9.970 | 9.230
17 | 7.310 | 7.090 | 7.210 | 6.990
18 | 5.950 | 5.850 | 5.890 | 5.720
19 | 5.000 | 4.860 | 4.920 | 4.840

Decompression
--
Level | 1.4.0 O3 | 1.4.0 O2 | 1.4.3 O3 | 1.4.3 O2
1 | 775.200 | 755.600 | 777.600 | 754.200
2 | 753.100 | 733.400 | 752.400 | 713.600
3 | 743.300 | 710.200 | 719.900 | 693.500
4 | 735.000 | 690.100 | 710.200 | 684.000
5 | 719.200 | 683.400 | 693.600 | 676.900
6 | 720.000 | 685.900 | 699.900 | 685.100
7 | 735.600 | 696.000 | 711.600 | 701.000
8 | 733.900 | 693.600 | 710.300 | 703.600
9 | 734.100 | 695.300 | 708.300 | 708.900
10 | 723.100 | 689.000 | 702.100 | 705.400
11 | 714.600 | 685.800 | 698.400 | 703.700
12 | 717.300 | 687.100 | 698.600 | 705.500
13 | 718.500 | 691.400 | 704.300 | 709.800
14 | 715.000 | 689.400 | 702.600 | 709.200
15 | 721.700 | 692.000 | 709.200 | 717.200
16 | 712.300 | 683.600 | 697.700 | 705.800
17 | 610.700 | 609.600 | 615.200 | 625.300
18 | 455.100 | 464.500 | 453.900 | 463.300
19 | 417.200 | 428.500 | 427.400 | 439.100

Comparison
--
Level | O3 Delta
1 | -0.31%
2 | 0.09%
3 | 3.25%
4 | 3.49%
5 | 3.69%
6 | 2.87%
7 | 3.37%
8 | 3.32%
9 | 3.64%
10 | 2.99%
11 | 2.32%
12 | 2.68%
13 | 2.02%
14 | 1.76%
15 | 1.76%
16 | 2.09%
17 | -0.73%
18 | 0.26%
19 | -2.39%

The complete CPU specs - I'd also be interested to see how the recent intel bug fixes effect this.

vendor_id	: GenuineIntel
cpu family	: 6
model		: 61
model name	: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
stepping	: 4
microcode	: 0x2d
cpu MHz		: 1335.493
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap intel_pt xsaveopt dtherm ida arat pln pts md_clear flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips	: 5188.22
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

@KBentley57
Copy link
Author

KBentley57 commented Sep 11, 2019

I ran the same test as above on my work pc. The results are a little different here than I was expecting. The data is identical to the case above, but different from that in the original post. The results here show 1.4.0 and 1.4.3 being closer in terms of performance than the comparison on a newer PC. Here's the decompression results for discussion.

1.4.0 O3 beats or ties 1.4.3 O3 across a large range of levels, but not by a lot. Again, check the y-axis. Besides providing good evidence for an upgrade of my development rig, there's not much going on. It does go contrary of the reported gains though.
image

Shown here is the percent difference. The lead is small or none in more cases than not.
image

The specs of the PC are the same is in the first few posts.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 11, 2019

Thanks @KBentley57 ,
that's interesting indeed,
it shows the picture is not that clear.

I'm also impressed by the very large drop in decompression speed between level 16 and 19.
That's something I'm not used to, though it could be sample specific.

edit : or maybe related to the 4 MB cache size of target cpu, since higher levels increase window size up to 8 MB, resulting in more cache misses.

@KBentley57
Copy link
Author

@Cyan4973
I want to stress that the axes could be misleading at first glance, in that the decompression speed doesn't approach zero, but in both cases, just a little over half the maximum speed of any of the other levels.

I'm a little perplexed, however. One one hand it doesn't really matter, since they're so close, but on my laptop there's clearly a performance difference between the two versions, at least for my type of data. On the other hand, the cumulation of small time / energy savings is significant over the course of a large montecarlo run consisting of > 1 million trials. Am I looking too hard at what is likely an unpredictable quantity?

I saw that there were a few commits on the PR page , have you done any investigations into this? Not pushing, just asking if anything obvious poked its head out.

Thanks,

Kyle

@Cyan4973
Copy link
Contributor

No, unfortunately, no easy conclusion here.
It's likely worth an investigation, so we'll start one.

@terrelln
Copy link
Contributor

terrelln commented Sep 11, 2019

@KBentley57 what is the compression ratio of your file?

I'm going to start the following benchmark on my machine, which produces very stable benchmark results, and I want to make sure to include a representative file.

#!/usr/bin/env sh

PROG="$0"
USAGE="$PROG 0zstd 1zstd FILES..."

PREFIX="taskset --cpu-list 0"

ZSTD0="$1"
shift
ZSTD1="$1"
shift

if [ "x$ZSTD0" == "x" ]; then
        echo $USAGE
        exit 1
fi

if [ "x$ZSTD1" == "x" ]; then
        echo $USAGE
        echi 1
fi

levels=$(seq 1 19)

echo "Compressing each file with each level"
for file in $@; do
        for level in $levels; do
                ofile="$file.zst.$level"
                if [ ! -f "$ofile" ]; then
                        $ZSTD1 "$file" -$level -o "$file.zst.$level"
                fi
        done
done

echo "Disabling Turbo"
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

for file in $@; do
        echo "Benchmarking on $file"
        for ZSTD in $ZSTD0 $ZSTD1; do
                echo $ZSTD
                for level in $levels; do
                        $ZSTD -b -d "$file.zst.$level"
                done
        done
done

echo "Enabling Turbo"
echo 0 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

@terrelln
Copy link
Contributor

The difference in decompression speed between levels 16, 17, 18, and 19 can be explained by the minimum match length.

There must be a bit more 4 byte matches than 5 byte matches, and a lot more 3 byte matches.

@terrelln
Copy link
Contributor

@KBentley57 would it be possible to regenerate a OpenVDB file containing data that you can share that has the same performance regression? We suspect it is either the file, or the CPU specs, but we're not sure which. Having the file would really help us narrow down the issue.

@KBentley57
Copy link
Author

KBentley57 commented Sep 11, 2019

@Cyan4973 Thanks, I'll do what I can to help out, if it's needed.

@terrelln I am working on getting the OK for that right now. I can't promise it'll be tomorrow, but by Friday afternoon I should have a few test cases that you can try. Generally, we see compression ratios of anywhere from 1.4 and up. That depends on a few parameters, but I'd put it in the range (1.4, 2.0).

@KBentley57
Copy link
Author

All,

Sorry for the delay. While I'm afraid I can't provide the actual data file yet, OpenVDB has many sample voxel files that are similar enough to my use case that I think should be sufficient. Here is one that is nearly identical in filesize and roughly the same density, etc..

https://nexus.aswf.io/content/repositories/releases/io/aswf/openvdb/models/smoke2.vdb/1.0.0/smoke2.vdb-1.0.0.zip

There are a few differences, notably that I'm using HalfFloat (An Industrial Light and Magic (ILM) component of OpenVDB), so that the leaf nodes in this structure would be twice as large. I'm not certain if that matters. The are storing two grids, their second grid is a vec3, whereas I'm storing a second grid of scalar values, again, HalfFloat. The compression ratios for this file aren't nearly as high as my data, achieving somewhere around 1.1

Here's a few of my numbers on the old work machine - The case still remains that 1.4.0 O2 beats 1.4.3 O3 under a significant number of cases, though not by much. Under 1.4.3 O2 wins a few loses a few, etc.

image

Here's a look at the differences between the two version when compiled under O2 vs O3 (1.4.0 - 1.4.3).

image

@terrelln I effectively ran your script, and experienced similar results. BTW, I'm not sure if was copy/paste error, but there's a typo in the second if [ ... ] fi block where echi should be exit

And finally, the tabular data in case it reveals anything more -

  1.4.0 O3 1.4.0 O2 1.4.3 O3 1.4.3 O2
1 2592.3 2649.3 2578.2 2561.8
2 2348.3 2396.2 2355.3 2355.3
3 2204.3 2226.3 2218 2116.2
4 2174.1 2174.1 2152.8 2187.9
5 2085 2106.5 2053.3 1980.8
6 2044.3 2085 2064 2104.3
7 2209.5 2263.7 2218 2236.5
8 2150.9 2302.3 2256.8 2294.9
9 2302.9 2325.1 2287.3 2341.2
10 2271.8 2294.9 2264.7 2287.3
11 2248.8 2264.7 2264.7 2240.6
12 2294.9 2302.3 2279.6 2295.5
13 2317.8 2325.1 2348.3 2372
14 2302.3 2336 2355.3 2372
15 2325.1 2355.3 2378.8 2409.3
16 2226.3 2240.6 2287.3 2256.8
17 2054.6 2073.8 2125.6 2052.2
18 1358.8 1405.3 1402.9 1280.9
19 1465.1 1495.6 1421.3 1433.4

@Cyan4973
Copy link
Contributor

We have a project to look more into these -O2 / -O3 differences.
It should start soon, and will give us a better visibility on what's going on and what are the best settings.

On the specific issue of performance comparison between 1.4.0 and 1.4.3, the topic is becoming obsolete, because the next version in v1.4.4 is featuring substantial differences for the decompression algorithm, resulting in a dramatically better speed. As a consequence, the code already merged in dev branch is no longer comparable.

I would suggest a test of the current code in dev to ensure that it does indeed provide better performance for your setup too.

@KBentley57
Copy link
Author

@Cyan4973

I'll give that a look sometime soon. If that's the case, should this issue be marked as closed? I don't want it to run on forever now that it has fulfilled its purpose.

Thanks,

Kyle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants