Use pure Go based ZSTD implementation #1176

jarifibrahim · 2019-12-26T11:31:29Z

This PR proposes to use https://github.com/klauspost/compress/tree/master/zstd instead of CGO based https://github.com/DataDog/zstd .

This PR also removes the CompressionLevel options since https://github.com/klauspost/compress/tree/master/zstd supports only two levels of ZSTD Compression. The default level is ZSTD Level 3 and the fastest level is ZSTD level 1. ZSTD level 1 will be the default level in badger.

I've experimented will all the suggestions mentioned in klauspost/compress#196 (comment) . Setting WithSingleSegment didn't seem to make a lot of speed difference (~ 1MB/s difference)
WithNoEntropyCompression seemed to have ~ 3% of speed improvement (but that could also be because of non-deterministic nature of benchmarks)

name                                       old time/op      new time/op (NoEntropy set)   delta
Compression/ZSTD_-_Go_-_level1-16           35.7µs ± 1%     36.9µs ± 5%                 +3.41%  (p=0.008 n=5+5)
Decompression/ZSTD_-_Go-16                  16.0µs ± 0%     15.9µs ± 1%                 -0.77%  (p=0.016 n=5+5)

name                                    old speed      new speed (NoEntropy set)      delta
Compression/ZSTD_-_Go_-_level1-16      115MB/s ± 1%   111MB/s ± 5%                -3.24%  (p=0.008 n=5+5)
Decompression/ZSTD_-_Go-16             256MB/s ± 0%   258MB/s ± 1%                 +0.78%  (p=0.016 n=5+5)

Benchmarks

Table Data (contains some randomly generated data).

Compression Ratio Datadog ZSTD level 1 3.1993720565149135
Compression Ratio Datadog ZSTD level 3 3.099619771863118

Compression Ratio Go ZSTD 3.2170481452249406
Compression Ratio Go ZSTD level 3 3.1474903474903475

name                                        time/op
Compression/ZSTD_-_Datadog-level1-16    17.6µs ± 3%
Compression/ZSTD_-_Datadog-level3-16    20.7µs ± 3%

Compression/ZSTD_-_Go_-_level1-16       27.8µs ± 2%
Compression/ZSTD_-_Go_-_Default-16      39.1µs ± 1%

Decompression/ZSTD_-_Datadog-16         7.12µs ± 2%
Decompression/ZSTD_-_Go-16              13.7µs ± 2%

name                                       speed
Compression/ZSTD_-_Datadog-level1-16   231MB/s ± 3%
Compression/ZSTD_-_Datadog-level3-16   197MB/s ± 3%

Compression/ZSTD_-_Go_-_level1-16      147MB/s ± 2%
Compression/ZSTD_-_Go_-_Default-16     104MB/s ± 1%

Decompression/ZSTD_-_Datadog-16        573MB/s ± 2%
Decompression/ZSTD_-_Go-16             298MB/s ± 2%

4KB of text taken from https://gist.github.com/StevenClontz/4445774

Compression Ratio ZSTD level 1 1.9294781382228492
Compression Ratio ZSTD level 3 1.9322033898305084

Compression Ratio Go ZSTD 1.894736842105263
Compression Ratio Go ZSTD level 3 1.927665570690465

name                                       time/op
Compression/ZSTD_-_Datadog-level1-16    22.7µs ± 4%
Compression/ZSTD_-_Datadog-level3-16    29.6µs ± 4%

Compression/ZSTD_-_Go_-_level1-16       35.7µs ± 1%
Compression/ZSTD_-_Go_-_Default-16      97.9µs ± 1%

Decompression/ZSTD_-_Datadog-16         8.36µs ± 0%
Decompression/ZSTD_-_Go-16              16.0µs ± 0%

name                                       speed
Compression/ZSTD_-_Datadog-level1-16   181MB/s ± 4%
Compression/ZSTD_-_Datadog-level3-16   139MB/s ± 4%

Compression/ZSTD_-_Go_-_level1-16      115MB/s ± 1%
Compression/ZSTD_-_Go_-_Default-16    41.9MB/s ± 1%

Decompression/ZSTD_-_Datadog-16        489MB/s ± 2%
Decompression/ZSTD_-_Go-16             256MB/s ± 0%

Here's the script I've used https://gist.github.com/jarifibrahim/91920e93d1ecac3006b269e0c05d6a24

This change is

y/zstd.go

coveralls · 2019-12-26T13:07:24Z

Coverage decreased (-0.09%) to 69.851% when pulling a288897 on ibrahim/klauspost-compress into 0f2e629 on master.

y/zstd.go

jarifibrahim · 2020-01-13T12:14:30Z

I had a chat with @manishrjain and we've decided to not use the pure go ZSTD because it's about 1.5x slower than the CGO based implementation.

Compression/ZSTD_-_Datadog-level1-16    22.7µs ± 4%
Compression/ZSTD_-_Go_-_level1-16       35.7µs ± 1%

Compression/ZSTD_-_Datadog-level3-16    29.6µs ± 4%
Compression/ZSTD_-_Go_-_Default-level3-16      97.9µs ± 1%

Decompression/ZSTD_-_Datadog-16         8.36µs ± 0%
Decompression/ZSTD_-_Go-16              16.0µs ± 0%

klauspost · 2020-06-02T10:01:04Z

@jarifibrahim Reran your script.

BenchmarkComp/Compression/ZSTD_-_Datadog-32        31495             37848 ns/op         107.70 MB/s
BenchmarkComp/Compression/ZSTD_-_Go_-_Fastest-32                   49791             23325 ns/op         174.75 MB/s
BenchmarkComp/Compression/ZSTD_-_Go_-_Default-32                   30686             36927 ns/op         110.38 MB/s
BenchmarkComp/Decompression/ZSTD_-_Datadog-32                     166665              7074 ns/op         576.18 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-32                          109090             10542 ns/op         386.65 MB/s

So compression both in 'fast' and 'default' is faster than cgo default.

"1.5x slower" doesn't make sense to me. If you mean that runtime is 150% of cgo, that does seem to be the case for decompression on this particular payload.

This has a very skewed performance profile since blocks are so small. I can look into the 'very small block' decompression performance.

jarifibrahim · 2020-06-02T10:22:36Z

Hey @klauspost, the 1.5x number (which is technically 2x in my original comment) was for the decompression speed. In badger, we compress a block once but we have to decompress is multiple times. Slow compression speed is okay since we do compression in the background but slow decompression speed would mean the reads become slower.

@klauspost we'd love to use the pure go based implementation if we can improve the decompression speed. I don't have experience with compression algorithms but if there's anyway I can help, please do let me know.

Thank for looking into this issue :)

klauspost · 2020-06-02T10:39:58Z

@jarifibrahim It depends so much on how you run the benchmark and a single (or two) payloads can skew the numbers significantly.

For example, look at these numbers:

BenchmarkComp/Decompression/ZSTD_-_Datadog-32             472735              24
75 ns/op        1646.84 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-32                 1124998              10
14 ns/op        4018.83 MB/s

These are the numbers by simply running a parallel benchmark:

		b.Run("ZSTD - Datadog", func(b *testing.B) {
			b.SetBytes(int64(len(data)))
			b.RunParallel(func(pb *testing.PB) {
				buf := make([]byte, len(data))
				for pb.Next() {
					d, err := zstd.Decompress(buf, ZSTDCompressed)
					if err != nil {
						panic(err)
					}
					_ = d
				}
			})
		})
		b.Run("ZSTD - Go", func(b *testing.B) {
			b.SetBytes(int64(len(data)))
			dec, err := gozstd.NewReader(nil)
			if err != nil {
				panic(err)
			}
			b.ResetTimer()
			b.RunParallel(func(pb *testing.PB) {
				buf := make([]byte, len(data))
				for pb.Next() {
					d, err := dec.DecodeAll(ZSTDCompressed, buf[:0])
					if err != nil {
						panic(err)
					}
					_ = d
				}
			})
		})

With this small change the Go version beats decompression speed of the CGO version by more than 2x. I would say this leads to a quite different conclusion AND it is a bit closer to what you would see in the real world.

jarifibrahim · 2020-06-02T10:48:15Z

@klauspost wow, I did not anticipate that. Why does the parallel version work so much faster? Or why is the CGO one significantly slower?

We have compression/decompression running parallelly all the time and from this new benchmark, it looks like the go based implementation would be very efficient. This is awesome.
Quick question - how big was your data for this decompression benchmark? 4 KB? Can you share your benchmark script?

klauspost · 2020-06-02T10:49:53Z

TBH I am a bit surprised myself ;) My guess is that the cgo version allocates new memory on every run and trashes the cache.

It is your script linked above with only the lines above changed.

klauspost · 2020-06-02T11:02:26Z

The Go version will only allocate GOMAXPROCS decompressors and re-use them across goroutines thus limiting the total amount of memory used. The cgo version just allocates on every run.

I suspect actual performance is somewhere in between and not as extreme as seen above since other stuff is going on between compression runs.

jarifibrahim · 2020-06-02T12:28:43Z

@klauspost why are my results so different than yours?

 go test -run xxx -bench Comp/Decompression        
data size 4096
goos: linux
goarch: amd64
pkg: github.com/dgraph-io/badger/v2/table
BenchmarkComp/Decompression/Snappy-8  	  756020	      1607 ns/op	2549.00 MB/s
BenchmarkComp/Decompression/LZ4-8     	 1399564	       862 ns/op	4749.56 MB/s
BenchmarkComp/Decompression/ZSTD_-_Datadog-8         	  426633	      2859 ns/op	1432.49 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-8              	  194971	      6082 ns/op	 673.50 MB/s
PASS
ok  	github.com/dgraph-io/badger/v2/table	7.432s

	b.Run("Decompression", func(b *testing.B) {
		buf := make([]byte, len(data))
		b.Run("Snappy", func(b *testing.B) {
			b.SetBytes(int64(len(data)))
			b.ResetTimer()
			b.RunParallel(func(pb *testing.PB) {
				for pb.Next() {
					d, err := snappy.Decode(buf, snappyCompressed)
					if err != nil {
						panic(err)
					}
					_ = d
					if validate {
						require.Equal(b, d, data)
					}
				}
			})
		})
		b.Run("LZ4", func(b *testing.B) {
			b.SetBytes(int64(len(data)))
			b.ResetTimer()
			b.RunParallel(func(pb *testing.PB) {
				for pb.Next() {
					n, err := lz4.UncompressBlock(LZ4Compressed, buf)
					if err != nil {
						fmt.Println(err)
					}
					buf = buf[:n] // uncompressed data
					if validate {
						require.Equal(b, buf, data)
					}
				}
			})
		})
		b.Run("ZSTD - Datadog", func(b *testing.B) {
			b.SetBytes(int64(len(data)))
			b.ResetTimer()
			b.RunParallel(func(pb *testing.PB) {
				for pb.Next() {
					d, err := zstd.Decompress(buf, ZSTDCompressed)
					if err != nil {
						panic(err)
					}
					_ = d
					if validate {
						require.Equal(b, d, data)
					}
				}
			})
		})
		b.Run("ZSTD - Go", func(b *testing.B) {
			b.SetBytes(int64(len(data)))
			dec, err := gozstd.NewReader(nil)
			if err != nil {
				panic(err)
			}
			b.ResetTimer()
			b.RunParallel(func(pb *testing.PB) {
				for pb.Next() {
					d, err := dec.DecodeAll(ZSTDCompressed, buf[:0])
					if err != nil {
						panic(err)
					}
					_ = d
					if validate {
						require.Equal(b, d, data)
					}
				}
			})
		})

klauspost · 2020-06-02T12:43:35Z

You are sharing the output buffer between goroutines.

Also, are you using v1.10.7?

klauspost · 2020-06-02T12:50:53Z

@jarifibrahim Other than that, the core count 4 vs. 16 is probably making a difference. The cgo version seems to saturate at a low concurrency level.

jarifibrahim · 2020-06-02T12:53:00Z

You are sharing the output buffer between goroutines.

Yeah. Fixed that.

Also, are you using v1.10.7?

Yes, I'm using v1.10.7

@jarifibrahim Other than that, the core count 4 vs. 16 is probably making a difference. The cgo version seems to saturate at a low concurrency level.

That could be the reason. Let me try this on a different machine and get back.

klauspost · 2020-06-02T12:55:33Z

Cool. I'm using the generated table, btw.

edit: Weird, only seeing 2863.11 MB/s now. Still nice, but makes me wonder what happened ;)
edit 2: ok, with more stuff closed it is at 3740.22 MB/s... benchmarking sucks.

cgo remains at ~1700/s.

klauspost · 2020-06-02T16:29:57Z

go test -bench=Comp/Decompression/ZST -cpu=1,2,4,8,16,32 -test.run=none

BenchmarkComp/Decompression/ZSTD_-_Datadog                166670              7188 ns/op         567.07 MB/s
BenchmarkComp/Decompression/ZSTD_-_Datadog-2              324328              3648 ns/op        1117.47 MB/s
BenchmarkComp/Decompression/ZSTD_-_Datadog-4              585801              2038 ns/op        1999.77 MB/s
BenchmarkComp/Decompression/ZSTD_-_Datadog-8             1034521              1147 ns/op        3552.41 MB/s
BenchmarkComp/Decompression/ZSTD_-_Datadog-16             666648              1812 ns/op        2249.38 MB/s
BenchmarkComp/Decompression/ZSTD_-_Datadog-32             521748              2382 ns/op        1710.90 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go                     114285             10220 ns/op         398.82 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-2                   230766              5148 ns/op         791.75 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-4                   461523              2641 ns/op        1543.21 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-8                   800047              1365 ns/op        2986.26 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-16                 1291708               927 ns/op        4398.49 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-32                  999940              1010 ns/op        4035.40 MB/s

It seems that on my system the switchover is somewhere between 8 and 16 cores.

klauspost · 2020-06-02T19:17:59Z

Interestingly in concurrent setup 40% of all CPU is spent dealing with channels trying to get decoders when blocks are this small. Rather surprising.

There should be a reasonable chance for a quick win there.

klauspost · 2020-06-02T19:44:19Z

Managed to reduce the number of channel operations from 2 to 1 which gave a nice boost on the top end:

./ZSTD_-_Datadog                171428              6983 ns/op         583.74 MB/s
./ZSTD_-_Datadog-2              324331              3574 ns/op        1140.62 MB/s
./ZSTD_-_Datadog-4              631578              1947 ns/op        2093.86 MB/s
./ZSTD_-_Datadog-8              909066              1122 ns/op        3632.53 MB/s
./ZSTD_-_Datadog-16             749985              1576 ns/op        2586.24 MB/s
./ZSTD_-_Datadog-32             571471              2135 ns/op        1909.28 MB/s
./ZSTD_-_Go                     124996              9442 ns/op         431.68 MB/s
./ZSTD_-_Go-2                   249990              4733 ns/op         861.27 MB/s
./ZSTD_-_Go-4                   480024              2389 ns/op        1705.89 MB/s
./ZSTD_-_Go-8                  1000000              1237 ns/op        3294.05 MB/s
./ZSTD_-_Go-16                 1591513               745 ns/op        5469.66 MB/s
./ZSTD_-_Go-32                12779566               971 ns/op        4196.37 MB/s

(edited for ease of reading)

Also for higher concurrency I find that longer benchmark times makes it more stable -benchtime=10s

klauspost · 2020-06-03T12:14:08Z

FYI, I found about 5% when looking at small blocks single threaded: klauspost/compress#265

Most of it is bounds check eliminations, so it doesn't show up too much in single file benchmarks, but will affect performance with a 'cold' branch predictor.

Fuzz testing is looking good, so it will probably be merged soon.

klauspost · 2020-06-03T20:51:09Z

Managed to get the single block decodes up to around a 1.15x speedup. The test-payload from here was 1.12x the speed.

jarifibrahim · 2020-06-04T14:03:37Z

@klauspost

edit 2: ok, with more stuff closed it is at 3740.22 MB/s... benchmarking sucks.

Ouch, tell me more about it :) I believe you have a much better understanding of benchmarks than I do and you should write a blog post about it. I'd love to read that blog post about benchmarking.

Managed to get the single block decodes up to around a 1.15x speedup. The test-payload from here was 1.12x the speed.

This is awesome. If I understand this correctly, the new implementation is 15% faster than the last benchmark we did against v1.10.7. Is this correct? I didn't get a chance to run the benchmark on a 16 core machine but I'll do that once your PR klauspost/compress#265 is merged.

Cool. I'm using the generated table, btw.

I was using the first 4KB from mobydick. I'll run the next benchmark with the generated table 👍

klauspost · 2020-06-04T14:27:04Z

@jarifibrahim Yeah. It will probably be merged soon. I will fuzz test a bit more before doing a release.

Tried another couple of changes today, but no gains.

jarifibrahim · 2020-06-04T14:30:03Z

@klauspost what else did you try apart from increasing the benchtime to stabilize the benchmark results?

klauspost · 2020-06-04T18:45:25Z

@jarifibrahim I could do a long talk on that :)

Other than getting a thermally stable CPU I tend to more use many (1s) benchmarks instead of one. As you can see in my bench when working with compression you often see regressions in one case and improvements in others, so having a diverse test set is more important than a single stable one.

So I am more looking for general trends, but basically benchmarking every single change along the way since it is almost impossible to predict.

So in the bench above the trend is clear, but html_x_4.zst showing a minor regression. I am not super worried since it is fast to begin with. Mostly I am looking to improve the worst cases since they have a bigger impact.

jarifibrahim · 2020-06-05T06:27:59Z

Thanks @klauspost! The explanation was very helpful.

klauspost · 2020-06-05T15:10:32Z

@jarifibrahim I have merged it and released v1.10.8.

jarifibrahim · 2020-06-05T15:59:45Z

@jarifibrahim I have merged it and released v1.10.8.

Got it 👍

Ibrahim Jarif added 4 commits December 26, 2019 16:18

Use pure go based ZSTD compression algorithm

00dc714

fix check in zstd.go

084b4ff

Remove compression level option

e85a8a3

Remove cgo references from README.md

8ab095a

jarifibrahim requested review from ashish-goswami and manishrjain as code owners December 26, 2019 11:31

jarifibrahim requested a review from a team December 26, 2019 11:31

golangcibot reviewed Dec 26, 2019

View reviewed changes

y/zstd.go Outdated Show resolved Hide resolved

Ibrahim Jarif added 2 commits January 7, 2020 18:53

Merge branch 'master' into ibrahim/klauspost-compress

014c131

fix line length

a288897

golangcibot reviewed Jan 7, 2020

View reviewed changes

y/zstd.go Outdated Show resolved Hide resolved

jarifibrahim closed this Jan 13, 2020

jarifibrahim deleted the ibrahim/klauspost-compress branch January 13, 2020 12:14

jarifibrahim mentioned this pull request Jan 13, 2020

Use pure Go zstd implementation #1162

Closed

jarifibrahim mentioned this pull request Jun 23, 2020

Replace Datadog/zstd with Klauspost/compress #1383

Closed

yaroot mentioned this pull request Sep 9, 2020

Support zstd hashicorp/go-getter#270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use pure Go based ZSTD implementation #1176

Use pure Go based ZSTD implementation #1176

jarifibrahim commented Dec 26, 2019 •

edited

Loading

coveralls commented Dec 26, 2019 •

edited

Loading

jarifibrahim commented Jan 13, 2020

klauspost commented Jun 2, 2020

jarifibrahim commented Jun 2, 2020

klauspost commented Jun 2, 2020

jarifibrahim commented Jun 2, 2020

klauspost commented Jun 2, 2020 •

edited

Loading

klauspost commented Jun 2, 2020

jarifibrahim commented Jun 2, 2020

klauspost commented Jun 2, 2020

klauspost commented Jun 2, 2020 •

edited by jarifibrahim

Loading

jarifibrahim commented Jun 2, 2020

klauspost commented Jun 2, 2020 •

edited

Loading

klauspost commented Jun 2, 2020

klauspost commented Jun 2, 2020 •

edited

Loading

klauspost commented Jun 2, 2020 •

edited

Loading

klauspost commented Jun 3, 2020 •

edited

Loading

klauspost commented Jun 3, 2020

jarifibrahim commented Jun 4, 2020

klauspost commented Jun 4, 2020

jarifibrahim commented Jun 4, 2020

klauspost commented Jun 4, 2020

jarifibrahim commented Jun 5, 2020 •

edited

Loading

klauspost commented Jun 5, 2020

jarifibrahim commented Jun 5, 2020

Use pure Go based ZSTD implementation #1176

Use pure Go based ZSTD implementation #1176

Conversation

jarifibrahim commented Dec 26, 2019 • edited Loading

Benchmarks

coveralls commented Dec 26, 2019 • edited Loading

jarifibrahim commented Jan 13, 2020

klauspost commented Jun 2, 2020

jarifibrahim commented Jun 2, 2020

klauspost commented Jun 2, 2020

jarifibrahim commented Jun 2, 2020

klauspost commented Jun 2, 2020 • edited Loading

klauspost commented Jun 2, 2020

jarifibrahim commented Jun 2, 2020

klauspost commented Jun 2, 2020

klauspost commented Jun 2, 2020 • edited by jarifibrahim Loading

jarifibrahim commented Jun 2, 2020

klauspost commented Jun 2, 2020 • edited Loading

klauspost commented Jun 2, 2020

klauspost commented Jun 2, 2020 • edited Loading

klauspost commented Jun 2, 2020 • edited Loading

klauspost commented Jun 3, 2020 • edited Loading

klauspost commented Jun 3, 2020

jarifibrahim commented Jun 4, 2020

klauspost commented Jun 4, 2020

jarifibrahim commented Jun 4, 2020

klauspost commented Jun 4, 2020

jarifibrahim commented Jun 5, 2020 • edited Loading

klauspost commented Jun 5, 2020

jarifibrahim commented Jun 5, 2020

jarifibrahim commented Dec 26, 2019 •

edited

Loading

coveralls commented Dec 26, 2019 •

edited

Loading

klauspost commented Jun 2, 2020 •

edited

Loading

klauspost commented Jun 2, 2020 •

edited by jarifibrahim

Loading

klauspost commented Jun 2, 2020 •

edited

Loading

klauspost commented Jun 2, 2020 •

edited

Loading

klauspost commented Jun 2, 2020 •

edited

Loading

klauspost commented Jun 3, 2020 •

edited

Loading

jarifibrahim commented Jun 5, 2020 •

edited

Loading