Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pure Go based ZSTD implementation #1176

Closed
wants to merge 6 commits into from

Conversation

jarifibrahim
Copy link
Contributor

@jarifibrahim jarifibrahim commented Dec 26, 2019

Fixes #1162

This PR proposes to use https://github.com/klauspost/compress/tree/master/zstd instead of CGO based https://github.com/DataDog/zstd .

This PR also removes the CompressionLevel options since https://github.com/klauspost/compress/tree/master/zstd supports only two levels of ZSTD Compression. The default level is ZSTD Level 3 and the fastest level is ZSTD level 1. ZSTD level 1 will be the default level in badger.

I've experimented will all the suggestions mentioned in klauspost/compress#196 (comment) . Setting WithSingleSegment didn't seem to make a lot of speed difference (~ 1MB/s difference)
WithNoEntropyCompression seemed to have ~ 3% of speed improvement (but that could also be because of non-deterministic nature of benchmarks)

name                                       old time/op      new time/op (NoEntropy set)   delta
Compression/ZSTD_-_Go_-_level1-16           35.7µs ± 1%     36.9µs ± 5%                 +3.41%  (p=0.008 n=5+5)
Decompression/ZSTD_-_Go-16                  16.0µs ± 0%     15.9µs ± 1%                 -0.77%  (p=0.016 n=5+5)

name                                    old speed      new speed (NoEntropy set)      delta
Compression/ZSTD_-_Go_-_level1-16      115MB/s ± 1%   111MB/s ± 5%                -3.24%  (p=0.008 n=5+5)
Decompression/ZSTD_-_Go-16             256MB/s ± 0%   258MB/s ± 1%                 +0.78%  (p=0.016 n=5+5)

Benchmarks

  1. Table Data (contains some randomly generated data).
Compression Ratio Datadog ZSTD level 1 3.1993720565149135
Compression Ratio Datadog ZSTD level 3 3.099619771863118

Compression Ratio Go ZSTD 3.2170481452249406
Compression Ratio Go ZSTD level 3 3.1474903474903475
name                                        time/op
Compression/ZSTD_-_Datadog-level1-16    17.6µs ± 3%
Compression/ZSTD_-_Datadog-level3-16    20.7µs ± 3%

Compression/ZSTD_-_Go_-_level1-16       27.8µs ± 2%
Compression/ZSTD_-_Go_-_Default-16      39.1µs ± 1%

Decompression/ZSTD_-_Datadog-16         7.12µs ± 2%
Decompression/ZSTD_-_Go-16              13.7µs ± 2%

name                                       speed
Compression/ZSTD_-_Datadog-level1-16   231MB/s ± 3%
Compression/ZSTD_-_Datadog-level3-16   197MB/s ± 3%

Compression/ZSTD_-_Go_-_level1-16      147MB/s ± 2%
Compression/ZSTD_-_Go_-_Default-16     104MB/s ± 1%

Decompression/ZSTD_-_Datadog-16        573MB/s ± 2%
Decompression/ZSTD_-_Go-16             298MB/s ± 2%
  1. 4KB of text taken from https://gist.github.com/StevenClontz/4445774
Compression Ratio ZSTD level 1 1.9294781382228492
Compression Ratio ZSTD level 3 1.9322033898305084

Compression Ratio Go ZSTD 1.894736842105263
Compression Ratio Go ZSTD level 3 1.927665570690465
name                                       time/op
Compression/ZSTD_-_Datadog-level1-16    22.7µs ± 4%
Compression/ZSTD_-_Datadog-level3-16    29.6µs ± 4%

Compression/ZSTD_-_Go_-_level1-16       35.7µs ± 1%
Compression/ZSTD_-_Go_-_Default-16      97.9µs ± 1%

Decompression/ZSTD_-_Datadog-16         8.36µs ± 0%
Decompression/ZSTD_-_Go-16              16.0µs ± 0%

name                                       speed
Compression/ZSTD_-_Datadog-level1-16   181MB/s ± 4%
Compression/ZSTD_-_Datadog-level3-16   139MB/s ± 4%

Compression/ZSTD_-_Go_-_level1-16      115MB/s ± 1%
Compression/ZSTD_-_Go_-_Default-16    41.9MB/s ± 1%

Decompression/ZSTD_-_Datadog-16        489MB/s ± 2%
Decompression/ZSTD_-_Go-16             256MB/s ± 0%

Here's the script I've used https://gist.github.com/jarifibrahim/91920e93d1ecac3006b269e0c05d6a24


This change is Reviewable

y/zstd.go Outdated Show resolved Hide resolved
@coveralls
Copy link

coveralls commented Dec 26, 2019

Coverage Status

Coverage decreased (-0.09%) to 69.851% when pulling a288897 on ibrahim/klauspost-compress into 0f2e629 on master.

y/zstd.go Outdated Show resolved Hide resolved
@jarifibrahim
Copy link
Contributor Author

I had a chat with @manishrjain and we've decided to not use the pure go ZSTD because it's about 1.5x slower than the CGO based implementation.

Compression/ZSTD_-_Datadog-level1-16    22.7µs ± 4%
Compression/ZSTD_-_Go_-_level1-16       35.7µs ± 1%

Compression/ZSTD_-_Datadog-level3-16    29.6µs ± 4%
Compression/ZSTD_-_Go_-_Default-level3-16      97.9µs ± 1%

Decompression/ZSTD_-_Datadog-16         8.36µs ± 0%
Decompression/ZSTD_-_Go-16              16.0µs ± 0%

@jarifibrahim jarifibrahim deleted the ibrahim/klauspost-compress branch January 13, 2020 12:14
@klauspost
Copy link

@jarifibrahim Reran your script.

BenchmarkComp/Compression/ZSTD_-_Datadog-32        31495             37848 ns/op         107.70 MB/s
BenchmarkComp/Compression/ZSTD_-_Go_-_Fastest-32                   49791             23325 ns/op         174.75 MB/s
BenchmarkComp/Compression/ZSTD_-_Go_-_Default-32                   30686             36927 ns/op         110.38 MB/s
BenchmarkComp/Decompression/ZSTD_-_Datadog-32                     166665              7074 ns/op         576.18 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-32                          109090             10542 ns/op         386.65 MB/s

So compression both in 'fast' and 'default' is faster than cgo default.

"1.5x slower" doesn't make sense to me. If you mean that runtime is 150% of cgo, that does seem to be the case for decompression on this particular payload.

This has a very skewed performance profile since blocks are so small. I can look into the 'very small block' decompression performance.

@jarifibrahim
Copy link
Contributor Author

Hey @klauspost, the 1.5x number (which is technically 2x in my original comment) was for the decompression speed. In badger, we compress a block once but we have to decompress is multiple times. Slow compression speed is okay since we do compression in the background but slow decompression speed would mean the reads become slower.

@klauspost we'd love to use the pure go based implementation if we can improve the decompression speed. I don't have experience with compression algorithms but if there's anyway I can help, please do let me know.

Thank for looking into this issue :)

@klauspost
Copy link

@jarifibrahim It depends so much on how you run the benchmark and a single (or two) payloads can skew the numbers significantly.

For example, look at these numbers:

BenchmarkComp/Decompression/ZSTD_-_Datadog-32             472735              24
75 ns/op        1646.84 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-32                 1124998              10
14 ns/op        4018.83 MB/s

These are the numbers by simply running a parallel benchmark:

		b.Run("ZSTD - Datadog", func(b *testing.B) {
			b.SetBytes(int64(len(data)))
			b.RunParallel(func(pb *testing.PB) {
				buf := make([]byte, len(data))
				for pb.Next() {
					d, err := zstd.Decompress(buf, ZSTDCompressed)
					if err != nil {
						panic(err)
					}
					_ = d
				}
			})
		})
		b.Run("ZSTD - Go", func(b *testing.B) {
			b.SetBytes(int64(len(data)))
			dec, err := gozstd.NewReader(nil)
			if err != nil {
				panic(err)
			}
			b.ResetTimer()
			b.RunParallel(func(pb *testing.PB) {
				buf := make([]byte, len(data))
				for pb.Next() {
					d, err := dec.DecodeAll(ZSTDCompressed, buf[:0])
					if err != nil {
						panic(err)
					}
					_ = d
				}
			})
		})

With this small change the Go version beats decompression speed of the CGO version by more than 2x. I would say this leads to a quite different conclusion AND it is a bit closer to what you would see in the real world.

@jarifibrahim
Copy link
Contributor Author

@klauspost wow, I did not anticipate that. Why does the parallel version work so much faster? Or why is the CGO one significantly slower?

We have compression/decompression running parallelly all the time and from this new benchmark, it looks like the go based implementation would be very efficient. This is awesome.
Quick question - how big was your data for this decompression benchmark? 4 KB? Can you share your benchmark script?

@klauspost
Copy link

klauspost commented Jun 2, 2020

TBH I am a bit surprised myself ;) My guess is that the cgo version allocates new memory on every run and trashes the cache.

It is your script linked above with only the lines above changed.

@klauspost
Copy link

The Go version will only allocate GOMAXPROCS decompressors and re-use them across goroutines thus limiting the total amount of memory used. The cgo version just allocates on every run.

I suspect actual performance is somewhere in between and not as extreme as seen above since other stuff is going on between compression runs.

@jarifibrahim
Copy link
Contributor Author

@klauspost why are my results so different than yours?

 go test -run xxx -bench Comp/Decompression        
data size 4096
goos: linux
goarch: amd64
pkg: github.com/dgraph-io/badger/v2/table
BenchmarkComp/Decompression/Snappy-8  	  756020	      1607 ns/op	2549.00 MB/s
BenchmarkComp/Decompression/LZ4-8     	 1399564	       862 ns/op	4749.56 MB/s
BenchmarkComp/Decompression/ZSTD_-_Datadog-8         	  426633	      2859 ns/op	1432.49 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-8              	  194971	      6082 ns/op	 673.50 MB/s
PASS
ok  	github.com/dgraph-io/badger/v2/table	7.432s

	b.Run("Decompression", func(b *testing.B) {
		buf := make([]byte, len(data))
		b.Run("Snappy", func(b *testing.B) {
			b.SetBytes(int64(len(data)))
			b.ResetTimer()
			b.RunParallel(func(pb *testing.PB) {
				for pb.Next() {
					d, err := snappy.Decode(buf, snappyCompressed)
					if err != nil {
						panic(err)
					}
					_ = d
					if validate {
						require.Equal(b, d, data)
					}
				}
			})
		})
		b.Run("LZ4", func(b *testing.B) {
			b.SetBytes(int64(len(data)))
			b.ResetTimer()
			b.RunParallel(func(pb *testing.PB) {
				for pb.Next() {
					n, err := lz4.UncompressBlock(LZ4Compressed, buf)
					if err != nil {
						fmt.Println(err)
					}
					buf = buf[:n] // uncompressed data
					if validate {
						require.Equal(b, buf, data)
					}
				}
			})
		})
		b.Run("ZSTD - Datadog", func(b *testing.B) {
			b.SetBytes(int64(len(data)))
			b.ResetTimer()
			b.RunParallel(func(pb *testing.PB) {
				for pb.Next() {
					d, err := zstd.Decompress(buf, ZSTDCompressed)
					if err != nil {
						panic(err)
					}
					_ = d
					if validate {
						require.Equal(b, d, data)
					}
				}
			})
		})
		b.Run("ZSTD - Go", func(b *testing.B) {
			b.SetBytes(int64(len(data)))
			dec, err := gozstd.NewReader(nil)
			if err != nil {
				panic(err)
			}
			b.ResetTimer()
			b.RunParallel(func(pb *testing.PB) {
				for pb.Next() {
					d, err := dec.DecodeAll(ZSTDCompressed, buf[:0])
					if err != nil {
						panic(err)
					}
					_ = d
					if validate {
						require.Equal(b, d, data)
					}
				}
			})
		})

@klauspost
Copy link

You are sharing the output buffer between goroutines.

Also, are you using v1.10.7?

@klauspost
Copy link

klauspost commented Jun 2, 2020

@jarifibrahim Other than that, the core count 4 vs. 16 is probably making a difference. The cgo version seems to saturate at a low concurrency level.

@jarifibrahim
Copy link
Contributor Author

You are sharing the output buffer between goroutines.

Yeah. Fixed that.

Also, are you using v1.10.7?

Yes, I'm using v1.10.7

@jarifibrahim Other than that, the core count 4 vs. 16 is probably making a difference. The cgo version seems to saturate at a low concurrency level.

That could be the reason. Let me try this on a different machine and get back.

@klauspost
Copy link

klauspost commented Jun 2, 2020

Cool. I'm using the generated table, btw.

edit: Weird, only seeing 2863.11 MB/s now. Still nice, but makes me wonder what happened ;)
edit 2: ok, with more stuff closed it is at 3740.22 MB/s... benchmarking sucks.

cgo remains at ~1700/s.

@klauspost
Copy link

go test -bench=Comp/Decompression/ZST -cpu=1,2,4,8,16,32 -test.run=none

BenchmarkComp/Decompression/ZSTD_-_Datadog                166670              7188 ns/op         567.07 MB/s
BenchmarkComp/Decompression/ZSTD_-_Datadog-2              324328              3648 ns/op        1117.47 MB/s
BenchmarkComp/Decompression/ZSTD_-_Datadog-4              585801              2038 ns/op        1999.77 MB/s
BenchmarkComp/Decompression/ZSTD_-_Datadog-8             1034521              1147 ns/op        3552.41 MB/s
BenchmarkComp/Decompression/ZSTD_-_Datadog-16             666648              1812 ns/op        2249.38 MB/s
BenchmarkComp/Decompression/ZSTD_-_Datadog-32             521748              2382 ns/op        1710.90 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go                     114285             10220 ns/op         398.82 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-2                   230766              5148 ns/op         791.75 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-4                   461523              2641 ns/op        1543.21 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-8                   800047              1365 ns/op        2986.26 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-16                 1291708               927 ns/op        4398.49 MB/s
BenchmarkComp/Decompression/ZSTD_-_Go-32                  999940              1010 ns/op        4035.40 MB/s

It seems that on my system the switchover is somewhere between 8 and 16 cores.

@klauspost
Copy link

klauspost commented Jun 2, 2020

Interestingly in concurrent setup 40% of all CPU is spent dealing with channels trying to get decoders when blocks are this small. Rather surprising.

There should be a reasonable chance for a quick win there.

@klauspost
Copy link

klauspost commented Jun 2, 2020

Managed to reduce the number of channel operations from 2 to 1 which gave a nice boost on the top end:

./ZSTD_-_Datadog                171428              6983 ns/op         583.74 MB/s
./ZSTD_-_Datadog-2              324331              3574 ns/op        1140.62 MB/s
./ZSTD_-_Datadog-4              631578              1947 ns/op        2093.86 MB/s
./ZSTD_-_Datadog-8              909066              1122 ns/op        3632.53 MB/s
./ZSTD_-_Datadog-16             749985              1576 ns/op        2586.24 MB/s
./ZSTD_-_Datadog-32             571471              2135 ns/op        1909.28 MB/s
./ZSTD_-_Go                     124996              9442 ns/op         431.68 MB/s
./ZSTD_-_Go-2                   249990              4733 ns/op         861.27 MB/s
./ZSTD_-_Go-4                   480024              2389 ns/op        1705.89 MB/s
./ZSTD_-_Go-8                  1000000              1237 ns/op        3294.05 MB/s
./ZSTD_-_Go-16                 1591513               745 ns/op        5469.66 MB/s
./ZSTD_-_Go-32                12779566               971 ns/op        4196.37 MB/s

(edited for ease of reading)

Also for higher concurrency I find that longer benchmark times makes it more stable -benchtime=10s

@klauspost
Copy link

klauspost commented Jun 3, 2020

FYI, I found about 5% when looking at small blocks single threaded: klauspost/compress#265

Most of it is bounds check eliminations, so it doesn't show up too much in single file benchmarks, but will affect performance with a 'cold' branch predictor.

Fuzz testing is looking good, so it will probably be merged soon.

@klauspost
Copy link

Managed to get the single block decodes up to around a 1.15x speedup. The test-payload from here was 1.12x the speed.

@jarifibrahim
Copy link
Contributor Author

@klauspost

edit 2: ok, with more stuff closed it is at 3740.22 MB/s... benchmarking sucks.

Ouch, tell me more about it :) I believe you have a much better understanding of benchmarks than I do and you should write a blog post about it. I'd love to read that blog post about benchmarking.

Managed to get the single block decodes up to around a 1.15x speedup. The test-payload from here was 1.12x the speed.

This is awesome. If I understand this correctly, the new implementation is 15% faster than the last benchmark we did against v1.10.7. Is this correct? I didn't get a chance to run the benchmark on a 16 core machine but I'll do that once your PR klauspost/compress#265 is merged.

Cool. I'm using the generated table, btw.

I was using the first 4KB from mobydick. I'll run the next benchmark with the generated table 👍

@klauspost
Copy link

@jarifibrahim Yeah. It will probably be merged soon. I will fuzz test a bit more before doing a release.

Tried another couple of changes today, but no gains.

@jarifibrahim
Copy link
Contributor Author

@klauspost what else did you try apart from increasing the benchtime to stabilize the benchmark results?

@klauspost
Copy link

@jarifibrahim I could do a long talk on that :)

Other than getting a thermally stable CPU I tend to more use many (1s) benchmarks instead of one. As you can see in my bench when working with compression you often see regressions in one case and improvements in others, so having a diverse test set is more important than a single stable one.

So I am more looking for general trends, but basically benchmarking every single change along the way since it is almost impossible to predict.

So in the bench above the trend is clear, but html_x_4.zst showing a minor regression. I am not super worried since it is fast to begin with. Mostly I am looking to improve the worst cases since they have a bigger impact.

@jarifibrahim
Copy link
Contributor Author

jarifibrahim commented Jun 5, 2020

Thanks @klauspost! The explanation was very helpful.

@klauspost
Copy link

@jarifibrahim I have merged it and released v1.10.8.

@jarifibrahim
Copy link
Contributor Author

@jarifibrahim I have merged it and released v1.10.8.

Got it 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Use pure Go zstd implementation
4 participants