-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pure Go based ZSTD implementation #1176
Conversation
I had a chat with @manishrjain and we've decided to not use the
|
@jarifibrahim Reran your script.
So compression both in 'fast' and 'default' is faster than cgo default. "1.5x slower" doesn't make sense to me. If you mean that runtime is 150% of cgo, that does seem to be the case for decompression on this particular payload. This has a very skewed performance profile since blocks are so small. I can look into the 'very small block' decompression performance. |
Hey @klauspost, the @klauspost we'd love to use the pure go based implementation if we can improve the decompression speed. I don't have experience with compression algorithms but if there's anyway I can help, please do let me know. Thank for looking into this issue :) |
@jarifibrahim It depends so much on how you run the benchmark and a single (or two) payloads can skew the numbers significantly. For example, look at these numbers:
These are the numbers by simply running a parallel benchmark: b.Run("ZSTD - Datadog", func(b *testing.B) {
b.SetBytes(int64(len(data)))
b.RunParallel(func(pb *testing.PB) {
buf := make([]byte, len(data))
for pb.Next() {
d, err := zstd.Decompress(buf, ZSTDCompressed)
if err != nil {
panic(err)
}
_ = d
}
})
})
b.Run("ZSTD - Go", func(b *testing.B) {
b.SetBytes(int64(len(data)))
dec, err := gozstd.NewReader(nil)
if err != nil {
panic(err)
}
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
buf := make([]byte, len(data))
for pb.Next() {
d, err := dec.DecodeAll(ZSTDCompressed, buf[:0])
if err != nil {
panic(err)
}
_ = d
}
})
}) With this small change the Go version beats decompression speed of the CGO version by more than 2x. I would say this leads to a quite different conclusion AND it is a bit closer to what you would see in the real world. |
@klauspost wow, I did not anticipate that. Why does the parallel version work so much faster? Or why is the CGO one significantly slower? We have compression/decompression running parallelly all the time and from this new benchmark, it looks like the go based implementation would be very efficient. This is awesome. |
TBH I am a bit surprised myself ;) My guess is that the cgo version allocates new memory on every run and trashes the cache. It is your script linked above with only the lines above changed. |
The Go version will only allocate GOMAXPROCS decompressors and re-use them across goroutines thus limiting the total amount of memory used. The cgo version just allocates on every run. I suspect actual performance is somewhere in between and not as extreme as seen above since other stuff is going on between compression runs. |
@klauspost why are my results so different than yours?
b.Run("Decompression", func(b *testing.B) {
buf := make([]byte, len(data))
b.Run("Snappy", func(b *testing.B) {
b.SetBytes(int64(len(data)))
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
d, err := snappy.Decode(buf, snappyCompressed)
if err != nil {
panic(err)
}
_ = d
if validate {
require.Equal(b, d, data)
}
}
})
})
b.Run("LZ4", func(b *testing.B) {
b.SetBytes(int64(len(data)))
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
n, err := lz4.UncompressBlock(LZ4Compressed, buf)
if err != nil {
fmt.Println(err)
}
buf = buf[:n] // uncompressed data
if validate {
require.Equal(b, buf, data)
}
}
})
})
b.Run("ZSTD - Datadog", func(b *testing.B) {
b.SetBytes(int64(len(data)))
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
d, err := zstd.Decompress(buf, ZSTDCompressed)
if err != nil {
panic(err)
}
_ = d
if validate {
require.Equal(b, d, data)
}
}
})
})
b.Run("ZSTD - Go", func(b *testing.B) {
b.SetBytes(int64(len(data)))
dec, err := gozstd.NewReader(nil)
if err != nil {
panic(err)
}
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
d, err := dec.DecodeAll(ZSTDCompressed, buf[:0])
if err != nil {
panic(err)
}
_ = d
if validate {
require.Equal(b, d, data)
}
}
})
}) |
You are sharing the output buffer between goroutines. Also, are you using |
@jarifibrahim Other than that, the core count 4 vs. 16 is probably making a difference. The cgo version seems to saturate at a low concurrency level. |
Yeah. Fixed that.
Yes, I'm using v1.10.7
That could be the reason. Let me try this on a different machine and get back. |
Cool. I'm using the generated table, btw. edit: Weird, only seeing cgo remains at ~1700/s. |
It seems that on my system the switchover is somewhere between 8 and 16 cores. |
Interestingly in concurrent setup 40% of all CPU is spent dealing with channels trying to get decoders when blocks are this small. Rather surprising. There should be a reasonable chance for a quick win there. |
Managed to reduce the number of channel operations from 2 to 1 which gave a nice boost on the top end:
(edited for ease of reading) Also for higher concurrency I find that longer benchmark times makes it more stable |
FYI, I found about 5% when looking at small blocks single threaded: klauspost/compress#265 Most of it is bounds check eliminations, so it doesn't show up too much in single file benchmarks, but will affect performance with a 'cold' branch predictor. Fuzz testing is looking good, so it will probably be merged soon. |
Managed to get the single block decodes up to around a 1.15x speedup. The test-payload from here was 1.12x the speed. |
Ouch, tell me more about it :) I believe you have a much better understanding of benchmarks than I do and you should write a blog post about it. I'd love to read that blog post about benchmarking.
This is awesome. If I understand this correctly, the new implementation is
I was using the first 4KB from mobydick. I'll run the next benchmark with the generated table 👍 |
@jarifibrahim Yeah. It will probably be merged soon. I will fuzz test a bit more before doing a release. Tried another couple of changes today, but no gains. |
@klauspost what else did you try apart from increasing the |
@jarifibrahim I could do a long talk on that :) Other than getting a thermally stable CPU I tend to more use many (1s) benchmarks instead of one. As you can see in my bench when working with compression you often see regressions in one case and improvements in others, so having a diverse test set is more important than a single stable one. So I am more looking for general trends, but basically benchmarking every single change along the way since it is almost impossible to predict. So in the bench above the trend is clear, but |
Thanks @klauspost! The explanation was very helpful. |
@jarifibrahim I have merged it and released v1.10.8. |
Got it 👍 |
Fixes #1162
This PR proposes to use https://github.com/klauspost/compress/tree/master/zstd instead of CGO based https://github.com/DataDog/zstd .
This PR also removes the
CompressionLevel
options since https://github.com/klauspost/compress/tree/master/zstd supports only two levels of ZSTD Compression. The default level isZSTD Level 3
and thefastest
level isZSTD level 1
.ZSTD level 1
will be the default level in badger.I've experimented will all the suggestions mentioned in klauspost/compress#196 (comment) . Setting
WithSingleSegment
didn't seem to make a lot of speed difference (~ 1MB/s difference)WithNoEntropyCompression
seemed to have ~ 3% of speed improvement (but that could also be because of non-deterministic nature of benchmarks)Benchmarks
Here's the script I've used https://gist.github.com/jarifibrahim/91920e93d1ecac3006b269e0c05d6a24
This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)