Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BuildZStdDict Testing #950

Closed
macneale4 opened this issue Apr 10, 2024 · 6 comments · Fixed by #951
Closed

BuildZStdDict Testing #950

macneale4 opened this issue Apr 10, 2024 · 6 comments · Fixed by #951

Comments

@macneale4
Copy link

macneale4 commented Apr 10, 2024

Hello!

I'm hoping to use the compress package for packing bytes in our archive format for Dolt (https://github.com/dolthub/dolt). Our data is very amenable to custom dictionaries for small sets (5-50) of binary blobs which average about 4K in size.

We built our POC using the gozstd package which has Dictionary support: https://pkg.go.dev/github.com/valyala/gozstd

Now that I'm are convinced this is a good approach for us, I'm curious how we can move the dict.BuildZstdDict function out of the experimental phase, as it's stated in the dict/README.md. In particular, the README says it doesn't work well with small sets, and indeed I got a panic when using a set of 32 samples.

How far from complete do you think it is? How can I help get this tested and get the bugs ironed out? I have one local test case I can share with you if that's helpful.

FWIW, using the dictionary produced by valyala/gozstd, I can compress/decompress successfully with the compress/zstd package. So I think it's just a matter of generating a correct dict and we'll be able to move forward.

Thanks!

@klauspost
Copy link
Owner

I'm curious how we can move the dict.BuildZstdDict function out of the experimental phase

TBH I don't see it happening. I don't have the bandwidth for it at the moment.

However if you can make a simple reproducer for any crash I will fix it.

@klauspost
Copy link
Owner

Also sorry for being a bit distracted in my last replies... Let me provide some context:

dict.BuildZstdDict(input [][]byte, o Options) is the higher level API.

What it mainly does is to create a common dictionary for the input samples you provide. Deduplicating and truncating to opts.MaxDictSize. Once that is done it calls zstd.BuildDict to contruct the actual dictionary.

It will attempt to compress all of Contents [][]byte with History []byte as a common history to generate the optimal tables for these. These tables are then all combined with the history and written in the zstd dictionary format.

There are for sure untested corner cases in this process and it is mostly a "released PoC" - as in - it works - some/most of the time and it is "ok" in terms of (compression) efficiency.

@macneale4
Copy link
Author

That's good context. I'll try and bypass the higher level API and use the zstd.BuildDict next. In the meantime, here is a test which is failing.

macneale4@956c229

There is a divide by 0 error, which I got around with a hack (see commit). I doubt that is right. But then once the dictionary is created, the loadDict method fails with the error tableLog too large

I'm out of my element when is comes to compression, but will probably be pecking at this for the next few days. We'd really like to avoid using CGO in our product.

@klauspost
Copy link
Owner

klauspost commented Apr 12, 2024

Fix in #951

Made a few tweaks to your test.

A) You are writing to the input buffer when decompressing.
B) EncodeAll/DecodeAll appends- adding a [:0] to make that clear.
C) MaxDictSize is for the backreference size, not the total size.

package dict

import (
	"bytes"
	"io"
	"math/rand"
	"os"
	"testing"

	"github.com/klauspost/compress/zstd"
)

func TestZStdDict(t *testing.T) {
	for _, level := range []zstd.EncoderLevel{zstd.SpeedFastest, zstd.SpeedDefault, zstd.SpeedBetterCompression, zstd.SpeedBestCompression} {
		testZStdDict(t, level)
	}
}

func testZStdDict(t *testing.T, level zstd.EncoderLevel) {
	out := io.Discard
	if testing.Verbose() {
		out = os.Stdout
	}
	opts := Options{
		MaxDictSize:    2048,
		HashBytes:      4,
		Output:         out,
		ZstdDictID:     0,
		ZstdDictCompat: false,
		ZstdLevel:      level,
	}

	inBuf := make([]byte, 0, 4096)
	outBuf := make([]byte, 0, 4096)

	// This is 32K worth of data, but it's all very similar. Only fits in 4K if compressed with a dictionary.
	samples := generateSimilarByteSlices(42, 32)

	dict, err := BuildZstdDict(samples, opts)
	if err != nil {
		t.Fatal(err.Error())
	}

	totalSize := 0
	for _, blob := range samples {
		compressed, err := zCompressDict(inBuf, dict, blob)
		if err != nil {
			t.Fatal(err.Error())
		}
		totalSize += len(compressed)

		// Check round trip.
		decompressed, err := zDecompressDict(outBuf, dict, compressed)
		if err != nil {
			t.Fatal(err.Error())
		}
		if !bytes.Equal(decompressed, blob) {
			t.Fatal("Round trip failed")
		}
	}
	if totalSize > 4096 {
		t.Fatal("Total compressed size exceeds 4096 bytes")
	}
	t.Log("Total compressed size:", totalSize)
}

func zCompressDict(dst, dict, data []byte) ([]byte, error) {
	encoder, err := zstd.NewWriter(nil, zstd.WithEncoderDict(dict))
	if err != nil {
		return nil, err
	}
	defer encoder.Close()

	result := encoder.EncodeAll(data, dst[:0])
	return result, nil
}

func zDecompressDict(dst, dict, data []byte) ([]byte, error) {
	decoder, err := zstd.NewReader(nil, zstd.WithDecoderDicts(dict))
	if err != nil {
		return nil, err
	}
	defer decoder.Close()

	result, err := decoder.DecodeAll(data, dst[:0])
	if err != nil {
		return nil, err
	}

	return result, nil
}

// Creates a slice of byte slices, each of which is has the same random seed, so they are very similar. The length
// of each slice is 1024 + the index of the slice.
func generateSimilarByteSlices(seed int64, count int) [][]byte {
	chks := make([][]byte, count)
	for i := 0; i < count; i++ {
		chks[i] = generateRandomByteSlice(seed, 1024+i)
		if false {
			// Generate a small diff.
			chks[i][i] = byte(seed)
		}
	}

	return chks
}

func generateRandomByteSlice(seed int64, len int) []byte {
	r := rand.NewSource(seed)

	data := make([]byte, len)
	for i := range data {
		data[i] = byte(r.Int63())
	}
	return data
}

With your permission I'd like to include that as a test.

@macneale4
Copy link
Author

I pulled your changes, and they do result in our unit tests passing. Great!

When exercising against real data, I ran into another edge case. I pushed a new test to my fork branch:

https://github.com/macneale4/compress/tree/zstd-dict-test

There tip commit has a speculative change (allowing v == threshold) will allows my testing to continue. I have no idea if that's correct though.

You are welcome to take any of this and stick into your tests. The data is coming from one of our public data repos.

@klauspost
Copy link
Owner

The threshold is mostly to filter out candidates in large sets. Though everything being discarded could indicate that there is little of actual value. I'll merge your change back into the PR.

For production use you'd probably want to set up a fuzz test to check that border cases are handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants