feat: add approximate compression ratio for messages and values #191

brayniac · 2024-04-04T17:01:51Z

Adds approximate compression ratio targeting for messages and values.

This is done by limiting the number of random bytes in the payload to the first N bytes and iteratively estimating the compression ratio using gzip.

Adds approximate compression ratio targeting for messages and values. This is done by limiting the number of random bytes in the payload to the first N bytes and iteratively estimating the compression ratio using gzip.

Adds a new smoketest config

src/config/workload.rs

configs/smoketest.toml

configs/segcache.toml

configs/blabber.toml

configs/smoketest.toml

mihirn · 2024-04-11T15:46:28Z

src/workload/mod.rs

@@ -162,7 +163,7 @@ impl Generator {
        // add a header
        [m[0], m[1], m[2], m[3], m[4], m[5], m[6], m[7]] =
            [0x54, 0x45, 0x53, 0x54, 0x49, 0x4E, 0x47, 0x21];
-        rng.fill(&mut m[32..topics.message_len]);
+        rng.fill(&mut m[32..(topics.message_random_bytes + 32)]);


To Yuri's point about the entropy being evenly distributed across the payload, is it worth shuffling the vector after this if message_random_bytes != message_len?

I think it really depends what content we're trying to emulate. I think some further analysis and design decisions would be necessary before deciding on a strategy. I expect that random bytes spread throughout the message still doesn't look like, as an example, english text or json in terms of how the entropy is distributed and what the expected symbols even are.

Do we even find this has an impact for the compression algorithms we anticipate being used for transport and/or storage?

I'm voting to defer this to a follow-up PR. We don't have enough information to inform the design right now but we do have the need to produce payloads that are compressible.

#196 created to track

src/workload/mod.rs

mihirn · 2024-04-11T20:33:33Z

src/workload/mod.rs

+
+fn estimate_random_bytes_needed(length: usize, compression_ratio: f64) -> usize {
+    // if compression ratio is low, all bytes should be random
+    if compression_ratio <= 1.0 {


Do we want to short circuit exit if length == 0 as well?

I don't think it's needed. Initializing the PRNG shouldn't be too expensive, and it happens only once per keyspace/topic-space.

brayniac added 3 commits April 4, 2024 09:59

feat: add approximate compression ratio for messgaes and values

a226af5

Adds approximate compression ratio targeting for messages and values. This is done by limiting the number of random bytes in the payload to the first N bytes and iteratively estimating the compression ratio using gzip.

add example to configs

ecb7a01

add smoketest config

b756b8b

Adds a new smoketest config

brayniac changed the title ~~feat: add approximate compression ratio for messgaes and values~~ feat: add approximate compression ratio for messages and values Apr 4, 2024

Merge branch 'main' into compression-ratio

cd813eb

brayniac force-pushed the compression-ratio branch from f18415e to cd813eb Compare April 10, 2024 18:21

brayniac requested a review from mihirn April 10, 2024 18:23

mihirn reviewed Apr 10, 2024

View reviewed changes

src/config/workload.rs Outdated Show resolved Hide resolved

brayniac commented Apr 10, 2024

View reviewed changes

configs/smoketest.toml Outdated Show resolved Hide resolved

brayniac commented Apr 10, 2024

View reviewed changes

configs/segcache.toml Outdated Show resolved Hide resolved

brayniac added 2 commits April 10, 2024 15:10

simplify naming, fix errors in example

16e3abb

rustfmt

dbceb8a