-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add approximate compression ratio for messages and values #191
Conversation
Adds approximate compression ratio targeting for messages and values. This is done by limiting the number of random bytes in the payload to the first N bytes and iteratively estimating the compression ratio using gzip.
Adds a new smoketest config
f18415e
to
cd813eb
Compare
@@ -162,7 +163,7 @@ impl Generator { | |||
// add a header | |||
[m[0], m[1], m[2], m[3], m[4], m[5], m[6], m[7]] = | |||
[0x54, 0x45, 0x53, 0x54, 0x49, 0x4E, 0x47, 0x21]; | |||
rng.fill(&mut m[32..topics.message_len]); | |||
rng.fill(&mut m[32..(topics.message_random_bytes + 32)]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To Yuri's point about the entropy being evenly distributed across the payload, is it worth shuffling the vector after this if message_random_bytes != message_len
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it really depends what content we're trying to emulate. I think some further analysis and design decisions would be necessary before deciding on a strategy. I expect that random bytes spread throughout the message still doesn't look like, as an example, english text or json in terms of how the entropy is distributed and what the expected symbols even are.
Do we even find this has an impact for the compression algorithms we anticipate being used for transport and/or storage?
I'm voting to defer this to a follow-up PR. We don't have enough information to inform the design right now but we do have the need to produce payloads that are compressible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#196 created to track
|
||
fn estimate_random_bytes_needed(length: usize, compression_ratio: f64) -> usize { | ||
// if compression ratio is low, all bytes should be random | ||
if compression_ratio <= 1.0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to short circuit exit if length == 0
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's needed. Initializing the PRNG shouldn't be too expensive, and it happens only once per keyspace/topic-space.
Adds approximate compression ratio targeting for messages and values.
This is done by limiting the number of random bytes in the payload to the first N bytes and iteratively estimating the compression ratio using gzip.