-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion on compression changes #102
Comments
gzip uses DEFLATE, which is block-based, and uses a 32KB sliding window dictionary. So why does it matter if we stream data in, compared to buffering and then compressing? If there's an improvement, can you please provide some benchmark comparisons? |
@kruskall could you provide an update on this? |
We had a discussion and decided to revert the compression changes and stop compressing events in batches. We've also decided to discard the optimization to send uncompressed data if the request is too small. The new approach doesn't need offsets and just iterates over the array. It does not decompress the whole request in memory but read chunks of data. |
Please provide a bit more details on some of the decisions and the rationale behind them:
Do I understand correctly that this can be solved by iterating over an array and reading chunks of data without decompressing the whole body into memory?
Can you motivate the decision please? I'm not against it, but the rationale should be documented since it had been an open discussion before.
What does slower mean and can this be quantified?
Are you reverting back to this solution, or are you looking into optimizing, given that you showed that the previous approach wasn't efficient with respect to compression ratio? |
Sorry about that!
Yes, see #99, we're reading batches of data to avoid decompressing the whole request.
Avoiding compression if there is not enough data was an optimization that could be triggered if the flush interval was hit and there was not enough data. If that happens the application has low throughput and there is not a lot of benefits in saving some cpu time.
We have to iterate over the array compared to looking up an offset at a specific point like we did before. This approach is gonna be slower.
This is unrelated to the retries, I've opened a new issue to discuss the issue and share a proof of concept: #108 |
Compression changes have been reverted. Buffering before compressing can be discussed in #108 |
Goals:
What changed ?
Why ?
If events are compressed in batches of size
compressThreshold
we can store the offsets and decompress that specific "container" to avoid decompressing the whole request in memory.A buffer can also skip the compression completely if the flush timeout is hit and there's not enough data for compression.
Additional optimization technique: if we know there's only 1 event in a "container" we don't even need to decompress at all, we can just copy the compressed bytes since we have the offsets.
On the compression ratio/efficiency
There was some discussion on how compression worked before. To futher clarify the picture above:
The compression logic happened inside the bulk indexer:
go-docappender/bulk_indexer.go
Lines 174 to 184 in 0de03a5
The event was written to the
gzip.Writer
immediately and we were not buffering and then compressing all the events.(The
gzip.writer
has buffers data internally but it's arbitrary and can't be configured)Now we write to a buffer compress if we reach a certain size (configurable).
Testing
The open PR for document retries (#99) has testing to ensure the feature is working as intended.
Additional testing can be performed, see #98
Infinite retries
IMO this is an issue with the PR so it should be discussed there but we can discuss it more holistically.
In the current implementation events are submitted to the buffer and flushed in the next flush request. There's no max retry setting.
The text was updated successfully, but these errors were encountered: