Skip to content

Commit

Permalink
Avoid 32kB decompression lag + compact less often. (#447)
Browse files Browse the repository at this point in the history
Avoiding 32kB decompression lag
===============================

Before this commit, decompressed data would be accumulated in
`ZlibStream::out_buffer` and returned via `image_data` with 32kB lag
corresponding to `CHUNCK_BUFFER_SIZE`:

    ```
    fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize {
        let safe = self.out_pos.saturating_sub(CHUNCK_BUFFER_SIZE);
        image_data.extend(self.out_buffer.drain(..safe));
        ...
    ```

32kB is a typical size of L1 cache, so the lag would mean that the data
passed to `image_data.extend(...)` would be already cold and evicted
from the L1 cache.

This commit avoids the lag by always returning into `image_data` all the
data from `out_buffer` (i.e. data up to `out_pos`):

    ```
    fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize {
        let transferred = &self.out_buffer[self.read_pos..self.out_pos];
        image_data.extend_from_slice(transferred);
        self.read_pos = self.out_pos;
        ...
    ```

Compacting less often
=====================

The changes above mean that `Vec::drain` no longer compacts `out_buffer`.
Therefore this commit also refactors how this compaction works.

Before this commit, not-yet-returned data would be shifted to the
beginning of `out_buffer` every time `transfer_finished_data` is called.
This could potentially mean that for 1 returned byte, N bytes have to be
copied during compaction.

After this commit, compaction is only done when the compaction cost if
offset by many read bytes - for 3 returned bytes 1 byte has to be copied
during compaction.

Performance impact
==================

The commit has a positive impact on performance, except for:

* `decode/Transparency.png` - regression between 15% and 20% is reported
  in 3-out-of-3 measurements.
* `decode/kodim17.png` - a regression of 2.1% has been reported in
  1-out-of-3 measurements (an improvement of 0.6% - 1.13% has been
  reported in the other 2-out-of-3 measurements).
* `generated-noncompressed-64k-idat/128x128.png` - a regression of 25%
  has been reported in 1-out-of-3 measurements (an improvement of 21% -
  29% has been reported in the other 2-out-of-3 measurements).

The results below have been gathered by running the `decoder` benchmark.
First a baseline was saved before this commit, and then a comparison
was done after the commit.  This (the baseline + the comparison) was
repeated a total of 3 times.  All results below are for the relative
impact on the runtime.  All results are with p = 0.00 < 0.05.

* decode/kodim23.png:
    * [-2.9560% -2.7112% -2.4009%]
    * [-3.4876% -3.3406% -3.1928%]
    * [-3.0559% -2.9208% -2.7787%]

* decode/kodim07.png:
    * [-1.2527% -1.0110% -0.6780%]
    * [-1.7851% -1.6558% -1.5164%]
    * [-1.6576% -1.5216% -1.3856%]

* decode/kodim02.png:
    * [-0.5108% -0.2806% -0.0112%]
    * [-1.0885% -0.9493% -0.8118%]
    * [-0.5563% -0.4239% -0.2874%]

* decode/kodim17.png:
    * [+1.8649% +2.1138% +2.4169%] (**regression**)
    * [-1.2891% -1.1322% -0.9736%]
    * [-0.7753% -0.6276% -0.4866%]

* decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png:
    * [-1.7165% -1.4968% -1.2650%]
    * [-1.7051% -1.4473% -1.2229%]
    * [-1.2544% -1.0457% -0.8375%]

* decode/Transparency.png:
    * [+19.329% +19.789% +20.199%] (**regression**)
    * [+15.337% +15.798% +16.294%] (**regression**)
    * [+18.694% +19.106% +19.518%] (**regression**)

* generated-noncompressed-4k-idat/8x8.png:
    * [-2.3295% -1.9940% -1.5912%]
    * [-6.1285% -5.8872% -5.6091%]
    * [-2.8814% -2.6787% -2.4820%]

* generated-noncompressed-4k-idat/128x128.png:
    * [-59.793% -59.599% -59.417%]
    * [-63.930% -63.846% -63.756%]
    * [-62.377% -62.248% -62.104%]

* generated-noncompressed-4k-idat/2048x2048.png:
    * [-67.678% -67.579% -67.480%]
    * [-65.616% -65.519% -65.429%]
    * [-65.824% -65.647% -65.413%]

* generated-noncompressed-4k-idat/12288x12288.png:
    * [-60.932% -60.774% -60.528%]
    * [-62.088% -62.016% -61.940%]
    * [-61.663% -61.604% -61.546%]

* generated-noncompressed-64k-idat/128x128.png:
    * [-22.237% -21.975% -21.701%]
    * [-29.656% -29.480% -29.311%]
    * [+24.812% +25.190% +25.571%] (**regression**)

* generated-noncompressed-64k-idat/2048x2048.png:
    * [-21.826% -21.499% -21.087%]
    * [-54.279% -54.049% -53.715%]
    * [-11.174% -10.828% -10.482%]

* generated-noncompressed-64k-idat/12288x12288.png:
    * [-40.421% -40.311% -40.180%]
    * [-39.496% -39.183% -38.871%]
    * [-41.443% -41.367% -41.295%]

* generated-noncompressed-2g-idat/2048x2048.png:
    * [-40.136% -40.010% -39.865%]
    * [-58.507% -58.333% -58.060%]
    * [-35.822% -35.457% -35.038%]

* generated-noncompressed-2g-idat/12288x12288.png:
    * [-37.196% -37.107% -37.014%]
    * [-36.125% -36.049% -35.970%]
    * [-35.636% -35.477% -35.350%]

Co-authored-by: Lukasz Anforowicz <lukasza@marcin-mx>
  • Loading branch information
anforowicz and Lukasz Anforowicz authored Jan 6, 2024
1 parent c4121b5 commit 1636b55
Showing 1 changed file with 43 additions and 8 deletions.
51 changes: 43 additions & 8 deletions src/decoder/zlib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,12 @@ pub(super) struct ZlibStream {
/// The decoder sometimes wants inspect some already finished bytes for further decoding. So we
/// keep a total of 32KB of decoded data available as long as more data may be appended.
out_buffer: Vec<u8>,
/// The cursor position in the output stream as a buffer index.
/// The first index of `out_buffer` where new data can be written.
out_pos: usize,
/// The first index of `out_buffer` that hasn't yet been passed to our client
/// (i.e. not yet appended to the `image_data` parameter of `fn decompress` or `fn
/// finish_compressed_chunks`).
read_pos: usize,
/// Limit on how many bytes can be decompressed in total. This field is mostly used for
/// performance optimizations (e.g. to avoid allocating and zeroing out large buffers when only
/// a small image is being decoded).
Expand All @@ -33,6 +37,7 @@ impl ZlibStream {
started: false,
out_buffer: Vec::new(),
out_pos: 0,
read_pos: 0,
max_total_output: usize::MAX,
ignore_adler32: true,
}
Expand All @@ -42,6 +47,7 @@ impl ZlibStream {
self.started = false;
self.out_buffer.clear();
self.out_pos = 0;
self.read_pos = 0;
self.max_total_output = usize::MAX;
*self.state = Decompressor::new();
}
Expand Down Expand Up @@ -94,6 +100,7 @@ impl ZlibStream {
self.started = true;
self.out_pos += out_consumed;
self.transfer_finished_data(image_data);
self.compact_out_buffer_if_needed();

Ok(in_consumed)
}
Expand Down Expand Up @@ -128,11 +135,12 @@ impl ZlibStream {
transferred > 0 || out_consumed > 0,
"No more forward progress made in stream decoding."
);
self.compact_out_buffer_if_needed();
}
}

self.out_buffer.truncate(self.out_pos);
image_data.append(&mut self.out_buffer);
self.transfer_finished_data(image_data);
self.out_buffer.clear();
Ok(())
}

Expand Down Expand Up @@ -179,10 +187,37 @@ impl ZlibStream {
}

fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize {
let safe = self.out_pos.saturating_sub(CHUNCK_BUFFER_SIZE);
// TODO: allocation limits.
image_data.extend(self.out_buffer.drain(..safe));
self.out_pos -= safe;
safe
let transferred = &self.out_buffer[self.read_pos..self.out_pos];
image_data.extend_from_slice(transferred);
self.read_pos = self.out_pos;
transferred.len()
}

fn compact_out_buffer_if_needed(&mut self) {
// [PNG spec](https://www.w3.org/TR/2003/REC-PNG-20031110/#10Compression) says that
// "deflate/inflate compression with a sliding window (which is an upper bound on the
// distances appearing in the deflate stream) of at most 32768 bytes".
//
// `fdeflate` requires that we keep this many most recently decompressed bytes in the
// `out_buffer` - this allows referring back to them when handling "length and distance
// codes" in the deflate stream).
const LOOKBACK_SIZE: usize = 32768;

// Compact `self.out_buffer` when "needed". Doing this conditionally helps to put an upper
// bound on the amortized cost of copying the data within `self.out_buffer`.
//
// TODO: The factor of 4 is an ad-hoc heuristic. Consider measuring and using a different
// factor. (Early experiments seem to indicate that factor of 4 is faster than a factor of
// 2 and 4 * `LOOKBACK_SIZE` seems like an acceptable memory trade-off. Higher factors
// result in higher memory usage, but the compaction cost is lower - factor of 4 means
// that 1 byte gets copied during compaction for 3 decompressed bytes.)
if self.out_pos > LOOKBACK_SIZE * 4 {
// Only preserve the `lookback_buffer` and "throw away" the earlier prefix.
let lookback_buffer = self.out_pos.saturating_sub(LOOKBACK_SIZE)..self.out_pos;
let preserved_len = lookback_buffer.len();
self.out_buffer.copy_within(lookback_buffer, 0);
self.read_pos = preserved_len;
self.out_pos = preserved_len;
}
}
}

0 comments on commit 1636b55

Please sign in to comment.