Discussion: PNG decoding performance improvement opportunities #415

anforowicz · 2023-09-29T20:11:42Z

Hello!

I wanted to share some of my ideas for performance opportunities in the png crate.
I hope that we can discuss these ideas together to refine them - for example:

Maybe it’s better to invest in performance-sensitive areas of code other than the ones identified below?
Maybe there are better ways to benchmark and/or profile the code?
Maybe some of these changes are good for microbenchmarks but not for end-to-end performance?
etc.

I hope that this issue is a good way to have those discussions, but please shout if you think there might be better ways to engage. For example - maybe I should instead try joining a discord server somewhere? Joining a mailing list or discussion group? Writing all of this in a google doc (and discussing in doc comments)?

I apologize in advance if you find the text below rather long. I tried to keep things structured and avoid rambling, but I also wanted to brainstorm (and avoid immediately rejecting) half-baked or even silly ideas. Please bear with me :-).

My ultimate goal is to bring the performance of the png crate to parity with Chromium’s C++ implementation. Description of my (quite possibly flawed) comparison/measurement methodology can be found in the (incomplete / work-in-progress / please-be-gentle) “Discussion of benchmarking methodology” section in a separate doc (the link should be viewable by anyone, I’d be happy to grant commenting access if needed)

Performance improvement ideas

Idea 1: Explicit SIMD-ification of `unfilter`

Why this idea seems promising:

perf record of png crate’s benchmarks says that png::filter::unfilter accounts for 18% of runtime
libpng uses explicit SIMD intrinsics to SIMD-ify unfilter.
- There are implementations for 3-bytes-wide and 4-bytes-wide pixels (e.g. png_read_filter_row_avg3_sse2 and png_read_filter_row_avg4_sse2)
- There are implementations for Avg, Sub, and Paeth filter types (e.g. png_read_filter_row_sub3_sse2, png_read_filter_row_avg3_sse2, and png_read_filter_row_paeth3_sse2
- There are separate Intel (intel/filter_sse2_intrinsics.c) and Arm (arm/filter_neon_intrinsics.c - also SIMD-ifies Up filter type)

Status: PR is under review:

unfilter benchmarks landed in Scaffolding for direct benchmarking of crate::filter::unfilter. #413
Paeth/bpp=3 and Paeth/bpp=6 improvements are under review in Using std::simd to speed-up unfilter for Paeth for bpp=3 and bpp=6 #414 (this yields 20% improvements in unfilter benchmarks)
Failed to improve Up
Failed to improve Avg and Sub (see the abandoned commit here, and the discussion here)
Failed to improve other bytes-per-pixel (see the abandoned commit here)

Discussion of next steps:

(in progress) Continue the discussion in Using std::simd to speed-up unfilter for Paeth for bpp=3 and bpp=6 #414
(not in my short-term plan, help welcomed) Double-check if investing more time here is worthwhile:
- Maybe one can use godbolt to compare disassembly of Rust vs C++ (for all filter types and/or bytes-per-pixels?). Example for Paeth/bpp=3: https://godbolt.org/z/j8fK39oEE.
- FWIW png::filter::unfilter went down to 8.6% of runtime - this is the upper bound on possible further improvements.

Idea 2: Avoid copying data within the `png` crate

Why this idea seems promising:

This seems to make intuitive sense - avoiding unnecessary work (moving bytes around) should improve performance (if we can avoid somehow hurting performance as a side effect - via extra computations, or extra cache pressure…)
perf record of png crate’s benchmarks (after SIMD-ification of unfilter) shows:
- __memmove_avx_unaligned_erms: 9.25% of runtime
- __memset_avx2_unaligned_erms: 1.45% of runtime
- __memmove_avx_unaligned: 0.98% of runtime

Idea 2.1: Reduce copying of raw rows

Goal: get rid of the copies done in the two places here:

Preserving the previous row for unfiltering here: self.prev[..rowlen].copy_from_slice(&row[..rowlen])
Compacting the raw rows buffer here: self.current.drain(..self.scan_start).for_each(drop); (copies all bytes from self.scan_start to the end of the self.current vector)

Status: in progress - actively working on this:

Prototype: see here
Problem: I need to convince myself that this really helps (see also the benchmarking write up in a separate section below). Strategy:
- High confidence / I plan to work on this soon: Implement a benchmark that isolates the new RawRowsBuffer while still replicating decoding/decompressing patterns (of say kodim02.png). This can be done by abstracting away ReadDecoder by hiding it behind an equivalent AbstractReadDecoder trait. The abstract trait can be also implemented as a pass-thru to capture/log the decoding/decompressing patterns (that should be replicated by the microbenchmark).
- Low confidence: Somehow reduce noise for end-to-end benchmarks…
  - Maybe creating an artificial PNG that replicates most behavioral patterns of kodim02.png, but uses no unfilter and uses no Huffman encoding… This seems hard…
  - Other ideas from the benchmarking write-up below?

Idea 2.2: Reduce copying within `ZlibStream::transfer_finished_data`

There seem to be 2 performance improvement opportunities within ZlibStream::transfer_finished_data:

Avoid copying not-yet-decompressed/populated bytes in self.out_buffer[self.out_pos..].
- Maybe we can drop fdeflate’s assumption that out_buffer has been zero-ed out. In this case, these bytes can simply be ignored.
- OTOH, we can keep fdeflate's assumption and still avoid copying these bytes by rely on zeroing them out in prepare_vec_for_appending. In other words, we can replace (slower?) memcpy with (faster?) memset.
Reduce how often out_buffer is compacted
- Today transfer_finished_data compacts the buffer every time (copying 32kB every time safe number of bytes is returned to the client; possibly copying multiple bytes per single returned byte)
- Idea: only compact when safe > CHUNCK_BUFFER_SIZE * 3 (copying 32kB after returning at least 32*3kB of data; copying at most 1 byte per 3 returned bytes). Number 3 is somewhat arbitrary and there is a trade-off here: higher constant means more memory/cache pressure (but less copying).

Status: started - plan to resume work on this soon:

Prototyped here (passes functional tests, not measured yet due to irrational fear of benchmark noise… 🙂)
Problem: I need to convince myself that this really helps (see also the benchmarking write up in a separate section below). Strategy: similar to what I plan idea 2.1 above:
- High confidence / I plan to work on this soon: Implement a benchmark that isolates the ZlibStream while still replicating decoding/decompressing patterns (of say kodim02.png)

Idea 2.3: Avoid `BufReader` when unnecessary

Sometimes the input the png crate has already been buffered into memory (Chromium’s //ui/gfx/codec/png_codec.cc takes a span of memory as input; png crate’s benchmarks use Vec<u8>). In such scenarios using a BufReader will needlessly copy the already-buffered/already-in-memory bytes into BufReader’s internal buffer.

Status: thinking how to best address this:

Maybe the public API of the png and image crates should take BufRead instead of Read as input?
Maybe the public API of the png(and image?) crate can be bifurcated (internally holding dyn Box<BufRead> instead of BufReader<R> in ReadDecoder::reader):
- new can continue taking Read as input (maybe deprecate new + rename it into from_unbuffered_input)
- A new API (from_buf_read? from_buffered_input?) can take BufRead as input. Alternatively we could also introduce a from_slice API (OTOH all slices are BufRead but not all BufReads are slices, so this isn't a fully generic solution).
Other options?

Idea 2.4: Other copy avoidance

Half-baked ideas:

ZlibStream::out_buffer (extension of idea 2.2): maybe we can avoid compacting at all
- Silly idea 2.4.1: What if we had a data structure like Vec (i.e. elements occupy contiguous memory) but that supports efficient dropping of large prefixes (bypassing the allocator and returning whole memory pages to the OS?)
- Silly idea 2.4.2: Circular buffer instead of compacting (probably detrimental to performance of fdeflate...)
ZlibStream::in_buffer:
- Seems impossible to avoid because decompression needs to work with a single buffer, but PNG can split compressed data stream across multiple IDAT chunks.

Idea 3: Minimize the number of allocations

Why this idea seems promising:

Chromium benchmarks show limited, but non-zero performance opportunities
- 0.78% of runtime: partition_alloc::internal::PartitionBucket::SlowPathAlloc
- 0.76% of runtime: allocator_shim::internal::PartitionMalloc
- 0.66% of runtime: allocator_shim::internal::PartitionFree
There are some low-hanging fruit here - most allocations are triggered (directly or indirectly) underneath (as measured by heaptrack for png crate's decoder benchmarks):
- 26.1% allocations in fdeflate::decompress::Decompressor::build_tables
- 17.1% allocations in png::decoder::zlib::ZlibStream::new
- 15.7% allocations in png::decoder::stream::StreamingDecoder::parse_text
- 12.2% allocations in png::decoder::zlib::ZlibStream::transfer_finished_data

Status: not started yet:

Idea 3.1: image crate and Chromium ignore text chunks - we should just call png::Decoder::set_ignore_text_chunk. (This should help with parse_text.)
Idea 3.2: Maybe the Box in ZlibStream is not needed?

Idea 4: Try to improve decompression speed

Why this idea seems promising:

Other Chromium engineers point out that some Chromium patches may be responsible for significant performance improvements of zlib. In particular, there are SIMD-related patches here (see also a slide here by ARM engineers)
perf record of png crate’s benchmarks says that:
- 56.44% of runtime is spent in fdeflate::decompress::Decompressor::read
- 9.20% of runtime is spent in fdeflate::decompress::Decompressor::build_tables

Status: haven’t really started looking at fdeflate source code yet:

High-level ideas:
- Try to replicate some of the ideas captured in Wuff blog post (see also github permalink here):
  - 4-byte loads and stores
  - Run fast longer
- Stare hard at the Chromium SIMD code in a futile attempt to understand why it helps?
More specific, but still rather vague ideas:
- 4-byte (or 8-byte?) loads and stores in table lookups (maybe already done by fdeflate... I haven’t started reading the code yet…)
- Copying previous chunks: let’s overshoot if it helps memcpy faster + if we know the output buffer is long enough? May need to wrestle with deflate’s undocumented assumption that the output buffer has been zeroed out…

Idea 5: Try to improve `expand_paletted`

Why this idea seemed promising:

I am fairly sure that I saw expand_paletted relatively high (11% I think?) in a profile (but I didn’t take sufficient notes and now I have trouble replicating this…)
ARM engineers landed some related improvements in Chromium (see https://crbug.com/706134 and the slide here)
- Idea 5.1: Precompute/memoize “riffled” palette (combining separate RGB-palette and alpha-palette chunks into a single lookup table)
- Idea 5.2: SIMD-ify (using std::simd::Simd::gather_or_default? mimicking Chromium SIMD code that I don’t understand?)

Status: tentatively abandoned for now

Prototyped here
The prototype seems to improve expand_palette microbenchmarks by 21% to 56%, but
- I’ve realized that expand_paletted barely registers in a runtime profile of png crate’s end-to-end benchmarks (even when using the original testcase proposed by ARM engineers in https://crbug.com/706134#c6; I tested after tweaking that png crate's benchmark to use the EXPAND transformation similarly to how the png crate is used from the image crate)
- What’s good for microbenchmarks isn’t necessarily good for end-to-end performance. See also https://abseil.io/fast/39.

Benchmarking is hard…

I find it challenging to evaluate commits/prototypes/changes to see whether they result in a performance improvement. Let me share some of my thoughts below (hoping to get feedback that will help me navigate this area of software engineering).

The best case is if a change has such a big positive impact, that it clearly show up on benchmarks - for example the unfilter changes (idea 1 above) show a clear improvement in png crate’s benchmarks (not only 20% improvement on unfilter microbenchmarks for Paeth/bpp=3, but also 5-8% improvement on some end-to-end decoder benchmarks and “change within noise threshold” for other end-to-end benchmarks). OTOH, there are changes that seem beneficial (e.g. avoiding memory copies - idea 2 above) that have relatively modest gains (maybe 1% - 7% in end-to-end benchmarks) and that may sometimes (randomly - due to noise?) seem to be performance regressions. I struggle a bit figuring how to cope with the latter scenario.

Maybe I should try to run Criterion benchmarks for longer (e.g. --sample-size=1000 --measurement-time=500? (This doesn’t seem to help. I still get wild swings in results… :-/)
Maybe I should look into cachegrind-based, iai-based instruction-count-focused benchmarks? OTOH it seems to me that the estimated cycle count can be rather inaccurate. The unfilter change for example shows the following changes in estimated cycles: kodim02=-8.1% (-5.1% runtime), kodim07=-19.2% (-8.4% runtime), kodim17=-0.01% (+0.87% runtime), kodim23=-5.7% (-2.4% runtime).
Maybe measuring an average and doing fancy statistics is misguided and one should just measure the minimum possible time? This assumes that the external noise can only make a benchmark slower. This idea is based on a blog post here.
Maybe I should give up microbenchmarks and instead rely on Chromium’s A/B experimentation framework.
Maybe I should introduce and use more focused microbenchmarks (and to some extent ignore the end-to-end decoder benchmarks of the png crate):
- I do in fact plan to use dependency injection to isolate performance of RawRowsBuffer (idea 2.1) and ZlibStream (idea 2.2)
- I also have a tentative idea to craft a PNG file that has minimal decompression and unfiltering overhead while still preserving most of the other behavioral patterns (e.g. preserving the number and length of IDAT and ZLIB chunks, how they get decompressed, etc.). This seems hard. Maybe I should try compressing a black PNG?
Maybe I should attempt to reduce the measurement noise somehow?
- nice -19 is one way (sadly I didn’t think of using that initially…)
- Maybe there is a way to check and/or guarantee if the CPU is always using a consistent clock speed/frequency (e.g. doesn’t sleep or enter low power mode)
- Maybe I should look into discounting code/memory-layout effects as described in https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf?
- Maybe I can somehow dedicate a single CPU core to the benchmarks? nice -19 helps to some extent, but I think that the benchmark may still yield to other processes (?) and/or interrupt handlers? OTOH, I am not sure if there is an easy way to do that: I don’t have csets command on my Linux box; writing to /proc/irq/0/smp_affinity seems icky; I am not sure if I can control boot flags in my datacenter-based machine to use isolcpus.

So...

So, what do you think? Does any of the above make sense? Am I wildly off anywhere? Any specific/actionable suggestions?

Feedback welcomed!

The text was updated successfully, but these errors were encountered:

image-rs locked and limited conversation to collaborators Sep 29, 2023

fintelia converted this issue into discussion #416 Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Discussion: PNG decoding performance improvement opportunities #415

Discussion: PNG decoding performance improvement opportunities #415

anforowicz commented Sep 29, 2023 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Discussion: PNG decoding performance improvement opportunities #415

Discussion: PNG decoding performance improvement opportunities #415

Comments

anforowicz commented Sep 29, 2023 • edited Loading

Performance improvement ideas

Idea 1: Explicit SIMD-ification of unfilter

Idea 2: Avoid copying data within the png crate

Idea 2.1: Reduce copying of raw rows

Idea 2.2: Reduce copying within ZlibStream::transfer_finished_data

Idea 2.3: Avoid BufReader when unnecessary

Idea 2.4: Other copy avoidance

Idea 3: Minimize the number of allocations

Idea 4: Try to improve decompression speed

Idea 5: Try to improve expand_paletted

Benchmarking is hard…

So...

This issue was moved to a discussion.

anforowicz commented Sep 29, 2023 •

edited

Loading

Idea 1: Explicit SIMD-ification of `unfilter`

Idea 2: Avoid copying data within the `png` crate

Idea 2.2: Reduce copying within `ZlibStream::transfer_finished_data`

Idea 2.3: Avoid `BufReader` when unnecessary

Idea 5: Try to improve `expand_paletted`