Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: PNG decoding performance improvement opportunities #415

Closed
anforowicz opened this issue Sep 29, 2023 · 0 comments
Closed

Discussion: PNG decoding performance improvement opportunities #415

anforowicz opened this issue Sep 29, 2023 · 0 comments

Comments

@anforowicz
Copy link
Contributor

anforowicz commented Sep 29, 2023

Hello!

I wanted to share some of my ideas for performance opportunities in the png crate.
I hope that we can discuss these ideas together to refine them - for example:

  • Maybe it’s better to invest in performance-sensitive areas of code other than the ones identified below?
  • Maybe there are better ways to benchmark and/or profile the code?
  • Maybe some of these changes are good for microbenchmarks but not for end-to-end performance?
  • etc.

I hope that this issue is a good way to have those discussions, but please shout if you think there might be better ways to engage. For example - maybe I should instead try joining a discord server somewhere? Joining a mailing list or discussion group? Writing all of this in a google doc (and discussing in doc comments)?

I apologize in advance if you find the text below rather long. I tried to keep things structured and avoid rambling, but I also wanted to brainstorm (and avoid immediately rejecting) half-baked or even silly ideas. Please bear with me :-).

My ultimate goal is to bring the performance of the png crate to parity with Chromium’s C++ implementation. Description of my (quite possibly flawed) comparison/measurement methodology can be found in the (incomplete / work-in-progress / please-be-gentle) “Discussion of benchmarking methodology” section in a separate doc (the link should be viewable by anyone, I’d be happy to grant commenting access if needed)

Performance improvement ideas

Idea 1: Explicit SIMD-ification of unfilter

Why this idea seems promising:

  • perf record of png crate’s benchmarks says that png::filter::unfilter accounts for 18% of runtime

  • libpng uses explicit SIMD intrinsics to SIMD-ify unfilter.

    • There are implementations for 3-bytes-wide and 4-bytes-wide pixels (e.g. png_read_filter_row_avg3_sse2 and png_read_filter_row_avg4_sse2)
    • There are implementations for Avg, Sub, and Paeth filter types (e.g. png_read_filter_row_sub3_sse2, png_read_filter_row_avg3_sse2, and png_read_filter_row_paeth3_sse2
    • There are separate Intel (intel/filter_sse2_intrinsics.c) and Arm (arm/filter_neon_intrinsics.c - also SIMD-ifies Up filter type)

Status: PR is under review:

Discussion of next steps:

Idea 2: Avoid copying data within the png crate

Why this idea seems promising:

  • This seems to make intuitive sense - avoiding unnecessary work (moving bytes around) should improve performance (if we can avoid somehow hurting performance as a side effect - via extra computations, or extra cache pressure…)

  • perf record of png crate’s benchmarks (after SIMD-ification of unfilter) shows:

    • __memmove_avx_unaligned_erms: 9.25% of runtime
    • __memset_avx2_unaligned_erms: 1.45% of runtime
    • __memmove_avx_unaligned: 0.98% of runtime

Idea 2.1: Reduce copying of raw rows

Goal: get rid of the copies done in the two places here:

  • Preserving the previous row for unfiltering here: self.prev[..rowlen].copy_from_slice(&row[..rowlen])
  • Compacting the raw rows buffer here: self.current.drain(..self.scan_start).for_each(drop); (copies all bytes from self.scan_start to the end of the self.current vector)

Status: in progress - actively working on this:

  • Prototype: see here

  • Problem: I need to convince myself that this really helps (see also the benchmarking write up in a separate section below). Strategy:

    • High confidence / I plan to work on this soon: Implement a benchmark that isolates the new RawRowsBuffer while still replicating decoding/decompressing patterns (of say kodim02.png). This can be done by abstracting away ReadDecoder by hiding it behind an equivalent AbstractReadDecoder trait. The abstract trait can be also implemented as a pass-thru to capture/log the decoding/decompressing patterns (that should be replicated by the microbenchmark).
    • Low confidence: Somehow reduce noise for end-to-end benchmarks…
      • Maybe creating an artificial PNG that replicates most behavioral patterns of kodim02.png, but uses no unfilter and uses no Huffman encoding… This seems hard…
      • Other ideas from the benchmarking write-up below?

Idea 2.2: Reduce copying within ZlibStream::transfer_finished_data

There seem to be 2 performance improvement opportunities within ZlibStream::transfer_finished_data:

  • Avoid copying not-yet-decompressed/populated bytes in self.out_buffer[self.out_pos..].

    • Maybe we can drop fdeflate’s assumption that out_buffer has been zero-ed out. In this case, these bytes can simply be ignored.
    • OTOH, we can keep fdeflate's assumption and still avoid copying these bytes by rely on zeroing them out in prepare_vec_for_appending. In other words, we can replace (slower?) memcpy with (faster?) memset.
  • Reduce how often out_buffer is compacted

    • Today transfer_finished_data compacts the buffer every time (copying 32kB every time safe number of bytes is returned to the client; possibly copying multiple bytes per single returned byte)
    • Idea: only compact when safe > CHUNCK_BUFFER_SIZE * 3 (copying 32kB after returning at least 32*3kB of data; copying at most 1 byte per 3 returned bytes). Number 3 is somewhat arbitrary and there is a trade-off here: higher constant means more memory/cache pressure (but less copying).

Status: started - plan to resume work on this soon:

  • Prototyped here (passes functional tests, not measured yet due to irrational fear of benchmark noise… 🙂)
  • Problem: I need to convince myself that this really helps (see also the benchmarking write up in a separate section below). Strategy: similar to what I plan idea 2.1 above:
    • High confidence / I plan to work on this soon: Implement a benchmark that isolates the ZlibStream while still replicating decoding/decompressing patterns (of say kodim02.png)

Idea 2.3: Avoid BufReader when unnecessary

Sometimes the input the png crate has already been buffered into memory (Chromium’s //ui/gfx/codec/png_codec.cc takes a span of memory as input; png crate’s benchmarks use Vec<u8>). In such scenarios using a BufReader will needlessly copy the already-buffered/already-in-memory bytes into BufReader’s internal buffer.

Status: thinking how to best address this:

  • Maybe the public API of the png and image crates should take BufRead instead of Read as input?
  • Maybe the public API of the png(and image?) crate can be bifurcated (internally holding dyn Box<BufRead> instead of BufReader<R> in ReadDecoder::reader):
    • new can continue taking Read as input (maybe deprecate new + rename it into from_unbuffered_input)
    • A new API (from_buf_read? from_buffered_input?) can take BufRead as input. Alternatively we could also introduce a from_slice API (OTOH all slices are BufRead but not all BufReads are slices, so this isn't a fully generic solution).
  • Other options?

Idea 2.4: Other copy avoidance

Half-baked ideas:

  • ZlibStream::out_buffer (extension of idea 2.2): maybe we can avoid compacting at all
    • Silly idea 2.4.1: What if we had a data structure like Vec (i.e. elements occupy contiguous memory) but that supports efficient dropping of large prefixes (bypassing the allocator and returning whole memory pages to the OS?)
    • Silly idea 2.4.2: Circular buffer instead of compacting (probably detrimental to performance of fdeflate...)
  • ZlibStream::in_buffer:
    • Seems impossible to avoid because decompression needs to work with a single buffer, but PNG can split compressed data stream across multiple IDAT chunks.

Idea 3: Minimize the number of allocations

Why this idea seems promising:

  • Chromium benchmarks show limited, but non-zero performance opportunities
    • 0.78% of runtime: partition_alloc::internal::PartitionBucket::SlowPathAlloc
    • 0.76% of runtime: allocator_shim::internal::PartitionMalloc
    • 0.66% of runtime: allocator_shim::internal::PartitionFree
  • There are some low-hanging fruit here - most allocations are triggered (directly or indirectly) underneath (as measured by heaptrack for png crate's decoder benchmarks):
    • 26.1% allocations in fdeflate::decompress::Decompressor::build_tables
    • 17.1% allocations in png::decoder::zlib::ZlibStream::new
    • 15.7% allocations in png::decoder::stream::StreamingDecoder::parse_text
    • 12.2% allocations in png::decoder::zlib::ZlibStream::transfer_finished_data

Status: not started yet:

Idea 4: Try to improve decompression speed

Why this idea seems promising:

  • Other Chromium engineers point out that some Chromium patches may be responsible for significant performance improvements of zlib. In particular, there are SIMD-related patches here (see also a slide here by ARM engineers)
  • perf record of png crate’s benchmarks says that:
    • 56.44% of runtime is spent in fdeflate::decompress::Decompressor::read
    • 9.20% of runtime is spent in fdeflate::decompress::Decompressor::build_tables

Status: haven’t really started looking at fdeflate source code yet:

Idea 5: Try to improve expand_paletted

Why this idea seemed promising:

  • I am fairly sure that I saw expand_paletted relatively high (11% I think?) in a profile (but I didn’t take sufficient notes and now I have trouble replicating this…)
  • ARM engineers landed some related improvements in Chromium (see https://crbug.com/706134 and the slide here)

Status: tentatively abandoned for now

  • Prototyped here
  • The prototype seems to improve expand_palette microbenchmarks by 21% to 56%, but
    • I’ve realized that expand_paletted barely registers in a runtime profile of png crate’s end-to-end benchmarks (even when using the original testcase proposed by ARM engineers in https://crbug.com/706134#c6; I tested after tweaking that png crate's benchmark to use the EXPAND transformation similarly to how the png crate is used from the image crate)
    • What’s good for microbenchmarks isn’t necessarily good for end-to-end performance. See also https://abseil.io/fast/39.

Benchmarking is hard…

I find it challenging to evaluate commits/prototypes/changes to see whether they result in a performance improvement. Let me share some of my thoughts below (hoping to get feedback that will help me navigate this area of software engineering).

The best case is if a change has such a big positive impact, that it clearly show up on benchmarks - for example the unfilter changes (idea 1 above) show a clear improvement in png crate’s benchmarks (not only 20% improvement on unfilter microbenchmarks for Paeth/bpp=3, but also 5-8% improvement on some end-to-end decoder benchmarks and “change within noise threshold” for other end-to-end benchmarks). OTOH, there are changes that seem beneficial (e.g. avoiding memory copies - idea 2 above) that have relatively modest gains (maybe 1% - 7% in end-to-end benchmarks) and that may sometimes (randomly - due to noise?) seem to be performance regressions. I struggle a bit figuring how to cope with the latter scenario.

  • Maybe I should try to run Criterion benchmarks for longer (e.g. --sample-size=1000 --measurement-time=500? (This doesn’t seem to help. I still get wild swings in results… :-/)
  • Maybe I should look into cachegrind-based, iai-based instruction-count-focused benchmarks? OTOH it seems to me that the estimated cycle count can be rather inaccurate. The unfilter change for example shows the following changes in estimated cycles: kodim02=-8.1% (-5.1% runtime), kodim07=-19.2% (-8.4% runtime), kodim17=-0.01% (+0.87% runtime), kodim23=-5.7% (-2.4% runtime).
  • Maybe measuring an average and doing fancy statistics is misguided and one should just measure the minimum possible time? This assumes that the external noise can only make a benchmark slower. This idea is based on a blog post here.
  • Maybe I should give up microbenchmarks and instead rely on Chromium’s A/B experimentation framework.
  • Maybe I should introduce and use more focused microbenchmarks (and to some extent ignore the end-to-end decoder benchmarks of the png crate):
    • I do in fact plan to use dependency injection to isolate performance of RawRowsBuffer (idea 2.1) and ZlibStream (idea 2.2)
    • I also have a tentative idea to craft a PNG file that has minimal decompression and unfiltering overhead while still preserving most of the other behavioral patterns (e.g. preserving the number and length of IDAT and ZLIB chunks, how they get decompressed, etc.). This seems hard. Maybe I should try compressing a black PNG?
  • Maybe I should attempt to reduce the measurement noise somehow?
    • nice -19 is one way (sadly I didn’t think of using that initially…)
    • Maybe there is a way to check and/or guarantee if the CPU is always using a consistent clock speed/frequency (e.g. doesn’t sleep or enter low power mode)
    • Maybe I should look into discounting code/memory-layout effects as described in https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf?
    • Maybe I can somehow dedicate a single CPU core to the benchmarks? nice -19 helps to some extent, but I think that the benchmark may still yield to other processes (?) and/or interrupt handlers? OTOH, I am not sure if there is an easy way to do that: I don’t have csets command on my Linux box; writing to /proc/irq/0/smp_affinity seems icky; I am not sure if I can control boot flags in my datacenter-based machine to use isolcpus.

So...

So, what do you think? Does any of the above make sense? Am I wildly off anywhere? Any specific/actionable suggestions?

Feedback welcomed!

@image-rs image-rs locked and limited conversation to collaborators Sep 29, 2023
@fintelia fintelia converted this issue into discussion #416 Sep 29, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant