You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to share some of my ideas for performance opportunities in the png crate.
I hope that we can discuss these ideas together to refine them - for example:
Maybe it’s better to invest in performance-sensitive areas of code other than the ones identified below?
Maybe there are better ways to benchmark and/or profile the code?
Maybe some of these changes are good for microbenchmarks but not for end-to-end performance?
etc.
I hope that this issue is a good way to have those discussions, but please shout if you think there might be better ways to engage. For example - maybe I should instead try joining a discord server somewhere? Joining a mailing list or discussion group? Writing all of this in a google doc (and discussing in doc comments)?
I apologize in advance if you find the text below rather long. I tried to keep things structured and avoid rambling, but I also wanted to brainstorm (and avoid immediately rejecting) half-baked or even silly ideas. Please bear with me :-).
My ultimate goal is to bring the performance of the png crate to parity with Chromium’s C++ implementation. Description of my (quite possibly flawed) comparison/measurement methodology can be found in the (incomplete / work-in-progress / please-be-gentle) “Discussion of benchmarking methodology” section in a separate doc (the link should be viewable by anyone, I’d be happy to grant commenting access if needed)
Performance improvement ideas
Idea 1: Explicit SIMD-ification of unfilter
Why this idea seems promising:
perf record of png crate’s benchmarks says that png::filter::unfilter accounts for 18% of runtime
libpng uses explicit SIMD intrinsics to SIMD-ify unfilter.
There are implementations for 3-bytes-wide and 4-bytes-wide pixels (e.g. png_read_filter_row_avg3_sse2 and png_read_filter_row_avg4_sse2)
There are implementations for Avg, Sub, and Paeth filter types (e.g. png_read_filter_row_sub3_sse2, png_read_filter_row_avg3_sse2, and png_read_filter_row_paeth3_sse2
(not in my short-term plan, help welcomed) Double-check if investing more time here is worthwhile:
Maybe one can use godbolt to compare disassembly of Rust vs C++ (for all filter types and/or bytes-per-pixels?). Example for Paeth/bpp=3: https://godbolt.org/z/j8fK39oEE.
FWIW png::filter::unfilter went down to 8.6% of runtime - this is the upper bound on possible further improvements.
Idea 2: Avoid copying data within the png crate
Why this idea seems promising:
This seems to make intuitive sense - avoiding unnecessary work (moving bytes around) should improve performance (if we can avoid somehow hurting performance as a side effect - via extra computations, or extra cache pressure…)
perf record of png crate’s benchmarks (after SIMD-ification of unfilter) shows:
__memmove_avx_unaligned_erms: 9.25% of runtime
__memset_avx2_unaligned_erms: 1.45% of runtime
__memmove_avx_unaligned: 0.98% of runtime
Idea 2.1: Reduce copying of raw rows
Goal: get rid of the copies done in the two places here:
Preserving the previous row for unfiltering here: self.prev[..rowlen].copy_from_slice(&row[..rowlen])
Compacting the raw rows buffer here: self.current.drain(..self.scan_start).for_each(drop); (copies all bytes from self.scan_start to the end of the self.current vector)
Problem: I need to convince myself that this really helps (see also the benchmarking write up in a separate section below). Strategy:
High confidence / I plan to work on this soon: Implement a benchmark that isolates the new RawRowsBuffer while still replicating decoding/decompressing patterns (of say kodim02.png). This can be done by abstracting away ReadDecoder by hiding it behind an equivalent AbstractReadDecoder trait. The abstract trait can be also implemented as a pass-thru to capture/log the decoding/decompressing patterns (that should be replicated by the microbenchmark).
Low confidence: Somehow reduce noise for end-to-end benchmarks…
Maybe creating an artificial PNG that replicates most behavioral patterns of kodim02.png, but uses no unfilter and uses no Huffman encoding… This seems hard…
Other ideas from the benchmarking write-up below?
Idea 2.2: Reduce copying within ZlibStream::transfer_finished_data
Avoid copying not-yet-decompressed/populated bytes in self.out_buffer[self.out_pos..].
Maybe we can drop fdeflate’s assumption that out_buffer has been zero-ed out. In this case, these bytes can simply be ignored.
OTOH, we can keep fdeflate's assumption and still avoid copying these bytes by rely on zeroing them out in prepare_vec_for_appending. In other words, we can replace (slower?) memcpy with (faster?) memset.
Reduce how often out_buffer is compacted
Today transfer_finished_data compacts the buffer every time (copying 32kB every time safe number of bytes is returned to the client; possibly copying multiple bytes per single returned byte)
Idea: only compact when safe > CHUNCK_BUFFER_SIZE * 3 (copying 32kB after returning at least 32*3kB of data; copying at most 1 byte per 3 returned bytes). Number 3 is somewhat arbitrary and there is a trade-off here: higher constant means more memory/cache pressure (but less copying).
Status: started - plan to resume work on this soon:
Prototyped here (passes functional tests, not measured yet due to irrational fear of benchmark noise… 🙂)
Problem: I need to convince myself that this really helps (see also the benchmarking write up in a separate section below). Strategy: similar to what I plan idea 2.1 above:
High confidence / I plan to work on this soon: Implement a benchmark that isolates the ZlibStream while still replicating decoding/decompressing patterns (of say kodim02.png)
Idea 2.3: Avoid BufReader when unnecessary
Sometimes the input the png crate has already been buffered into memory (Chromium’s //ui/gfx/codec/png_codec.cctakes a span of memory as input; png crate’s benchmarks useVec<u8>). In such scenarios using a BufReader will needlessly copy the already-buffered/already-in-memory bytes into BufReader’s internal buffer.
Status: thinking how to best address this:
Maybe the public API of the png and image crates should take BufRead instead of Read as input?
Maybe the public API of the png(and image?) crate can be bifurcated (internally holding dyn Box<BufRead> instead of BufReader<R> in ReadDecoder::reader):
new can continue taking Read as input (maybe deprecate new + rename it into from_unbuffered_input)
A new API (from_buf_read? from_buffered_input?) can take BufRead as input. Alternatively we could also introduce a from_slice API (OTOH all slices are BufRead but not all BufReads are slices, so this isn't a fully generic solution).
Other options?
Idea 2.4: Other copy avoidance
Half-baked ideas:
ZlibStream::out_buffer (extension of idea 2.2): maybe we can avoid compacting at all
Silly idea 2.4.1: What if we had a data structure like Vec (i.e. elements occupy contiguous memory) but that supports efficient dropping of large prefixes (bypassing the allocator and returning whole memory pages to the OS?)
Silly idea 2.4.2: Circular buffer instead of compacting (probably detrimental to performance of fdeflate...)
ZlibStream::in_buffer:
Seems impossible to avoid because decompression needs to work with a single buffer, but PNG can split compressed data stream across multiple IDAT chunks.
Idea 3: Minimize the number of allocations
Why this idea seems promising:
Chromium benchmarks show limited, but non-zero performance opportunities
0.78% of runtime: partition_alloc::internal::PartitionBucket::SlowPathAlloc
0.76% of runtime: allocator_shim::internal::PartitionMalloc
0.66% of runtime: allocator_shim::internal::PartitionFree
There are some low-hanging fruit here - most allocations are triggered (directly or indirectly) underneath (as measured by heaptrack for png crate's decoder benchmarks):
26.1% allocations in fdeflate::decompress::Decompressor::build_tables
17.1% allocations in png::decoder::zlib::ZlibStream::new
15.7% allocations in png::decoder::stream::StreamingDecoder::parse_text
12.2% allocations in png::decoder::zlib::ZlibStream::transfer_finished_data
Status: not started yet:
Idea 3.1: image crate and Chromium ignore text chunks - we should just call png::Decoder::set_ignore_text_chunk. (This should help with parse_text.)
Other Chromium engineers point out that some Chromium patches may be responsible for significant performance improvements of zlib. In particular, there are SIMD-related patches here (see also a slide here by ARM engineers)
perf record of png crate’s benchmarks says that:
56.44% of runtime is spent in fdeflate::decompress::Decompressor::read
9.20% of runtime is spent in fdeflate::decompress::Decompressor::build_tables
Status: haven’t really started looking at fdeflate source code yet:
I am fairly sure that I saw expand_paletted relatively high (11% I think?) in a profile (but I didn’t take sufficient notes and now I have trouble replicating this…)
The prototype seems to improve expand_palette microbenchmarks by 21% to 56%, but
I’ve realized that expand_paletted barely registers in a runtime profile of png crate’s end-to-end benchmarks (even when using the original testcase proposed by ARM engineers in https://crbug.com/706134#c6; I tested after tweaking that png crate's benchmark to use the EXPAND transformation similarly to how the png crate is used from the image crate)
What’s good for microbenchmarks isn’t necessarily good for end-to-end performance. See also https://abseil.io/fast/39.
Benchmarking is hard…
I find it challenging to evaluate commits/prototypes/changes to see whether they result in a performance improvement. Let me share some of my thoughts below (hoping to get feedback that will help me navigate this area of software engineering).
The best case is if a change has such a big positive impact, that it clearly show up on benchmarks - for example the unfilter changes (idea 1 above) show a clear improvement in png crate’s benchmarks (not only 20% improvement on unfilter microbenchmarks for Paeth/bpp=3, but also 5-8% improvement on some end-to-end decoder benchmarks and “change within noise threshold” for other end-to-end benchmarks). OTOH, there are changes that seem beneficial (e.g. avoiding memory copies - idea 2 above) that have relatively modest gains (maybe 1% - 7% in end-to-end benchmarks) and that may sometimes (randomly - due to noise?) seem to be performance regressions. I struggle a bit figuring how to cope with the latter scenario.
Maybe I should try to run Criterion benchmarks for longer (e.g. --sample-size=1000 --measurement-time=500? (This doesn’t seem to help. I still get wild swings in results… :-/)
Maybe I should look into cachegrind-based, iai-based instruction-count-focused benchmarks? OTOH it seems to me that the estimated cycle count can be rather inaccurate. The unfilter change for example shows the following changes in estimated cycles: kodim02=-8.1% (-5.1% runtime), kodim07=-19.2% (-8.4% runtime), kodim17=-0.01% (+0.87% runtime), kodim23=-5.7% (-2.4% runtime).
Maybe measuring an average and doing fancy statistics is misguided and one should just measure the minimum possible time? This assumes that the external noise can only make a benchmark slower. This idea is based on a blog post here.
Maybe I should give up microbenchmarks and instead rely on Chromium’s A/B experimentation framework.
Maybe I should introduce and use more focused microbenchmarks (and to some extent ignore the end-to-end decoder benchmarks of the png crate):
I do in fact plan to use dependency injection to isolate performance of RawRowsBuffer (idea 2.1) and ZlibStream (idea 2.2)
I also have a tentative idea to craft a PNG file that has minimal decompression and unfiltering overhead while still preserving most of the other behavioral patterns (e.g. preserving the number and length of IDAT and ZLIB chunks, how they get decompressed, etc.). This seems hard. Maybe I should try compressing a black PNG?
Maybe I should attempt to reduce the measurement noise somehow?
nice -19 is one way (sadly I didn’t think of using that initially…)
Maybe there is a way to check and/or guarantee if the CPU is always using a consistent clock speed/frequency (e.g. doesn’t sleep or enter low power mode)
Maybe I can somehow dedicate a single CPU core to the benchmarks? nice -19 helps to some extent, but I think that the benchmark may still yield to other processes (?) and/or interrupt handlers? OTOH, I am not sure if there is an easy way to do that: I don’t have csets command on my Linux box; writing to /proc/irq/0/smp_affinity seems icky; I am not sure if I can control boot flags in my datacenter-based machine to use isolcpus.
So...
So, what do you think? Does any of the above make sense? Am I wildly off anywhere? Any specific/actionable suggestions?
Feedback welcomed!
The text was updated successfully, but these errors were encountered:
Hello!
I wanted to share some of my ideas for performance opportunities in the
png
crate.I hope that we can discuss these ideas together to refine them - for example:
I hope that this issue is a good way to have those discussions, but please shout if you think there might be better ways to engage. For example - maybe I should instead try joining a discord server somewhere? Joining a mailing list or discussion group? Writing all of this in a google doc (and discussing in doc comments)?
I apologize in advance if you find the text below rather long. I tried to keep things structured and avoid rambling, but I also wanted to brainstorm (and avoid immediately rejecting) half-baked or even silly ideas. Please bear with me :-).
My ultimate goal is to bring the performance of the
png
crate to parity with Chromium’s C++ implementation. Description of my (quite possibly flawed) comparison/measurement methodology can be found in the (incomplete / work-in-progress / please-be-gentle) “Discussion of benchmarking methodology” section in a separate doc (the link should be viewable by anyone, I’d be happy to grant commenting access if needed)Performance improvement ideas
Idea 1: Explicit SIMD-ification of
unfilter
Why this idea seems promising:
perf record
ofpng
crate’s benchmarks says thatpng::filter::unfilter
accounts for 18% of runtimelibpng
uses explicit SIMD intrinsics to SIMD-ify unfilter.png_read_filter_row_avg3_sse2
andpng_read_filter_row_avg4_sse2
)png_read_filter_row_sub3_sse2
,png_read_filter_row_avg3_sse2
, andpng_read_filter_row_paeth3_sse2
Status: PR is under review:
unfilter
benchmarks landed in Scaffolding for direct benchmarking ofcrate::filter::unfilter
. #413std::simd
to speed-upunfilter
forPaeth
for bpp=3 and bpp=6 #414 (this yields 20% improvements inunfilter
benchmarks)Discussion of next steps:
std::simd
to speed-upunfilter
forPaeth
for bpp=3 and bpp=6 #414png::filter::unfilter
went down to 8.6% of runtime - this is the upper bound on possible further improvements.Idea 2: Avoid copying data within the
png
crateWhy this idea seems promising:
This seems to make intuitive sense - avoiding unnecessary work (moving bytes around) should improve performance (if we can avoid somehow hurting performance as a side effect - via extra computations, or extra cache pressure…)
perf record
ofpng
crate’s benchmarks (after SIMD-ification ofunfilter
) shows:__memmove_avx_unaligned_erms
: 9.25% of runtime__memset_avx2_unaligned_erms
: 1.45% of runtime__memmove_avx_unaligned
: 0.98% of runtimeIdea 2.1: Reduce copying of raw rows
Goal: get rid of the copies done in the two places here:
self.prev[..rowlen].copy_from_slice(&row[..rowlen])
self.current.drain(..self.scan_start).for_each(drop);
(copies all bytes fromself.scan_start
to the end of theself.current
vector)Status: in progress - actively working on this:
Prototype: see here
Problem: I need to convince myself that this really helps (see also the benchmarking write up in a separate section below). Strategy:
RawRowsBuffer
while still replicating decoding/decompressing patterns (of saykodim02.png
). This can be done by abstracting awayReadDecoder
by hiding it behind an equivalentAbstractReadDecoder
trait. The abstract trait can be also implemented as a pass-thru to capture/log the decoding/decompressing patterns (that should be replicated by the microbenchmark).unfilter
and uses no Huffman encoding… This seems hard…Idea 2.2: Reduce copying within
ZlibStream::transfer_finished_data
There seem to be 2 performance improvement opportunities within
ZlibStream::transfer_finished_data
:Avoid copying not-yet-decompressed/populated bytes in
self.out_buffer[self.out_pos..]
.fdeflate
’s assumption thatout_buffer
has been zero-ed out. In this case, these bytes can simply be ignored.fdeflate
's assumption and still avoid copying these bytes by rely on zeroing them out inprepare_vec_for_appending
. In other words, we can replace (slower?)memcpy
with (faster?)memset
.Reduce how often
out_buffer
is compactedtransfer_finished_data
compacts the buffer every time (copying 32kB every timesafe
number of bytes is returned to the client; possibly copying multiple bytes per single returned byte)safe > CHUNCK_BUFFER_SIZE * 3
(copying 32kB after returning at least 32*3kB of data; copying at most 1 byte per 3 returned bytes). Number 3 is somewhat arbitrary and there is a trade-off here: higher constant means more memory/cache pressure (but less copying).Status: started - plan to resume work on this soon:
ZlibStream
while still replicating decoding/decompressing patterns (of saykodim02.png
)Idea 2.3: Avoid
BufReader
when unnecessarySometimes the input the
png
crate has already been buffered into memory (Chromium’s//ui/gfx/codec/png_codec.cc
takes a span of memory as input;png
crate’s benchmarks useVec<u8>
). In such scenarios using aBufReader
will needlessly copy the already-buffered/already-in-memory bytes intoBufReader
’s internal buffer.Status: thinking how to best address this:
png
andimage
crates should takeBufRead
instead ofRead
as input?png
(andimage
?) crate can be bifurcated (internally holdingdyn Box<BufRead>
instead ofBufReader<R>
inReadDecoder::reader
):new
can continue takingRead
as input (maybe deprecatenew
+ rename it intofrom_unbuffered_input
)from_buf_read
?from_buffered_input
?) can takeBufRead
as input. Alternatively we could also introduce afrom_slice
API (OTOH all slices areBufRead
but not allBufRead
s are slices, so this isn't a fully generic solution).Idea 2.4: Other copy avoidance
Half-baked ideas:
ZlibStream::out_buffer
(extension of idea 2.2): maybe we can avoid compacting at allVec
(i.e. elements occupy contiguous memory) but that supports efficient dropping of large prefixes (bypassing the allocator and returning whole memory pages to the OS?)fdeflate
...)ZlibStream::in_buffer
:Idea 3: Minimize the number of allocations
Why this idea seems promising:
partition_alloc::internal::PartitionBucket::SlowPathAlloc
allocator_shim::internal::PartitionMalloc
allocator_shim::internal::PartitionFree
heaptrack
forpng
crate'sdecoder
benchmarks):fdeflate::decompress::Decompressor::build_tables
png::decoder::zlib::ZlibStream::new
png::decoder::stream::StreamingDecoder::parse_text
png::decoder::zlib::ZlibStream::transfer_finished_data
Status: not started yet:
image
crate and Chromium ignore text chunks - we should just callpng::Decoder::set_ignore_text_chunk
. (This should help withparse_text
.)Box
inZlibStream
is not needed?Idea 4: Try to improve decompression speed
Why this idea seems promising:
zlib
. In particular, there are SIMD-related patches here (see also a slide here by ARM engineers)perf record
ofpng
crate’s benchmarks says that:fdeflate::decompress::Decompressor::read
fdeflate::decompress::Decompressor::build_tables
Status: haven’t really started looking at
fdeflate
source code yet:fdeflate
... I haven’t started reading the code yet…)Idea 5: Try to improve
expand_paletted
Why this idea seemed promising:
expand_paletted
relatively high (11% I think?) in a profile (but I didn’t take sufficient notes and now I have trouble replicating this…)std::simd::Simd::gather_or_default
? mimicking Chromium SIMD code that I don’t understand?)Status: tentatively abandoned for now
expand_palette
microbenchmarks by 21% to 56%, butexpand_paletted
barely registers in a runtime profile ofpng
crate’s end-to-end benchmarks (even when using the original testcase proposed by ARM engineers in https://crbug.com/706134#c6; I tested after tweaking thatpng
crate's benchmark to use theEXPAND
transformation similarly to how thepng
crate is used from theimage
crate)Benchmarking is hard…
I find it challenging to evaluate commits/prototypes/changes to see whether they result in a performance improvement. Let me share some of my thoughts below (hoping to get feedback that will help me navigate this area of software engineering).
The best case is if a change has such a big positive impact, that it clearly show up on benchmarks - for example the
unfilter
changes (idea 1 above) show a clear improvement inpng
crate’s benchmarks (not only 20% improvement onunfilter
microbenchmarks for Paeth/bpp=3, but also 5-8% improvement on some end-to-enddecoder
benchmarks and “change within noise threshold” for other end-to-end benchmarks). OTOH, there are changes that seem beneficial (e.g. avoiding memory copies - idea 2 above) that have relatively modest gains (maybe 1% - 7% in end-to-end benchmarks) and that may sometimes (randomly - due to noise?) seem to be performance regressions. I struggle a bit figuring how to cope with the latter scenario.--sample-size=1000 --measurement-time=500
? (This doesn’t seem to help. I still get wild swings in results… :-/)cachegrind
-based,iai
-based instruction-count-focused benchmarks? OTOH it seems to me that the estimated cycle count can be rather inaccurate. Theunfilter
change for example shows the following changes in estimated cycles: kodim02=-8.1% (-5.1% runtime), kodim07=-19.2% (-8.4% runtime), kodim17=-0.01% (+0.87% runtime), kodim23=-5.7% (-2.4% runtime).decoder
benchmarks of thepng
crate):RawRowsBuffer
(idea 2.1) andZlibStream
(idea 2.2)nice -19
is one way (sadly I didn’t think of using that initially…)nice -19
helps to some extent, but I think that the benchmark may still yield to other processes (?) and/or interrupt handlers? OTOH, I am not sure if there is an easy way to do that: I don’t havecsets
command on my Linux box; writing to /proc/irq/0/smp_affinity seems icky; I am not sure if I can control boot flags in my datacenter-based machine to useisolcpus
.So...
So, what do you think? Does any of the above make sense? Am I wildly off anywhere? Any specific/actionable suggestions?
Feedback welcomed!
The text was updated successfully, but these errors were encountered: