Discussion: PNG decoding performance improvement opportunities #416
Replies: 16 comments 71 replies
-
General Optimization TipsMy first suggestion for benchmarking would be to collect a representative sample of a couple thousand PNGs to use as a corpus. In my experience, it is a much better use of time to measure many different images once than the same image many times. The corpus-bench was initially created to work on encoder performance and compression ratio, but with some tweaks it should be suitable for optimizing decoding (specifically, you'll want to measure the time decoding the original version of PNGs, rather than re-encoded versions). I've personally used the QOI Corpus, but I suspect Chromium might already have -- or be able to collect -- its own corpuses Ideally, you'd measure both this crate and Chromium's existing PNG decoder against the same corpus, which should give an indication of how much potential for further improvement is left. (I'd caution you not to rely on other people's benchmarks here. As an example, the numbers quoted in the Wuffs blog post are quite out of date) Having representative end-to-end benchmarks has two advantages: it lets you double check that any optimizations are worthwhile, and profiling benchmarks runs can help identify portions of the code worth optimizing. If you can compare traces between multiple decoders, that's even better for identifying optimization targets! |
Beta Was this translation helpful? Give feedback.
-
Idea 5: Try to improve expand_palettedThe core problem here is that the vast majority of PNGs aren't paletted (though only a representative corpus will tell you the exact number...). There's probably a lot of low hanging fruit here, so on the images that do have palettes you can make a big improvement. But optimizing a function that is never called for the vast majority of images isn't going to improve average performance very much, no matter how good a job you do! |
Beta Was this translation helpful? Give feedback.
-
Idea 4: Try to improve decompression speedThe
And another notable optimization that it uses is multi-byte decoding. This isn't to say there aren't further optimization opportunities. The decoder uses heuristics to decide how much time to spend building decoding tables versus performing decoding (bigger tables take longer to build but make decoding go faster). And there are surely other optimizations to be found. But this is all a long way of saying that there probably isn't a ton of easy performance wins to be found here. |
Beta Was this translation helpful? Give feedback.
-
Idea 3: Minimize the number of allocationsThere hasn't been a ton of investigation here. With focused attention, it is likely that small gains may be possible. |
Beta Was this translation helpful? Give feedback.
-
Idea 2: Avoid copying data within the png crateProfiling can be misleading here. Decoding an image requires pulling the bytes of the image from RAM into the CPU cache. The first time you do this will be quite slow. Subsequent copies of data that's already in the L1 cache will be extremely fast. If you avoid the first copy, you'll find that the next time the data is copied suddenly gets way slower. Avoiding copies is good for code clarity, but I personally don't think doing lots of copies is holding us back much. Idea 2.3: Avoid BufReader when unnecessaryI actually thought this was already removed. The convention elsewhere in image-rs is that decoders take an |
Beta Was this translation helpful? Give feedback.
-
Idea 1: Explicit SIMD-ification of unfilterI'm honestly thrilled this worked. Paeth with 3 bpp is probably the single most important case: 3 bpp is the most common bit depth and the paeth filter gets the highest compression ratio*. The net result is that it gets used a lot. And despite those two factors, paeth unfiltering was by far the slowest. *An adaptive filter that picks among all filter methods wins by maybe one percentage point, but a large portion of the rows still get encoded with paeth. |
Beta Was this translation helpful? Give feedback.
-
Benchmarking noiseSpecial thanks to @marshallpierce and @veluca93 for their tips on how to reduce benchmarking noise. In particular, the https://rigtorp.se/low-latency-guide/ link I got from @marshallpierce has been very helpful. For replicability, let me share the set of steps that I’ve followed when running benchmarks reported below:
Other notes:
Benchmarking corpusThank you very much @fintelia for pointing out the QOI corpus of PNG images. That does indeed seem like a much better way to evaluate performance of PNG decoding. I plan to switch to this corpus after completing (and measuring) the current batch of in-progress performance improvements. Idea 2: Avoid copying data within the png crateI’ve tried combining ideas 2.1 and 2.2 into something that is both simpler, and avoids even more copies (maybe let’s call this idea 2.5 when referring to this later?): 87b44a8 This seems to help: 1 image wasn’t affected by much, 1 image improved by more than 10%, 4 images improved by 3-5%:
OTOH, I am a bit reluctant to pursue this direction further, because of the idea 6 below… ( Idea 6: Avoiding 32kB delay / L1 cache friendlinessIdeas 2.1, 2.2, or 2.5 above don’t change the fact that unfiltering happens with a delay of 32kB. And when discussing this with @veluca93 and @marshallpierce, they pointed out that such a delay means that things may drop out of the L1 cache most of the time (Skylake, for instance, has a 32K L1; and a bigger L1 cache doesn’t necessarily help outside of microbenchmarks, where other things put additional pressure on the cache). And avoiding the 32kB delay necessitates some copying (because we can't mutate the most recently decompressed 32kB), and therefore maybe idea 2.5 is not that great in the long run (i.e. avoiding the 32kB delay would require reverting parts of idea 2.5 and e.g. reintroducing some form of the I think I’ll focus on this area in the next couple of days. Quick-and-dirty notes about what this may encompass: |
Beta Was this translation helpful? Give feedback.
-
More realistic benchmarking corpusI have run some additional benchmarks on PNG images from the top 500 websites, and on the QOI corpus Fetching images from the top 500 websitesI have fetched images from the main pages of the top 500 most popular websites (according to https://moz.com/top500). The pages and their subresources were fetched using the following script:
Benchmarking processI have measured performance using a WIP Chromium CL at https://crrev.com/c/4980955/5. After patching the CL (and the chain of 11 upstream WIP CLs) I have built
The benchmarking process has been kicked off against the top 500 websites and against the QOI corpus (recommended above) by running the following commands:
The benchmark tested 2 implementations of PNG decoding under //ui/gfx/codec:
Disclaimers and other notes:
ResultsSpreadsheet with the results can be found here: For each tested file, the benchmark has recorded the following information about the behavior of the PNG decoder under
The spreadsheet has additionally computed the following data:
Discussion of some of the resultsIt is interesting to note the distribution of various PNG properties - this should help guide future optimization work:
The Rust-vs-C++ speed difference is more pronounced for small images, but small images don’t contribute as much as big images to the total runtime (at least from Chromium website rendering perspective - the conclusion may be different in other scenarios that are more heavily skewed toward small images). Revisiting copy avoidanceEffects on the bigger test corpusI have tried using the bigger testing corpus to test the performance impact of the copy avoidance approach (see the series of commits here. The numbers below show the ratio between the total, cumulative Rust runtime and the C/C++ runtime:
Effects on non-compressed imagesI have also created bare-bones PNG images, where no
The images used for these benchmarks are somewhat artificial, but they do help to measure the performance effects of the code outside of decompression (by magnifying this effect and therefore avoiding noise). My initial assumption was that if a change has a positive performance effect on such images, then it should also have a positive (although smaller) effect on more realistic images. It seems that the results from the QOI test corpus have invalidated this assumption. Next stepsI will break the copy avoidance commits into a set of a few smaller PRs. Hopefully proceeding in this way will shed some light on the source of the unexpected slowdown observed with the QOI test corpus (i.e. maybe I made a mistake in one of the commits and we can catch this with a collective review and/or with the new “noncompressed” benchmarks; one suspicious part is somewhat arbitrary choice of constants - we probably should use less than 1MB for Various notes and arguments for proceeding with merging the PRs:
|
Beta Was this translation helpful? Give feedback.
-
I just wanted to share that I plan to try implementing and measuring the impact of the following on the noncompressed-8x8 benchmark. (But I promise that no cookies have been licked - please just gives me a heads-up if you plan to work on these items yourself.) Hopefully today or tomorrow:
Hopefully later this week:
/cc @fintelia |
Beta Was this translation helpful? Give feedback.
-
EDIT 2024-01-12: WARNING: Conclusions of this comment/post are incorrect because of a mistake in measurement methodology (see here). I just wanted to note another negative result and confirm that investing in When profiling all 5 of these files via And when trying the improvement ideas from the commit here, the resulting runtime gets worse:
|
Beta Was this translation helpful? Give feedback.
-
There is a known performance issue on paletted images: #393 It is possible to do much better than the |
Beta Was this translation helpful? Give feedback.
-
The
|
Beta Was this translation helpful? Give feedback.
-
Hi, thought it would be nice to inform you, I ran some benchmarks on the latest The summary is that but kudos to everyone involved |
Beta Was this translation helpful? Give feedback.
-
A significant performance improvement can be obtained by switching the underlying Zlib implementation from With zlib-rs and the |
Beta Was this translation helpful? Give feedback.
-
stb_image is really good, Fabian, one of the developers has done
interesting optimizations in the decoder, e.g their paeth implementation is
https://github.com/nothings/stb/blob/5c205738c191bcb0abc65c4febfa9bd25ff35234/stb_image.h#L4657
which
looks a bit weird but boosted decoding perf for some paeth heavy images by
20% on some corpus I had when the implementation fell back to scalar so I
am not surprised its doing competitively well
…On Thu, 28 Nov 2024 at 08:05, Jonathan Behrens ***@***.***> wrote:
I'm glad you're seeing similar numbers. One interesting reference is this
2021 blog post
<https://nigeltao.github.io/blog/2021/fastest-safest-png-decoder.html>
comparing wuffs' PNG decoder to others. The sample size is very limited,
but at the time they measured wuffs being 2-3x faster than their C-based
competitors and libpng being overall a bit slower than spng and stb_image.
(They also measured this crate; it is incredible to see just how far we've
come since then!)
I'd totally believe that something strange is going on with the benchmark
harness, but I haven't been able to find anything that would account for
the size of differences we're seeing. Just minor differences in the
decoding options used and how input/output buffers are managed. Though it
is entirely possible that one or more of the libraries are being used
incorrectly somehow.
I also just added logic to write a measurements.csv file when the harness
finishes. It includes the the decoding time in milliseconds for each image
across the different decoders to make it easier to investigate outliers.
—
Reply to this email directly, view it on GitHub
<#416 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFZRVEYJL2KGLYZ5NIZ3CAL2C2QATAVCNFSM6AAAAAA5NFHHF6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNBQGIYTCMY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Below I am sharing some data that may help estimate the impact of PGO (Profile-guided Optimization) on PNG decoding runtime. (This is relevant for me, because by default Chromium's PGO data is generated using only one experiment arm.) The TL;DR is that we can probably expect 1.3% - 5.9% improvement: Summary of results:
My repro steps were based on https://doc.rust-lang.org/rustc/profile-guided-optimization.html: Step 1: gather baseline measurements without PGO:
Step 2: gather PGO data and build an optimized binary:
Step 3: measure and compare PGO vs non-PGO (I am eliding results for artificial /
|
Beta Was this translation helpful? Give feedback.
-
Hello!
I wanted to share some of my ideas for performance opportunities in the
png
crate.I hope that we can discuss these ideas together to refine them - for example:
I hope that this issue is a good way to have those discussions, but please shout if you think there might be better ways to engage. For example - maybe I should instead try joining a discord server somewhere? Joining a mailing list or discussion group? Writing all of this in a google doc (and discussing in doc comments)?
I apologize in advance if you find the text below rather long. I tried to keep things structured and avoid rambling, but I also wanted to brainstorm (and avoid immediately rejecting) half-baked or even silly ideas. Please bear with me :-).
My ultimate goal is to bring the performance of the
png
crate to parity with Chromium’s C++ implementation. Description of my (quite possibly flawed) comparison/measurement methodology can be found in the (incomplete / work-in-progress / please-be-gentle) “Discussion of benchmarking methodology” section in a separate doc (the link should be viewable by anyone, I’d be happy to grant commenting access if needed)Performance improvement ideas
Idea 1: Explicit SIMD-ification of
unfilter
Why this idea seems promising:
perf record
ofpng
crate’s benchmarks says thatpng::filter::unfilter
accounts for 18% of runtimelibpng
uses explicit SIMD intrinsics to SIMD-ify unfilter.png_read_filter_row_avg3_sse2
andpng_read_filter_row_avg4_sse2
)png_read_filter_row_sub3_sse2
,png_read_filter_row_avg3_sse2
, andpng_read_filter_row_paeth3_sse2
Status: PR is under review:
unfilter
benchmarks landed in Scaffolding for direct benchmarking ofcrate::filter::unfilter
. #413std::simd
to speed-upunfilter
forPaeth
for bpp=3 and bpp=6 #414 (this yields 20% improvements inunfilter
benchmarks)Discussion of next steps:
std::simd
to speed-upunfilter
forPaeth
for bpp=3 and bpp=6 #414png::filter::unfilter
went down to 8.6% of runtime - this is the upper bound on possible further improvements.Idea 2: Avoid copying data within the
png
crateWhy this idea seems promising:
This seems to make intuitive sense - avoiding unnecessary work (moving bytes around) should improve performance (if we can avoid somehow hurting performance as a side effect - via extra computations, or extra cache pressure…)
perf record
ofpng
crate’s benchmarks (after SIMD-ification ofunfilter
) shows:__memmove_avx_unaligned_erms
: 9.25% of runtime__memset_avx2_unaligned_erms
: 1.45% of runtime__memmove_avx_unaligned
: 0.98% of runtimeIdea 2.1: Reduce copying of raw rows
Goal: get rid of the copies done in the two places here:
self.prev[..rowlen].copy_from_slice(&row[..rowlen])
self.current.drain(..self.scan_start).for_each(drop);
(copies all bytes fromself.scan_start
to the end of theself.current
vector)Status: in progress - actively working on this:
Prototype: see here
Problem: I need to convince myself that this really helps (see also the benchmarking write up in a separate section below). Strategy:
RawRowsBuffer
while still replicating decoding/decompressing patterns (of saykodim02.png
). This can be done by abstracting awayReadDecoder
by hiding it behind an equivalentAbstractReadDecoder
trait. The abstract trait can be also implemented as a pass-thru to capture/log the decoding/decompressing patterns (that should be replicated by the microbenchmark).unfilter
and uses no Huffman encoding… This seems hard…Idea 2.2: Reduce copying within
ZlibStream::transfer_finished_data
There seem to be 2 performance improvement opportunities within
ZlibStream::transfer_finished_data
:Avoid copying not-yet-decompressed/populated bytes in
self.out_buffer[self.out_pos..]
.fdeflate
’s assumption thatout_buffer
has been zero-ed out. In this case, these bytes can simply be ignored.fdeflate
's assumption and still avoid copying these bytes by rely on zeroing them out inprepare_vec_for_appending
. In other words, we can replace (slower?)memcpy
with (faster?)memset
.Reduce how often
out_buffer
is compactedtransfer_finished_data
compacts the buffer every time (copying 32kB every timesafe
number of bytes is returned to the client; possibly copying multiple bytes per single returned byte)safe > CHUNCK_BUFFER_SIZE * 3
(copying 32kB after returning at least 32*3kB of data; copying at most 1 byte per 3 returned bytes). Number 3 is somewhat arbitrary and there is a trade-off here: higher constant means more memory/cache pressure (but less copying).Status: started - plan to resume work on this soon:
ZlibStream
while still replicating decoding/decompressing patterns (of saykodim02.png
)Idea 2.3: Avoid
BufReader
when unnecessarySometimes the input the
png
crate has already been buffered into memory (Chromium’s//ui/gfx/codec/png_codec.cc
takes a span of memory as input;png
crate’s benchmarks useVec<u8>
). In such scenarios using aBufReader
will needlessly copy the already-buffered/already-in-memory bytes intoBufReader
’s internal buffer.Status: thinking how to best address this:
png
andimage
crates should takeBufRead
instead ofRead
as input?png
(andimage
?) crate can be bifurcated (internally holdingdyn Box<BufRead>
instead ofBufReader<R>
inReadDecoder::reader
):new
can continue takingRead
as input (maybe deprecatenew
+ rename it intofrom_unbuffered_input
)from_buf_read
?from_buffered_input
?) can takeBufRead
as input. Alternatively we could also introduce afrom_slice
API (OTOH all slices areBufRead
but not allBufRead
s are slices, so this isn't a fully generic solution).Idea 2.4: Other copy avoidance
Half-baked ideas:
ZlibStream::out_buffer
(extension of idea 2.2): maybe we can avoid compacting at allVec
(i.e. elements occupy contiguous memory) but that supports efficient dropping of large prefixes (bypassing the allocator and returning whole memory pages to the OS?)fdeflate
...)ZlibStream::in_buffer
:Idea 3: Minimize the number of allocations
Why this idea seems promising:
partition_alloc::internal::PartitionBucket::SlowPathAlloc
allocator_shim::internal::PartitionMalloc
allocator_shim::internal::PartitionFree
heaptrack
forpng
crate'sdecoder
benchmarks):fdeflate::decompress::Decompressor::build_tables
png::decoder::zlib::ZlibStream::new
png::decoder::stream::StreamingDecoder::parse_text
png::decoder::zlib::ZlibStream::transfer_finished_data
Status: not started yet:
image
crate and Chromium ignore text chunks - we should just callpng::Decoder::set_ignore_text_chunk
. (This should help withparse_text
.)Box
inZlibStream
is not needed?Idea 4: Try to improve decompression speed
Why this idea seems promising:
zlib
. In particular, there are SIMD-related patches here (see also a slide here by ARM engineers)perf record
ofpng
crate’s benchmarks says that:fdeflate::decompress::Decompressor::read
fdeflate::decompress::Decompressor::build_tables
Status: haven’t really started looking at
fdeflate
source code yet:fdeflate
... I haven’t started reading the code yet…)Idea 5: Try to improve
expand_paletted
Why this idea seemed promising:
expand_paletted
relatively high (11% I think?) in a profile (but I didn’t take sufficient notes and now I have trouble replicating this…)std::simd::Simd::gather_or_default
? mimicking Chromium SIMD code that I don’t understand?)Status: tentatively abandoned for now
expand_palette
microbenchmarks by 21% to 56%, butexpand_paletted
barely registers in a runtime profile ofpng
crate’s end-to-end benchmarks (even when using the original testcase proposed by ARM engineers in https://crbug.com/706134#c6; I tested after tweaking thatpng
crate's benchmark to use theEXPAND
transformation similarly to how thepng
crate is used from theimage
crate)Benchmarking is hard…
I find it challenging to evaluate commits/prototypes/changes to see whether they result in a performance improvement. Let me share some of my thoughts below (hoping to get feedback that will help me navigate this area of software engineering).
The best case is if a change has such a big positive impact, that it clearly show up on benchmarks - for example the
unfilter
changes (idea 1 above) show a clear improvement inpng
crate’s benchmarks (not only 20% improvement onunfilter
microbenchmarks for Paeth/bpp=3, but also 5-8% improvement on some end-to-enddecoder
benchmarks and “change within noise threshold” for other end-to-end benchmarks). OTOH, there are changes that seem beneficial (e.g. avoiding memory copies - idea 2 above) that have relatively modest gains (maybe 1% - 7% in end-to-end benchmarks) and that may sometimes (randomly - due to noise?) seem to be performance regressions. I struggle a bit figuring how to cope with the latter scenario.--sample-size=1000 --measurement-time=500
? (This doesn’t seem to help. I still get wild swings in results… :-/)cachegrind
-based,iai
-based instruction-count-focused benchmarks? OTOH it seems to me that the estimated cycle count can be rather inaccurate. Theunfilter
change for example shows the following changes in estimated cycles: kodim02=-8.1% (-5.1% runtime), kodim07=-19.2% (-8.4% runtime), kodim17=-0.01% (+0.87% runtime), kodim23=-5.7% (-2.4% runtime).decoder
benchmarks of thepng
crate):RawRowsBuffer
(idea 2.1) andZlibStream
(idea 2.2)nice -19
is one way (sadly I didn’t think of using that initially…)nice -19
helps to some extent, but I think that the benchmark may still yield to other processes (?) and/or interrupt handlers? OTOH, I am not sure if there is an easy way to do that: I don’t havecsets
command on my Linux box; writing to /proc/irq/0/smp_affinity seems icky; I am not sure if I can control boot flags in my datacenter-based machine to useisolcpus
.So...
So, what do you think? Does any of the above make sense? Am I wildly off anywhere? Any specific/actionable suggestions?
Feedback welcomed!
Beta Was this translation helpful? Give feedback.
All reactions