NEON implementation for Adler32 #251

Adenilson · 2017-04-12T01:28:43Z

The checksum is calculated in the uncompressed PNG data and can be made much faster by using SIMD.

Tests in ARMv8 yielded an improvement of about 3x (e.g. walltime was 350ms x 125ms for 4096x4096 bytes executed 30 times). That results in at least 18% improvement in PNG image decoding in Chromium.

Further details at:
https://bugs.chromium.org/p/chromium/issues/detail?id=688601

ProgramMax · 2017-04-13T00:59:46Z

Hello everyone. cblume@google.com / @chromium.org here.

Don't forget to add yourself to contrib/README.contrib

The checksum is calculated in the uncompressed PNG data and can be made much faster by using SIMD. Tests in ARMv8 yielded an improvement of about 3x (e.g. walltime was 350ms x 125ms for a 4096x4096 bytes executed 30 times). That results in at least 18% improvement in image decoding in Chromium. Further details at: https://bugs.chromium.org/p/chromium/issues/detail?id=688601

Adenilson · 2017-04-13T17:56:23Z

@ProgramMax nice catch! Fixed.

CRC32 affects performance for both image decompression (PNG) as also in general browsing while accessing websites that serve content using compression (i.e. Content-Encoding: gzip). This first patch implements an optimized CRC32 function using the dedicated instruction available in ARMv8. It should be between 6x (A53: 116ms X 22ms for a 4Kx4Kx4 buffer) to 10x faster (A72: 91ms x 9ms) than the C implementation currently used by zlib. Details: https://bugs.chromium.org/p/chromium/issues/detail?id=709716 Change-Id: I069408ebc06c49a3c2be4ba3253319e025ee09d7

Adenilson · 2017-04-25T19:22:30Z

Just uploaded a function that uses the new ARMv8 crc32 instruction (it is about 10x faster than the C function in zlib).

To build using the optimized versions, just have an ARM compiler (e.g. from linaro, https://releases.linaro.org/components/toolchain/binaries/6.3-2017.02/) and export CC to point to it (e.g. export CC=arm-linux-gnueabihf-gcc-5) and next enable the feature in CMake buildsystem, either using the ncurses GUI (i.e. ccmake ..) or passing the options as:
cmake .. -DARMv8=ON -DARMv8CRC=ON

I didn't touch the configure script and tried my best to integrate the ARM specific functions using the same strategy used for AMD64 and ASM686 specific code.

Adenilson · 2017-04-25T19:47:14Z

Also validated the changes with 'make test':
(xenial)adenilson@localhost:~/canonical-fork/build$ make test
Running tests...
Test project /home/adenilson/canonical-fork/build
Start 1: example
1/2 Test #1: example .......................... Passed 0.01 sec
Start 2: example64
2/2 Test #2: example64 ........................ Passed 0.01 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) = 0.01 sec

Adenilson · 2017-05-12T19:37:59Z

Touching the base on this one as it has passed over 1 month already.

Adenilson · 2017-06-17T00:57:39Z

Friendly ping on the matter.

Adenilson · 2017-08-08T18:54:16Z

@madler anything you like to change on the original patch?

Adenilson · 2017-09-03T01:07:25Z

@madler ping?

diizzyy · 2017-09-04T12:19:43Z

@Adenilson
Is this ARMv8 only or does it also work on ARMv7?

Adenilson · 2017-09-04T21:07:14Z

First patch should work in both ARMv7 (assuming that the SoC has a NEON unit) + ARMv8.

Second (CRC32) is ARMv8 (both AArch32 and AArch64) only.

Adenilson · 2017-09-21T19:06:41Z

@madler friendly ping?

Adenilson · 2017-11-18T00:32:37Z

For record, Chrome M62 is shipping the inflate_fast optimization and Chrome M63 has a variant of the Adler-32 optimization.

I'm working towards having the patch using the ARMv8 crc32 instruction included in M64 (branching in 2 weeks).

Adenilson · 2017-11-18T01:07:35Z

contrib/arm/armv8_crc32.c

+uint32_t armv8_crc32_little(uint32_t crc,
+                            const unsigned char *buf,
+                            size_t len) {
+    uint32_t c;


In a second thought, this should handle the case of buf == Z_NULL.

Adenilson · 2017-12-13T01:52:25Z

An update on the issue: the crc32 optimization landed on Chrome M64 but it was reverted because it broke the build for android_x86 and android_x64.

I'm working in a new version for the crc32 optimization, using __crc32d() as it seems to be up to 2x faster than the original ARMv8 crc32 code (which was itself 10x faster than the vanilla C code).

I've posted some data in:
https://bugs.chromium.org/p/chromium/issues/detail?id=709716#c22

It can be from 32x to 45x faster (big core X little core) than the original zlib C function for vectors of 8KB.

As soon we land this on Chromium I'm planning to update this merge request.

Some samples:

random data length 16384 bytes
--test-- crc32 Min time Max Rate Median time Median Rate
crc32_zlib d2a888f6 65 u-secs 0.24 MB/s 65 u-secs 0.24 MB/s
crc32_neon d2a888f6 1 u-secs 15.62 MB/s 1 u-secs 15.62 MB/s

random data length 8192 bytes
--test-- crc32 Min time Max Rate Median time Median Rate
crc32_zlib 09bc142e 32 u-secs 0.24 MB/s 33 u-secs 0.24 MB/s
crc32_neon 09bc142e 0 u-secs inf MB/s 1 u-secs 7.81 MB/s

This adds two optimizations for ARM: NEON optimized Adler(-)32 checksum algorithm (ARMv7 and newer NEON CPUs) ARM(v7+) specific optimization for inflate I've also connected inflate optimization to the build using the following source as template. mirror/chromium@0397489#diff-a62ad2db6c83dbc205d34bb9a8884f16 Additional info: https://codereview.chromium.org/2676493007/ https://codereview.chromium.org/2722063002/ Sources: madler/zlib#251 (only the first commit) madler/zlib#256 Signed-off-by: Daniel Engberg <daniel.engberg.lists@pyret.net>

Adenilson · 2018-01-24T01:41:49Z

Friendly ping on the subject.

Any hope to merge this upstream?

tbeu · 2018-01-24T09:55:41Z

Any hope to merge this upstream?

Hope never dies.

Adenilson · 2018-01-24T20:31:40Z

We did further testing, the average gain of the ARMv8 crc32 in PNG decoding ranges from 2.1% (140PNGs corpus) to 3.2% (Doodles) and up to 7.1% (Kodak).

The latest version of the crc32 patch (https://chromium-review.googlesource.com/c/chromium/src/+/801108) has code to perform the CPU feature detection on ARM.

Adenilson · 2018-01-24T20:33:22Z

The image corpus:
a) Kodak: http://r0k.us/graphics/kodak/
b) doodles: https://drive.google.com/drive/folders/1BaqNYg9jUbuUlYLxaBDLCo8him30qyZV?usp=sharing

Adenilson · 2018-01-25T16:12:28Z

The 140PNGs corpus is internal to Chromium developers (and access is granted by request).

Adenilson · 2018-03-03T05:02:11Z

@madler friendly ping on the subject.

The crc32 optimized function improved the performance decompressing gzipped content in 29%.

Adenilson · 2018-03-14T23:59:25Z

Hey, look at this... we are just about to complete 1 year (since the pull request)!
:-)

Adenilson · 2018-03-15T00:13:50Z

The ARMv8-a optimized crc32 code (an improved version of it) landed in Chromium almost 1 month ago and made the cut to Chromium M66.

If all goes well, M66 will be shipping to users in the first week of May with this optimization enabled for ARM.

Adenilson · 2018-04-11T21:49:05Z

Look at this, time really flies: 1 (one) year since the request was open!

To celebrate the anniversary, I opened another merge request (the inflate_fast NEON optimization, should be around 20% faster at decompression in average) in:
#345

And if one day we manage to merge the Chromium optimizations (ARM + Intel):
#346

Adenilson · 2018-04-25T17:35:00Z

Friendly ping... @madler any comment?

Adenilson · 2018-07-10T10:19:41Z

@madler any comment?

Adenilson · 2018-08-15T07:20:27Z

@madler ping?

This introduces arm/neon optimizations to zlib. The first two patches are a neon optimization relating to zlib's inflate function. They increase decompression speed. It has been shipping in Chromimum since release 62 (Oct. 2017). The patches have been pulled from a PR to zlib upstream: madler/zlib#345. Patches 003 and 004 have been pulled from Fedora Core's aarch64 zlib package. They improve zlib compression speed and have been there for 4 months. Patch 005 is pulled from a PR to zlib upstream. madler/zlib#251. It's been shipping in Chromium since release 63, and increases decompression speed. Patch 006 is my own to allow 005 to merge without conflict with the previous patches. Signed-off-by: Ian Leonard <antonlacon@gmail.com>

Adenilson · 2018-10-10T21:18:09Z

It is being over 1 1/2 year (i.e. 18 months), I'm closing this merge request for the given reasons:

a) We are shipping faster (and better tested) versions of this optimizations on Chromium.
b) These optimizations have also being ported to x86.

For anyone interested, please check the latest code in:
a) NEON-ized Adler-32: https://cs.chromium.org/chromium/src/third_party/zlib/adler32_simd.c?l=203
b) ARMv8 crypto accelerated CRC32: https://cs.chromium.org/chromium/src/third_party/zlib/crc32_simd.c?l=159

tbeu · 2018-10-11T06:44:51Z

Too sad.

praiskup · 2018-10-11T09:03:49Z

Mark, I'm curious what would help with reviews of similar contributions like this one:

is this something which should never go to zlib?
is this lack of time only? Could you e.g. specify how to become a contributor to zlib, so you could enlarge the set of people you trust enough to do the code reviews?
would some automatic testing help here? I mean, if we had some HW being able to test the added code, would you be more confident that the code from pull request is acceptable?

Adenilson mentioned this pull request Apr 12, 2017

ARM specific optimizations #216

Open

Adenilson force-pushed the NEON03 branch 2 times, most recently from bd6476f to aa4d660 Compare April 12, 2017 19:13

Adenilson force-pushed the NEON03 branch from aa4d660 to d2f06cd Compare April 13, 2017 17:54

diizzyy mentioned this pull request Sep 16, 2017

package/libs/zlib: Tidy up and add optimization lede-project/source#1329

Closed

Adenilson mentioned this pull request Sep 28, 2017

ARM-specific optimisations for inflate. #256

Closed

Adenilson commented Nov 18, 2017

View reviewed changes

grooverdan mentioned this pull request Jan 17, 2018

Optimized CRC32 framework with Power crc32 c optimized function #335

Closed

Adenilson mentioned this pull request Jul 10, 2018

Port Chromium optimizations to zlib #346

Open

Adenilson closed this Oct 10, 2018

qmfrederik mentioned this pull request Mar 4, 2021

Failure when using compressed disk TrungNguyen1909/qemu-t8030#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEON implementation for Adler32 #251

NEON implementation for Adler32 #251

Adenilson commented Apr 12, 2017

ProgramMax commented Apr 13, 2017

Adenilson commented Apr 13, 2017

Adenilson commented Apr 25, 2017

Adenilson commented Apr 25, 2017

Adenilson commented May 12, 2017

Adenilson commented Jun 17, 2017

Adenilson commented Aug 8, 2017

Adenilson commented Sep 3, 2017

diizzyy commented Sep 4, 2017

Adenilson commented Sep 4, 2017

Adenilson commented Sep 21, 2017

Adenilson commented Nov 18, 2017

Adenilson Nov 18, 2017

Adenilson commented Dec 13, 2017 •

edited

Loading

Adenilson commented Jan 24, 2018

tbeu commented Jan 24, 2018

Adenilson commented Jan 24, 2018

Adenilson commented Jan 24, 2018

Adenilson commented Jan 25, 2018

Adenilson commented Mar 3, 2018

Adenilson commented Mar 14, 2018

Adenilson commented Mar 15, 2018

Adenilson commented Apr 11, 2018 •

edited

Loading

Adenilson commented Apr 25, 2018 •

edited

Loading

Adenilson commented Jul 10, 2018

Adenilson commented Aug 15, 2018

Adenilson commented Oct 10, 2018

tbeu commented Oct 11, 2018 •

edited

Loading

praiskup commented Oct 11, 2018

NEON implementation for Adler32 #251

NEON implementation for Adler32 #251

Conversation

Adenilson commented Apr 12, 2017

ProgramMax commented Apr 13, 2017

Adenilson commented Apr 13, 2017

Adenilson commented Apr 25, 2017

Adenilson commented Apr 25, 2017

Adenilson commented May 12, 2017

Adenilson commented Jun 17, 2017

Adenilson commented Aug 8, 2017

Adenilson commented Sep 3, 2017

diizzyy commented Sep 4, 2017

Adenilson commented Sep 4, 2017

Adenilson commented Sep 21, 2017

Adenilson commented Nov 18, 2017

Adenilson Nov 18, 2017

Choose a reason for hiding this comment

Adenilson commented Dec 13, 2017 • edited Loading

Some samples:

Adenilson commented Jan 24, 2018

tbeu commented Jan 24, 2018

Adenilson commented Jan 24, 2018

Adenilson commented Jan 24, 2018

Adenilson commented Jan 25, 2018

Adenilson commented Mar 3, 2018

Adenilson commented Mar 14, 2018

Adenilson commented Mar 15, 2018

Adenilson commented Apr 11, 2018 • edited Loading

Adenilson commented Apr 25, 2018 • edited Loading

Adenilson commented Jul 10, 2018

Adenilson commented Aug 15, 2018

Adenilson commented Oct 10, 2018

tbeu commented Oct 11, 2018 • edited Loading

praiskup commented Oct 11, 2018

Adenilson commented Dec 13, 2017 •

edited

Loading

Adenilson commented Apr 11, 2018 •

edited

Loading

Adenilson commented Apr 25, 2018 •

edited

Loading

tbeu commented Oct 11, 2018 •

edited

Loading