-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEON implementation for Adler32 #251
Conversation
bd6476f
to
aa4d660
Compare
Hello everyone. cblume@google.com / @chromium.org here. Don't forget to add yourself to contrib/README.contrib |
The checksum is calculated in the uncompressed PNG data and can be made much faster by using SIMD. Tests in ARMv8 yielded an improvement of about 3x (e.g. walltime was 350ms x 125ms for a 4096x4096 bytes executed 30 times). That results in at least 18% improvement in image decoding in Chromium. Further details at: https://bugs.chromium.org/p/chromium/issues/detail?id=688601
@ProgramMax nice catch! Fixed. |
CRC32 affects performance for both image decompression (PNG) as also in general browsing while accessing websites that serve content using compression (i.e. Content-Encoding: gzip). This first patch implements an optimized CRC32 function using the dedicated instruction available in ARMv8. It should be between 6x (A53: 116ms X 22ms for a 4Kx4Kx4 buffer) to 10x faster (A72: 91ms x 9ms) than the C implementation currently used by zlib. Details: https://bugs.chromium.org/p/chromium/issues/detail?id=709716 Change-Id: I069408ebc06c49a3c2be4ba3253319e025ee09d7
Just uploaded a function that uses the new ARMv8 crc32 instruction (it is about 10x faster than the C function in zlib). To build using the optimized versions, just have an ARM compiler (e.g. from linaro, https://releases.linaro.org/components/toolchain/binaries/6.3-2017.02/) and export CC to point to it (e.g. export CC=arm-linux-gnueabihf-gcc-5) and next enable the feature in CMake buildsystem, either using the ncurses GUI (i.e. ccmake ..) or passing the options as: I didn't touch the configure script and tried my best to integrate the ARM specific functions using the same strategy used for AMD64 and ASM686 specific code. |
Also validated the changes with 'make test': 100% tests passed, 0 tests failed out of 2 Total Test time (real) = 0.01 sec |
Touching the base on this one as it has passed over 1 month already. |
Friendly ping on the matter. |
@madler anything you like to change on the original patch? |
@madler ping? |
@Adenilson |
First patch should work in both ARMv7 (assuming that the SoC has a NEON unit) + ARMv8. Second (CRC32) is ARMv8 (both AArch32 and AArch64) only. |
@madler friendly ping? |
For record, Chrome M62 is shipping the inflate_fast optimization and Chrome M63 has a variant of the Adler-32 optimization. I'm working towards having the patch using the ARMv8 crc32 instruction included in M64 (branching in 2 weeks). |
uint32_t armv8_crc32_little(uint32_t crc, | ||
const unsigned char *buf, | ||
size_t len) { | ||
uint32_t c; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a second thought, this should handle the case of buf == Z_NULL.
An update on the issue: the crc32 optimization landed on Chrome M64 but it was reverted because it broke the build for android_x86 and android_x64. I'm working in a new version for the crc32 optimization, using __crc32d() as it seems to be up to 2x faster than the original ARMv8 crc32 code (which was itself 10x faster than the vanilla C code). I've posted some data in: It can be from 32x to 45x faster (big core X little core) than the original zlib C function for vectors of 8KB. As soon we land this on Chromium I'm planning to update this merge request. Some samples:random data length 16384 bytes random data length 8192 bytes |
This adds two optimizations for ARM: NEON optimized Adler(-)32 checksum algorithm (ARMv7 and newer NEON CPUs) ARM(v7+) specific optimization for inflate I've also connected inflate optimization to the build using the following source as template. mirror/chromium@0397489#diff-a62ad2db6c83dbc205d34bb9a8884f16 Additional info: https://codereview.chromium.org/2676493007/ https://codereview.chromium.org/2722063002/ Sources: madler/zlib#251 (only the first commit) madler/zlib#256 Signed-off-by: Daniel Engberg <daniel.engberg.lists@pyret.net>
This adds two optimizations for ARM: NEON optimized Adler(-)32 checksum algorithm (ARMv7 and newer NEON CPUs) ARM(v7+) specific optimization for inflate I've also connected inflate optimization to the build using the following source as template. mirror/chromium@0397489#diff-a62ad2db6c83dbc205d34bb9a8884f16 Additional info: https://codereview.chromium.org/2676493007/ https://codereview.chromium.org/2722063002/ Sources: madler/zlib#251 (only the first commit) madler/zlib#256 Signed-off-by: Daniel Engberg <daniel.engberg.lists@pyret.net>
This adds two optimizations for ARM: NEON optimized Adler(-)32 checksum algorithm (ARMv7 and newer NEON CPUs) ARM(v7+) specific optimization for inflate I've also connected inflate optimization to the build using the following source as template. mirror/chromium@0397489#diff-a62ad2db6c83dbc205d34bb9a8884f16 Additional info: https://codereview.chromium.org/2676493007/ https://codereview.chromium.org/2722063002/ Sources: madler/zlib#251 (only the first commit) madler/zlib#256 Signed-off-by: Daniel Engberg <daniel.engberg.lists@pyret.net>
Friendly ping on the subject. Any hope to merge this upstream? |
Hope never dies. |
We did further testing, the average gain of the ARMv8 crc32 in PNG decoding ranges from 2.1% (140PNGs corpus) to 3.2% (Doodles) and up to 7.1% (Kodak). The latest version of the crc32 patch (https://chromium-review.googlesource.com/c/chromium/src/+/801108) has code to perform the CPU feature detection on ARM. |
The image corpus: |
The 140PNGs corpus is internal to Chromium developers (and access is granted by request). |
@madler friendly ping on the subject. The crc32 optimized function improved the performance decompressing gzipped content in 29%. |
Hey, look at this... we are just about to complete 1 year (since the pull request)! |
The ARMv8-a optimized crc32 code (an improved version of it) landed in Chromium almost 1 month ago and made the cut to Chromium M66. If all goes well, M66 will be shipping to users in the first week of May with this optimization enabled for ARM. |
Look at this, time really flies: 1 (one) year since the request was open! To celebrate the anniversary, I opened another merge request (the inflate_fast NEON optimization, should be around 20% faster at decompression in average) in: And if one day we manage to merge the Chromium optimizations (ARM + Intel): |
Friendly ping... @madler any comment? |
@madler any comment? |
@madler ping? |
This introduces arm/neon optimizations to zlib. The first two patches are a neon optimization relating to zlib's inflate function. They increase decompression speed. It has been shipping in Chromimum since release 62 (Oct. 2017). The patches have been pulled from a PR to zlib upstream: madler/zlib#345. Patches 003 and 004 have been pulled from Fedora Core's aarch64 zlib package. They improve zlib compression speed and have been there for 4 months. Patch 005 is pulled from a PR to zlib upstream. madler/zlib#251. It's been shipping in Chromium since release 63, and increases decompression speed. Patch 006 is my own to allow 005 to merge without conflict with the previous patches. Signed-off-by: Ian Leonard <antonlacon@gmail.com>
It is being over 1 1/2 year (i.e. 18 months), I'm closing this merge request for the given reasons: a) We are shipping faster (and better tested) versions of this optimizations on Chromium. For anyone interested, please check the latest code in: |
Too sad. |
Mark, I'm curious what would help with reviews of similar contributions like this one:
|
The checksum is calculated in the uncompressed PNG data and can be made much faster by using SIMD.
Tests in ARMv8 yielded an improvement of about 3x (e.g. walltime was 350ms x 125ms for 4096x4096 bytes executed 30 times). That results in at least 18% improvement in PNG image decoding in Chromium.
Further details at:
https://bugs.chromium.org/p/chromium/issues/detail?id=688601