Inflate fast NEON optimization #345

Adenilson · 2018-04-05T18:11:33Z

Using SIMD to perform wide loads/stores in inflate_fast, this should improve performance on ARM between
18% to 30% depending on the data.

Plus it has the fix for the InflateBack() corner case (details in: https://bugs.chromium.org/p/chromium/issues/detail?id=769880).

This optimization is shipping in Chromium since M62 (landed in the repository around September/October 2017).

In inflate_fast() the output pointer always has plenty of room to write. This means that so long as the target is capable, wide un-aligned loads and stores can be used to transfer several bytes at once. When the reference distance is too short simply unroll the data a little to increase the distance. For reference, please see: https://chromium.googlesource.com/chromium/src/+/78104f4d73e3bbb4155fa804d00ed66682180556 ps: this is still missing the fix for inflate_back corner case. Change-Id: I5216424ab584e069b77ddf04000a313d5ca99839

This handles the case where a zlib user could rely on InflateBack API to decompress content. The NEON optimization assumes that it can perform wide stores, sometimes overwriting data on the output pointer (but never overflowing the buffer end as it has enough room for the write). For infback there is no such guarantees (i.e. no extra wiggle room), which can result in illegal operations. This patch fixes the potential issue by falling back to the non-optimized code for such cases. Also it adds some comments about the entry assumptions in inflate and writes out a defined value at the write buffer to identify where the real data has ended (helpful while debugging). For reference, please see: https://chromium.googlesource.com/chromium/src/+/0bb11040792edc5b28fcb710fc4c01fedd98c97c Change-Id: Iffbda9eb5e08a661aa15c6e3d1c59b678cc23b2c

Adenilson · 2018-04-05T18:18:29Z

Ideally this should be applied first followed by updated (WIP) versions of the checksums patches (i.e. optimized crc32 and adler32).

Adenilson · 2018-04-05T18:18:58Z

@madler any suggestions?

Adenilson · 2018-04-05T18:32:48Z

For further details concerning the optimization, please see:
https://bugs.chromium.org/p/chromium/issues/detail?id=697280

Adenilson · 2018-04-05T19:26:05Z

contrib/arm/inffast_chunk.c

@@ -0,0 +1,311 @@
+/* inffast.c -- fast decoding
+ * Copyright (C) 1995-2017 Mark Adler
+ * For conditions of distribution and use, see copyright notice in zlib.h


I wonder if we should point clearly that this is a modded inffast.c (i.e. by adding the respective Copyright).

Adenilson · 2018-04-05T19:27:53Z

contrib/arm/inflate.c

@@ -0,0 +1,1582 @@
+/* inflate.c -- zlib decompression
+ * Copyright (C) 1995-2016 Mark Adler


I wonder if we should point clearly that this is a modded inflate.c (i.e. by adding the respective Copyright).

Adenilson · 2018-04-25T17:33:02Z

Some benchmarking data running in an ARM CPU (big core A72, snappy data set), shows an average of 31% performance improvement:

a) Vanilla
(xenial)adenilson@localhost:~/canonical-fork/build$ time taskset -c 3 ./zlib_bench gzip ~/corpora/snappy/testdata/*
/home/adenilson/corpora/snappy/testdata/alice29.txt :
GZIP: [b 1M] bytes 152089 -> 54426 35.8% comp 7.1 ( 7.2) MB/s uncomp 127.7 (127.9) MB/s
/home/adenilson/corpora/snappy/testdata/asyoulik.txt :
GZIP: [b 1M] bytes 125179 -> 48949 39.1% comp 6.5 ( 6.5) MB/s uncomp 120.5 (120.6) MB/s
/home/adenilson/corpora/snappy/testdata/baddata1.snappy :
GZIP: [b 1M] bytes 27512 -> 22920 83.3% comp 18.6 ( 18.7) MB/s uncomp 88.2 ( 88.3) MB/s
/home/adenilson/corpora/snappy/testdata/baddata2.snappy :
GZIP: [b 1M] bytes 27483 -> 23000 83.7% comp 18.6 ( 18.6) MB/s uncomp 88.4 ( 88.4) MB/s
/home/adenilson/corpora/snappy/testdata/baddata3.snappy :
GZIP: [b 1M] bytes 28384 -> 23705 83.5% comp 18.5 ( 18.5) MB/s uncomp 87.9 ( 87.9) MB/s
/home/adenilson/corpora/snappy/testdata/fireworks.jpeg :
GZIP: [b 1M] bytes 123093 -> 122927 99.9% comp 21.8 ( 21.8) MB/s uncomp 314.5 (314.8) MB/s
/home/adenilson/corpora/snappy/testdata/geo.protodata :
GZIP: [b 1M] bytes 118588 -> 15143 12.8% comp 34.4 ( 34.7) MB/s uncomp 237.2 (237.3) MB/s
/home/adenilson/corpora/snappy/testdata/html :
GZIP: [b 1M] bytes 102400 -> 13711 13.4% comp 27.3 ( 27.5) MB/s uncomp 220.2 (220.4) MB/s
/home/adenilson/corpora/snappy/testdata/html_x_4 :
GZIP: [b 1M] bytes 409600 -> 53299 13.0% comp 24.3 ( 24.5) MB/s uncomp 220.7 (221.1) MB/s
/home/adenilson/corpora/snappy/testdata/kppkn.gtb :
GZIP: [b 1M] bytes 184320 -> 38789 21.0% comp 5.2 ( 5.3) MB/s uncomp 162.3 (162.5) MB/s
/home/adenilson/corpora/snappy/testdata/lcet10.txt :
GZIP: [b 1M] bytes 426754 -> 144904 34.0% comp 7.2 ( 7.2) MB/s uncomp 129.4 (129.6) MB/s
/home/adenilson/corpora/snappy/testdata/paper-100k.pdf :
GZIP: [b 1M] bytes 102400 -> 81276 79.4% comp 22.1 ( 22.1) MB/s uncomp 146.2 (146.4) MB/s
/home/adenilson/corpora/snappy/testdata/plrabn12.txt :
GZIP: [b 1M] bytes 481861 -> 195220 40.5% comp 5.3 ( 5.3) MB/s uncomp 117.1 (117.4) MB/s
/home/adenilson/corpora/snappy/testdata/urls.10K :
GZIP: [b 1M] bytes 702087 -> 222381 31.7% comp 14.0 ( 14.0) MB/s uncomp 141.4 (141.5) MB/s

b) inflate_fast
(xenial)adenilson@localhost:~/canonical-fork/build$ time taskset -c 3 ./zlib_bench gzip ~/corpora/snappy/testdata/*
/home/adenilson/corpora/snappy/testdata/alice29.txt :
GZIP: [b 1M] bytes 152089 -> 54426 35.8% comp 7.2 ( 7.2) MB/s uncomp 177.1 (177.2) MB/s
/home/adenilson/corpora/snappy/testdata/asyoulik.txt :
GZIP: [b 1M] bytes 125179 -> 48949 39.1% comp 6.5 ( 6.5) MB/s uncomp 164.5 (164.6) MB/s
/home/adenilson/corpora/snappy/testdata/baddata1.snappy :
GZIP: [b 1M] bytes 27512 -> 22920 83.3% comp 18.8 ( 18.8) MB/s uncomp 90.8 ( 91.0) MB/s
/home/adenilson/corpora/snappy/testdata/baddata2.snappy :
GZIP: [b 1M] bytes 27483 -> 23000 83.7% comp 18.8 ( 18.8) MB/s uncomp 90.7 ( 90.7) MB/s
/home/adenilson/corpora/snappy/testdata/baddata3.snappy :
GZIP: [b 1M] bytes 28384 -> 23705 83.5% comp 18.7 ( 18.7) MB/s uncomp 90.4 ( 90.5) MB/s
/home/adenilson/corpora/snappy/testdata/fireworks.jpeg :
GZIP: [b 1M] bytes 123093 -> 122927 99.9% comp 21.8 ( 21.9) MB/s uncomp 311.1 (311.3) MB/s
/home/adenilson/corpora/snappy/testdata/geo.protodata :
GZIP: [b 1M] bytes 118588 -> 15143 12.8% comp 34.9 ( 35.1) MB/s uncomp 299.1 (299.1) MB/s
/home/adenilson/corpora/snappy/testdata/html :
GZIP: [b 1M] bytes 102400 -> 13711 13.4% comp 27.7 ( 27.7) MB/s uncomp 284.6 (284.9) MB/s
/home/adenilson/corpora/snappy/testdata/html_x_4 :
GZIP: [b 1M] bytes 409600 -> 53299 13.0% comp 24.7 ( 24.8) MB/s uncomp 284.9 (285.5) MB/s
/home/adenilson/corpora/snappy/testdata/kppkn.gtb :
GZIP: [b 1M] bytes 184320 -> 38789 21.0% comp 5.3 ( 5.3) MB/s uncomp 222.0 (222.1) MB/s
/home/adenilson/corpora/snappy/testdata/lcet10.txt :
GZIP: [b 1M] bytes 426754 -> 144904 34.0% comp 7.2 ( 7.3) MB/s uncomp 180.0 (180.1) MB/s
/home/adenilson/corpora/snappy/testdata/paper-100k.pdf :
GZIP: [b 1M] bytes 102400 -> 81276 79.4% comp 20.2 ( 21.8) MB/s uncomp 147.9 (149.5) MB/s
/home/adenilson/corpora/snappy/testdata/plrabn12.txt :
GZIP: [b 1M] bytes 481861 -> 195220 40.5% comp 5.3 ( 5.3) MB/s uncomp 163.4 (163.7) MB/s
/home/adenilson/corpora/snappy/testdata/urls.10K :
GZIP: [b 1M] bytes 702087 -> 222381 31.7% comp 14.0 ( 14.0) MB/s uncomp 175.1 (175.2) MB/s

Adenilson · 2018-07-10T10:19:13Z

@madler any comment?

Adenilson · 2018-08-15T07:20:37Z

@madler ping?

This introduces arm/neon optimizations to zlib. The first two patches are a neon optimization relating to zlib's inflate function. They increase decompression speed. It has been shipping in Chromimum since release 62 (Oct. 2017). The patches have been pulled from a PR to zlib upstream: madler/zlib#345. Patches 003 and 004 have been pulled from Fedora Core's aarch64 zlib package. They improve zlib compression speed and have been there for 4 months. Patch 005 is pulled from a PR to zlib upstream. madler/zlib#251. It's been shipping in Chromium since release 63, and increases decompression speed. Patch 006 is my own to allow 005 to merge without conflict with the previous patches. Signed-off-by: Ian Leonard <antonlacon@gmail.com>

* Remove old zlib readme. * Remove old zlib change history from inflate.c. * Remove old treebuild.xml and zlib pdf.

PolynomialDivision · 2022-10-29T15:12:46Z

Can you rebase on the latest master? :)

Adenilson Cavalcanti added 2 commits April 5, 2018 11:05

Adenilson commented Apr 5, 2018

View reviewed changes

Adenilson mentioned this pull request Apr 11, 2018

NEON implementation for Adler32 #251

Closed

Adenilson mentioned this pull request Jul 10, 2018

Port Chromium optimizations to zlib #346

Open

GerHobbelt pushed a commit to GerHobbelt/zlib that referenced this pull request Nov 20, 2021

Remove old zlib artifacts (madler#345)

3b52fdd

* Remove old zlib readme. * Remove old zlib change history from inflate.c. * Remove old treebuild.xml and zlib pdf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inflate fast NEON optimization #345

Inflate fast NEON optimization #345

Adenilson commented Apr 5, 2018

Adenilson commented Apr 5, 2018

Adenilson commented Apr 5, 2018

Adenilson commented Apr 5, 2018

Adenilson Apr 5, 2018

Adenilson Apr 5, 2018

Adenilson commented Apr 25, 2018

Adenilson commented Jul 10, 2018

Adenilson commented Aug 15, 2018

PolynomialDivision commented Oct 29, 2022

		@@ -0,0 +1,1582 @@
		/* inflate.c -- zlib decompression
		* Copyright (C) 1995-2016 Mark Adler

Inflate fast NEON optimization #345

Are you sure you want to change the base?

Inflate fast NEON optimization #345

Conversation

Adenilson commented Apr 5, 2018

Adenilson commented Apr 5, 2018

Adenilson commented Apr 5, 2018

Adenilson commented Apr 5, 2018

Adenilson Apr 5, 2018

Choose a reason for hiding this comment

Adenilson Apr 5, 2018

Choose a reason for hiding this comment

Adenilson commented Apr 25, 2018

Adenilson commented Jul 10, 2018

Adenilson commented Aug 15, 2018

PolynomialDivision commented Oct 29, 2022