Clean up and optimize crc32c #22385

yuyichao · 2017-06-15T18:32:26Z

This is split out of #21849. A summary of the changes (should be all in commit messages)

Move crc32c so that it can use logic in julia_internal.h and libjulia. (the latter will be useful for ARM)
Improve CPU dispatch, remove jl_crc32c_init.

On new enough glibc+gcc/clang no dispatch will be needed thanks to ifunc
Implement hardware accelerated crc32c on AArch64 and 32bit X86

32bit ARMv8 should work too but hardware feature detection is much harder (due to old glibc version support) so that'll wait until Implement function multi versioning in sysimg #21849 is merged.

I'll probably post some benchmarks later.

nalimilan · 2017-06-15T18:35:54Z

test/misc.jl

+unsafe_crc32c_sw(a, n, crc) =
+    ccall(:jl_crc32c_sw, UInt32, (UInt32, Ptr{UInt8}, Csize_t), crc, a, n)
+crc32c_sw(a::Union{Array{UInt8},Base.FastContiguousSubArray{UInt8,N,<:Array{UInt8}} where N},
+       crc::UInt32=0x00000000) = unsafe_crc32c_sw(a, length(a), crc)


Incorrect indentation.

yuyichao · 2017-06-16T01:19:12Z

Added another optimization for 32bit arch and here's the benchmark result. A ratio smaller than 1 represent speed up.

The only slow down comes from single byte crc32c on arm and aarch64. Unclear what's causing that but it probably doesn't matter.....

The speed up on x64 only comes from reducing fixed overhead (which is measureable all the way to ~1kB size) and the speed up for other archs comes from algorithm optimizations/use of hardware instructions.

yuyichao · 2017-06-16T01:24:25Z

@jlbuild !filter=arm,aarch64,ppc

jlbuild · 2017-06-16T01:24:36Z

Status of `e6bc9b1` builds:

Builder Name	Build	Download
linuxaarch64	COMPLETE	Download
linuxarmv7l	COMPLETE	Download
linuxppc64le	PENDING	N/A

tkelman · 2017-06-16T08:37:15Z

src/crc32c.c

@@ -0,0 +1,590 @@
+/* crc32c.c -- compute CRC-32C using software table or available hardware instructions


adjust the path to this file in the exceptions list in contrib/add_license_to_files.jl

stevengj · 2017-06-16T16:45:33Z

base/util.jl

@@ -783,9 +783,9 @@ function crc32c end
 unsafe_crc32c(a, n, crc) = ccall(:jl_crc32c, UInt32, (UInt32, Ptr{UInt8}, Csize_t), crc, a, n)

 crc32c(a::Union{Array{UInt8},FastContiguousSubArray{UInt8,N,<:Array{UInt8}} where N}, crc::UInt32=0x00000000) =
-    unsafe_crc32c(a, length(a), crc)
+    unsafe_crc32c(a, length(a) % Csize_t, crc)


What is this for? Maybe you should just add an unsafe_crc32c(a, n::Int, crc) method for this if it is important?

Remove overflow check.

stevengj · 2017-06-16T16:48:00Z

src/crc32c.c

+}
+
+#if (defined(_CPU_X86_64_) || defined(_CPU_X86_)) && !defined(_COMPILER_MICROSOFT_)
+#ifdef _CPU_X86_64_


Indent nested preprocessor statements, e.g. # ifdef (keeping # at the start of the line)?

stevengj · 2017-06-16T16:50:08Z

contrib/add_license_to_files.jl

@@ -58,7 +58,7 @@ const skipfiles = [
    "../src/support/tzfile.h",
    "../src/support/utf8.c",
    "../test/perf/micro/randmtzig.c",
-    "../src/support/crc32c.c",
+    "../src/crc32c.c",


Did you forget to do a git mv for this file? The diff is confusing because it as if you removed the file and added it elsewhere.

No. The commit changes the file enough that git may not show them as the same file. Also there isn't anything different between git mv and remove + add in general.

stevengj · 2017-06-16T16:51:40Z

src/crc32c.c

+
+#if (defined(_CPU_X86_64_) || defined(_CPU_X86_)) && !defined(_COMPILER_MICROSOFT_)
+#ifdef _CPU_X86_64_
+#define CRC32_PTR "crc32q"


CRC32_ASM or CRC32_INSTRUCTION?

The PTR is the important part since it is the instruction that operates on uintptr_t.

stevengj · 2017-06-16T17:07:47Z

src/crc32c.c

+}
+// For ifdef detection below
+#    define crc32c_dispatch() crc32c_dispatch(getauxval(AT_HWCAP))
+#    define crc32c_dispatch_ifunc "crc32c_dispatch"


Since crc32c_dispatch_ifunc is always the same (when it is defined), do we really need a #define for it?

I'd rather not make that assumption.

I don't see much of an improvement from "assume the implementation defines a function called crc32_dispatch" to "assume that the implementation #defines a symbol named crc32c_dispatch_ifunc". What do you gain by allowing different names for the actual function, since only one can be defined at a time?

The error can be easier to catch.

stevengj · 2017-06-16T17:20:12Z

The benchmarks here may also be interesting: https://github.com/htot/crc32c (cc @htot). They found that Intel's hand-coded asm implementation is a factor of 2 faster than Adler's code. (I'm not sure that we actually care that much about CRC32c speed, but Intel's ASM code appears to 3-clause BSD if we want it.)

htot · 2017-06-18T22:29:46Z

Adler's code is very fast and so is Intel's assembly version. Differences occur at different buffer lengths due to the way of recombining the crc on the buffer parts. The intel method also elegantly makes use of the pclmuldq instruction to simplify the code. See also here https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf

If you have a fixed buffer length that is unequal to 4048 you can use my repo to test which is fastest.

My work was only to port back the assembly code to C and make that run on 32 bit as well.

* Skip runtime dispatch if required features are available at compile time * Use lazy dispatch * Remove jl_crc32c_init and change how sw version is tested * Remove uneeded ifdef * Use ifunc when available

yuyichao · 2017-06-19T14:54:02Z

This PR is actually centered around making the hardware accelerated version working on AArch64 including the cleanup to get it working with minimum ifdef's. Given this is pretty big as it is, I'll probably leave the pclmuldq change to another PR. (ARMv8 also have pmul so I wonder if that can be applied too.)

htot · 2017-06-19T22:18:30Z

I have tried, but haven't been to figure out the math behind pclmuldq. I just replaced all asm by C, then carefully verified the resulting compiled code with -O2/3 is similar and same speed as the original. Of course for some instructions that is not possible, hence the inline asm.

nalimilan reviewed Jun 15, 2017

View reviewed changes

yuyichao force-pushed the yyc/crc32c branch 2 times, most recently from 5f6b8ef to e6bc9b1 Compare June 16, 2017 00:59

tkelman reviewed Jun 16, 2017

View reviewed changes

yuyichao force-pushed the yyc/crc32c branch 2 times, most recently from d5bc370 to 7af464f Compare June 16, 2017 14:36

stevengj reviewed Jun 16, 2017

View reviewed changes

yuyichao force-pushed the yyc/crc32c branch from 7af464f to 116a49d Compare June 18, 2017 16:39

yuyichao added 6 commits June 19, 2017 09:34

Move ifunc detection logic into julia_internal.h

04e8b23

Move crc32c.c out of support so that it can use julia runtime functions.

94bb51b

Clean up CPU dispatch in crc32c.c

3a8fbe6

* Skip runtime dispatch if required features are available at compile time * Use lazy dispatch * Remove jl_crc32c_init and change how sw version is tested * Remove uneeded ifdef * Use ifunc when available

Implement hardware accelerated CRC32C on AArch64

ec796e7

Enable hardware CRC32C on 32bit x86

811a40e

Optimize software crc32c on 32bit machine

fba468b

yuyichao force-pushed the yyc/crc32c branch from 116a49d to fba468b Compare June 19, 2017 13:34

yuyichao merged commit 0b53b9a into master Jun 19, 2017

yuyichao deleted the yyc/crc32c branch June 19, 2017 21:12

stevengj mentioned this pull request Mar 14, 2018

Undeclared HWCAP_CRC32 compilation error in ARMv8 (Jetson TX2) #26458

Closed

yuyichao mentioned this pull request Aug 8, 2022

Fix checksumming in the presence of large, mostly-zero buffers. rr-debugger/rr#3355

Open

stevengj mentioned this pull request Dec 9, 2023

incorporate upstream fixes to crc32c.c #52326

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up and optimize crc32c #22385

Clean up and optimize crc32c #22385

yuyichao commented Jun 15, 2017

nalimilan Jun 15, 2017

yuyichao commented Jun 16, 2017

yuyichao commented Jun 16, 2017

jlbuild commented Jun 16, 2017 •

edited

Loading

tkelman Jun 16, 2017

stevengj Jun 16, 2017 •

edited

Loading

yuyichao Jun 16, 2017

stevengj Jun 16, 2017

stevengj Jun 16, 2017

yuyichao Jun 17, 2017

stevengj Jun 16, 2017

yuyichao Jun 17, 2017

stevengj Jun 16, 2017 •

edited

Loading

yuyichao Jun 17, 2017

stevengj Jun 17, 2017

yuyichao Jun 17, 2017

stevengj commented Jun 16, 2017

htot commented Jun 18, 2017

yuyichao commented Jun 19, 2017

htot commented Jun 19, 2017

		@@ -0,0 +1,590 @@
		/* crc32c.c -- compute CRC-32C using software table or available hardware instructions

Clean up and optimize crc32c #22385

Clean up and optimize crc32c #22385

Conversation

yuyichao commented Jun 15, 2017

Choose a reason for hiding this comment

yuyichao commented Jun 16, 2017

yuyichao commented Jun 16, 2017

jlbuild commented Jun 16, 2017 • edited Loading

Status of e6bc9b1 builds:

Choose a reason for hiding this comment

stevengj Jun 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj Jun 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj commented Jun 16, 2017

htot commented Jun 18, 2017

yuyichao commented Jun 19, 2017

htot commented Jun 19, 2017

jlbuild commented Jun 16, 2017 •

edited

Loading

Status of `e6bc9b1` builds:

stevengj Jun 16, 2017 •

edited

Loading

stevengj Jun 16, 2017 •

edited

Loading