Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up and optimize crc32c #22385

Merged
merged 6 commits into from
Jun 19, 2017
Merged

Clean up and optimize crc32c #22385

merged 6 commits into from
Jun 19, 2017

Conversation

yuyichao
Copy link
Contributor

This is split out of #21849. A summary of the changes (should be all in commit messages)

  • Move crc32c so that it can use logic in julia_internal.h and libjulia. (the latter will be useful for ARM)

  • Improve CPU dispatch, remove jl_crc32c_init.

    On new enough glibc+gcc/clang no dispatch will be needed thanks to ifunc

  • Implement hardware accelerated crc32c on AArch64 and 32bit X86

    32bit ARMv8 should work too but hardware feature detection is much harder (due to old glibc version support) so that'll wait until Implement function multi versioning in sysimg #21849 is merged.

I'll probably post some benchmarks later.

test/misc.jl Outdated
unsafe_crc32c_sw(a, n, crc) =
ccall(:jl_crc32c_sw, UInt32, (UInt32, Ptr{UInt8}, Csize_t), crc, a, n)
crc32c_sw(a::Union{Array{UInt8},Base.FastContiguousSubArray{UInt8,N,<:Array{UInt8}} where N},
crc::UInt32=0x00000000) = unsafe_crc32c_sw(a, length(a), crc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect indentation.

@yuyichao yuyichao force-pushed the yyc/crc32c branch 2 times, most recently from 5f6b8ef to e6bc9b1 Compare June 16, 2017 00:59
@yuyichao
Copy link
Contributor Author

Added another optimization for 32bit arch and here's the benchmark result. A ratio smaller than 1 represent speed up.

The only slow down comes from single byte crc32c on arm and aarch64. Unclear what's causing that but it probably doesn't matter.....

The speed up on x64 only comes from reducing fixed overhead (which is measureable all the way to ~1kB size) and the speed up for other archs comes from algorithm optimizations/use of hardware instructions.

@yuyichao
Copy link
Contributor Author

@jlbuild !filter=arm,aarch64,ppc

@jlbuild
Copy link

jlbuild commented Jun 16, 2017

Status of e6bc9b1 builds:

Builder Name Build Download
linuxaarch64 COMPLETE Download
linuxarmv7l COMPLETE Download
linuxppc64le PENDING N/A

@@ -0,0 +1,590 @@
/* crc32c.c -- compute CRC-32C using software table or available hardware instructions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjust the path to this file in the exceptions list in contrib/add_license_to_files.jl

@yuyichao yuyichao force-pushed the yyc/crc32c branch 2 times, most recently from d5bc370 to 7af464f Compare June 16, 2017 14:36
@@ -783,9 +783,9 @@ function crc32c end
unsafe_crc32c(a, n, crc) = ccall(:jl_crc32c, UInt32, (UInt32, Ptr{UInt8}, Csize_t), crc, a, n)

crc32c(a::Union{Array{UInt8},FastContiguousSubArray{UInt8,N,<:Array{UInt8}} where N}, crc::UInt32=0x00000000) =
unsafe_crc32c(a, length(a), crc)
unsafe_crc32c(a, length(a) % Csize_t, crc)
Copy link
Member

@stevengj stevengj Jun 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? Maybe you should just add an unsafe_crc32c(a, n::Int, crc) method for this if it is important?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove overflow check.

src/crc32c.c Outdated
}

#if (defined(_CPU_X86_64_) || defined(_CPU_X86_)) && !defined(_COMPILER_MICROSOFT_)
#ifdef _CPU_X86_64_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent nested preprocessor statements, e.g. # ifdef (keeping # at the start of the line)?

@@ -58,7 +58,7 @@ const skipfiles = [
"../src/support/tzfile.h",
"../src/support/utf8.c",
"../test/perf/micro/randmtzig.c",
"../src/support/crc32c.c",
"../src/crc32c.c",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you forget to do a git mv for this file? The diff is confusing because it as if you removed the file and added it elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The commit changes the file enough that git may not show them as the same file. Also there isn't anything different between git mv and remove + add in general.

src/crc32c.c Outdated

#if (defined(_CPU_X86_64_) || defined(_CPU_X86_)) && !defined(_COMPILER_MICROSOFT_)
#ifdef _CPU_X86_64_
#define CRC32_PTR "crc32q"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRC32_ASM or CRC32_INSTRUCTION?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PTR is the important part since it is the instruction that operates on uintptr_t.

}
// For ifdef detection below
# define crc32c_dispatch() crc32c_dispatch(getauxval(AT_HWCAP))
# define crc32c_dispatch_ifunc "crc32c_dispatch"
Copy link
Member

@stevengj stevengj Jun 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since crc32c_dispatch_ifunc is always the same (when it is defined), do we really need a #define for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not make that assumption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much of an improvement from "assume the implementation defines a function called crc32_dispatch" to "assume that the implementation #defines a symbol named crc32c_dispatch_ifunc". What do you gain by allowing different names for the actual function, since only one can be defined at a time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error can be easier to catch.

@stevengj
Copy link
Member

The benchmarks here may also be interesting: https://github.com/htot/crc32c (cc @htot). They found that Intel's hand-coded asm implementation is a factor of 2 faster than Adler's code. (I'm not sure that we actually care that much about CRC32c speed, but Intel's ASM code appears to 3-clause BSD if we want it.)

@htot
Copy link

htot commented Jun 18, 2017

Adler's code is very fast and so is Intel's assembly version. Differences occur at different buffer lengths due to the way of recombining the crc on the buffer parts. The intel method also elegantly makes use of the pclmuldq instruction to simplify the code. See also here https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf

If you have a fixed buffer length that is unequal to 4048 you can use my repo to test which is fastest.

My work was only to port back the assembly code to C and make that run on 32 bit as well.

@yuyichao
Copy link
Contributor Author

This PR is actually centered around making the hardware accelerated version working on AArch64 including the cleanup to get it working with minimum ifdef's. Given this is pretty big as it is, I'll probably leave the pclmuldq change to another PR. (ARMv8 also have pmul so I wonder if that can be applied too.)

@yuyichao yuyichao merged commit 0b53b9a into master Jun 19, 2017
@yuyichao yuyichao deleted the yyc/crc32c branch June 19, 2017 21:12
@htot
Copy link

htot commented Jun 19, 2017

I have tried, but haven't been to figure out the math behind pclmuldq. I just replaced all asm by C, then carefully verified the resulting compiled code with -O2/3 is similar and same speed as the original. Of course for some instructions that is not possible, hence the inline asm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants