Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorporate upstream fixes to crc32c.c #52326

Merged
merged 10 commits into from
Nov 29, 2023
Merged

incorporate upstream fixes to crc32c.c #52326

merged 10 commits into from
Nov 29, 2023

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Nov 28, 2023

Closes #52325.

I have no only a vague idea what these changes to the __asm__ register arguments actually do; I just copied them from @madler upstream. It would be good if @yuyichao or @vtjnash or someone else who knows __asm__ syntax would take a look. It looks like they are a response to this comment on StackOverflow:

Your asm constraints aren't strictly safe. A pointer in a "r" constraint does not imply that the pointed-to memory is also an input. You could just use memory operands (or the _mm_crc32_u64 intrinsic), but if you want to force the addressing mode you can use a dummy memory operand like asm("crc32 (%1),%0 " : "+r"(crc0) : "r"(next), "m" (*(const char (*)[24]) pStr) );. Inlining or LTO could break this. See at&t asm inline c++ problem (which needs an update: pointer-to-array is cleaner than a struct with a flexible array member).

There should be test coverage of this PR via https://github.com/JuliaLang/julia/blob/master/stdlib/CRC32c/test/runtests.jl The existing test coverage wasn't enough to fully exercise Adler's code, so I added another test.

@stevengj stevengj added the upstream The issue is with an upstream dependency, e.g. LLVM label Nov 28, 2023
src/crc32c.c Show resolved Hide resolved
src/crc32c.c Outdated Show resolved Hide resolved
src/crc32c.c Outdated Show resolved Hide resolved
src/crc32c.c Outdated Show resolved Hide resolved
src/crc32c.c Outdated Show resolved Hide resolved
src/crc32c.c Outdated Show resolved Hide resolved
src/crc32c.c Show resolved Hide resolved
src/crc32c.c Outdated Show resolved Hide resolved
@stevengj
Copy link
Member Author

@vtjnash, I updated the other constraints following your example. Looks good?

@vtjnash vtjnash merged commit fd67cb2 into master Nov 29, 2023
7 checks passed
@vtjnash vtjnash deleted the crc32c_upstream branch November 29, 2023 21:13
@vchuravy
Copy link
Member

vchuravy commented Dec 5, 2023

This causes a build failure on 32bit debug

JuliaPackaging/Yggdrasil#7757 (comment)

@benlorenz
Copy link
Contributor

To reproduce the error:

$ cat Make.user 
MARCH=pentium4
ARCH=i686
$ make julia-src-debug
    CC src/crc32c.dbg.obj
julia-tmp/src/crc32c.c: In function 'crc32c_sse42':
julia-tmp/src/crc32c.c:117:13: error: 'asm' operand has impossible constraints
  117 |             __asm__(CRC32_PTR "\t" "(%3), %0\n\t"
      |             ^~~~~~~
julia-tmp/src/crc32c.c:140:13: error: 'asm' operand has impossible constraints
  140 |             __asm__(CRC32_PTR "\t" "(%3), %0\n\t"
      |             ^~~~~~~
make[1]: *** [Makefile:236: crc32c.dbg.obj] Error 1
make: *** [Makefile:97: julia-src-debug] Error 2

@stevengj
Copy link
Member Author

stevengj commented Dec 5, 2023

Are the constraints actually impossible or is this a compiler bug?

@stevengj
Copy link
Member Author

stevengj commented Dec 5, 2023

Maybe we should just change

            __asm__(CRC32_PTR "\t" "(%3), %0\n\t"
                    CRC32_PTR "\t" LONGx1 "(%3), %1\n\t"
                    CRC32_PTR "\t" LONGx2 "(%3), %2"
                    : "+r"(crc0), "+r"(crc1), "+r"(crc2)
                    : "r"(buf),
                      "m"(* (const char (*)[sizeof(void*)]) &buf[0]),
                      "m"(* (const char (*)[sizeof(void*)]) &buf[LONG]),
                      "m"(* (const char (*)[sizeof(void*)]) &buf[LONG*2]));

to

            __asm__(CRC32_PTR "\t" "(%3), %0\n\t"
                    CRC32_PTR "\t" LONGx1 "(%3), %1\n\t"
                    CRC32_PTR "\t" LONGx2 "(%3), %2"
                    : "+r"(crc0), "+r"(crc1), "+r"(crc2)
                    : "r"(buf),
                      "m"(* (const char (*)[sizeof(void*)]) buf));

(and similarly for the SHORT case) even though it's technically under-constrained?

@benlorenz
Copy link
Contributor

Are the constraints actually impossible or is this a compiler bug?

Both gcc and clang do complain, but only for -O0. clang gives:

error: inline assembly requires more registers than available

Freeing up the frame pointer helps, e.g. -O0 -fomit-frame-pointer works fine. Interestingly also -O1 -fno-omit-frame-pointer also works, but I don't know what other optimizations makes it work in this case.

@vtjnash
Copy link
Member

vtjnash commented Dec 5, 2023

how about this:

            __asm__(CRC32_PTR "\t" "(%1), %0\n"
                    : "+r"(crc0),
                    : "r"(buf),
                      "m"(* (const char (*)[sizeof(void*)]) &buf[0]));
            __asm__(CRC32_PTR "\t" LONGx1 "(%1), %0\n"
                    : "+r"(crc1),
                    : "r"(buf),
                      "m"(* (const char (*)[sizeof(void*)]) &buf[LONG]));
            __asm__(CRC32_PTR "\t" LONGx1 "(%1), %0\n"
                    : "+r"(crc2),
                    : "r"(buf),
                      "m"(* (const char (*)[sizeof(void*)]) &buf[LONG*2]));

we can simplify them also like this:

            __asm__(CRC32_PTR " %1, %0\n"
                    : "+r"(crc2),
                    : "rm"(* (const char (*)[sizeof(void*)]) &buf[LONG*2]));

@stevengj
Copy link
Member Author

stevengj commented Dec 8, 2023

See #52437 for updated constraints to fix the 32-bit debug build.

@madler
Copy link

madler commented Dec 8, 2023

You don't need the r in "rm". The r was a mistake on my part, and leaving it out permits more compiler optmization in addressing modes and unrolling loops. gcc doesn't do much with it, but clang really goes to town. This also requires that the offsets be moved from the instructions to the constraints, as suggested.

If you have different input constraints for the three instructions in one __asm__, then they need to be numbered appropriately in each instruction, %3, %4, and %5 respectively for the three constraints. Or you can just make it three different __asm__'s as suggested.

If you feel the need to tell the compiler the extent of the operand, then you should use 8 instead of sizeof(void*). The machine instruction crc32q always processes exactly eight bytes, regardless of what the compiler thinks the size of a pointer is, which does not have to be eight.

So this works:

            __asm__("crc32q\t%3, %0\n\t"
                    "crc32q\t%4, %1\n\t"
                    "crc32q\t%5, %2"
                    : "+r"(crc0), "+r"(crc1), "+r"(crc2)
                    : "m"(*(uint64_t const *)next),
                      "m"(*(uint64_t const *)(next + LONG)),
                      "m"(*(uint64_t const *)(next + 2*LONG)));

@stevengj
Copy link
Member Author

stevengj commented Dec 9, 2023

@madler, the reason we use sizeof(void*) is that the instruction we are using is crc32q on a 64-bit machine and crc32l on a 32-bit machine. (Patch by @yuyichao in #22385.)

(I guess we could also use uintptr_t.)

@madler
Copy link

madler commented Dec 9, 2023

Ah, ok.

vtjnash pushed a commit that referenced this pull request Dec 13, 2023
@vtjnash
Copy link
Member

vtjnash commented Dec 13, 2023

The r was a mistake on my part, and leaving it out permits more compiler optmization in addressing modes and unrolling loops

What gives that more options, when "rm" seems to mean "select either a register or a memory expression", which is stricly more optimization permission, and the crc32 instruction permits either form? They seem to generate identical code for me (at optimization levels at or above -Og anyways)

@madler
Copy link

madler commented Dec 13, 2023

These examples are with clang 15, using -O3.

The r appears to force the use of a pre-computed pointer:

.LBB0_14:                               #   Parent Loop BB0_13 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	movq	%rbx, %rbp
	movq	8(%rsi,%rdi), %rbx
	movq	8200(%rsi,%rdi), %rcx
	movq	16392(%rsi,%rdi), %rax
	movq	%rcx, -24(%rsp)
	movq	%rbx, -16(%rsp)
	movq	%rbp, %rbx
	movq	%rax, -32(%rsp)
	#APP
	crc32q	-16(%rsp), %rbx
	crc32q	-24(%rsp), %rdx
	crc32q	-32(%rsp), %r10
	#NO_APP
	addq	$8, %rdi
	cmpq	$8184, %rdi                     # imm = 0x1FF8
	jb	.LBB0_14

Whereas without the r, more creative addressing modes are used:

.LBB0_21:                               #   Parent Loop BB0_20 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	#APP
	crc32q	(%rsi,%rdi), %rcx
	crc32q	(%rdx,%rdi), %rbx
	crc32q	(%r10,%rdi), %rax
	#NO_APP
	leaq	8(%rdi), %rbp
	cmpq	$8184, %rdi                     # imm = 0x1FF8
	movq	%rbp, %rdi
	jb	.LBB0_21

The subsequent shorter loop with three crc32q's results in a simple loop like above with rm. Without the r, the loop is unrolled, since it can apply offsets in the memory address:

.LBB0_11:                               # =>This Inner Loop Header: Depth=1
	xorl	%ebx, %ebx
	xorl	%edx, %edx
	#APP
	crc32q	(%rsi), %rax
	crc32q	256(%rsi), %rbx
	crc32q	512(%rsi), %rdx
	#NO_APP
	#APP
	crc32q	8(%rsi), %rax
	crc32q	264(%rsi), %rbx
	crc32q	520(%rsi), %rdx
	#NO_APP

...

	#APP
	crc32q	248(%rsi), %rax
	crc32q	504(%rsi), %rbx
	crc32q	760(%rsi), %rdx

The net effect is about a 30% speedup by removing the r's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

incorporate upstream fixes to crc32c code
5 participants