-
Notifications
You must be signed in to change notification settings - Fork 68
Description
Currently, in binary, all the primitives to load a single word is done by loading individual bytes and concatenating them with bitwise operations. This works and is portable, but it compiles to inefficient machine code. For instance, for the primitive below:
word32le :: B.ByteString -> Word32
word32le = \s ->
(fromIntegral (s `B.unsafeIndex` 3) `unsafeShiftL` 24) .|.
(fromIntegral (s `B.unsafeIndex` 2) `unsafeShiftL` 16) .|.
(fromIntegral (s `B.unsafeIndex` 1) `unsafeShiftL` 8) .|.
(fromIntegral (s `B.unsafeIndex` 0) )ghc-9.12.2 with -O2 would emit this on x64:
Example_word32le_info:
leaq -8(%rbp),%rax
cmpq %r15,%rax
jb .Lc1pJ
movq $.Lc1or_info,-8(%rbp)
movq %r14,%rbx
addq $-8,%rbp
testb $7,%bl
jne .Lc1or
movq (%rbx),%rax
jmp *%rax
.Lc1or_info:
.Lc1or:
addq $16,%r12
cmpq 856(%r13),%r12
ja .Lc1pN
movq 7(%rbx),%rax
movq 15(%rbx),%rax
movb 3(%rax),%bl
movb 2(%rax),%cl
movb 1(%rax),%dl
movb (%rax),%al
movq $ghczminternal_GHCziInternalziWord_W32zh_con_info,-8(%r12)
movl $4294967295,%esi
movzbl %al,%eax
andq %rsi,%rax
movl $4294967295,%esi
movl $4294967295,%edi
movl $4294967295,%r8d
movzbl %dl,%edx
andq %r8,%rdx
shlq $8,%rdx
andq %rdi,%rdx
movl $4294967295,%edi
movl $4294967295,%r8d
movl $4294967295,%r9d
movzbl %cl,%ecx
andq %r9,%rcx
shlq $16,%rcx
andq %r8,%rcx
movl $4294967295,%r8d
movl $4294967295,%r9d
movzbl %bl,%ebx
andq %r9,%rbx
shlq $24,%rbx
andq %r8,%rbx
orq %rcx,%rbx
andq %rdi,%rbx
orq %rdx,%rbx
andq %rsi,%rbx
orq %rax,%rbx
movl %ebx,(%r12)
leaq -7(%r12),%rbx
addq $8,%rbp
jmp *(%rbp)
.Lc1pN:
movq $16,904(%r13)
jmp stg_gc_unpt_r1
.Lc1pJ:
leaq Example_word32le_closure(%rip),%rbx
jmp *-8(%r13)
Example_word32le_closure:
.quad Example_word32le_info So currently GHC is unable to fuse these loads bitwise operations into a single load. There's a lot of potential for improving performance here, especially given binary is used in a lot of places in ghc and would affect ghc compile time performance.
Starting from ghc-9.10, unaligned Addr# primops are available, so it's possible to add logic to binary to use those when available, so to load/store integer types larger than 8 bits efficiently in a single operation, all without breaking compatibility with existing serialized data or changing any user facing interface. I can volunteer to write a patch to implement it, but first I'd like to hear some feedback: would such a patch be accepted? Would it warrant a CLC proposal? etc.