Skip to content

Use unaligned Addr# primops for performant loads/stores? #215

@TerrorJack

Description

@TerrorJack

Currently, in binary, all the primitives to load a single word is done by loading individual bytes and concatenating them with bitwise operations. This works and is portable, but it compiles to inefficient machine code. For instance, for the primitive below:

word32le :: B.ByteString -> Word32
word32le = \s ->
              (fromIntegral (s `B.unsafeIndex` 3) `unsafeShiftL` 24) .|.
              (fromIntegral (s `B.unsafeIndex` 2) `unsafeShiftL` 16) .|.
              (fromIntegral (s `B.unsafeIndex` 1) `unsafeShiftL`  8) .|.
              (fromIntegral (s `B.unsafeIndex` 0) )

ghc-9.12.2 with -O2 would emit this on x64:

Example_word32le_info:

        leaq -8(%rbp),%rax
        cmpq %r15,%rax
        jb .Lc1pJ

        movq $.Lc1or_info,-8(%rbp)
        movq %r14,%rbx
        addq $-8,%rbp

        testb $7,%bl
        jne .Lc1or

        movq (%rbx),%rax
        jmp *%rax
.Lc1or_info:
.Lc1or:

        addq $16,%r12
        cmpq 856(%r13),%r12
        ja .Lc1pN

        movq 7(%rbx),%rax
        movq 15(%rbx),%rax
        movb 3(%rax),%bl
        movb 2(%rax),%cl
        movb 1(%rax),%dl
        movb (%rax),%al
        movq $ghczminternal_GHCziInternalziWord_W32zh_con_info,-8(%r12)
        movl $4294967295,%esi
        movzbl %al,%eax
        andq %rsi,%rax
        movl $4294967295,%esi
        movl $4294967295,%edi
        movl $4294967295,%r8d
        movzbl %dl,%edx
        andq %r8,%rdx
        shlq $8,%rdx
        andq %rdi,%rdx
        movl $4294967295,%edi
        movl $4294967295,%r8d
        movl $4294967295,%r9d
        movzbl %cl,%ecx
        andq %r9,%rcx
        shlq $16,%rcx
        andq %r8,%rcx
        movl $4294967295,%r8d
        movl $4294967295,%r9d
        movzbl %bl,%ebx
        andq %r9,%rbx
        shlq $24,%rbx
        andq %r8,%rbx
        orq %rcx,%rbx
        andq %rdi,%rbx
        orq %rdx,%rbx
        andq %rsi,%rbx
        orq %rax,%rbx
        movl %ebx,(%r12)
        leaq -7(%r12),%rbx
        addq $8,%rbp

        jmp *(%rbp)
.Lc1pN:

        movq $16,904(%r13)
        jmp stg_gc_unpt_r1
.Lc1pJ:

        leaq Example_word32le_closure(%rip),%rbx
        jmp *-8(%r13)
Example_word32le_closure:
        .quad   Example_word32le_info 

So currently GHC is unable to fuse these loads bitwise operations into a single load. There's a lot of potential for improving performance here, especially given binary is used in a lot of places in ghc and would affect ghc compile time performance.

Starting from ghc-9.10, unaligned Addr# primops are available, so it's possible to add logic to binary to use those when available, so to load/store integer types larger than 8 bits efficiently in a single operation, all without breaking compatibility with existing serialized data or changing any user facing interface. I can volunteer to write a patch to implement it, but first I'd like to hear some feedback: would such a patch be accepted? Would it warrant a CLC proposal? etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions