Use unaligned Addr# primops for performant loads/stores?

Currently, in `binary`, all the primitives to load a single word is done by loading individual bytes and concatenating them with bitwise operations. This works and is portable, but it compiles to inefficient machine code. For instance, for the primitive below:

```haskell
word32le :: B.ByteString -> Word32
word32le = \s ->
              (fromIntegral (s `B.unsafeIndex` 3) `unsafeShiftL` 24) .|.
              (fromIntegral (s `B.unsafeIndex` 2) `unsafeShiftL` 16) .|.
              (fromIntegral (s `B.unsafeIndex` 1) `unsafeShiftL`  8) .|.
              (fromIntegral (s `B.unsafeIndex` 0) )
```

ghc-9.12.2 with -O2 would emit this on x64:

```asm
Example_word32le_info:

        leaq -8(%rbp),%rax
        cmpq %r15,%rax
        jb .Lc1pJ

        movq $.Lc1or_info,-8(%rbp)
        movq %r14,%rbx
        addq $-8,%rbp

        testb $7,%bl
        jne .Lc1or

        movq (%rbx),%rax
        jmp *%rax
.Lc1or_info:
.Lc1or:

        addq $16,%r12
        cmpq 856(%r13),%r12
        ja .Lc1pN

        movq 7(%rbx),%rax
        movq 15(%rbx),%rax
        movb 3(%rax),%bl
        movb 2(%rax),%cl
        movb 1(%rax),%dl
        movb (%rax),%al
        movq $ghczminternal_GHCziInternalziWord_W32zh_con_info,-8(%r12)
        movl $4294967295,%esi
        movzbl %al,%eax
        andq %rsi,%rax
        movl $4294967295,%esi
        movl $4294967295,%edi
        movl $4294967295,%r8d
        movzbl %dl,%edx
        andq %r8,%rdx
        shlq $8,%rdx
        andq %rdi,%rdx
        movl $4294967295,%edi
        movl $4294967295,%r8d
        movl $4294967295,%r9d
        movzbl %cl,%ecx
        andq %r9,%rcx
        shlq $16,%rcx
        andq %r8,%rcx
        movl $4294967295,%r8d
        movl $4294967295,%r9d
        movzbl %bl,%ebx
        andq %r9,%rbx
        shlq $24,%rbx
        andq %r8,%rbx
        orq %rcx,%rbx
        andq %rdi,%rbx
        orq %rdx,%rbx
        andq %rsi,%rbx
        orq %rax,%rbx
        movl %ebx,(%r12)
        leaq -7(%r12),%rbx
        addq $8,%rbp

        jmp *(%rbp)
.Lc1pN:

        movq $16,904(%r13)
        jmp stg_gc_unpt_r1
.Lc1pJ:

        leaq Example_word32le_closure(%rip),%rbx
        jmp *-8(%r13)
Example_word32le_closure:
        .quad   Example_word32le_info 
```


So currently GHC is unable to fuse these loads bitwise operations into a single load. There's a lot of potential for improving performance here, especially given `binary` is used in a lot of places in ghc and would affect ghc compile time performance.

Starting from ghc-9.10, unaligned Addr# primops are available, so it's possible to add logic to `binary` to use those when available, so to load/store integer types larger than 8 bits efficiently in a single operation, all without breaking compatibility with existing serialized data or changing any user facing interface. I can volunteer to write a patch to implement it, but first I'd like to hear some feedback: would such a patch be accepted? Would it warrant a CLC proposal? etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use unaligned Addr# primops for performant loads/stores? #215

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use unaligned Addr# primops for performant loads/stores? #215

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions