memset is terrible and not performant likely (for nonzeroing path) #265

cartazio · 2020-03-09T19:30:37Z

it falls back to a scalar c loop when not zeroing. which seems disappointing. Also not sure if it correctly handles -0 in the floating point code

The text was updated successfully, but these errors were encountered:

cartazio · 2020-03-09T19:34:30Z

this is gonna be some fun hackery to do a sane portable performant thingy :)

cartazio · 2020-03-09T19:48:30Z

@bgamari under what circumstances does the GHC code gen do something smart with memset/memmove? As best i can recall, the only instances are internally generated and via the llvm codegen?

andrewthad · 2020-03-09T20:43:57Z

GHC has a primop for memset (setByteArray#) that is not currently used by primitive. There is a PR that switches to using this. In GHC 8.10, GHC is going to become smarter about unrolling this when the size is known and small.

andrewthad · 2020-03-09T20:45:06Z

Original issue on GHC is here and Artem's work that resolved the issue is here.

cartazio · 2020-03-09T22:33:15Z

I’ll look at how ghc does that primop and figure out what the different choices are. Thanks for helping out :)

cartazio · 2020-03-09T23:11:28Z

ok, so
https://gitlab.haskell.org/ghc/ghc/blob/4898df1cc25132dc9e2599d4fa4e1bbc9423cda5/compiler/GHC/StgToCmm/Prim.hs#L2480-2496

-- ----------------------------------------------------------------------------
-- Setting byte arrays

-- | Takes a 'MutableByteArray#', an offset into the array, a length,
-- and a byte, and sets each of the selected bytes in the array to the
-- character.
doSetByteArrayOp :: CmmExpr -> CmmExpr -> CmmExpr -> CmmExpr
                 -> FCode ()
doSetByteArrayOp ba off len c = do
    dflags <- getDynFlags

    let byteArrayAlignment = wordAlignment dflags -- known since BA is allocated on heap
        offsetAlignment = cmmExprAlignment off
        align = min byteArrayAlignment offsetAlignment

    p <- assignTempE $ cmmOffsetExpr dflags (cmmOffsetB dflags ba (arrWordsHdrSize dflags)) off
    emitMemsetCall p c len align

and emitMemsetCall is defined as

-- | Emit a call to @memset@.  The second argument must fit inside an
-- unsigned char.
emitMemsetCall :: CmmExpr -> CmmExpr -> CmmExpr -> Alignment -> FCode ()
emitMemsetCall dst c n align = do
    emitPrimCall
        [ {- no results -} ]
        (MO_Memset (alignmentBytes align))
        [ dst, c, n ]

then for known stuff we get

https://gitlab.haskell.org/ghc/ghc/blob/4898df1cc25132dc9e2599d4fa4e1bbc9423cda5/compiler/nativeGen/X86/CodeGen.hs#L2222-2293

genCCall' dflags _ (PrimTarget (MO_Memset align)) _
         [dst,
          CmmLit (CmmInt c _),
          CmmLit (CmmInt n _)]
         _
    | fromInteger insns <= maxInlineMemsetInsns dflags = do
        code_dst <- getAnyReg dst
        dst_r <- getNewRegNat format
        if format == II64 && n >= 8 then do
          code_imm8byte <- getAnyReg (CmmLit (CmmInt c8 W64))
          imm8byte_r <- getNewRegNat II64
          return $ code_dst dst_r `appOL`
                   code_imm8byte imm8byte_r `appOL`
                   go8 dst_r imm8byte_r (fromInteger n)
        else
          return $ code_dst dst_r `appOL`
                   go4 dst_r (fromInteger n)
  where
    maxAlignment = wordAlignment dflags -- only machine word wide MOVs are supported
    effectiveAlignment = min (alignmentOf align) maxAlignment
    format = intFormat . widthFromBytes $ alignmentBytes effectiveAlignment
    c2 = c `shiftL` 8 .|. c
    c4 = c2 `shiftL` 16 .|. c2
    c8 = c4 `shiftL` 32 .|. c4

    -- The number of instructions we will generate (approx). We need 1
    -- instructions per move.
    insns = (n + sizeBytes - 1) `div` sizeBytes

    -- The size of each move, in bytes.
    sizeBytes :: Integer
    sizeBytes = fromIntegral (formatInBytes format)

    -- Depending on size returns the widest MOV instruction and its
    -- width.
    gen4 :: AddrMode -> Integer -> (InstrBlock, Integer)
    gen4 addr size
        | size >= 4 =
            (unitOL (MOV II32 (OpImm (ImmInteger c4)) (OpAddr addr)), 4)
        | size >= 2 =
            (unitOL (MOV II16 (OpImm (ImmInteger c2)) (OpAddr addr)), 2)
        | size >= 1 =
            (unitOL (MOV II8 (OpImm (ImmInteger c)) (OpAddr addr)), 1)
        | otherwise = (nilOL, 0)

    -- Generates a 64-bit wide MOV instruction from REG to MEM.
    gen8 :: AddrMode -> Reg -> InstrBlock
    gen8 addr reg8byte =
      unitOL (MOV format (OpReg reg8byte) (OpAddr addr))

    -- Unrolls memset when the widest MOV is <= 4 bytes.
    go4 :: Reg -> Integer -> InstrBlock
    go4 dst left =
      if left <= 0 then nilOL
      else curMov `appOL` go4 dst (left - curWidth)
      where
        possibleWidth = minimum [left, sizeBytes]
        dst_addr = AddrBaseIndex (EABaseReg dst) EAIndexNone (ImmInteger (n - left))
        (curMov, curWidth) = gen4 dst_addr possibleWidth

    -- Unrolls memset when the widest MOV is 8 bytes (thus another Reg
    -- argument). Falls back to go4 when all 8 byte moves are
    -- exhausted.
    go8 :: Reg -> Reg -> Integer -> InstrBlock
    go8 dst reg8byte left =
      if possibleWidth >= 8 then
        let curMov = gen8 dst_addr reg8byte
        in  curMov `appOL` go8 dst reg8byte (left - 8)
      else go4 dst left
      where
        possibleWidth = minimum [left, sizeBytes]
        dst_addr = AddrBaseIndex (EABaseReg dst) EAIndexNone (ImmInteger (n - left))

otherwise it falls back to the c function call on a given platform.

cartazio · 2020-03-09T23:28:48Z

now to understand what code impls on what platforms!

https://github.com/freebsd/freebsd/blob/master/lib/libc/amd64/string/memset.S
is the amd 64 platform impl for freebsd. and the other related operations

glibC , at least on AMD64, has a whole mess of fallback cases for different micro architectures
as can be seen https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/multiarch or here
https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=tree;f=sysdeps/x86_64/multiarch;h=942a61db8fc4615bf1998c484d883720b82c1e70;hb=HEAD

cartazio · 2020-03-09T23:44:15Z

the default cutoff for ghc level unrolling in native code gen is
https://gitlab.haskell.org/ghc/ghc/blob/4898df1cc25132dc9e2599d4fa4e1bbc9423cda5/compiler/main/DynFlags.hs#L2121-2124

        maxInlineAllocSize = 128,
        maxInlineMemcpyInsns = 32,
        maxInlineMemsetInsns = 32,

so roughly, ghc native code gen can / will unroll any const call to memset when it has constant arguments such that
(# of bytes copied + native word size in bytes - 1 ) / (native word size in bytes) <= 32

on 64bit amd64 that simplifies to n + 8 -1 / 8 <= 32, which rewrite to n + 7 <= 256, so n <= 249 bytes will get unrolled with this current scheme. (using ~ at most 32 GPR flavored asm moves), and otherwise falls back to lib c stuff.

how was this threshold chosen? Fun aside: probably could improve this now that we assume >= SSE2 on all AMD64/x86 platforms.

Also theres a question about whether the OS/LIb C flavor is acceptable on each platform.

cartazio · 2020-03-09T23:59:40Z

interstingly, according to godbolt, clang optimizes to "call lib c" for large constant args

https://godbolt.org/z/JBhCJz

cartazio · 2020-03-10T00:06:49Z

https://godbolt.org/z/dxwawW heres some smaller thresholds and how clang does it

cartazio · 2020-03-10T00:09:52Z

https://godbolt.org/z/6Fs3pA thres some intersting variation in code selection depending on what cpu architecture is chosen

cartazio · 2020-03-10T00:18:58Z

ok, so i think for small enough argumements < 249 on amd64, and < 125 bytes on x86, ghc does a reasonable thing for this stuff. THOUGH, i think we'd get a genuine win from using sse registers instead of gpr wrt halving the code size.. I think we have roughly the right unrolling threshold, at least if clang and gcc tuning are actually decent.

cc @andrewthad @bgamari @m37 (andreask?)

it does look like on platforms that aren't linux, we might not be able to assume a GOOD optimized fallback c symbol? (though some benchmarking will be needed i guess)

cartazio · 2020-03-10T00:19:56Z

my current thinking is :
write the primitive stuff such that for known args, theres a branch that goes through into the unrolling case for ghc, and we have a fallback for large stuff thats fancy and fast and simple

cartazio · 2020-03-10T01:23:23Z

as best i can tell, the only OSS lib c implementation which has a SSE/AVX/simd anything implementation is linux/glib c, everything just does hand written assembly to (probably?) prevent the c compiler from being tooooo clever and replacing the mem* implementation with a call to itself!

edit: some implementations do a tiny c code thing

cartazio · 2020-03-10T02:25:04Z

zooming out: the real answer is benchmarks! :)

andrewthad · 2020-03-10T10:22:42Z

I think the only reasonable thing to do in primitive is to use the primop that GHC provides. Optimizing the implementation further is a GHC issue, not a primitive issue. Having GHC use wider registers when possible might be a win, but then you won't be doing aligned writes (which I think doesn't really matter on more recent Intel processors), and you have to load the 128- or 256-bit word from memory rather than including it in the instruction, so for writing 16 or 24 bytes, this may not actually be better.

cartazio · 2020-03-10T15:58:00Z

The portable simd c is easy to write. The concern which I need to do some measurements to determine , is how big a difference in performance for large non const arg calls is there between the agner fog or glib c impls vs say the FreeBSD ones. Especially on really large memory ranges. If there’s even a 20 percent difference on large inputs on multiple cpu micro architectures that’s a pretty important thing Ghc abi and native suport for simd are orthogonal to understanding the quality of the underlying lib c fallback case for non const arguments. Any improvements should of course be added to ghc too. But that’s a very different release cadence.

…

On Tue, Mar 10, 2020 at 6:22 AM Andrew Martin ***@***.***> wrote: I think the only reasonable thing to do in primitive is to use the primop that GHC provides. Optimizing the implementation further is a GHC issue, not a primitive issue. Having GHC use wider registers when possible might be a win, but then you won't be doing aligned writes (which I think doesn't really matter on more recent Intel processors), and you have to load the 128- or 256-bit word from memory rather than including it in the instruction, so for writing 16 or 24 bytes, this may not actually be better. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#265?email_source=notifications&email_token=AAABBQXDZWUHENONHYR3V3LRGYIHFA5CNFSM4LEP3ZV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOK2RQI#issuecomment-597010625>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAABBQRTNFMWDQYSWRW6PWDRGYIHFANCNFSM4LEP3ZVQ> .

cartazio · 2020-03-10T17:41:23Z

andrew: to clarify, i'm not concerned with the ghc unrolling case, i'm concerned with the performance characteristics for input sizes that fall back to libc / system provided memset and friends. (the conclusion may very well be that libc simple codes are fine on all cpus we care about, or not). Theres also a subtle mismatch in how we use the mem* related operations in ghc / haskell today versus the specified of how they're done today.

Also theres some nontrival ways where a more haskell specific flavor will actually make a 2x difference in storable vector in some cases (at least per asymptotics, not sure if the constant factors work out in practice)

cartazio · 2020-03-11T04:05:54Z

as a bit of context for why this actually should be done as performantly as possible for unknown size / large buffers is how it comes into storable vector
see
https://github.com/haskell/vector/blob/eeb42ad42aa345ce192086baed80c805bcfc3e72/Data/Vector/Storable/Mutable.hs#L202-L256

dcoutts · 2024-03-26T14:32:01Z

I think the only reasonable thing to do in primitive is to use the primop that GHC provides.

~~Yes! And currently it doesn't. It uses memset directly, so doesn't benefit from any unrolling or from -fcheck-prim-bounds debug support.~~

I take it back, shimmedSetWord8Array# does use GHC.Exts.setByteArray#.

cartazio added bug enhancement task Something that needs to be done (not a bug fix) labels Mar 9, 2020

cartazio added this to the 0.7.1 milestone Mar 9, 2020

andrewthad removed this from the 0.7.1 milestone Jun 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memset is terrible and not performant likely (for nonzeroing path) #265

memset is terrible and not performant likely (for nonzeroing path) #265

cartazio commented Mar 9, 2020

cartazio commented Mar 9, 2020

cartazio commented Mar 9, 2020

andrewthad commented Mar 9, 2020

andrewthad commented Mar 9, 2020

cartazio commented Mar 9, 2020

cartazio commented Mar 9, 2020

cartazio commented Mar 9, 2020

cartazio commented Mar 9, 2020

cartazio commented Mar 9, 2020

cartazio commented Mar 10, 2020

cartazio commented Mar 10, 2020

cartazio commented Mar 10, 2020

cartazio commented Mar 10, 2020

cartazio commented Mar 10, 2020 •

edited

Loading

cartazio commented Mar 10, 2020

andrewthad commented Mar 10, 2020

cartazio commented Mar 10, 2020 via email

cartazio commented Mar 10, 2020

cartazio commented Mar 11, 2020

dcoutts commented Mar 26, 2024 •

edited

Loading

memset is terrible and not performant likely (for nonzeroing path) #265

memset is terrible and not performant likely (for nonzeroing path) #265

Comments

cartazio commented Mar 9, 2020

cartazio commented Mar 9, 2020

cartazio commented Mar 9, 2020

andrewthad commented Mar 9, 2020

andrewthad commented Mar 9, 2020

cartazio commented Mar 9, 2020

cartazio commented Mar 9, 2020

cartazio commented Mar 9, 2020

cartazio commented Mar 9, 2020

cartazio commented Mar 9, 2020

cartazio commented Mar 10, 2020

cartazio commented Mar 10, 2020

cartazio commented Mar 10, 2020

cartazio commented Mar 10, 2020

cartazio commented Mar 10, 2020 • edited Loading

cartazio commented Mar 10, 2020

andrewthad commented Mar 10, 2020

cartazio commented Mar 10, 2020 via email

cartazio commented Mar 10, 2020

cartazio commented Mar 11, 2020

dcoutts commented Mar 26, 2024 • edited Loading

cartazio commented Mar 10, 2020 •

edited

Loading

dcoutts commented Mar 26, 2024 •

edited

Loading