-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memset is terrible and not performant likely (for nonzeroing path) #265
Comments
this is gonna be some fun hackery to do a sane portable performant thingy :) |
@bgamari under what circumstances does the GHC code gen do something smart with memset/memmove? As best i can recall, the only instances are internally generated and via the llvm codegen? |
GHC has a primop for |
I’ll look at how ghc does that primop and figure out what the different choices are. Thanks for helping out :) |
-- ----------------------------------------------------------------------------
-- Setting byte arrays
-- | Takes a 'MutableByteArray#', an offset into the array, a length,
-- and a byte, and sets each of the selected bytes in the array to the
-- character.
doSetByteArrayOp :: CmmExpr -> CmmExpr -> CmmExpr -> CmmExpr
-> FCode ()
doSetByteArrayOp ba off len c = do
dflags <- getDynFlags
let byteArrayAlignment = wordAlignment dflags -- known since BA is allocated on heap
offsetAlignment = cmmExprAlignment off
align = min byteArrayAlignment offsetAlignment
p <- assignTempE $ cmmOffsetExpr dflags (cmmOffsetB dflags ba (arrWordsHdrSize dflags)) off
emitMemsetCall p c len align
and -- | Emit a call to @memset@. The second argument must fit inside an
-- unsigned char.
emitMemsetCall :: CmmExpr -> CmmExpr -> CmmExpr -> Alignment -> FCode ()
emitMemsetCall dst c n align = do
emitPrimCall
[ {- no results -} ]
(MO_Memset (alignmentBytes align))
[ dst, c, n ] then for known stuff we get genCCall' dflags _ (PrimTarget (MO_Memset align)) _
[dst,
CmmLit (CmmInt c _),
CmmLit (CmmInt n _)]
_
| fromInteger insns <= maxInlineMemsetInsns dflags = do
code_dst <- getAnyReg dst
dst_r <- getNewRegNat format
if format == II64 && n >= 8 then do
code_imm8byte <- getAnyReg (CmmLit (CmmInt c8 W64))
imm8byte_r <- getNewRegNat II64
return $ code_dst dst_r `appOL`
code_imm8byte imm8byte_r `appOL`
go8 dst_r imm8byte_r (fromInteger n)
else
return $ code_dst dst_r `appOL`
go4 dst_r (fromInteger n)
where
maxAlignment = wordAlignment dflags -- only machine word wide MOVs are supported
effectiveAlignment = min (alignmentOf align) maxAlignment
format = intFormat . widthFromBytes $ alignmentBytes effectiveAlignment
c2 = c `shiftL` 8 .|. c
c4 = c2 `shiftL` 16 .|. c2
c8 = c4 `shiftL` 32 .|. c4
-- The number of instructions we will generate (approx). We need 1
-- instructions per move.
insns = (n + sizeBytes - 1) `div` sizeBytes
-- The size of each move, in bytes.
sizeBytes :: Integer
sizeBytes = fromIntegral (formatInBytes format)
-- Depending on size returns the widest MOV instruction and its
-- width.
gen4 :: AddrMode -> Integer -> (InstrBlock, Integer)
gen4 addr size
| size >= 4 =
(unitOL (MOV II32 (OpImm (ImmInteger c4)) (OpAddr addr)), 4)
| size >= 2 =
(unitOL (MOV II16 (OpImm (ImmInteger c2)) (OpAddr addr)), 2)
| size >= 1 =
(unitOL (MOV II8 (OpImm (ImmInteger c)) (OpAddr addr)), 1)
| otherwise = (nilOL, 0)
-- Generates a 64-bit wide MOV instruction from REG to MEM.
gen8 :: AddrMode -> Reg -> InstrBlock
gen8 addr reg8byte =
unitOL (MOV format (OpReg reg8byte) (OpAddr addr))
-- Unrolls memset when the widest MOV is <= 4 bytes.
go4 :: Reg -> Integer -> InstrBlock
go4 dst left =
if left <= 0 then nilOL
else curMov `appOL` go4 dst (left - curWidth)
where
possibleWidth = minimum [left, sizeBytes]
dst_addr = AddrBaseIndex (EABaseReg dst) EAIndexNone (ImmInteger (n - left))
(curMov, curWidth) = gen4 dst_addr possibleWidth
-- Unrolls memset when the widest MOV is 8 bytes (thus another Reg
-- argument). Falls back to go4 when all 8 byte moves are
-- exhausted.
go8 :: Reg -> Reg -> Integer -> InstrBlock
go8 dst reg8byte left =
if possibleWidth >= 8 then
let curMov = gen8 dst_addr reg8byte
in curMov `appOL` go8 dst reg8byte (left - 8)
else go4 dst left
where
possibleWidth = minimum [left, sizeBytes]
dst_addr = AddrBaseIndex (EABaseReg dst) EAIndexNone (ImmInteger (n - left))
otherwise it falls back to the c function call on a given platform. |
now to understand what code impls on what platforms! https://github.com/freebsd/freebsd/blob/master/lib/libc/amd64/string/memset.S glibC , at least on AMD64, has a whole mess of fallback cases for different micro architectures |
the default cutoff for ghc level unrolling in native code gen is maxInlineAllocSize = 128,
maxInlineMemcpyInsns = 32,
maxInlineMemsetInsns = 32, so roughly, ghc native code gen can / will unroll any const call to memset when it has constant arguments such that on 64bit amd64 that simplifies to n + 8 -1 / 8 <= 32, which rewrite to n + 7 <= 256, so n <= 249 bytes will get unrolled with this current scheme. (using ~ at most 32 GPR flavored asm moves), and otherwise falls back to lib c stuff. how was this threshold chosen? Fun aside: probably could improve this now that we assume >= SSE2 on all AMD64/x86 platforms. Also theres a question about whether the OS/LIb C flavor is acceptable on each platform. |
interstingly, according to godbolt, clang optimizes to "call lib c" for large constant args |
https://godbolt.org/z/dxwawW heres some smaller thresholds and how clang does it |
https://godbolt.org/z/6Fs3pA thres some intersting variation in code selection depending on what cpu architecture is chosen |
ok, so i think for small enough argumements < 249 on amd64, and < 125 bytes on x86, ghc does a reasonable thing for this stuff. THOUGH, i think we'd get a genuine win from using sse registers instead of gpr wrt halving the code size.. I think we have roughly the right unrolling threshold, at least if clang and gcc tuning are actually decent. cc @andrewthad @bgamari @m37 (andreask?) it does look like on platforms that aren't linux, we might not be able to assume a GOOD optimized fallback c symbol? (though some benchmarking will be needed i guess) |
my current thinking is : |
as best i can tell, the only OSS lib c implementation which has a SSE/AVX/simd anything implementation is linux/glib c, everything just does hand written assembly to (probably?) prevent the c compiler from being tooooo clever and replacing the mem* implementation with a call to itself! edit: some implementations do a tiny c code thing |
zooming out: the real answer is benchmarks! :) |
I think the only reasonable thing to do in |
The portable simd c is easy to write.
The concern which I need to do some measurements to determine , is how big
a difference in performance for large non const arg calls is there between
the agner fog or glib c impls vs say the FreeBSD ones. Especially on really
large memory ranges. If there’s even a 20 percent difference on large
inputs on multiple cpu micro architectures that’s a pretty important thing
Ghc abi and native suport for simd are orthogonal to understanding the
quality of the underlying lib c fallback case for non const arguments. Any
improvements should of course be added to ghc too. But that’s a very
different release cadence.
…On Tue, Mar 10, 2020 at 6:22 AM Andrew Martin ***@***.***> wrote:
I think the only reasonable thing to do in primitive is to use the primop
that GHC provides. Optimizing the implementation further is a GHC issue,
not a primitive issue. Having GHC use wider registers when possible might
be a win, but then you won't be doing aligned writes (which I think doesn't
really matter on more recent Intel processors), and you have to load the
128- or 256-bit word from memory rather than including it in the
instruction, so for writing 16 or 24 bytes, this may not actually be better.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#265?email_source=notifications&email_token=AAABBQXDZWUHENONHYR3V3LRGYIHFA5CNFSM4LEP3ZV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOK2RQI#issuecomment-597010625>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAABBQRTNFMWDQYSWRW6PWDRGYIHFANCNFSM4LEP3ZVQ>
.
|
andrew: to clarify, i'm not concerned with the ghc unrolling case, i'm concerned with the performance characteristics for input sizes that fall back to libc / system provided memset and friends. (the conclusion may very well be that libc simple codes are fine on all cpus we care about, or not). Theres also a subtle mismatch in how we use the Also theres some nontrival ways where a more haskell specific flavor will actually make a 2x difference in storable vector in some cases (at least per asymptotics, not sure if the constant factors work out in practice) |
as a bit of context for why this actually should be done as performantly as possible for unknown size / large buffers is how it comes into storable vector |
I take it back, |
it falls back to a scalar c loop when not zeroing. which seems disappointing. Also not sure if it correctly handles
-0
in the floating point codeThe text was updated successfully, but these errors were encountered: