-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce basicUnsafeIndexM# #489
base: master
Are you sure you want to change the base?
Conversation
This is however very breaking change. Due to design of unboxed vector (which I consider a mistake) this change will break all unboxed vectors. They require defining data instances and |
@Shimuuar what exactly would break? I tried couple of packages with manual instances for unboxed vectors, they seem to compile fine. |
Sorry, I missed mutually recursive |
| otherwise = case basicUnsafeIndexM v i of | ||
Box x -> return $ Yield x (i+1) | ||
step (I# i) | ||
| I# i >= n = return Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and everywhere in this PR, when we pattern-match on I#
but then construct it again in function's body is GHC 100% guaranteed to optimise allocation of fresh Int
away and will just reuse whatever we pattern-matched on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GHC should be quite good at this. So it should eliminate Int
allocations.
I tried to measure impact of this optimization. To do so I added simple benchmark which computed variance of vector of doubles (branches varianceNoInline :: (VG.Vector v Double) => v Double -> Double
{-# NOINLINE varianceNoInline #-}
varianceNoInline xs
= VG.sum (VG.map (\x -> (x - s)^(2::Int)) xs) / n
where
n = fromIntegral $ VG.length xs
s = VG.sum xs / n Function specialized by GHC runs in constant space. Changes to indexing make no difference and performance is stable for all GHC version with 6 cycles/element. For non-specialized situation is much more interesting. Adding strictness (#485) make no difference al all in this benchmark. It seems programs are identical, al least number of instruction is exactly same, so I'll compare baseline ( AllocationsBelow are allocations per array element:
This PR is clear improvement for GHC<=9.4 but somehow it become a pessimization for GHC>=9.6. I haven't looked into core so I have no idea why PerformanceRuntime performance follows same pattern:
25% win for GHC<=9.4 and 100% loss for GHC>=9.6. And latter performs worse even without this optimization. It looks we have some regression in GHC optimizer. But I have looked into core yet and answer lies there. |
First of all some estimations. We need to perform indexing twice, which al least means allocating 2 Cleaned up core for nonspecialized function could be seen in gist. Core is very similar between GHC versions and with/without this optimization. Notable differences: GHC 9.4 → 9.8GHC changed how worker-wrapper transformation is done. In GHC9.4 it passed 9.4varianceNoInline :: forall (v :: * -> *). Vector v Double => v Double -> Double
varianceNoInline
= \ (@(v_ :: * -> *))
($dVector_s2fQ :: Vector v_ Double)
(xs_s2g2 :: v_ Double) ->
case $dVector_s2fQ of
{ C:Vector ww_s2fS ww1_s2fT ww2_s2fU ww3_s2fV ww4_s2fW ww5_s2fX
ww6_s2fY ww7_s2fZ ww8_s2g0 ->
case $wvarianceNoInline @v_ ww3_s2fV ww6_s2fY xs_s2g2 of ww9_s2g6
{ __DEFAULT -> D# ww9_s2g6 }} 9.8-- RHS size: {terms: 10, types: 9, coercions: 0, joins: 0/0}
varianceNoInline
:: forall (v :: * -> *). Vector v Double => v Double -> Double
varianceNoInline
= \ (@(v_ :: * -> *))
($dictVec :: Vector v_ Double)
(vec0 :: v_ Double) ->
case $wvarianceNoInline @v_ $dictVec vec0 of ww_sfOR
{ __DEFAULT -> D# ww_sfOR } Apparently lookup of functions in the dictionary caused performance degradation (44 CYC/elt → 65 CYC/elt) But allocations picture is a mystery:
P.S. Box trick seems to be terribly wasteful in case when GHC can't specialize. It doubles allocations for small values |
In stddev benchmark with NOINLINE it gives quite significat improvements accross all compiler versions: - 3-10% reduction on CPU cycles depending on GHC version - -2 branches/per indexing for all cases. No change for inlined version Overall this is cheap and nice change.
I found that using No changes for case when specialization happens. |
#485 did only half of the job: while GHC now knows that index is used strictly it still would not necessarily unpack it, because
basicUnsafeIndexM
must receiveInt
notInt#
. Only after inlining an opportunity to erase boxing will arise.This patch introduces
basicUnsafeIndexM#
to help GHC further. If it looks good, I'll go forbasicUnsafeRead#
/basicUnsafeWrite#
in another PR.