Introduce basicUnsafeIndexM# #489

Bodigrim · 2024-03-23T00:44:36Z

#485 did only half of the job: while GHC now knows that index is used strictly it still would not necessarily unpack it, because basicUnsafeIndexM must receive Int not Int#. Only after inlining an opportunity to erase boxing will arise.

This patch introduces basicUnsafeIndexM# to help GHC further. If it looks good, I'll go for basicUnsafeRead# / basicUnsafeWrite# in another PR.

Shimuuar · 2024-03-23T12:48:21Z

This is however very breaking change. Due to design of unboxed vector (which I consider a mistake) this change will break all unboxed vectors. They require defining data instances and MVector/Vector instances for each type.

Bodigrim · 2024-03-23T17:05:13Z

@Shimuuar what exactly would break? I tried couple of packages with manual instances for unboxed vectors, they seem to compile fine.

Shimuuar · 2024-03-23T17:30:57Z

Sorry, I missed mutually recursive basicUnsafeIndexM and basicUnsafeIndexM#. I withdraw objection

sergv · 2024-03-31T17:47:53Z

vector/src/Data/Vector/Fusion/Bundle/Monadic.hs

-           | otherwise = case basicUnsafeIndexM v i of
-                           Box x -> return $ Yield x (i+1)
+    step (I# i)
+      | I# i >= n = return Done


Here and everywhere in this PR, when we pattern-match on I# but then construct it again in function's body is GHC 100% guaranteed to optimise allocation of fresh Int away and will just reuse whatever we pattern-matched on?

GHC should be quite good at this. So it should eliminate Int allocations.

Shimuuar · 2024-04-01T17:20:32Z

I tried to measure impact of this optimization. To do so I added simple benchmark which computed variance of vector of doubles (branches index-profiling and index-profiling-B) in two otherwise identical variants: one with INLINE pragma and another with NOINLINE. It reads from vector so it should test changes to basicUnsafeIndex. I've run them with GHC 8.10, 9.2, 9.4, 9.6, 9.8

varianceNoInline :: (VG.Vector v Double) => v Double -> Double
{-# NOINLINE varianceNoInline #-}
varianceNoInline xs
  = VG.sum (VG.map (\x -> (x - s)^(2::Int)) xs) / n
  where
    n = fromIntegral $ VG.length xs
    s = VG.sum xs / n

Function specialized by GHC runs in constant space. Changes to indexing make no difference and performance is stable for all GHC version with 6 cycles/element.

For non-specialized situation is much more interesting. Adding strictness (#485) make no difference al all in this benchmark. It seems programs are identical, al least number of instruction is exactly same, so I'll compare baseline (stddev[base]) and this PR (stddev[unboxed]).

Allocations

Below are allocations per array element:

  8.10.7     index[base]     96
   9.2.8     index[base]     96
   9.4.8     index[base]     96
   9.6.3     index[base]     96
   9.8.1     index[base]     96
  8.10.7  index[unboxed]     64
   9.2.8  index[unboxed]     64
   9.4.8  index[unboxed]     64
   9.6.3  index[unboxed]    128
   9.8.1  index[unboxed]    128

This PR is clear improvement for GHC<=9.4 but somehow it become a pessimization for GHC>=9.6. I haven't looked into core so I have no idea why

Performance

Runtime performance follows same pattern:

    ghc             tag  INS/element  CYC/element
 8.10.7     index[base]  155.5        44.6
 8.10.7  index[unboxed]  128.6        32.7
  9.2.8     index[base]  153.9        44.9
  9.2.8  index[unboxed]  127.2        33.6
  9.4.8     index[base]  153.9        44.1
  9.4.8  index[unboxed]  127.3        34.1
  9.6.3     index[base]  189.9        65.0
  9.6.3  index[unboxed]  372.6       125.6
  9.8.1     index[base]  189.9        60.8
  9.8.1  index[unboxed]  372.6       120.3

25% win for GHC<=9.4 and 100% loss for GHC>=9.6. And latter performs worse even without this optimization. It looks we have some regression in GHC optimizer.

But I have looked into core yet and answer lies there.

Shimuuar · 2024-04-03T10:03:13Z

First of all some estimations. We need to perform indexing twice, which al least means allocating 2 Box Double (64B) per element with this optimization and 2 Box Double & 2 Int (96B) without. This means that generated code is rather tight. Something happened between 9.4 and 9.6

Cleaned up core for nonspecialized function could be seen in gist. Core is very similar between GHC versions and with/without this optimization. Notable differences:

GHC 9.4 → 9.8

GHC changed how worker-wrapper transformation is done. In GHC9.4 it passed basicLength and basicUnsafeIndexM class methods into worker function and 9.8 passes full dictionary:

9.4

varianceNoInline :: forall (v :: * -> *). Vector v Double => v Double -> Double
varianceNoInline
  = \ (@(v_ :: * -> *))
      ($dVector_s2fQ :: Vector v_ Double)
      (xs_s2g2 :: v_ Double) ->
      case $dVector_s2fQ of
      { C:Vector ww_s2fS ww1_s2fT ww2_s2fU ww3_s2fV ww4_s2fW ww5_s2fX
                 ww6_s2fY ww7_s2fZ ww8_s2g0 ->
      case $wvarianceNoInline @v_ ww3_s2fV ww6_s2fY xs_s2g2 of ww9_s2g6
      { __DEFAULT -> D# ww9_s2g6 }}

9.8

-- RHS size: {terms: 10, types: 9, coercions: 0, joins: 0/0}
varianceNoInline
  :: forall (v :: * -> *). Vector v Double => v Double -> Double
varianceNoInline
  = \ (@(v_ :: * -> *))
      ($dictVec :: Vector v_ Double)
      (vec0 :: v_ Double) ->
      case $wvarianceNoInline @v_ $dictVec vec0 of ww_sfOR
      { __DEFAULT -> D# ww_sfOR }

Apparently lookup of functions in the dictionary caused performance degradation (44 CYC/elt → 65 CYC/elt)

But allocations picture is a mystery:

case basicUnsafeIndexM @v_ @Double $dict vec (I# sc_sfOy) of allocates Box Double (32) & Int (16).
case basicUnsafeIndexM# @v_ @Double $dictVec vec sc_sfPc if allocated Box Double (32) and 32B somewhere else. Rest of the core is essentially same.

P.S. Box trick seems to be terribly wasteful in case when GHC can't specialize. It doubles allocations for small values

In stddev benchmark with NOINLINE it gives quite significat improvements accross all compiler versions: - 3-10% reduction on CPU cycles depending on GHC version - -2 branches/per indexing for all cases. No change for inlined version Overall this is cheap and nice change.

Shimuuar · 2024-04-08T17:50:22Z

I found that using coerce in unboxed instances which just wrap primitive vectors gives nice performance boost when specialization doesn't happen. 3-10% percent for benchmark above. It also removes two branch instructions per index. Overall it's cheap optimization

No changes for case when specialization happens.

Introduce basicUnsafeIndexM#

4b682ae

sergv reviewed Mar 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce basicUnsafeIndexM# #489

Introduce basicUnsafeIndexM# #489

Bodigrim commented Mar 23, 2024 •

edited

Loading

Shimuuar commented Mar 23, 2024

Bodigrim commented Mar 23, 2024

Shimuuar commented Mar 23, 2024

sergv Mar 31, 2024

Shimuuar Apr 1, 2024

Shimuuar commented Apr 1, 2024

Shimuuar commented Apr 3, 2024

Shimuuar commented Apr 8, 2024

Introduce basicUnsafeIndexM# #489

Are you sure you want to change the base?

Introduce basicUnsafeIndexM# #489

Conversation

Bodigrim commented Mar 23, 2024 • edited Loading

Shimuuar commented Mar 23, 2024

Bodigrim commented Mar 23, 2024

Shimuuar commented Mar 23, 2024

sergv Mar 31, 2024

Choose a reason for hiding this comment

Shimuuar Apr 1, 2024

Choose a reason for hiding this comment

Shimuuar commented Apr 1, 2024

Allocations

Performance

Shimuuar commented Apr 3, 2024

GHC 9.4 → 9.8

9.4

9.8

Shimuuar commented Apr 8, 2024

Bodigrim commented Mar 23, 2024 •

edited

Loading