Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce basicUnsafeIndexM# #489

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Introduce basicUnsafeIndexM# #489

wants to merge 2 commits into from

Conversation

Bodigrim
Copy link
Contributor

@Bodigrim Bodigrim commented Mar 23, 2024

#485 did only half of the job: while GHC now knows that index is used strictly it still would not necessarily unpack it, because basicUnsafeIndexM must receive Int not Int#. Only after inlining an opportunity to erase boxing will arise.

This patch introduces basicUnsafeIndexM# to help GHC further. If it looks good, I'll go for basicUnsafeRead# / basicUnsafeWrite# in another PR.

@Shimuuar
Copy link
Contributor

This is however very breaking change. Due to design of unboxed vector (which I consider a mistake) this change will break all unboxed vectors. They require defining data instances and MVector/Vector instances for each type.

@Bodigrim
Copy link
Contributor Author

@Shimuuar what exactly would break? I tried couple of packages with manual instances for unboxed vectors, they seem to compile fine.

@Shimuuar
Copy link
Contributor

Sorry, I missed mutually recursive basicUnsafeIndexM and basicUnsafeIndexM#. I withdraw objection

| otherwise = case basicUnsafeIndexM v i of
Box x -> return $ Yield x (i+1)
step (I# i)
| I# i >= n = return Done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and everywhere in this PR, when we pattern-match on I# but then construct it again in function's body is GHC 100% guaranteed to optimise allocation of fresh Int away and will just reuse whatever we pattern-matched on?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GHC should be quite good at this. So it should eliminate Int allocations.

@Shimuuar
Copy link
Contributor

Shimuuar commented Apr 1, 2024

I tried to measure impact of this optimization. To do so I added simple benchmark which computed variance of vector of doubles (branches index-profiling and index-profiling-B) in two otherwise identical variants: one with INLINE pragma and another with NOINLINE. It reads from vector so it should test changes to basicUnsafeIndex. I've run them with GHC 8.10, 9.2, 9.4, 9.6, 9.8

varianceNoInline :: (VG.Vector v Double) => v Double -> Double
{-# NOINLINE varianceNoInline #-}
varianceNoInline xs
  = VG.sum (VG.map (\x -> (x - s)^(2::Int)) xs) / n
  where
    n = fromIntegral $ VG.length xs
    s = VG.sum xs / n

Function specialized by GHC runs in constant space. Changes to indexing make no difference and performance is stable for all GHC version with 6 cycles/element.

For non-specialized situation is much more interesting. Adding strictness (#485) make no difference al all in this benchmark. It seems programs are identical, al least number of instruction is exactly same, so I'll compare baseline (stddev[base]) and this PR (stddev[unboxed]).

Allocations

Below are allocations per array element:

  8.10.7     index[base]     96
   9.2.8     index[base]     96
   9.4.8     index[base]     96
   9.6.3     index[base]     96
   9.8.1     index[base]     96
  8.10.7  index[unboxed]     64
   9.2.8  index[unboxed]     64
   9.4.8  index[unboxed]     64
   9.6.3  index[unboxed]    128
   9.8.1  index[unboxed]    128

This PR is clear improvement for GHC<=9.4 but somehow it become a pessimization for GHC>=9.6. I haven't looked into core so I have no idea why

Performance

Runtime performance follows same pattern:

    ghc             tag  INS/element  CYC/element
 8.10.7     index[base]  155.5        44.6
 8.10.7  index[unboxed]  128.6        32.7
  9.2.8     index[base]  153.9        44.9
  9.2.8  index[unboxed]  127.2        33.6
  9.4.8     index[base]  153.9        44.1
  9.4.8  index[unboxed]  127.3        34.1
  9.6.3     index[base]  189.9        65.0
  9.6.3  index[unboxed]  372.6       125.6
  9.8.1     index[base]  189.9        60.8
  9.8.1  index[unboxed]  372.6       120.3

25% win for GHC<=9.4 and 100% loss for GHC>=9.6. And latter performs worse even without this optimization. It looks we have some regression in GHC optimizer.

But I have looked into core yet and answer lies there.

@Shimuuar
Copy link
Contributor

Shimuuar commented Apr 3, 2024

First of all some estimations. We need to perform indexing twice, which al least means allocating 2 Box Double (64B) per element with this optimization and 2 Box Double & 2 Int (96B) without. This means that generated code is rather tight. Something happened between 9.4 and 9.6

Cleaned up core for nonspecialized function could be seen in gist. Core is very similar between GHC versions and with/without this optimization. Notable differences:

GHC 9.4 → 9.8

GHC changed how worker-wrapper transformation is done. In GHC9.4 it passed basicLength and basicUnsafeIndexM class methods into worker function and 9.8 passes full dictionary:

9.4
varianceNoInline :: forall (v :: * -> *). Vector v Double => v Double -> Double
varianceNoInline
  = \ (@(v_ :: * -> *))
      ($dVector_s2fQ :: Vector v_ Double)
      (xs_s2g2 :: v_ Double) ->
      case $dVector_s2fQ of
      { C:Vector ww_s2fS ww1_s2fT ww2_s2fU ww3_s2fV ww4_s2fW ww5_s2fX
                 ww6_s2fY ww7_s2fZ ww8_s2g0 ->
      case $wvarianceNoInline @v_ ww3_s2fV ww6_s2fY xs_s2g2 of ww9_s2g6
      { __DEFAULT -> D# ww9_s2g6 }}
9.8
-- RHS size: {terms: 10, types: 9, coercions: 0, joins: 0/0}
varianceNoInline
  :: forall (v :: * -> *). Vector v Double => v Double -> Double
varianceNoInline
  = \ (@(v_ :: * -> *))
      ($dictVec :: Vector v_ Double)
      (vec0 :: v_ Double) ->
      case $wvarianceNoInline @v_ $dictVec vec0 of ww_sfOR
      { __DEFAULT -> D# ww_sfOR }

Apparently lookup of functions in the dictionary caused performance degradation (44 CYC/elt → 65 CYC/elt)

But allocations picture is a mystery:

  • case basicUnsafeIndexM @v_ @Double $dict vec (I# sc_sfOy) of allocates Box Double (32) & Int (16).
  • case basicUnsafeIndexM# @v_ @Double $dictVec vec sc_sfPc if allocated Box Double (32) and 32B somewhere else. Rest of the core is essentially same.

P.S. Box trick seems to be terribly wasteful in case when GHC can't specialize. It doubles allocations for small values

In stddev benchmark with NOINLINE it gives quite significat improvements
accross all compiler versions:

 - 3-10% reduction on CPU cycles depending on GHC version
 - -2 branches/per indexing for all cases.

No change for inlined version

Overall this is cheap and nice change.
@Shimuuar
Copy link
Contributor

Shimuuar commented Apr 8, 2024

I found that using coerce in unboxed instances which just wrap primitive vectors gives nice performance boost when specialization doesn't happen. 3-10% percent for benchmark above. It also removes two branch instructions per index. Overall it's cheap optimization

No changes for case when specialization happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants