Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider design change in light of GHC join point work #156

Open
treeowl opened this issue Feb 21, 2017 · 31 comments
Open

Consider design change in light of GHC join point work #156

treeowl opened this issue Feb 21, 2017 · 31 comments

Comments

@treeowl
Copy link
Contributor

treeowl commented Feb 21, 2017

The join points paper suggests that it may now be possible to avoid the need for Skip altogether. It would be worth checking if that's actually the case. If so, it might be possible to simplify the implementation considerably. Furthermore, I seem to recall that the presence of skips (which allow vectors of the same length to be represented by streams of different lengths) prevents one or more optimizations. I vaguely remember trying to do something a bit clever with replicate and finding it impossible because of skips, but I don't remember any details.

@cartazio
Copy link
Contributor

cartazio commented Feb 23, 2017 via email

@treeowl
Copy link
Contributor Author

treeowl commented Feb 23, 2017 via email

@dolio
Copy link
Contributor

dolio commented Feb 24, 2017

The paper says the new join point handling makes it so that recursive stepper functions are no longer a problem. So Skip is no longer needed.

However, making this change would break (fusion-wise) the functions that require Skip on every compiler version that doesn't have the new join point handling. I assume that's everything before 8.2, or at best 8.0. So I'm not sure such a change can be justified for quite some time.

@cartazio
Copy link
Contributor

cartazio commented Feb 24, 2017 via email

@dolio
Copy link
Contributor

dolio commented Feb 26, 2017

The one way I can think of to accelerate this kind of thing is to decouple the fusion from vectors. Have modules that provide plain vectors, and other modules that somehow add in fusion. Then give a choice of which fusion framework to use.

However, I'm not very sure how to accomplish something like this. The most obvious thing to say is, "this is where you want ML modules," because fusion is sort of a thing that transforms one implementation of an API into another. But backpack is probably more of a limiting factor than the join point stuff. Perhaps a newtype could work:

newtype Fused v a = F (v a)

Where you use Fused Vector a if you want fusion to happen for your choice of Vector.

Anyhow, this is a research project on top of a research project. :)

@cartazio
Copy link
Contributor

cartazio commented Aug 5, 2019

@treeowl whats the best way to explore the join point cleanup in your mind?

@GeorgeCo
Copy link

Any thoughts on this recently? From the paper it sound very interesting:

Result: simpler code, less of it, and faster to execute. It’s a straight win

Would it be worthwhile to do a small prototype to see e.g. how much it improves runtime and/or compiler performance? Perhaps some of the zip or zipWith functions ? I picked those based on the following from the paper, but I'm not sure if it applies to zip for vectors:

Now the stepper function can say to update the state and call again, obviating the need for a loop of its own. This makes
filter fusible, but it complicates everything else! Everything gets three cases instead of two, leading to more code and more runtime tests; and functions like zip that consume two lists become more complicated and less efficient.

@lehins
Copy link
Contributor

lehins commented Mar 21, 2021

It is interesting, but it is also an enormous amount of work. @GeorgeCo You feel like working on this and submitting a PR with a prototype? If I were to work on this I would start with implementing and verifying performance of filter, because that is exactly what we need Skip for.

We already have a plan for putting all the streaming functionality that uses Stream and Step types into a separate package #355 So if this approach with relying on join points proves itself as feasible replacement for newer ghc we could then have two stream packages: one for older ghc with Skip and another package for newer ghc versions that doesn't have to rely on Skip for performance.

@noughtmare
Copy link

noughtmare commented Sep 26, 2022

I've just done a quick scan over the code. Now I'm stuck and it seems to me like it is impossible to remove Skip for monadic streams. Take for example the drop function. I've removed the Step constructor and replaced its usage by a recursive call:

drop :: Monad m => Int -> Stream m a -> Stream m a
{-# INLINE_FUSED drop #-}
drop n (Stream step t) = Stream step' (t, Just n)
  where
    {-# INLINE_INNER step' #-}
    step' (s, Just i) | i > 0 = do
                                  r <- step s
                                  case r of
                                     Yield _ s' -> step' (s', Just (i-1))
                                     Done       -> return Done

    step' (s, _) = liftM (\r ->
                     case r of
                       Yield x s' -> Yield x (s', Nothing)
                       Done       -> Done
                   ) (step s)

But here the step' (s', Just (i - 1)) recursive call is not guaranteed to be a tail call (that depends on what the Monad m is), so GHC can't mark step' as a join point in general (if I understand it correctly).

Or is there another way to write this so that it can use a join point?

@cartazio
Copy link
Contributor

cartazio commented Sep 26, 2022 via email

@noughtmare
Copy link

noughtmare commented Sep 26, 2022

would having a flavor of monads that
guarantee they respect tail calls, in a way ghc understands, solve that?

That's exactly what I was thinking!

I wonder what that looks like. I know somebody asked SPJ once after one of his talks if join points could be made explicit in the surface language and SPJ considered the possibility but didn't see an immediate use case.

I also think this is related to Alexis King's "Effects for Less" talk which mentions that it is really important that the monadic bind operator gets inlined.

Perhaps if we use her Eff monad instead of arbitrary monads in the streams then we would not have this issue.

@GeorgeCo
Copy link

It might be worthwhile to draw SPJ and the other author of the paper into this conversation.

@noughtmare
Copy link

noughtmare commented Sep 26, 2022

ping @simonpj @pdownen @lukemaurer (I couldn't find Zena Ariola's github account if she even has one)

@simonpj
Copy link

simonpj commented Sep 26, 2022

Now I'm stuck and it seems to me like it is impossible to remove Skip for monadic streams. T

We are taking about fusion here, correct? That's only going to happen if you inine the function (in this case drop at the call site, else there is nothing to fuse with. And once you inline, you likely have fixed the monad.

I rather doubt that you'll achieve fusion for an arbitrary monad.

A good way to move this converation on would be to provide a concrete example of

  • a program where you think fusion should take place
  • the fused program that you think GHC should be able to reach by program transformations.

Being concrete aids comprehension, and saves misinderstandings.

@noughtmare
Copy link

noughtmare commented Sep 26, 2022

Oh, that sounds right. Now I feel silly for missing that.

To be much more concrete. Take for example the Quickhull benchmark. Before my changes it takes 55.7 ms and after removing Skip it takes 72.2 ms. Looking at the core reveals a bunch of bindings like the following which I think are to blame for this discrepancy. These are not present in the core dumps from before my changes.

letrec {
  $s$wstep'_sebJ :: Int# -> Int# ->
    Id (Step (Int, Int, Maybe (Double, Double)) ((Double, Double), Double))
  $s$wstep'_sebJ
    = \ (sc_sebF :: Int#) (sc1_sebE :: Int#) ->
        case >=# sc1_sebE ww_sdOU of {
          __DEFAULT ->
            case indexDoubleArray# ww2_sdP0 (+# ww1_sdOY sc1_sebE) of wild2_a6SW { __DEFAULT ->
            case indexDoubleArray# ww4_sdP7 (+# ww3_sdP5 sc1_sebE) of wild1_Xe { __DEFAULT ->
            case >=# sc_sebF ww7_sdOz of {
              __DEFAULT ->
                case indexDoubleArray# ipv3_adv1 sc_sebF of wild3_Xh { __DEFAULT ->
                case >## wild3_Xh 0.0## of {
                  __DEFAULT -> $s$wstep'_sebJ (+# sc_sebF 1#) (+# sc1_sebE 1#); -- <<< recursive call here
                  1# ->
                    (Yield
                       ((D# wild2_a6SW, D# wild1_Xe), D# wild3_Xh)
                       (I# (+# sc1_sebE 1#), I# (+# sc_sebF 1#), Nothing))
                    `cast` <Co:15>
                }};
              1# -> Done `cast` <Co:15>
            }}};
          1# -> Done `cast` <Co:15>
        }; } in

Here are the full pre and post dumps of all the benchmarks:
bench-dumps.zip

And the particular benchmarks that got noticeably worse are: listrank, awshcc, hybcc, quickhull, findIndexR, findIndexR_manual. Especially awshcc got more than 3x slower and hybcc about 2x slower.

@noughtmare
Copy link

noughtmare commented Sep 26, 2022

Maybe it is some kind of cross-function case of case optimization that is lacking. After reducing an example to just a V.map (* 2) . V.filter (> 10) I get this core:

letrec {
  $wstep'_s3Uf :: Int# -> Id (Step Int Double)
  $wstep'_s3Uf
    = \ (ww2_s3Ud :: Int#) ->
        case >=# ww2_s3Ud ipv1_s3Po of {
          __DEFAULT ->
            case indexDoubleArray# ipv2_s3Pp (+# ipv_s3Pn ww2_s3Ud)
            of wild2_i3d4
            { __DEFAULT ->
            case >## wild2_i3d4 10.0## of {
              __DEFAULT -> $wstep'_s3Uf (+# ww2_s3Ud 1#);
              1# ->
                (Yield (D# wild2_i3d4) (I# (+# ww2_s3Ud 1#))) `cast` <Co:5>
            }
            };
          1# -> Done `cast` <Co:5>
        }; } in
joinrec {
  $s$wfoldlM'_loop_s3WN
    :: State# RealWorld -> Int# -> Int# -> Vector Double
  $s$wfoldlM'_loop_s3WN (sc_s3WM :: State# RealWorld)
                        (sc1_s3WK :: Int#)
                        (sc2_s3WJ :: Int#)
    = case ($wstep'_s3Uf sc1_s3WK) `cast` <Co:4> of { -- <<< $wstep' can't be a join point
        Yield x_a3P4 s'_a3P5 ->
          case x_a3P4 of { D# x1_a2wq ->
          case writeDoubleArray#
                 ipv4_i3En
                 sc2_s3WJ
                 (+## x1_a2wq x1_a2wq)
                 (sc_s3WM `cast` <Co:5>)
          of s'#_i3S3
          { __DEFAULT ->
          case s'_a3P5 of { I# ww3_X4 ->
          jump $s$wfoldlM'_loop_s3WN
            (s'#_i3S3 `cast` <Co:4>) ww3_X4 (+# sc2_s3WJ 1#)
          }
          }
          };
        Done ->
          case unsafeFreezeByteArray# ipv4_i3En (sc_s3WM `cast` <Co:5>) of
          { (# ipv5_i3ci, ipv6_i3cj #) ->
          (Vector 0# sc2_s3WJ ipv6_i3cj) `cast` <Co:5>
          }
      }; } in
jump $s$wfoldlM'_loop_s3WN (ipv3_i3Em `cast` <Co:4>) 0# 0#

Maybe I'm running into https://gitlab.haskell.org/ghc/ghc/-/issues/16335 again?

@simonpj
Copy link

simonpj commented Sep 26, 2022

Maybe this is too hard, but could you distil out a standalone repro case?

Or (less good but still OK) supply precise repro instructions for your "reduced example". By precise I mean "check out this repo, cd to there, execute these commands..."

Looking at the code you show, I think that FloatIn should move the binding for $wstep' inwards, into the scrutinee of the case; there it will become a join point and good things should happen. I don't know why this doesn't happen.

@noughtmare
Copy link

noughtmare commented Sep 27, 2022

Here's a standalone reproducer:

https://gist.github.com/noughtmare/f979c0fc9eed30abcd0849d19cf0f746

with the interesting core here:

https://gist.github.com/noughtmare/f979c0fc9eed30abcd0849d19cf0f746#file-all-dump-simpl-L2737-L2794

Commands you can run:

wget https://gist.githubusercontent.com/noughtmare/f979c0fc9eed30abcd0849d19cf0f746/raw/858db7da16a5ba542148d794655057bed33f88b3/All.hs
ghc -package ghc-prim -package primitive -O2 -ddump-simpl -dusppress-all -dno-suppress-type-signatures -ddump-to-file All.hs

@noughtmare
Copy link

noughtmare commented Sep 27, 2022

If I translate the produced core back to Haskell:

module T where

import Control.Monad.Primitive
import Data.Primitive.Array
import System.IO.Unsafe (unsafePerformIO)

data Step s a = Yield a s | Done

uninitialised = undefined

test :: Int -> Int -> Array Double -> (Int, Int, Array Double)
test off n oldArr = unsafePerformIO $ do
  newArr <- newArray n uninitialised
  let
    step' i
      | i >= n = Done
      | otherwise =
        let x = indexArray oldArr (off + i) in
        if x > 10
        then Yield x (i + 1)
        else step' (i + 1)
    loop i j = do
      case step' i of
        Yield x s' -> do
          writeArray newArr j (x + 1)
          loop s' (j + 1)
        Done -> do
          out <- unsafeFreezeArray newArr
          return (0, j, out)
  loop 0 0

And then compile that again I do get the nice core with two join points where the intermediate Yield and Done constructors are eliminated:

$wtest :: Int -> Int# -> Array Double -> (Int, Int, Array Double)
$wtest
  = \ (w :: Int) (ww :: Int#) (w1 :: Array Double) ->
      runRW#
        (\ (s :: State# RealWorld) ->
           case noDuplicate# s of s' { __DEFAULT ->
           case newArray# ww uninitialised (s' `cast` <Co:3>) of
           { (# ipv, ipv1 #) ->
           joinrec {
             $s$wloop
               :: State# RealWorld -> Int# -> Int# -> (Int, Int, Array Double)
             $s$wloop (sc :: State# RealWorld) (sc1 :: Int#) (sc2 :: Int#)
               = join {
                   $j :: (Int, Int, Array Double)
                   $j
                     = case unsafeFreezeArray# ipv1 (sc `cast` <Co:3>) of
                       { (# ipv2, ipv3 #) ->
                       lazy (test1, I# sc1, Array ipv3)
                       } } in
                 case >=# sc2 ww of {
                   __DEFAULT ->
                     case w of { I# x ->
                     case w1 of { Array ds2 ->
                     case indexArray# ds2 (+# x sc2) of { (# ipv2 #) ->
                     case ipv2 of { D# x1 ->
                     case >## x1 10.0## of {
                       __DEFAULT ->
                         joinrec {
                           $wstep' :: Int# -> (Int, Int, Array Double)
                           $wstep' (ww1 :: Int#)
                             = case >=# ww1 ww of {
                                 __DEFAULT ->
                                   case indexArray# ds2 (+# x ww1) of { (# ipv3 #) ->
                                   case ipv3 of { D# x2 ->
                                   case >## x2 10.0## of {
                                     __DEFAULT -> jump $wstep' (+# ww1 1#);
                                     1# ->
                                       case writeArray#
                                              ipv1 sc1 (D# (+## x2 1.0##)) (sc `cast` <Co:3>)
                                       of s'#
                                       { __DEFAULT ->
                                       jump $s$wloop (s'# `cast` <Co:2>) (+# sc1 1#) (+# ww1 1#)
                                       }
                                   } } };
                                 1# -> jump $j
                               }; } in
                         jump $wstep' (+# sc2 1#);
                       1# ->
                         case writeArray# ipv1 sc1 (D# (+## x1 1.0##)) (sc `cast` <Co:3>)
                         of s'#
                         { __DEFAULT ->
                         jump $s$wloop (s'# `cast` <Co:2>) (+# sc1 1#) (+# sc2 1#)
                         }
                     } } } } };
                   1# -> jump $j
                 }; } in
           jump $s$wloop (ipv `cast` <Co:2>) 0# 0#
           } })

So is the simplifier just running out of steam?

@GeorgeCo
Copy link

GeorgeCo commented Sep 27, 2022 via email

@noughtmare
Copy link

noughtmare commented Sep 27, 2022

I did try increasing the max simplifier iterations which did not help.

But now I've rewritten it using the lower level unboxed operations:

{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}
module T where

import GHC.Exts
import GHC.IO

data Step s a = Yield a s | Done

uninitialised = undefined

test :: Int# -> Int# -> Array# Double -> (# Int#, Int#, Array# Double #)
test off n oldArr = runRW# $ \s0 ->
  case newArray# n uninitialised s0
   of { (# s1, newArr #) ->
  let
    step' i
      | isTrue# (i >=# n) = Done
      | otherwise =
        let (# D# x #) = indexArray# oldArr (off +# i) in
        if isTrue# (x >## 10.0##)
        then Yield (D# x) (I# (i +# 1#))
        else step' (i +# 1#)
    loop i j s2 =
      case step' i of
        Yield x (I# s') ->
          case writeArray# newArr j (x + 1) s2
           of { s3 -> 
          loop s' (j +# 1#) s3
        }
        Done ->
          case unsafeFreezeArray# newArr s2
           of { (# s3, out #) ->
          (# 0#, j, out #)
        }
  in
  loop 0# 0# s1
  }

And compiling this program does show the optimisation failure:

test
  = \ off n oldArr ->
      runRW#
        (\ s0 ->
           case newArray# n uninitialised s0 of { (# ipv, ipv1 #) ->
           letrec {
             step'
               = \ i ->
                   case >=# i n of {
                     __DEFAULT ->
                       case indexArray# oldArr (+# off i) of { (# ipv2 #) ->
                       case ipv2 of wild { D# x ->
                       case >## x 10.0## of {
                         __DEFAULT -> step' (+# i 1#);
                         1# -> Yield wild (I# (+# i 1#))
                       }
                       }
                       };
                     1# -> Done
                   }; } in
           join {
             exit j s2
               = case unsafeFreezeArray# ipv1 s2 of { (# ipv2, ipv3 #) ->
                 (# 0#, j, ipv3 #)
                 } } in
           joinrec {
             loop i j s2
               = case step' i of {
                   Yield x ds1 ->
                     case ds1 of { I# s' ->
                     case writeArray#
                            ipv1 j (case x of { D# x1 -> D# (+## x1 1.0##) }) s2
                     of s3
                     { __DEFAULT ->
                     jump loop s' (+# j 1#) s3
                     }
                     };
                   Done -> jump exit j s2
                 }; } in
           jump loop 0# 0# ipv
           })

@noughtmare
Copy link

I think it is time to open a proper GHC issue for this, so I've done that: https://gitlab.haskell.org/ghc/ghc/-/issues/22227

@cartazio
Copy link
Contributor

cartazio commented Sep 27, 2022 via email

@noughtmare
Copy link

noughtmare commented Nov 1, 2022

Good news! I have been able to manually implement the "loopification" optimization that GHC is missing (see this WIP merge request of GHC) with cfaf0d4. Now almost all benchmarks are the same or faster (tridiag speeds up the most from 86ms to 19.3ms) except awshcc which is still about 2x slower (406ms to 872ms). I'll look into that now. For now, I can already say is that it is able to fuse properly so the problem must be something else.

@GeorgeCo
Copy link

GeorgeCo commented Nov 1, 2022

Wow !

@cartazio
Copy link
Contributor

cartazio commented Nov 1, 2022

Would love to see the table of timing changes

@noughtmare
Copy link

noughtmare commented Nov 1, 2022

I've now reached the point that all benchmarks are (approximately) equal or faster, see #448 (comment).

@lehins
Copy link
Contributor

lehins commented Nov 1, 2022

@noughtmare That is all pretty great news.

If I understand correctly the proper speed up is only observed with the ghc patch

How does the regression look for all the older ghc versions. We do support all the way down to ghc-8.0

@noughtmare
Copy link

noughtmare commented Nov 1, 2022

@lehins

If I understand correctly the proper speed up is only observed with the ghc patch

No, I've manually implemented the optimization that GHC would do with that patch.

For now I've tested with GHC 9.2.4. I'll do some more tests with other GHC versions, but I believe this only really requires join points (I don't know when those were introduced, but I think GHC 8 already had them).

@lehins
Copy link
Contributor

lehins commented Nov 1, 2022

Now I understand, that is great! Awesome work!

Let's make sure we don't mess up performance for anyone and I don't see a reason why we shouldn't get this change into vector.

@sgraf812
Copy link

sgraf812 commented Nov 1, 2022

It's great to see that you can make a loopified definition work today with GHC, but do note that fusion will break across module boundaries. For example,

module A where
import Data.Stream
drop6 :: Stream a -> Stream a
drop6 xs = drop 6 xs

module B where

import A

main = print $ sum $ drop6 [1,2,3]

note that the stepper of drop6 won't be in a loopified form (because the local, non-recursive binding is probably immediately inlined).

This is kind of the same restriction that we have about foldr/build fusion at the moment, because we rely on compiler phases (too much).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants