-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
50% productivity on particle simulation #140
Comments
It's worth noting that things aren't much better with a static FRP network: fireworks :: Event Double -> MomentIO (Behavior [SparkData])
fireworks stepPhysics = do
sparks <- for (replicate 200 ()) $ \_ -> newSpark stepPhysics
return (sequenceA (map observeSpark sparks))
where
observeSpark Spark {..} =
SparkData <$> particlePosition sparkParticle <*> sparkIntensity
deleteOne i s = deleteAt i <$ sparkFaded s Gives
|
I don't know what exactly is going on here, but I do know that reactive-banana currently uses monad transformers internally and that last time that I've worked on performance improvements, it turned out that they are responsible for performance problems. Essentially, the problem is that many uses of I have also thought of a solution, but haven't gotten around to implementing it yet. For the record: The key idea is to remove the In summary: Making an algebraic data type for the different nodes types (e.g. fmap, filter, union etc.) allows us to get rid of |
Interesting, and that all loosely makes sense. I'm also surprised to see |
I think ocharles@37fca51 is what you are describing with replacing |
Yes, that's exactly what I meant! (This is called "defunctionalization", because you replace a general closure ( Now, it's possible to unwrap the
and expand the right-hand side of this line. The point would be to make sure that every application of
and
is precisely building a closure on the heap and discarding it again. Note that it may not be necessary to do all of this tedious rewriting by hand. If you look at the generated core, you can see whether the inliner can already perform it, or at least parts of it. The sign to look for is whether the core contains calls to the function (Also, it may help to write Do use a performance test case to check whether these changes to the code actually make things faster. :-) |
Ok, well ocharles@72ba09b "pushes"
I successfully swapped out |
I think that goes in the right direction, but it's probably necessary to unwrap everything inside the Whether this has to be done or not can only be judged by inspecting the compiled core. The Makefile* contains two (phony) targets
*The Makefile assumes that a cabal sandbox is used. |
(By the way, I have written a couple of notes on the issue of optimizing monad transformer stacks, which is relevant to what we are trying to do here. It also shows a Haskell Core output example and what to look for.) |
Thanks, I do get what's being suggested and I'll have a dig further. By the way, are you aware that the latest |
An idle thought that does come out when I stare at |
Alright, I got a bit more organised and took a stab at doing some serious optimisations. My
With the "john lato" benchmark, I move from 0.120s total time to 0.046s - with 1/3 bytes allocated. With my own criterion based benchmark, I move from a mean execution time of 6.456 μs to 1.363 μs. Unfortunately, my particle physics application is still only 70% productive, so I will continue to see what I can do. |
Impressive! 😄 It would be nice to know which of the changes has the most impact, i.e. which yield the biggest reduction in runtime. After all, each of them has the tendency to make the code more "spaghetti", so I would like to limit myself to those that give the best bang for the buck.
Did you have a chance to look at the generated Core? My (admittedly limited) experience was that GHC is pretty good at inlining things, so it's usually not necessary to add INLINE pragmas everywhere.
The nice thing about the
Unfortunately, I'm quite sure that this is not correct. It should be possible to construct a counterexample along the following lines: If
might get evaluated by the depth-first approach before the pulse corresponding to I think it's hard to beat the explicit priority queue compared to an implict "an IORef per node" approach, so that's probably not a low-hanging fruit.
I still suspect that this is due to closures being allocated and destroyed in the tight loop that is |
Any further luck with this? I have to finish a couple of other things first, but then I'm happy to take a closer look at this as well. |
Still working on different approaches. At the moment I'm exploring running with |
Hello! Any updates on this work? |
I don't have any updates I'm afraid.
…On Mon, Aug 21, 2017 at 10:24 PM Mitchell Rosen ***@***.***> wrote:
Hello! Any updates on this work?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#140 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABRjsE6gA72JIysFQ27ThrEtcFHhheoks5safWCgaJpZM4J7uF_>
.
|
It would be great to get some of your work merged, I think the only thing @HeinrichApfelmus objected to was ripping out the priority queue of nodes to evaluate. |
I'll have to dig out the branch again and see if I can make some sensible commits from it. I'll see if I can put some time aside. |
I resurrected this and applied all my current fixes. Unfortunately there's still a significant space leak: As this uses |
As #261 is now fixed — perhaps the situation has improved here as well? |
Things seem pretty good on current master, actually! I'm seeing around 88% productivity. |
Oh, actually only around 80%; I was getting higher productivity when doing profiled runs, oddly. Unfortunately there are nameless things. I believe they're all from |
Hi! I've put together a little benchmark of doing some good old 90s particle simulations: https://github.com/ocharles/reactive-banana-fireworks-benchmark.
Essentially every clock tick I spawn a new particle, which has position/velocity/life time, and I move all particles around on each clock tick. This exhibits some pretty disappointing performance, so hopefully we can work out what's going wrong and get it nice and speedy.
I've attached the result of running with
-p
here, but it's not very enlightening (to me) - here's the header:The text was updated successfully, but these errors were encountered: