optimize par
fast path by packaging all scheduler data into single closure
#193
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
We noticed previously that functions which call
ForkJoin.par
effectively take many more arguments than expected. For example, compiling this simple definition a parallel fib...... results in this RSSA:
Here, we see an RSSA-level function called
fib_0
, which (as you might hope) takes exactly two arguments: an inputn
(RSSA variablex_4373
) and an environmentenv_8
containing the necessary data to support the call toForkJoin.par
.However, upon entry,
fib_0
immediately unpacks approximately 20 components of the closure into temporaries.This inefficiency is carried through into codegen, and results in significantly more instructions on the hot path.
Diagnosis
Why is this happening?
In short, because
ForkJoin.par
closes over many (many!) components of the scheduler which are each used differently. Some are used on the fast path, others only on the slow path, and therefore MPL is forced to split the environment into temporaries and handle each temporary separately.It is helpful to consider the code for
ForkJoin.par
, which is implemented bygreedyWorkAmortizedFork
. This code calls a number of functions defined elsewhere, such asmaybeSpawnFunc
,syncEndAtomic
, etc., each of which has its own associated closure withinenv_8
of generated RSSA code above, but MPL can't statically prove that these closures all have the same lifetimes.Solution
This patch creates a single "scheduler package" which manually closes over all of the data that
ForkJoin.par
needs to execute; we then always access this data explicitly through the scheduler package, making it easy for MPL to prove that all of this data has the same lifetime.The advantage is immediately clear in the generated code. This removes approximately 20 instructions off the fast path.
The performance improvement is big! Nearly 2x on parallel fib.
I've similarly measured approximately 60% improvement on
linefit-ng
and 50% improvement onwc-ng
.