overlap.tex

% vim: ts=4 sw=4 et ft=tex

\chapter{Overlap Analysis for Dependent AND-parallelism}
\label{chap:overlap}

\status{This is work in progress, some sections are ready for review.}

Introducing parallelism into a Mercury program is easy.
A programmer can use the parallel conjunction operator (an ampersand)
instead of the plain conjunction operator (a comma) to tell the compiler
that the conjuncts of the conjunction should be executed in parallel with
one another.
However,
in almost all places where parallelism can be introduced,
it will not be
profitable as it is not worthwhile parallelising small computations.
Making a task available to another CPU
may take thousands of instructions,
so spawning off a task that takes only a hundred instructions is clearly a
loss.
Even spawning off a task of a few thousand instructions is not a win;
it should only be done for computations
that take long enough to benefit from parallelism.
It is often difficult for a programmer to determine if parallelisation of
any particular computation is worthwhile.
Researchers have therefore worked towards automatic parallelisation.
Autoparallelising compilers have long tried to use granularity analysis to
ensure that they only spawn off computations whose cost will probably exceed
the spawn-off cost by a comfortable margin.
However, this is not enough to yield good results,
because data dependencies may \emph{also} limit
the usefulness of running computations in parallel.
If a spawned off computation blocks almost immediately
and can resume only after another computation has completed its work,
then the cost of parallelisation again exceeds the benefit.

%This chapter is based on and extends our paper:
%
%\begin{quote}
%\pubauthor{Paul Bone, Zoltan Somogyi and Peter Schachte.}
%% This spelling of parallelisation is okay.
%\pubtitle{Estimating the overlap between dependent computations for automatic
%parallelization}
%\pubhow{Theory and Practice of Logic Programming,}{11(4--5):575--591, 2011.}
%\end{quote}

%\noindent
This chapter presents a set of algorithms for recognising places in a
program where it is worthwhile to execute two or more computations in
parallel,
algorithms that pay attention to the second of these issues as well as the
first.
Our system uses profiling information to estimate the
times at which a procedure call is expected to consume the values of its
input arguments
and the times at which it is expected to produce the values of its output
arguments.
Given two calls that may be executed in parallel,
our system uses the estimated times of production and consumption
of the variables they share
to determine how much their executions are likely to overlap
when run in parallel,
and therefore whether executing them in parallel is a good idea or not.

We have implemented this technique for Mercury
in the form of a tool
that uses data from Mercury's deep profiler
to generate recommendations about what to parallelise.
The programmer can then execute the compiler with automatic parallelism
enabled and provide the recommendations to generate a parallel version of
the program.
An important benefit of profile-directed parallelisation is that
since programmers do not annotate the source program,
it can be re-parallelised easily after a change to the program
obsoletes some old parallelisation opportunities and creates others.
To do this, the programmer can re-profile the program to generate fresh
recommendations and recompile the program to apply those recommendations.
Nevertheless, if programmers want to parallelise some conjunctions manually,
they can do so: our system will not override the programmer.

We present preliminary results that show that
this technique can yield useful parallelisation speedups,
while requiring nothing more from the programmer
than representative input data for the profiling run.

The structure of this chapter is as follows.
Section~\ref{sec:overlap_aims} states our two aims for this chapter.
Then Section~\ref{sec:overlap_approach} outlines our general approach
including information about the call graph search for parallelisation
opportunities.
Section~\ref{sec:overlap_reccalls}
describes how we calculate information about recursive calls missing from
the profiling data.
Section~\ref{sec:overlap_coverage}
describes our change to coverage profiling, which provides more accurate
coverage data for the new call graph based search for parallelisation
opportunities.
Section~\ref{sec:overlap_overlap_alg} describes our algorithm for
calculating the execution overlap between two or more dependent conjuncts.
A conjunction with more than two conjuncts can be parallelised
in several different ways;
Section~\ref{sec:overlap_howto} shows how we choose the best way.
Section~\ref{sec:overlap_pragmatic} discusses some pragmatic issues.
Section~\ref{sec:overlap_perf} evaluates
how our system works in practice on some example programs, and
Section~\ref{sec:overlap_related} concludes
with comparisons to related work.

\section{Aims}
\label{sec:overlap_aims}

\status{This section is ready for proofreading by Peter S. or Micheal R.}

When parallelising Mercury programs,
the best parallelisation opportunities occur
where two goals take a significant and roughly similar amount of time to
execute.
Their execution time should be as large as possible
so that the relative costs of parallel execution are small,
and they should be independent to minimise synchronisation costs.
Unfortunately, goals expensive enough to be worth executing in parallel
are rarely independent.
For example, in the Mercury compiler itself,
there are 69 conjunctions containing two or more expensive goals,
goals with a cost above 10,000csc (call sequence counts),
but in only three of those conjunctions are the expensive goals independent.
This is why Mercury supports the parallel execution of dependent conjunctions
through the use of futures and a compiler transformation
\citep{wang:2006:hons, wang:2011:dep-par} (Section~\ref{sec:backgnd_deppar}).
If the \emph{consumer} of the variable attempts to retrieve the variable's value
before it has been produced, then its execution is blocked
until the \emph{producer} makes the variable available.

\picfigure{overlap_compare}{Ample vs smaller parallel overlap between \code{p} and \code{q}}

Dependent parallel conjunctions differ widely
in the amount of parallelism they have available.
Consider a parallel conjunction with two similarly-sized conjuncts,
\code{p} and \code{q}, that share a single variable \code{A}.
If \code{p} produces \code{A} late but \code{q} consumes it early,
as shown on the right side of Figure~\ref{fig:overlap_compare},
there will be little parallelism,
since \code{q} will be blocked soon after it starts,
and will be unblocked only when \code{p} is about to finish.
Alternatively, if \code{p} produces \code{A} early
and \code{q} consumes it late,
as shown on the left side of in Figure~\ref{fig:overlap_compare},
we would get much more parallelism.
The top part of each scenario
shows the execution of the sequential form of the conjunction.

Unfortunately, in real Mercury programs,
almost all conjunctions are dependent conjunctions,
and in most of them,
shared variables are produced very late and consumed very early.
Parallelising them would therefore yield slowdowns instead of speedups,
because the overheads of parallel execution would far outweigh the
benefit of the small amount of parallelism that is available.
We want to parallelise only conjunctions
in which any shared variables are produced early, consumed late,
or (preferably) both;
such computations expose more parallelism.
The first purpose of this chapter is to show how one can find these conjunctions.

\begin{figure}
\begin{center}
\begin{minipage}[b]{0.49\textwidth}
\subfigure[Sequential \mapfoldl]{%
\label{fig:map_foldl_seq}
{\small
\begin{tabular}{l}
\\
\\
\code{map\_foldl(\_, \_, [], Acc, Acc).} \\
\code{map\_foldl(M, F, [X $|$ Xs], Acc0, Acc) :-} \\
\code{~~~~M(X, Y),} \\
\code{~~~~F(Y, Acc0, Acc1),} \\
\code{~~~~map\_foldl(M, F, Xs, Acc1, Acc).} \\
\end{tabular}}
}
\end{minipage}
%
\begin{minipage}[b]{0.49\textwidth}
\subfigure[Parallel \mapfoldl with overlap]{%
\label{fig:map_foldl_par}
{\small
\begin{tabular}{l}
\code{map\_foldl(\_, \_, [], Acc, Acc).} \\
\code{map\_foldl(M, F, [X $|$ Xs], Acc0, Acc) :-} \\
\code{~~~~(} \\
\code{~~~~~~~~M(X, Y),} \\
\code{~~~~~~~~F(Y, Acc0, Acc1)} \\
\code{~~~~) \&} \\
\code{~~~~map\_foldl(M, F, Xs, Acc1, Acc).} \\
\end{tabular}}
}
\end{minipage}

\end{center}
\caption{Sequential and parallel \mapfoldl}
% the recursive call is less dependent
% on the conjunction of the first two calls.
\label{fig:map_foldl}
%\vspace{-2\baselineskip}
\end{figure}

\picfigure{mapfoldl-overlap}{Overlap of \mapfoldl 
(Figure~\ref{fig:map_foldl_par})}

The second purpose of this chapter is to find the best way to parallelise
these conjunctions.
Consider the \mapfoldl predicate in Figure~\ref{fig:map_foldl_seq}.
The body of the recursive clause has three conjuncts.
We could make each conjunct execute in parallel,
or we could execute two conjuncts in sequence
(either the first and second, or the second and third),
and execute that sequential conjunction in parallel with the remaining conjunct.
In this case, there is little point in executing
the higher order calls to the \M and \F predicates
%(herein map and fold respectively)
in parallel with one another,
since in virtually all cases,
\M will generate \code{Y} very late and
\F will consume \code{Y} very early.
However, executing the sequential conjunction of the calls to \M and \F
in parallel with the recursive call \emph{will} be worthwhile
if \M is time-consuming,
because this implies that
a typical recursive call will consume its fourth argument late.
The recursive call processing the second element of the list
will have significant execution overlap (mainly the cost of \M)
with its parent processing the first element of the list
even if (as is typical) the fold predicate generates \code{Acc1} very late.
This parallel version of \mapfoldl is shown in
Figure~\ref{fig:map_foldl_par}.
A representation of the first three iterations of it
is shown in Figure~\ref{fig:mapfoldl-overlap}.
(This is the kind of computation that
Reform Prolog \citep{bevemyr:reform} was designed to parallelise.)

\section{Our general approach}
\label{sec:overlap_approach}

\status{This section is ready for proofreading by Peter S. or Micheal R.}

\plan{Goal}
We want to find the conjunctions in the program
whose parallelisation would be the most profitable.
This means finding the conjunctions with conjuncts
whose execution cost exceeds the spawning-off cost by the highest margin,
and whose interdependencies, if any,
allow their executions to overlap the most.
It is better to spawn off a medium-sized computation
whose execution can overlap almost completely
with the execution of another medium-sized computation,
than it is to spawn off a big computation
whose execution can overlap only slightly
with the execution of another big computation,
but it is better still to spawn off a big computation
whose execution can overlap almost completely
with the execution of another big computation.
Essentially, the more the tasks' executions can overlap with one another,
the greater the margin by which
the likely runtime of the parallel version of a conjunction beats
the likely runtime of the sequential version (speedup),
and the more beneficial parallelising that conjunction will be.

\plan{Profiler feedback}
To compute this likely benefit,
we need information
both about the likely cost of calls
and the execution overlap allowed by their dependencies.
A compiler may be able to estimate some cost information from static
analysis.
However, this will not be accurate;
static analysis cannot take into account sizes of data terms,
or other values that are only available at runtime.
It may be possible to provide this data by some other means,
such as by requiring the programmer to provide a 
descriptions of the typical shapes and sizes of 
their program's likely input data.
Programming folklore says that programmers are not good at estimating where
their programs' hotspots are.
Some of the reasons for this will affect a programmer's estimate of their
program's likely input data, making it inaccurate.
In fact, misunderstanding a program's typical input is one of the reasons
why a programmer is likely to mis-estimate the location of the
program's hotspots.
Our argument is that an estimate,
even a confident one, can only be verified by measurement,
but a measurement never needs estimation to back it up.
Therefore,
our automatic parallelisation system uses profiler feedback information.
This was introduced in Section~\ref{sec:backgnd_autopar},
which also includes a description of Mercury's deep profiler.
To generate the profiler feedback data,
we require programmers to follow this sequence of actions after they have
tested and debugged the program.

\begin{enumerate}
\item
Compile the program
with options asking for profiling
for automatic parallelisation.
\item
Run the program on a representative set of input data.
This will generate a profiling data file.
\item
Invoke our feedback tool on the profiling data file.
This will generate a parallelisation feedback file.
\item
Compile the program for parallel execution,
specifying the feedback file.
The file tells the compiler
\emph{which} sequential conjunctions to convert to parallel conjunctions,
and exactly \emph{how}.
For example, \code{c1, c2, c3} can be converted
into \code{c1 \& (c2, c3)},
into \code{(c1, c2) \& c3}, or
into \code{c1 \& c2 \& c3},
and as the \code{map\_foldl} example shows,
the speedups you get from them can be strikingly different.
\end{enumerate}

\noindent
A visual representation of such a workflow is shown in 
Figure~\ref{fig:prof_fb} on page~\pageref{fig:prof_fb}.
It is up to the programmer using our system
to select training input for the profiling run in step 2.
Obviously, programmers should pick input that is as representative as
possible;
but even input data that is quite different from the training input can
generate useful parallelisation recommendations.
Variations in our input data will change the numerical results
that we use to decide whether something should be parallelised,
however they rarely change a ``should parallelise'' decision into a
``should not parallelise'' decision or vice-versa.
The other source of inaccuracy comes from mis-estimating the hardware's
performance on certain operations such as the cost of spawning off a new
task.
Such mis-estimations will have the same impact as variations in input data.
The main focus of this chapter is on step 3;
we give the main algorithms used by the feedback tool.
However, we will also touch on steps 1 and 4.
We believe that step 2 can only be addressed by the programmer,
as they understand what input is representative for their program.

\plan{DFS \& limits}
Our feedback tool looks for parallelisation opportunities
by doing a depth-first search of the call tree of the profiling run,
each node of which is an SCC (strongly connected component) of procedure
calls.
It explores the subtree below a node in the tree
only if the per-call cost of the subtree is greater than a configurable
threshold,
and if the amount of parallelism it has found at and above that node
is below another configurable threshold.
The first test lets us avoid looking at code
that would take more work to spawn off than to execute,
while the second test lets us avoid creating
more parallel work than the target machine can handle.
Together these tests dramatically reduce the portions of a program that need
analysis,
reducing the time required to search for parallelisation opportunities.

For each procedure in the call tree,
we search its body for conjunctions that contain two or more calls with
execution times above yet another configurable threshold.
This test also reduces the parts of the program that will be analysed
further;
it quickly rejects procedures that cannot contain any profitable
parallelism.
Parallelising a conjunction
requires partitioning the original conjuncts into two or more groups,
with the conjuncts in each group being executed sequentially
but different groups being executed in parallel.
Each group represents a hypothetical sequential conjunction,
and the set of groups represents a hypothetical parallel conjunction.
As this parallel conjunction represents a possible parallelisation of the
original conjunction, we call it a \emph{candidate parallelisation}.
Most conjunctions can be partitioned into several alternative candidate
parallelisations,
for example, we showed above that \mapfoldl has three alternative
parallelisations of its recursive branch.
We use the algorithms of Section~\ref{sec:overlap_overlap_alg}
to compute the expected parallel execution time of each parallelisation.
These algorithms take into account the runtime overheads of parallel execution.
Large conjunctions can have a very large number of
candidate parallelisations ($2^{n-1}$ for $n$ conjuncts).
Therefore,
we use the algorithms of Section~\ref{sec:overlap_howto}
to heuristically reduce the number of parallelisations whose expected
execution time we calculate.
If the best-performing parallelisation we find
shows a nontrivial speedup over sequential execution,
we remember that we want to perform that parallelisation on this conjunction.
A procedure can contain several conjunctions with two or more goals that we
consider parallelising,
therefore multiple candidate parallelisations may be generated for different
conjunctions in a procedure.
The same procedure may also appear more than once in the call graph.
Each time it occurs in the call graph its conjunctions may be parallelised
differently, or not at all,
therefore it is said to be \emph{polyvariant} (having multiple forms).
Currently our implementation compiles a single \emph{monovariant} procedure.
We discuss how the implementation chooses which candidate parallelisations to
include in Section~\ref{sec:overlap_pragmatic}.

% \section{Traversing the call graph}
% \label{sec:overlap_dfs}
%
% % XXX Further work.
% \paul{TODO: This feature is not yet implemented.}
% If the depth first search later finds
% some of the conjuncts to have parallelisable code inside them,
% we revisit this conjunction,
% this time using updated data about the cost of those conjuncts.
% Otherwise,
% we add a recommendation to perform the selected parallelisation
% to the feedback advice we generate for the compiler.


% \paul{This is not yet implemented and will not be for this version of the paper.}
% \peter{Then you need to say that.}

% GREEDY_SEARCH The top level algorithm of the feedback tool
% GREEDY_SEARCH is a traversal of the tree of cliques
% GREEDY_SEARCH recorded in the deep profiling data file.
% GREEDY_SEARCH Each clique has its own unique entry point,
% GREEDY_SEARCH which will be a call site in a higher clique;
% GREEDY_SEARCH this higher clique is the parent node of this clique.
% GREEDY_SEARCH Likewise, every call site
% GREEDY_SEARCH in every procedure in the clique
% GREEDY_SEARCH will be the entry point of another clique,
% GREEDY_SEARCH provided that
% GREEDY_SEARCH (a) it is actually executed and (b) the callee is not in this clique.
% GREEDY_SEARCH These lower cliques are the children of this clique.

% GREEDY_SEARCH % We will describe our traversal algorithm in detail
% GREEDY_SEARCH % in section \ref{sec:bestfirst},
% GREEDY_SEARCH % but for now, consider this traversal
% GREEDY_SEARCH % as operating on a \emph{candidates list},
% GREEDY_SEARCH % a list of cliques sorted on total cost.
% GREEDY_SEARCH Our traversal algorithm operates on a \emph{candidates list},
% GREEDY_SEARCH which contains a list of cliques sorted on total cost.
% GREEDY_SEARCH We start with the list containing only
% GREEDY_SEARCH the clique of the top level call to \code{main},
% GREEDY_SEARCH the predicate where every Mercury program starts execution.
% GREEDY_SEARCH Then, at each step,
% GREEDY_SEARCH \begin{itemize}
% GREEDY_SEARCH \item
% GREEDY_SEARCH we remove the clique at the start of the candidates list;
% GREEDY_SEARCH \item
% GREEDY_SEARCH we process this clique
% GREEDY_SEARCH by looking at the conjunctions in the clique's procedures
% GREEDY_SEARCH to see whether they should be parallelised; and then
% GREEDY_SEARCH \item
% GREEDY_SEARCH we insert the child cliques (if any) of this clique into the candidates list.
% GREEDY_SEARCH \end{itemize}
% GREEDY_SEARCH We stop when either even the highest cost candidate
% GREEDY_SEARCH is too cheap to be worth parallelising,
% GREEDY_SEARCH or we have achieved our target CPU utilisation
% GREEDY_SEARCH for all phases of the program's execution.

% GREEDY_SEARCH This is only an outline of our traversal algorithm.
% GREEDY_SEARCH In section \ref{sec:pragmatic}, we will describe it in detail,
% GREEDY_SEARCH together with our solutions to several issues that come up in practice.


% \zoltan{this is wrong: the overheads should be PART OF the parallel time}
% \begin{equation*}
% Speedup = \frac{Time_{Seq}}{Time_{Par} + ParOverheads}
% \end{equation*}

\section{The cost of recursive calls}
\label{sec:overlap_reccalls}

\status{This section is ready for proofreading by Peter S. or Micheal R.}
% leave discussion of granularity estimation by static analysis
% for the related work section;
% mention that this work has not extended to large programs.

The Mercury deep profiler gives us
the costs of all non-recursive call sites in a clique.
For recursive calls,
the costs of the callee are mingled together
with the costs of the caller,
which is either the same procedure as the callee,
or is mutually recursive with it.
Therefore if we want to know the cost of a recursive call site (and we do),
we have to infer this
from the cost of the clique as a whole,
the cost of each call site within the procedures of the clique,
the structures of the bodies of those procedures,
and the frequency of execution of each path through those bodies.

For now, we will restrict our attention to SCCs
that contain only a single procedure and where that procedure matches one of
the three recursion patterns below.
These are among the most commonly used recursion patterns and the
inference processes for them are also among the simplest.
Later, we will discuss how partial support could be added for mutually
recursive procedures.

{\bf Pattern 1: no recursion at all.}
This is not a very interesting pattern, but we support it completely.
We do not need to compute the costs of recursive calls if a procedure is not
recursive.

{\bf Pattern 2: simply recursive procedures.}
The first pattern consists of procedures whose bodies
have just two types of execution path through them:
base cases, and recursive cases containing a single recursive call site.
Our example for this category is \code{map\_foldl},
whose code is shown in Figure~\ref{fig:map_foldl}.

Let us say that of 100 calls to the procedure,
90 were from the recursive call site
and 10 were from a call site in the parent SCC.
Then we would calculate
that each non-recursive call
(from the parent SCC)
would on average yield nine recursive calls
(from within the SCC).
Note that there are actually ten levels of recursion so we add one for the
highest level of recursion (the call from the parent SCC).
We call this the average deepest recursion:

\begin{equation*}
AvgMaxDepth = Calls_{RecCallSites} / Calls_{ParentCallSite} + 1
\end{equation*}

The deepest recursive call site executes only the non-recursive path,
and incurs only its costs ($CostNonRec$).
We measure the costs of calls in \emph{call sequence counts} (csc),
a unit defined in Section~\ref{sec:backgnd_deep}.
The next deepest would take the recursive path,
and incur one copy of the non-recursive call costs along the recursive path
($CostNonRec$)
plus the cost of the recursive call itself ($1$)
plus the cost of the non-recursive branch ($CostNonRec$).
The third last would incur two copies of the costs
of the non-recursive calls along the recursive path,
plus the cost of the last call.
The formulas for each of these recursive call site costs is:

\[
\begin{array}{r @{}l @{}l}
cost(0)~&= CostNonRec \\
cost(1)~&= CostNonRec + CostRec + 1 \\
cost(2)~&= CostNonRec + 2{\times}CostRec + 2
\end{array}
\]

\noindent
By induction,
the cost of a recursive call site to depth $D$ (0 being the deepest)
is:

\begin{equation*}
cost(D) = CostNonRec + D(CostRec + 1)
\end{equation*}

We can now calculate the average cost of a call at any level of the
recursion.
Simply recursive procedures have a uniform number of calls at each depth of
the recursion.
The depth representing the typical use of such a procedure is half of
$AvgMaxDepth - 1$.
We subtract 1 as the first level of recursion is not reached by a recursive
call site.
This allows us to calculate the typical cost of a recursive call from this
call site.
The typical depth of this part of the call graph is $(10 - 1)/2 - 1 = 3.5$
The extra subtraction of 1 is necessary as we start counting depth from
zero.
For example, if \mapfoldl's non-recursive path cost is 10csc,
its recursive path cost is 10,000csc,
and its typical depth is $3.5$.
Then its typical cost is $10 + 3.5(10,000 + 1) = 35,013.5$ call sequence counts.

\begin{figure}[tb]
\begin{center}
\begin{minipage}[b][1.9in]{0.49\textwidth}
\subfigure[Accumulator quicksort]{%
\label{fig:quicksort_acc}
\begin{tabular}{l}
\code{quicksort([], Acc, Acc).} \\
\code{quicksort([Pivot $|$ Xs], Acc0, Acc) :-} \\
\code{~~~~partition(Pivot, Xs, Lows, Highs),} \\
\code{~~~~quicksort(Lows, Acc0, Acc1),} \\
\code{~~~~quicksort(Highs, [Pivot $|$ Acc1], Acc).} \\
% Add whitespace to shift the table upwards without also moving the caption.
\\
\\
\\
\end{tabular}
}
\hfill
\end{minipage}
\begin{minipage}[b][1.9in]{0.49\textwidth}
\subfigure[Call graph]{%
\includegraphics[width=0.98\textwidth]{pics/call_tree_dc}
\label{fig:quicksort_acc_callgraph}
}
\hfill
\end{minipage}
\end{center}
\vspace{-2ex}
\caption{Accumulator quicksort, definition and call graph}
\end{figure}

{\bf Pattern 3: Divide-and-conquer procedures.}
The third pattern consists of procedures whose bodies
also have just two types of execution path through them:
base cases, and recursive cases containing \emph{two} recursive call sites.
Our example for this category is an accumulator version of \quicksortacc,
whose code is shown in Figure~\ref{fig:quicksort_acc}.

Calculating the recursion depth of \quicksortacc can be more
problematic.
We know that if a good value for \code{Pivot} is chosen \quicksortacc runs
optimally,
dividing the list in half with each recursion.
Figure~\ref{fig:quicksort_acc_callgraph} shows the call graph for such an
invocation of \quicksortacc.
There are 15 nodes in \code{qs}'s call tree,
each node has exactly one call leading to it;
therefore there is one call from outside \quicksortacc's SCC
(the call from \code{main}),
and 14 calls within the SCC
(the calls from \code{qs} nodes).
By inspection, there are four complete levels of recursion.
There are always
$\ceil{\log_2(N+1)}$
%$\log_2(N+1)$
levels in a divide and conquer call graph of $N$ nodes when
the graph is a \emph{complete binary tree}
(we will cover non-complete binary trees later).
Since there are always $N-1$ calls from within the SCC for a graph with
$N$ nodes then it follows that
there are $\ceil{\log_2(C+2)}$ levels for a divide and conquer call graph
with $C$ recursive calls.
In this example, $C$ is 14 and therefore there are four levels of recursion
as we noted above.
If there were two invocations of \quicksortacc from \code{main/2} then there
would be two calls from outside the SCC and 28 calls from within,
in this case the depth is still four.
The average maximum recursion depth in terms of call counts for divide and
conquer code is therefore:

\begin{equation*}
AvgMaxDepth = \log_2
	\left(\frac{Calls_{RecCallSites}}{Calls_{ParentCallSite}} + 2\right)
\end{equation*}

\noindent
This is the estimated depth of the tree so we omit the ceiling operator.
Non-complete binary trees can be divided into two cases:

\begin{description}
    \item[Pathologically bad trees] (trees which resemble sticks)
    are created when consistently worst-case pivots are chosen.

    These are rare and are considered performance bugs;
    programmers will usually want to remove such bugs in order to improve
    sequential execution performance before they attempt to parallelise
    their program.

    \item[Slightly imbalanced trees] are more common,
    such situations fall into the same class as those where the profiling
    data is not quite representative of the optimised program's future.
    In Section~\ref{sec:overlap_approach} we explained that these variations
    are harmless.
\end{description}

\noindent
Therefore, we assume that all divide and conquer code is, on average,
evenly balanced.

As before, the deepest call executes only the non-recursive path,
and incurs only its costs.
The next deepest takes the recursive path,
it incurs the costs of the goals along that path,
plus the costs of the two calls to the base case,
plus twice the base case's cost itself.
The third deepest also takes the recursive path,
plus the costs of the two recursive calls' executions of the recursive path,
that is three times the cost of the recursive path ($3{RecCost}$);
plus the costs of the two recursive calls and the costs of the four calls to
the base case ($6$);
plus four times the base case's cost ($4{CostNonRec}$).

\[
\begin{array}{r @{}l @{}l}
cost(0)~&= CostNonRec \\
cost(1)~&= CostRec + 2 + 2{CostNonRec} \\
cost(2)~&= 3{CostRec} + 6 + 4{CostNonRec} \\
\end{array}
\]

\noindent
The cost of a recursive call in a perfectly balanced divide and conquer
procedure at depth $D$ is:

\begin{equation*}
cost(D) = (2^D-1)(CostRec + 2) + 2^D{CostNonRec}
\end{equation*}

\plan{interesting depth of d\&c}
Most of the execution of a divide and conquer algorithm occurs at deep
recursion levels as there are many more calls made at these levels than higher
levels.
However, for parallelism the high recursion levels are more interesting:
we know that parallelising the top of the algorithm's call graph can provide
ample coarse-grained parallelism.
We will show how to calculate the cost of a recursive call at the top of the
call graph.
First, depth is measured from zero in the equations above, so we must
subtract one from AvgMaxDepth.
Second, the recursive calls at the first level call the second level,
to compute the cost of calls to this level we must subtract one again.
Therefore we use the cost formula with $D = AvgMaxDepth - 2$.

\plan{quicksort example}
For example, let us compute the costs of the recursive calls at the top of
\quicksortacc's call graph.
In this example, we gathered data using Mercury's deep profiler on 10
executions of \quicksortacc sorting a list of 32,768 elements.
The profiler reports that there are 655,370 calls into this SCC,
10 of which come from the parent SCC, leaving 655,360 from the two call sites
within \quicksortacc's SCC.
Using the formulas above we find that the AvgMaxDepth is
$\log_{2}(655,360/10 + 2) \approx 16$.
The total per-call cost of the call site to \partition reported by the profiler
is an estimated 35.5csc,
it is the only other goal in either the base case or recursive case with a
non-zero cost.
We wish to compute the costs of the recursive calls at the
15\textsuperscript{th} level so $D$ must be 14.
The cost of these calls is
$(2^{14} - 1)(35.5 + 2) + 2^{14}\times0 \approx 614,363$.
This is the average cost of both of the two recursive calls;
assuming that the list was partitioned evenly.
Since the deep profiler reports that the total per-call cost of the
\quicksortacc SCC is 1,229,106csc, 
and the two recursive calls cost 614,363csc each (their sum is 1,228,726csc)
plus the cost of \partition (35.5csc) is approximately 1,228,762.
This is reasonably close to the total cost of \quicksortacc,
especially given other uncertainties.

There are two ways in which this calculation can be inaccurate.
First, poorly chosen pivot values create imbalanced call graphs.
We have already discussed how this can affect the calculation of the
recursion depth.
This can also affect the cost calculation in much the same way,
and the same rebuttals apply.
There is one extra concern,
a pivot that is not perfect will result in two sublists of different sizes
and therefore the recursive calls will have different costs rather than the
same cost we computed above.
It is normal to use \emph{parallel slackness} to reduce the impact of
imbalances.
This means creating slightly more finely grained parallelism than is
necessary in order to increase the chance that a processor can find parallel
work.
This works because on average the computed costs will be close enough to the
real costs.

The second cause of inaccurate results comes from our assumption about
\partition's cost.
We assumed that partition always has the same cost at every level of
the call graph.
However \partition's cost is directly proportional to the size of its input
list which is itself directly proportional to the depth in \quicksortacc's call
graph.
This would seem to affect our calculations of the cost of recursive calls
different levels within the call tree.
However the rate at which \partition's cost increases with height in the tree
is linear, while at the same time the number of calls to \partition grows at
a power of two with the tree's hight.
Therefore as the tree becomes larger the \emph{average} cost of calls to
\partition within it asymptotically approaches a constant number.
So in significantly large trees the varying costs of \partition do not
matter.

%
%In our example we are using lists of integers and integer comparison costs
%1csc.
%The cost of the call to \partition is one (for the call itself) plus two
%times the lenght of the list (for the integer comparison and for the next
%recursive call within \partition):
%
%\begin{equation*}
%part\_cost(L) = 2L + 1 \\
%\end{equation*}
%
%\noindent
%As we move from the leaves to the root of \quicksortacc's call tree,
%the length of the list is double plus one.
%The costs of the calls to \partition in a tree is double the cost of the
%call in one of the calls in a child plus the cost of the call in the root
%node.
%And the number of calls to \partition in a tree is double the number of calls in
%either of its subtrees plus one, which is also $2^D - 1$
%
%\paul{XXX: Can I turn the second equation into an exponential rather than a
%recursive function?}
%\[
%\begin{array}{l @{}l @{}l}
%length(D)              ~&= 2(D - 1) \\
%total\_part\_cost(D)   ~&= part\_cost(length(D)) +
%    \begin{cases}
%        2total\_part\_cost(D - 1) & \text{if}~ D > 1 \\
%        0                         & \text{if}~ otherwise \\
%    \end{cases} \\
%num\_part\_calls(D)    ~&= 2^D - 1 \\
%\end{array}
%\]
%
%\noindent
%As above,
%these are accruate only when \quicksortacc's call graph is a complete binary
%tree.
%But now we can calculate the average cost of the calls to partition in any
%subtree begining at depth $D$:
%
%\begin{equation*}
%avg\_part\_cost(D) = \frac{total\_part\_cost(D)}{num\_part\_calls(D)}
%\end{equation*}
%
%\paul{This seems to approach 2 as D approaches $\inf$.
%If I knew how to remove the recursive definition above I could prove this
%and show that in a big enough tree inaccuracies does not matter.
%Also the computed average cost is way different to the measured cost, which
%is probably due to imperfect pivots, this is more of an issue when computing
%the cost of the recursive calls.
%}


\begin{figure}
\begin{center}
\begin{tabular}{rlrr}
\C{Line} & \C{Code}         & \C{Coverage}  & \C{Cost}  \\
    & \code{p(X, Y, ...) :-}&         100\% &           \\
    & \code{~~~~(}          &               &           \\
    & \code{~~~~~~~~X = a,} &          60\% &         0 \\
    & \code{~~~~~~~~q(...)} &          60\% &     1,000 \\
 5  & \code{~~~~;}          &               &           \\
    & \code{~~~~~~~~X = b,} &          20\% &         0 \\
    & \code{~~~~~~~~r(...)} &          20\% &     2,000 \\
    & \code{~~~~;}          &               &           \\
    & \code{~~~~~~~~X = c,} &          20\% &         0 \\
10  & \code{~~~~~~~~p(...)} &          20\% &           \\
    & \code{~~~~),}         &               &           \\
    & \code{~~~~(}          &               &           \\
    & \code{~~~~~~~~Y = d,} &          90\% &         0 \\
    & \code{~~~~~~~~s(...)} &          90\% &    10,000 \\
15  & \code{~~~~;}          &               &           \\
    & \code{~~~~~~~~Y = e,} &          10\% &         0 \\
    & \code{~~~~~~~~p(...)} &          10\% &           \\
    & \code{~~~~).}         &               &           \\
\end{tabular}
\end{center}
\caption{Two recursive calls and six code paths.}
\label{fig:2_reccalls_4_paths}
\end{figure}

\plan{How we classify recursion type}
We classify recursion types with an algorithm that walks over the structure
of a procedure.
As it traverses the procedure,
it counts the number of recursive calls along each path,
the path's cost, and the number of times the path is executed.
The number of times a path is executed is generated using coverage
profiling, which is described in the next section.
When the algorithm finds a branching structure like an if-then-else or
switch,
it processes each branch independently and then merges its
results at the end of the branch.
If several branches have the same number of recursive calls (including zero)
they can be merged.
If several branches have different numbers of recursive calls they are all
added to the result set.
This means that the result of traversing a goal might include data for
several different recursion counts.
Consider the example in Figure~\ref{fig:2_reccalls_4_paths}.
The example code has been annotated with coverage information (in the
third column) and with cost information where it is available (fourth
column).
The conjunction on lines three and four does not contain a recursive call.
The result of processing it is a list containing a single tuple:
\code{[(reccalls: 0, coverage: 60\%, cost: 1,000)]}.
The result for the second switch arm (lines six and seven) is:
\code{[(reccalls: 0, coverage: 20\%, cost: 2,000)]}.
The third conjunction in the same switch (lines nine and ten) contains
a recursive call.
The result of processing it is:
\code{[(reccalls: 1, coverage: 20\%, cost: 0)]}.
When the algorithm is finished processing all the cases in the switch it
adds them together.
When adding tuples, we can add tuples together with the same number of
recursive calls by adding their coverage and adding their costs weighted by
coverage (these are per-call costs).
This simplifies multiple code paths with the same number of recursive calls
into a single ``code path''.
The result of processing the switch from line 2--11 is:

\noindent
\begin{center}
\begin{tabular}{l}
\code{[(reccalls: 0, coverage: 80\%, cost: 1,250),}  \\
\code{~(reccalls: 1, coverage: 20\%, cost: ~~~~0)]}. \\
\end{tabular}
\end{center}

\noindent
In this way, the result of processing a goal represents all the possible
code paths through that goal.  In this case there are three code paths
through the switch,
and the result has two entries, one represents the two base case code paths,
the other represents the single recursive case.

The result of processing the other switch in the example,
lines 12--18, is:

\noindent
\begin{center}
\begin{tabular}{l}
\code{[(reccalls: 0, coverage: 90\%, cost: 10,000),}  \\
\code{~(reccalls: 1, coverage: 10\%, cost: ~~~~~0)]}. \\
\end{tabular}
\end{center}

\noindent
In order to compute the result for the whole procedure, we must compute the
product of these two results;
this computes all the possible paths through the two switches.
We do this by constructing pairs of tuples from the two lists.
Since each list has two entries there are four pairs:

\noindent
\begin{center}
\begin{tabular}{rcl}
\code{[   (rc: 0, cvg: 80\%, cost: ~1,250)} &
    $\times$&
    \code{(rc: 0, cvg: 90\%, cost: 10,000),}
    \\
\code{   ~(rc: 0, cvg: 80\%, cost: ~1,250)} &
    $\times$&
    \code{(rc: 1, cvg: 10\%, cost: ~~~~~0),}
    \\
\code{   ~(rc: 1, cvg: 20\%, cost: ~~~~~0)} &
    $\times$&
    \code{(rc: 0, cvg: 90\%, cost: 10,000),}
    \\
\code{   ~(rc: 1, cvg: 20\%, cost: ~~~~~0)} &
    $\times$&
    \code{(rc: 1, cvg: 10\%, cost: ~~~~~0)]}
    \\
\end{tabular}
\end{center}

\noindent
For each pair we compute a new tuple by adding the number of recursive
calls, averaging the coverage counts, and adding the costs.

\noindent
\begin{center}
\begin{tabular}{l}
\code{[   (rc: 0, cvg: 72\%, cost: 11,250),} \\
\code{   ~(rc: 1, cvg: ~8\%, cost: ~1,250),} \\
\code{   ~(rc: 1, cvg: 18\%, cost: 10,000),} \\
\code{   ~(rc: 2, cvg: ~2\%, cost: ~~~~~0),} \\
\end{tabular}
\end{center}

\noindent
Again, we merge the cases with the same numbers of recursive calls.

\noindent
\begin{center}
\begin{tabular}{l}
\code{[   (rc: 0, cvg: 72\%, cost: 11,250),} \\
\code{   ~(rc: 1, cvg: 26\%, cost: ~7,308),} \\
\code{   ~(rc: 2, cvg: ~2\%, cost: ~~~~~0)]} \\
\end{tabular}
\end{center}

\noindent
There are six paths through this procedure,
and two recursive calls.
The number of tuples needed is linear in the number of path types;
in this case we represent all six paths using three tuples.
This allows us to conveniently handle procedures of different forms as
rather simple recursion types such as ``simple recursion'' we saw above:
a procedure with two recursive paths with one call each can be handled as
if it has just one recursive path and one base case.

We can determine the type of recursion for any list of recursion path
information.
If there is a single path entry with zero recursive calls
then the procedure is not recursive.
If there is an entry for a path with zero recursive calls,
and a path with one recursive call
then the procedure is ``simply recursive''.
Finally if there are two paths, one with zero recursive calls and one with
two recursive calls, then we know that the procedure
uses ``divide and conquer'' recursion.
It is possible to generalise further, if a procedure has two entries,
one with zero recursive calls and the other with some $N$ recursive calls.
Then the recursion pattern is similar to divide and conquer except that the
base in the formulas shown above is $N$ rather than 2.
It has not been necessary to handle these cases.

\begin{table}
\begin{center}
\begin{tabular}{ll|rrr}
\multicolumn{2}{c|}{\textbf{Recursion Type}} &
\textbf{No.\ of SCCs} &
\textbf{Percent} &
\textbf{Total cost} \\
\hline
\Lbr{2}{Not recursive}              & 292,893 & 78.07\% & 2,320,270,385 \\
\Lbr{2}{Simple recursion}           &  48,458 & 12.92\% &   402,430,967 \\

\Lbr{2}{Mutual recursion: totals}
                                    &  19,293 &  5.14\% &   198,326,577 \\
~~~~&
 \Lbr{1}{2 procs}                   &   4,066 &  1.08\% &    27,099,504 \\
&\Lbr{1}{3 procs}                   &   9,393 &  2.50\% &    14,846,076 \\
&\Lbr{1}{4 procs}                   &   1,092 &  0.29\% &    12,542,308 \\
&\Lbr{1}{5 procs}                   &   1,035 &  0.28\% &     3,863,295 \\
&\Lbr{1}{6+ procs}                  &   3,707 &  0.99\% &   139,975.394 \\

\Lbr{2}{Unknown}                    &  11,803 &  3.05\% &     8,838,655 \\
\Lbr{2}{Divide and conquer}         &   1,917 &  0.51\% &     5,337,293 \\

\Lbr{2}{Multiple recursive paths: totals}
                                    &   1,089 &  0.28\% &     4,678,467 \\
&\Lbr{1}{rec-branches: 1, 2}        &      44 &  0.01\% &       580,994 \\
&\Lbr{1}{rec-branches: 2, 3}        &     281 &  0.07\% &       189,377 \\
&\Lbr{1}{rec-branches: 2, 3, 4}     &     564 &  0.15\% &     3,902,188 \\
&\Lbr{1}{other}                     &     200 &  0.05\% &         5,908 \\

%\Lbr{2}{Unknown (built-in \& foreign language code)}
%						            &  10,623 &  2.83\% &           364 \\
%\Lbr{2}{Unknown (error)}            &   1,180 &  0.32\% &     8,838,291 \\
\end{tabular}
\end{center}
\caption{Survey of recursion types in an execution of the Mercury compiler}
\label{tab:recursion_types}
\end{table}

\plan{Recursion type survey}
There are many possible types of recursion,
and we wanted to limit our development effort to just those recursion types
that would occur often enough to be important.
Therefore,
we ran our analysis across all the SCCs in the Mercury compiler,
the largest open source Mercury program,
to determine which types of recursion are most common.
Table~\ref{tab:recursion_types} summarises the results.
We can see that most SCCs are not recursive,
and the next biggest group is the simply recursive SCCs,
accounting for nearly 13\% of the profile's SCCs.
If our analysis finds an SCC with more than one procedure,
it counts the number of procedures and marks the whole SCC as
mutually recursive and does no further analysis.
It may be possible to perform further analysis on mutually recursive SCCs,
but we have not found it important to do this yet.
Mutual recursion as a whole accounts for a larger proportion of procedures
than divide and conquer.
The ``Multiple recursive paths'' recursion types refer to cases where there
are multiple recursive paths through the SCC's procedure with different
numbers of recursive calls plus a base case.
The table row labelled ``Unknown'' refers to cases that our algorithm
could not or does not handle.
This can include builtin code, foreign language code and
procedures that may backtrack because they are either \dnondet or \dmulti.
Note that some procedures such as \dsemidet procedures which we cannot
parallelise are still interesting:
such a procedure may be involved in the production or consumption of a
variable that is an argument to the procedure;
understanding this procedure may provide information needed to
decide if a parallelisation in the procedure's caller is profitable or not.

The profiler represents each procedure in the program's source code as a \PS
structure (Section~\ref{sec:backgnd_deep}).
Each procedure may be used multiple times and therefore have multiple \PD
structures appearing in different SCCs.
Each SCC may have a different recursion type depending on how it was called.
For example, calling \code{length/2} on an empty list will not execute the base
case and therefore this \emph{use} of \code{length/2} is non-recursive.

Some of the ``Multiple recursive paths'' recursion type cases are due to the
implementation of the \code{map} ADT and code in Mercury's standard library,
which uses a 2-3-4 tree implementation,
and hence almost all of the cases with 2, 3 or 4 recursive calls on a path
are 2-3-4 tree traversals.
These traversals also account for some of the other multi recursive path
cases,
such as those with 2 or 3 recursive calls on a path.
In many cases these are the same code running on a tree without any 4-nodes.

\begin{figure}
\begin{center}
\subfigure[One loop]{%
\label{fig:mutrec1}
\includegraphics[scale=0.75]{pics/mutrec1}
}
%
\hspace{0.09\textwidth}
%
\subfigure[Two loops]{%
\label{fig:mutrec2}
\includegraphics[scale=0.75]{pics/mutrec2}
}
%
\hspace{0.09\textwidth}
%
\subfigure[Three loops]{%
\label{fig:mutrec3}
\includegraphics[scale=0.75]{pics/mutrec3}
}
\end{center}
\caption{Mutual recursion}
\label{fig:mutrec}
\end{figure}

\plan{Mutual recursion}
It may be possible to handle simple cases of mutual recursion by a process
of hypothetical inlining.
Consider the call graph in Figure~\ref{fig:mutrec1}.
In this example $f$, $g$ and $h$ represent an SCC,
$f$ calls $g$, which calls $h$, which calls $f$.
These calls are recursive, they create a loop between all three procedures.
Prior to doing the recursion path analysis we could inline each of these
procedure's representations inside one another as follows:
inline $h$ into $g$ and then inline $g$ into $f$.
This creates a new pseudo-procedure that is equivalent to the loop between the
three procedures.
We can then run the recursion path analysis on this procedure and apply the
results to the three original procedures.
We have not yet needed to implement call site cost analysis for mutually
recursive procedures;
therefore this discussion is a thought experiment and not part of our
analysis tool.

We believe that we can handle some other forms of mutually recursive code.
One example of this is  the graph in Figure~\ref{fig:mutrec2},
which has two loops, one mutually recursive loop and one loop within $g$.
We cannot inline $g$ into $f$ because $g$ has a recursive call whose
entry point would disappear after inlining.
But we can duplicate $f$, creating a new $f\prime$, and re-write $g$'s call
to $f$ as a call to $f\prime$.
This allows us to inline $f\prime$ into $g$ without an issue, and $f$'s call
to $g$ is no longer recursive.

The graph in Figure~\ref{fig:mutrec3} is more complicated;
it has three loops.
In this case we cannot inline $g$ into $f$ because of the loop within $g$,
and we cannot inline $f$ into $g$ because of the loop within $f$.
Simple duplication of either $f$ or $g$ does not help us.

\plan{What is unimplemented}
Our current implementation handles non-recursive and simply recursive
loops within a single procedure.
We have shown how to calculate the cost of recursive calls in divide and
conquer code.
However, generating efficient parallel divide and conquer code is slightly
more complicated,
as Granularity Control is needed to prevent embarrassingly parallel
workloads (Section~\ref{sec:rts_work_stealing2}).
This is less true with simply recursive code as a non-recursive call that is
parallelised against will usually have a non-trivial cost in all levels of
the recursion.

\section{Deep coverage information}
\label{sec:overlap_coverage}

\status{This section is ready for proofreading by Peter S. or Micheal R.}

\plan{Explain the shallow coverage information.}
In Section~\ref{sec:backgnd_deep} we introduced the deep profiler,
a profiler that can gather \emph{ancestor context} specific profiling data.
We also introduced coverage profiling in Section~\ref{sec:backgnd_coverage},
which can gather the \emph{coverage} of any program point.
Before we started this project the coverage data gathered was not ancestor
context specific.
Coverage data was collected and stored in the profiler's \PS
structures where it was not associated with a specific ancestor context of
a procedure.
Throughout this section we will use the term \emph{deep} to describe data
that is ancestor context specific,
and we will use \emph{static} to describe data that is not ancestor
context specific.
We want to use deep profiling data for automatic parallelisation,
which requires using analyses that in turn require coverage data such as
the calculation of recursive call sites' costs
(Section~\ref{sec:overlap_reccalls}),
and variable use time analysis (Section~\ref{sec:backgnd_var_use_analysis}).
Therefore we have changed the way coverage data is collected and stored;
this section describes our changes.

\begin{figure}
\begin{center}
\begin{tabular}{r|rr|r|l}
\Cbr{\textbf{Line}} &
\multicolumn{2}{c|}{\textbf{Coverage}} & \Cbr{\textbf{Percall Cost}} &
    \C{\textbf{Code}} \\
&
\C{\textbf{Static}} & \Cbr{\textbf{Deep}} &
& \\
\hline
 1 & 7,851 & 2,689 & 3,898,634 & \code{foldl3(P, Xs0, !Acc1, ...) :-} \\
 2 &     - &     - &         - & \code{~~~~(} \\
 3 & 2,142 &     1 &         0 & \code{~~~~~~~~Xs0 = []} \\
 4 &     - &     - &         - & \code{~~~~;} \\
 5 & 5,709 & 2,688 &         0 & \code{~~~~~~~~Xs0 = [X $|$ Xs],} \\
 6 & 5,709 & 2,688 &     1,449 & \code{~~~~~~~~P(X, !Acc1, ...),} \\
 7 & 5,709 & 2,688 & 3,897,182 & \code{~~~~~~~~foldl3(P, Xs, !Acc1, ...)} \\
 8 &     - &     - &         - & \code{~~~~).} \\
%  1 & 7,851 & 3,358 & 2,689 & 3,898,634 & \code{foldl3(P, Xs0, !Acc1, ...) :-} \\
%  2 &     - &     - &     - &         - & \code{~~~~(} \\
%  3 & 2,142 &     0 &     1 &         0 & \code{~~~~~~~~Xs0 = []} \\
%  4 &     - &     - &     - &         - & \code{~~~~;} \\
%  5 & 5,709 &     0 & 2,688 &         0 & \code{~~~~~~~~Xs0 = [X $|$ Xs],} \\
%  6 & 5,709 & 1,259 & 2,688 &     1,449 & \code{~~~~~~~~P(X, !Acc1, ...),} \\
%  7 & 5,709 & 2,097 & 2,688 & 3,897,182 & \code{~~~~~~~~foldl3(P, Xs, !Acc1, ...)} \\
%  8 &     - &     - &     - &         - & \code{~~~~).} \\
\end{tabular}
\end{center}
\caption{Static and deep coverage data for \foldlthree}
\label{fig:static_and_deep_coverage}
\end{figure}

\plan{Show a problem, use an example.}
Our automatic parallelisation tool uses deep profiling data throughout.
However, using deep profiling data and static coverage data creates
inconsistencies that can result in erroneous results.
Figure~\ref{fig:static_and_deep_coverage} shows \listfoldlthree from
Mercury's standard library on the right hand side.
On the immediate left (the fourth column) of each line of code,
we have shown deep profiling data from a profiling run of the Mercury
compiler.
The first cell in this column shows the per-call cost of the
call site in the parent SCC (the total cost of \foldlthree's call graph).
The other cells in this column show the per-call costs of \foldlthree's call
sites;
the recursive calls cost is the cost of the first recursive call
(the call on the top-most recursion level)
as calculated by the algorithms in the previous section.
This recursion level was chosen as so that the sum of the costs of calls in
the body of the procedure is approximately equal to the cost of the procedure
from its parent call site.
The second and third columns in this table report static and deep coverage
data respectively.
The deep profiling and coverage data was collected from a single SCC of
\foldlthree\footnote{
    We do not mean that \foldlthree has more than one SCC, and we only
    picked one;
    we mean that \foldlthree is called many times during the compiler's
    execution, each time this creates a new SCC and we have picked one of
    these.}
whose ancestor context  is part of the compiler's mode checker;
the last procedure in the call chain, excluding \foldlthree, is
\code{modecheck\_to\_fixpoint/8}.

Using the static coverage data we can see that \foldlthree is called 7,851 times
(line one)
throughout the execution of the compiler;
this includes recursive calls.
There are 5,709 directly recursive calls (the coverage on line seven).
Using these figures we can calculate that the average deepest recursion is
2.66 levels.
% The static cost data provided by the profiler includes the average per-call
% cost for the procedure as a whole on line one,
% and the average per-call cost for the call to P on line six.
% We calculated the average cost of the recursive call (line seven) using the
% methods in the previous section for the depth of 1.66,
% which is the first level down from the top of \foldlthree's call tree
% (the first recursive call).
We can conclude that on average across the whole program,
\foldlthree does not make very deep recursions
(it is called on short lists).
This information is correct when generalised across the program,
but it is incorrect in specific ancestor contexts:
if we tried to analyse this loop in a specific context it would cause
problems.
% \paul{XXX: Why say this?}
% Additionally, coverage data is necessary to accurately compute the cost of
% the recursive call.
% and that the cost of the higher order call is cheap.
For example,
if we use this information with the deep profiling information in the fourth
column we would calculate an incorrect value of 4,464csc for the recursive
call's cost at this top-most level of the call tree.
This is obviously incorrect:
if we add it to the cost of the higher order call, 1,449csc
then the sum is 5,913csc,
which is a far cry from the measured cost of the call to this SCC
(3,898,634csc).
It is easy to understand how this discrepancy could affect a later analysis
such as the variable use time analysis.

If instead we use coverage information specific to this ancestor context,
then we see that the base case is executed only once and the recursive case
is executed 2,688 times.
This means that the recursion is much deeper,
with a depth of 2,689 levels.
Using this value,
we calculate that the cost of the recursive call at the top-most level of
the recursion is
3,897,182csc.

\plan{Explain how we gather deep coverage info, and why it fixes the
problem.}
We said above that coverage data was stored in \PS structures where it was
not associated with an ancestor context.
The data was represented using two arrays and an integer describing the
arrays' lengths.
Each corresponding pair of slots in the arrays referred to a single coverage
point in the procedure.
The first array gives static information such as the type of coverage point and
its location in the compiler's representation of the procedure.
The second array contains the current value of the coverage point,
which is incremented each time execution reaches the program point.
Each procedure in the program is associated with a single \PS structure
and any number of \PD structures.
Each \PD structure represents a use of a procedure in the program's
call graph (modulo recursive calls).
We moved the array containing the coverage points' execution
counts into the \PD structure.
Since there are more \PD structures than \PS structures in a program's
profile,
this will make the profile larger, consuming more memory and potentially
affecting performance.
It is important to minimise a profiler's impact on performance.
Poor performance both affects the user's experience and
increases the risk of distorting the program's profile.
However, we do not have to worry about distortion of time measured in
call sequence counts.
Time measured in seconds may be distorted slightly, but this is minimal.

\plan{Optimisations \& profiling}
Normally during profiling we disable optimisations that can transform a
program in ways that would make it hard for a programmer to recognise their
program's profile.
For example inlining one procedure into another might cause the programmer
to see one of their procedures making calls that they
did not write.
However automatic parallelisation, like other optimisations,
is used to speedup a program;
a programmer will typically use other optimisations in conjunction with
automatic parallelism in order to achieve more greater speedups.
The compiler performs parallelisation after most other optimisations,
and therefore it operates on a version of the program that has already had
these optimisations applied.
So that we can generate feedback information that can easily be applied to
the optimised program,
the feedback tool must also operate on an optimised version of the program.
Therefore we must enable optimisations when compiling the program for
profiling for automatic parallelism.

\begin{table}
\begin{center}
\begin{tabular}{l|r|rrr|r|r}
\Cbr{\textbf{Profiling type}} &
\Cbr{\textbf{Time}} &
\C{\textbf{.text}} &
\C{\textbf{.(ro)data}} &
\Cbr{\textbf{.bss}} &
\Cbr{\textbf{Heap growth}} &
\C{\textbf{Profile size}} \\
\hline
\hline
\multicolumn{7}{c}{Without optimisations} \\
\hline
no coverage     & 55.4s & 30,585K & 32,040K & 5,646K & 419,360K & 35,066K \\
static coverage & 56.0s & 34,097K & 33,547K & 6,267K & 419,380K & 36,151K \\
deep coverage   & 57.6s & 34,327K & 33,306K & 5,646K & 476,880K & 42,896K \\
\hline
\hline
\multicolumn{7}{c}{With optimisations} \\
\hline
no profiling    & 10.6s & 12,003K &  3,203K & 5,645K & 186,752K & - \\
no coverage     & 35.9s & 29,599K & 31,218K & 5,646K & 319,184K & 23,465K \\
static coverage & 37.1s & 33,391K & 32,856K & 6,302K & 319,188K & 24,640K \\
deep coverage   & 37.6s & 33,628K & 32,630K & 5,646K & 370,288K & 30,901K \\
\end{tabular}
\end{center}
\caption{Coverage profiling overheads}
\label{tab:coverage_prof_overheads}
\end{table}

\plan{Benchmark coverage profiling.}
Table~\ref{tab:coverage_prof_overheads} shows the overheads of profiling
the Mercury compiler, including coverage profiling.
We compiled the Mercury compiler with seven different sets of options for
profiling.
The two row groups of the table show results with optimisations both
disabled and enabled.
The rows of the table give results for no profiling support\footnote{
    The non-profiling version of the compiler was not compiled or tested with
    optimisations disabled
    as it ignores the \samp{--profile-optimised} flag when profiling is
    disabled.},
profiling support without coverage profiling,
static coverage profiling support,
and dynamic coverage profiling support.
The second column in the table shows
the user time, an average of 20 executions while compiling the eight largest
modules of the compiler itself.
The next three columns show the sizes of different sections in the
executable file:
the size of the executable's \samp{.text} section (the compiled executable
code);
the size of the \samp{.data} and \samp{.rodata} sections (static data);
the size of the \samp{.bss} section (static data that is initially zero).
The sixth column shows
the \emph{heap growth}, the amount by which the program break\footnote{
    The program break marks the end of the heap area.
    See the \code{sbrk(2)} Unix system call.}
moved during the execution.
The final column shows the size of the resulting profiling file.
All sizes are shown in kilobytes (KiB) (1024 bytes).

We can see that enabling profiling slows the program down by at least a
factor of three,
or a factor of five if optimisations are disabled.
The difference between the results with and without optimisations is
unsurprising.
It is well established that simple optimisations such as inlining can make a
significant difference to performance.
As optimisations make the program simpler, the call graph also becomes
simpler.
For example, inlining reduces the number of procedures in the program,
which means that fewer profiling data structures are needed to represent
them both in the program's memory and in the data file.
This is why the results show that the optimised programs use less memory
and generate smaller profiling data files.
Optimisations can also improve performance in another way:
inlining reduces the number of procedures which reduces the amount of
instrumentation placed in the code by the profiler.

Enabling static coverage profiling does not significantly impact the heap
usage.
The \PS structures and their coverage data are stored in the program's
static data,
the \samp{.data}, \samp{.rodata} and \samp{.bss} sections of the executable;
in particular, the coverage point values themselves are in the \samp{.bss}
section.
Coverage profiling also creates a small increase in the size of the
executable code, as the code now contains some additional instrumentation.
Likewise this instrumentation affects performance very slightly.
The effect is well within our goal of minimising coverage profiling
overheads.
Associating coverage data with \PD structures rather than \PS structures
significantly affects heap usage as \PD structures and the coverage points
therein are all stored on the heap.
This additional amount of heap usage (56MB or 50MB,
without and with optimisations respectively)
is acceptable.
Conversely we see that the amount of statically allocated memory decreases
when using deep coverage profiling.
The size of the \samp{.text} section and the program's execution time
increase only slightly when using deep coverage information.
This additional cost of collecting deep coverage data is very small compared
to the benefit that this data provides,
especially when optimisations are enabled during a profiling build.


\section{Calculating the overlap between dependent conjuncts}
\label{sec:overlap_overlap_alg}

\status{Draft finished, This is ready for review by Zoltan.}

The previous two sections provide the methods necessary for
estimating a candidate parallelisation's performance,
which is the topic of this section.
As stated earlier,
dependencies between conjuncts affect the amount of parallelism
available.
Our goal is to determine if a dependent parallel conjunction exposes enough
parallelism to be profitable.
We show this as the amount that the boxes overlap
in diagrams such as Figure~\ref{fig:overlap_compare};
this \emph{overlap} specifically represents the amount of elapsed time that can be
saved by parallel execution.
Estimating this overlap
in the parallel executions of two dependent conjuncts
requires knowing, for each of the variables they share,
when that variable is produced by the first conjunct and
when it is first consumed by the second conjunct.
The two algorithms for calculating estimates of when variables are produced and
consumed
(variable use times)
are described in Section~\ref{sec:backgnd_var_use_analysis}.
The implementations of both algorithms have been updated since my honours
project;
the algorithms now make use of the deep coverage information described in
the previous section.

Suppose a candidate parallel conjunction has two conjuncts $p$ and $q$,
and their execution times in the original, sequential conjunction ($p, q$),
are ${SeqTime}_p$ and ${SeqTime}_q$, which are both provided by the profiler.
The parallel execution time of $p$ in $p \& q$, namely $ParTime_p$,
is the same as its sequential execution time;
if we ignore overheads,
which we do for now but will come back to them later.
Whilst the parallel execution time of $q$, namely $ParTime_q$, may be
different from its sequential execution time as $q$ may block on futures
that $p$ may not have signalled yet.
In Section~\ref{sec:backgnd_priorautopar} we showed how to calculate
$ParTime_q$ in cases where there is a single shared variable between $p$ and
$q$.
Then we use $ParTime_p$ and $ParTime_q$ to calculate the speedup due to
parallelism.

\begin{algorithm}[tbp]
\begin{algorithmic}[1]
\Procedure{overlap\_simple}{$SeqTime_q$, $VarUsesList_q$, $ProdTimeMap_p$}
\State $CurSeqTime \gets 0$
\State $CurParTime \gets 0$
\State sort $VarUsesList_q$ on $ConsTime_{q, i}$
\For{$(Var_i, ConsTime_{q, i}) \in VarUsesList_q$}
    \State $Duration_{q, i} \gets ConsTime_{q, i} - CurSeqTime$
    \State $CurSeqTime \gets CurSeqTime + Duration_{q, i}$
    \State $ParWantTime_{q, i} \gets CurParTime + Duration_{q, i}$
    \State $CurParTime \gets$ max($ParWantTime_{q, i}, ProdTimeMap_{p}[Var_i]$)
\EndFor
\State $DurationRest_q \gets SeqTime_q - CurSeqTime$
%\State $SeqTime_q \gets CurSeqTime_q + DurationRest_q$
\State $ParTime_q \gets CurParTime + DurationRest_q$
\State \Return $ParTime_q$
\EndProcedure
\end{algorithmic}
\caption{Dependent parallel conjunction algorithm, for exactly two conjuncts}
\label{alg:dep_par_conj_overlap_simple}
\end{algorithm}

When there are multiple shared variables we must use a different algorithm
such as the one shown in Algorithm~\ref{alg:dep_par_conj_overlap_simple},
which is a part of the full algorithm, shown in
Algorithm~\ref{alg:dep_par_conj_overlap_complete}.
It calculates $ParTime_q$ from its arguments $SeqTime_q$, $VarUsesList_q$ and
$ProdTimeMap_p$.
$VarUsesList_q$ is a list of tuples,
with each tuple being a shared variable and the time at which $q$ would
consume the variable during sequential execution.
$ProdTimeMap_p$ is a map (sometimes called a dictionary)
that maps each shared variable to the time at which $p$ would produce the
variable during parallel execution.

\picfigure{overlap2-swap}{Overlap with multiple variables}

The algorithm works by dividing ${SeqTime}_q$ into chunks
and processing them in order;
it keeps track of the sequential and parallel execution times of the
chunks so far.
The end of each chunk, except the last, is defined by $q$'s consumption of a
shared variable;
the last chunk ends at the end of the $q$'s execution.
This way each of the first $i$ chunks represents the time before $q$ is likely to
consume the $i$\textsuperscript{th} shared variable,
these chunks' durations are $Duration_{q, i}$ (line 6).
The last ($i+1$\textsuperscript{th}) chunk represents the time after
producing the last shared variable but before $q$ finishes:
$q$ may use this time to produce any variables that are not shared with $p$
but are consumed by code after the parallel conjunction.
We call this chunk's duration is $DurationRest_q$ (line 10).

Figure~\ref{fig:overlap2-swap} shows the overlap of a parallel conjunction
($p~\&~q$) with two shared variables \var{A} and \var{B}.
The figure is to scale and the sequential execution times of $p$ and $q$ are
5 and 4 respectively.
$q$ is broken up into the chunks $qA$, $qB$ and $qR$.
The algorithm keeps track of the sequential and parallel execution times of $q$
up to the consumption of the current shared variable.
During sequential execution,
each chunk can execute immediately after the previous chunk,
since the values of the shared variables are all available when $q$ starts.
During parallel execution,
$p$ is producing the shared variables while $q$ is running.
If $q$ tries to use a variable (by calling \wait on the variable's future)
and the variable has not been produced yet then $q$ will block.
We can see this for \var{A}:
$ParWantTime_{q, 0}$ is 2 but $ProdTimeMap_p[\text{\var{A}}]$ is 4,
so during the first execution of
Algorithm~\ref{alg:dep_par_conj_overlap_simple}'s loop,
line 9 sets $CurParTime$ to 4,
which is when $q$ may resume execution.
It is also possible that $p$ will produce the variable before it is
needed, as is the case for \var{B}.
In cases such as this, $q$ will not be suspended.
In \var{B}'s case,
when the loop starts its second iteration $CurSeqTime$ is 2,
$CurParTime$ is 4, $ConsTime_{q, 1}$ is 3.5 and
$ProdTimeMap_q[\text{\var{B}}]$ is 1;
the algorithm calculates the duration of this second chunk as 1.5,
and $ParWantTime_{q, 1}$ as 5.5.
Line 9 sets $CurParTime$ to 5.5 which is the value of $ParWantTime_{q, 1}$
as \var{B}'s value was already available.
After the loop terminates the algorithm calculates the length of the final
chunk ($DurationRest_q$) as 0.5,
and adds it to $CurParTime$ to get the parallel execution time of the
conjunction: 6.

\picfigure{overlap3}{Overlap of more than two conjuncts}

When there are more than two conjuncts in the parallel conjunction our
algorithm requires an outer loop.
An example of such parallel conjunction is shown in Figure~\ref{fig:overlap3}.
We can see that the third conjunct's execution depends on the second's,
which depends upon the first's.
Dependencies can only exist in the left to right direction:
Mercury requires all dependencies to be left to right;
it will reorder right to left dependencies and report errors for
cyclic dependencies.
The new algorithm therefore iterates over the conjuncts from left to right,
processing each conjunct with a loop similar to the one in
Algorithm~\ref{alg:dep_par_conj_overlap_simple}.

\begin{algorithm}[tbp]
\begin{algorithmic}[1]
\Procedure{find\_par\_time}{$Conjs$}
    \State $ProdTimeMap \gets empty$
    \State $TotalParTime \gets 0$
    \For{$Conj_i \in Conjs$}
        \State $CurSeqTime_i \gets 0$
        \State $CurParTime_i \gets 0$
        \State $VarUsesList_i \gets$ get\_variable\_uses($Conj_i$)
        \State sort $VarUsesList_i$ by time
        \For{$(Var_{i, j}, Time_{i, j}) \in VarUsesList_i$}
            \State $Duration_{i, j} \gets Time_{i, j} - CurSeqTime_i$
            \State $CurSeqTime_i \gets CurSeqTime_i + Duration_{i, j}$
            \If{$Conj_i$ produces $Var_{i, j}$}
                \State $CurParTime_i \gets CurParTime_i + Duration_{i, j}$
                \State $ProdTimeMap$[$Var_{i, j}$]~$ \gets CurParTime_i$
            \Else
                \Comment $Conj_i$ must consume $Var_{i, j}$
                \State $ParWantTime_{i, j} \gets CurParTime_i + Duration_{i, j}$
                \State $CurParTime_i \gets$
                    max($ParWantTime_{i, j}, ProdTimeMap$[$Var_{i, j}$])
            \EndIf
        \EndFor
        \State $DurationRest_i \gets SeqTime_i - CurSeqTime_i$
        \State $CurParTime_i \gets CurParTime_i + DurationRest_i$
        \State $TotalParTime \gets$ max($TotalParTime, CurParTime_i$)
    \EndFor
    \State \Return $TotalParTime$
\EndProcedure
\end{algorithmic}
\caption{Dependent parallel conjunction algorithm}
\label{alg:dep_par_conj_overlap_middle}
%\vspace{-2\baselineskip}
\end{algorithm}

Algorithm~\ref{alg:dep_par_conj_overlap_middle} shows a more complete
version of our algorithm;
this version handles multiple conjuncts and variables,
but does not account for overheads.
% assuming an unlimited number of CPUs.
The input of the algorithm is $Conjs$,
the conjuncts themselves.
The algorithm returns the parallel execution time of the conjunction as a
whole.
The algorithm processes the conjuncts in an outer loop on lines 4--20.
Within each iteration of this loop,
it processes the first $i$ chunks of each conjunct in the loop on lines 9--17.
This inner loop is based on the previous algorithm,
the difference is that it must now also act on variables being produced by
the current conjunct.
$VarUsesList$ now contains both variable consumptions and productions,
so the inner loop creates chunks from variable productions as well.
The if-then-else on lines 12--17 handles the different producing and consuming
cases.
When the current conjunct
produces a variable, we simply record when this happens in $ProdTimeMap$,
which is shared by all conjuncts for variable productions.
The map must be built in this way, at the same time as we iterate over the
conjuncts' chunks,
so that it reflects how a delay in the execution of one task will usually
affect when variables may be consumed by another task.
We can see this in Figure~\ref{fig:overlap3}:
$r$ blocks for longer on \var{B} than it normally would because $q$ blocks
on \var{A}.
Note that $Var_{i, j}$
will always be in $ProdTimeMap$ when we look for it,
because it must have been produced by an earlier conjunct
and therefore processed by an earlier iteration of the loop over the
conjuncts.
The only new code outside the inner loop is the use of the $TotalParTime$
variable,
the total execution time of the parallel conjunction;
it is calculated as the maximum of all the conjuncts' parallel execution
times.

%% COSTS
%The version of this algorithm we have actually implemented is
%a bit longer than the one in algorithm \ref{alg:dep_par_conj_overlap_middle},
%because it also accounts for several forms of overhead:
%
%\begin{itemize}
%\item
%Creating a spark and adding it to a work queue has a cost.
%Every conjunct but the last conjunct incurs this cost
%to create the spark for the rest of the conjunction.
%\item
%It takes some time to take a spark off a spark queue,
%create or reuse a context for it, and start its execution.
%Every parallel conjunct that is not the first incurs this delay
%before it starts running.
%\item
%The signal and wait operations on shared variables' futures have a cost.
%\item
%It takes some time to wake up a context that was waiting on a future.
%\item
%It takes time for each conjunct to synchronise on the barrier
%when it has finished its job.
%\end{itemize}
%
%\noindent
%We can account for every one of these overheads
%by adding the estimated cost of the relevant operation to $CurParTime$
%at the correct point in the algorithm.

In many cases,
the conjunction given to Algorithm~\ref{alg:dep_par_conj_overlap_middle}
will contain a recursive call.
In these cases,
the algorithm uses the recursive call's cost at its average recursion depth
in the sequential execution data gathered by the profiler.
This is naive because it assumes that the recursive call
calls the \emph{original, sequential} version of the procedure,
however the call is recursive and so the parallelised procedure calls itself,
the \emph{transformed parallel} procedure whose cost at its average recursion
depth is going to be different from the sequential version's.
When the recursive call calls the parallelised version,
%we can expect a similar saving
there may be a similar saving 
(absolute time, not ratio)
on \emph{every} recursive invocation,
provided that there are enough free CPUs.
How this affects the expected speedup of the top level call
depends on the structure of the recursion.

It should be possible to estimate the parallel execution time of the top level
call into the recursive procedure,
including the parallelism created at each level of the recursion,
provided that
the recursion pattern is one that is understood by the algorithms in
Section~\ref{sec:overlap_reccalls}.
Before we implemented this it was more practical to improve the efficiency of
recursive code
(Chapter~\ref{chap:loop_control}).
We have not yet returned to this problem,
see Section~\ref{sec:conc_further_work}.
Nevertheless,
our current approach handles non-recursive cases correctly,
which are the majority (78\%) of all cases;
it handles a further 13\% of cases (single recursion) reasonably well
(Section~\ref{sec:overlap_reccalls}).
Note that even better results for singly recursive procedures can be
achieved because of the work in Chapter~\ref{chap:loop_control}.

So far, we have assumed an unlimited number of CPUs,
which of course is unrealistic.
If the machine has e.g.\ four CPUs,
then the prediction of any speedup higher than four is obviously invalid.
Less obviously,
even a predicted overall speedup of less than four may depend
on more than four conjuncts executing all at once at \emph{some} point.
We have not found this to be a problem yet.
If and when we do,
we intend to extend our algorithm to keep track
of the number of active conjuncts in all time periods.
Then if a chunk of a conjunct wants to run in a time period
when all CPUs are predicted to be already busy,
we assume that the start of that chunk is delayed until a CPU becomes free.

% \paul{XXX: Not implemented.  I think I vaguly remember this}
% The limited number of CPUs also means that
% there is a limit to how much parallelism we actually \emph{want}.
% The spawning off of every conjunct incurs overhead,
% but these overheads do not buy us anything if all CPUs are already busy.
% % If the machine has e.g.\ four CPUs,
% % then we do not actually want to spawn off
% % hundreds of iterations for parallel execution,
% % since parallel execution actually has several forms of overhead:
% That is why our system supports \emph{throttling}.
% If a conjunction being parallelised contains a recursive call,
% then the compiler can be asked to replace the original sequential conjunction
% not with the parallel form of the conjunction,
% but with an if-then-else.
% The condition of this if-then-else
% will test at runtime
% whether spawning off a new job is a good idea or not.
% If it is, we execute the parallelised conjunction, but
% if it is not, we execute the original sequential conjunction.
% The condition is obviously a heuristic.
% If the heuristic allows the list of runnable jobs to become empty,
% then we will not have any work to give to a CPU
% that finishes its task and becomes available.
% On the other hand,
% if the heuristic allows the list of runnable jobs to become too long,
% then we incur the overheads of spawning off some jobs unnecessarily.
% Currently, on machines with $N$ CPUs,
% we prefer to have a total of $M$ running and runnable jobs where $M > N$,
% so our heuristic stops spawning attempts
% if and only if the queue already has $M$ entries.
% Our current system by default sets $M$ to be $32$ for $N = 4$,
% though users can easily override this.

%The overheads of parallel execution can also affect conjunctions
%that do not contain recursive calls:
%a conjunction that looks worth parallelising if you ignore overheads
%may look not worth parallelising if you take them into account.
%This is why our system actually uses
%a version of algorithm~\ref{alg:dep_par_conj_overlap_middle}
%that accounts for overheads.

% We use \code{TotalParTime} to keep track of the ending time
% of the parallel conjunct that ends last.
% We also remember, in \code{FirstConjTime},
% the time at which the first conjunct finishes.
% The reason we do this is because
% our runtime system requires that
% when the parallel conjunction finishes,
% execution must continue in the context
% that entered the parallel conjunction in the first place.
% In our implementation, this context will execute the first conjunct.
% If the last conjunct to finish is the first conjunct,
% it can continue on without delay;
% if the last conjunct to finish is some other conjunct,
% then we need to free its context,
% and switch to executing the original context,
% which became idle when the first conjunct finished.
% The last two lines reflect this cost.

\begin{table}
\begin{center}
\begin{tabular}{l|rlr}
 & \C{\textbf{Cost}}
 & \multicolumn{2}{c}{\textbf{Local use of \code{Acc1}}} \\
\hline
\code{M}  &   1,625,050 &             & none \\
\code{F}  &           3 & production  & 3 \\
\mapfoldl &   1,625,054 & consumption & 1,625,051 \\
% Note: The cost of the recursive call assumes that there is one
% recursive case and one base case remaining in the recursion.
\end{tabular}
\end{center}
\caption{Profiling data for \mapfoldl}
\label{tab:prof_data_map_foldl}
\end{table}

% \begin{figure}[tb]
% \begin{verbatim}
% map_foldl_par(_, _, [], Acc, Acc).
% map_foldl_par(M, F, [X | Xs], Acc0, Acc) :-
%     (
%         M(X, Y),
%         F(Y, Acc0, Acc1)
%     ) &
%     map_foldl_par(M, Xs, Acc1, Acc).
% \end{verbatim}
% \caption{Parallel \mapfoldl}
% % the recursive call is less dependent
% % on the conjunction of the first two calls.
% \label{fig:map_foldl_par}
% \end{figure}

To see how the algorithm works on realistic data,
consider the \mapfoldl example from Figure~\ref{fig:map_foldl}.
Table~\ref{tab:prof_data_map_foldl} gives
the costs,
rounded to integers,
of the calls in the recursive clause of \mapfoldl
when used in a Mandelbrot image generator
(as in Section~\ref{sec:overlap_perf}).
Each call to \code{M} draws a row,
while \code{F} appends the new row
onto the cord of the rows already drawn.
The cord is an instance of a data structure that prescribes a sequence
(like a list) but whose append operation runs in constant time.
The table also shows when \code{F} produces \var{Acc1}
and when the recursive call consumes \var{Acc1}.
The costs were collected from a real execution using Mercury's deep profiler
and then rounded to make mental arithmetic easier.

Figure~\ref{fig:map_foldl_par} shows the best parallelisation of
\mapfoldl.
When evaluating the speedup for this parallelisation,
the production time for \var{Acc1} in the first conjunct
(\code{M(X, Y), F(Y, Acc0, Acc1)})
is
$1,625,050 + 3 = 1,625,053$, and
the consumption time for \var{Acc1} in the recursive call,
\code{map\_foldl(M, F, Xs, Acc1, Acc)},
is $1,625,051$.
In this example we are assuming that the recursive call calls a sequential
version of the code and that the recursive case will be executed,
and in turn the recursive case calls the base case;
we revisit this assumption later.
The first iteration of
Algorithm~\ref{alg:dep_par_conj_overlap_middle}'s outer loop processes the
first conjunct,
breaking it into two chunks separated by the production of \var{Acc1},
which is added to $ProdTimeMap$.
During the second iteration the recursive call is also broken into two
chunks, which are separated by the consumption of \var{Acc1}.
However, inside the recursive call \var{Acc1} is called \var{Acc0},
and so it is the consumption of \var{Acc0} within the recursive call that
is actually considered.
This chunk is a long chunk (1,625,051csc).
$ParWantTime$ will be 1,625,051 (above) and $ProdTimeMap[\text{\var{Acc1}}]$
will be 1,625,053 (above).
The execution of the recursive call is likely to block
%(these values are estimates)
but only very briefly,
$CurParTime_i$ will be set to 1,625,053 and the second chunk's duration will
be very small, only $1,625,054 - 1,625,051 = 3$.
This is added to $CurParTime_i$ to determine the parallel execution time of
the recursive call (1,625,056),
which is also the maximum of either of the conjuncts' total parallel
execution times,
making it the parallel execution time of the conjunction as a whole.

If there are many conjuncts in the parallel conjunction or if the parallel
conjunction contains a recursive call,
we can create more parallelism than the machine can handle.
If the machine has e.g.\ four CPUs,
then we do not actually want to spawn off
hundreds of iterations for parallel execution,
since parallel execution has several forms of overhead.
We classify each form of overhead into one of two groups, costs and delays.
Costs are time spent \emph{doing} something,
e.g.\ the current context must do something such as spawn off another
computation or read a future.
Delays represent time spent \emph{waiting} for something;
during this time the context is suspended (or being suspended or being woken
up).

\begin{description}
\item[$SparkCost$]
is the cost of creating a spark and adding it to the local spark stack.
In a parallel conjunction,
every conjunct that is not the last conjunct incurs this cost
to create the spark for the rest of the conjunction.
\item[$SparkDelay$]
is the estimated length of time between the creation of a spark
and the beginning of its execution on another engine.
Every parallel conjunct that is not the first incurs this delay
before it starts running.
\item[$SignalCost$]
is the cost of signalling a future.
\item[$WaitCost$]
is the cost of waiting on a future.
\item[$ContextWakeupDelay$]
is the estimated time that it takes for a context to resume execution
after being placed on the runnable queue,
assuming that the queue is empty and there is an idle engine.
This can occur in two places:
either at the end of a parallel conjunction the original context may need to
be resumed to continue its execution,
and if a context is blocked on a future it will need to be resumed
once that future is signalled.
\item[$BarrierCost$]
is the cost of executing the operation
that synchronises all the conjuncts at the barrier
at the end of the conjunction.
\end{description}

%Because of these overheads, our system uses \emph{throttling}.
%If a conjunction being parallelised contains a recursive call,
%then the compiler will replace the original sequential conjunction
%not with the parallel form of the conjunction,
%but with an if-then-else.
%The condition of this if-then-else
%will test at runtime
%whether spawning off a new job is a good idea or not.
%If it is, we execute the parallelised conjunction,
%if it is not, we execute the original sequential conjunction.
%The condition is obviously a heuristic.
%If the heuristic allows the list of runnable jobs to become empty,
%then we will not have any work to give to a CPU
%that finishes its task and becomes available.
%On the other hand,
%if the heuristic allows the list of runnable jobs to become too long,
%then we incur the overheads of spawning off some jobs unnecessarily.
%Currently, on machines with $N$ CPUs,
%we prefer to have a total of $M$ running and runnable jobs where $M > N$,
%so our heuristic stops spawning attempts
%if and only if the queue already has $M$ entries.
%Our current system by default sets $M$ to be $32$ for $N = 4$,
%though users can easily override this.

Our runtime system is now quite efficient so that these
costs are kept low (Chapter~\ref{chap:rts}).
This does not eliminate the overheads, and cannot ever hope to,
instead we wish to parallelise only those conjunctions that are still
profitable despite overheads.
This is why our system actually uses
Algorithm~\ref{alg:dep_par_conj_overlap_complete},
a version of Algorithm~\ref{alg:dep_par_conj_overlap_middle}
that accounts for overheads.

\begin{algorithm}[tbp]
\begin{algorithmic}[1]
\Procedure{find\_par\_time}{$Conjs$, $SeqTimes$}
    \State $N \gets$ length($Conjs$)
    \State $ProdTimeMap \gets$ empty
    \State $FirstConjTime \gets 0$
    \State $TotalParTime \gets 0$
    \For{$i \gets 0$ to $N-1$}
        \State $Conj_i \gets Conjs[i]$
        \State $CurSeqTime_i \gets 0$
        \State $CurParTime_i \gets (SparkCost + SparkDelay) \times i$
        \State $VarUsersList_i \gets$ get\_variable\_uses($Conj_i$)
        \State sort $VarUsesList_i$ by time
        \If{$i \neq N$}
            \State $CurParTime_i \gets CurParTime_i + SparkCost$
        \EndIf
        \For{$(Var_{i, j}, Time_{i, j}) \in VarUsesList_i$}
            \State $Duration_{i, j} \gets Time_{i, j} - CurSeqTime_i$
            \State $CurSeqTime_i \gets CurSeqTime_i + Duration_{i, j}$
            \If{$Conj_i$ produces $Var_{i, j}$}
                \State $CurParTime_i \gets
                    CurParTime_i + Duration_{i, j} + SignalCost$
                \State $ProdTimeMap$[$Var_{i, j}$]$ \gets CurParTime_i$
            \Else
                \Comment $Conj_i$ must consume $Var_{i, j}$
                \State $ParWantTime_{i, j} \gets CurParTime_i + Duration_{i, j}$
                \State $CurParTime_i \gets$
                    max($ParWantTime_{i, j}, ProdTimeMap$[$Var_{i, j}$]$) + WaitCost$
                \If{$ParWantTime_{i, j} < ProdTimeMap$[$Var_{i, j}$]}
                    \State $CurParTime_i \gets CurParTime_i + ContextWakeupDelay$
                \EndIf
            \EndIf
        \EndFor
        \State $DurationRest_i \gets SeqTime_i - CurSeqTime_i$
        \State $CurParTime_i \gets CurParTime_i + DurationRest_i + BarrierCost$
        \If{$i = 0$}
            \State $FirstConjTime = CurParTime_i$
        \EndIf
        \State $TotalParTime \gets$ max($TotalParTime, CurParTime_i$)
    \EndFor
    \If{$TotalParTime > FirstConjTime$}
        \State $TotalParTime \gets TotalParTime + ContextWakeupDelay$
    \EndIf
    \State \Return $TotalParTime$
\EndProcedure
\end{algorithmic}
\caption{Dependent parallel conjunction algorithm with overheads}
\label{alg:dep_par_conj_overlap_complete}
\end{algorithm}

Algorithm~\ref{alg:dep_par_conj_overlap_complete} handles each of the
overheads listed above.
The first two overheads that this algorithm accounts for are
$SparkCost$ and $SparkDelay$ on lines 9 and 12--13.
Each conjunct is started by a spark created by the previous conjunct,
except the first which is executed directly and therefore has no delay.
In a conjunction $G_0~\&~G_1~\&~G_2$,
the first conjunct ($G_0$) is executed directly and spends
$SparkCost$ time creating the spark for $G_1~\&~G_2$.
This spark is not executed until a further $SparkDelay$ time has passed.
Therefore on line 9, when we process $G_1$ and $i = 1$ we add the time
$SparkCost + SparkDelay$ to the parallel execution time so far.
$G_1$ will the create the spark for $G_2$ which costs $G_1$ a further
$SparkCost$ time (lines 12--13).
The third conjunct ($G_2$) must wait for the first conjunct to create the
second and the second to create it and then $SparkDelay$,
so line 9 adds $2\times(SparkDelay + SparkCost)$ to the current parallel
execution time.
$G_2$ does not need to create any other sparks, so the cost on lines 12--13
is not applied.

The next overhead is on line 18, the cost of signalling a future.
It is applied each time a future is signalled and the result of applying it
is factored into $ProdTimeMap$ along with all other costs,
so that its indirect effects on later conjuncts are accounted for.
A related pair of overheads are accounted for on lines 22--24:
$WaitCost$ is the cost of a \wait operation and is paid any time any context
waits on a future.
$ContextWakeupDelay$ is also used here;
if a context is likely to suspend
because of a future then it will take some time between getting signalled and
waking up.
Line 26 of the algorithm accounts for the $BarrierCost$ overhead,
which is paid at the end of every parallel conjunct.

Throughout Chapter~\ref{chap:rts} we saw the problems that right recursion
can create.
These problems were caused because the leftmost conjunct of a parallel
conjunction will be executed by the context that started the execution of
the conjunction (the original context),
and if the same context does not execute the other conjuncts and those
conjuncts take longer to execute than the leftmost one,
the original context must be suspended.
When this context is resumed,
we must wait $ContextWakeupDelay$ before it can
resume the execution of the code that follows the parallel conjunction.
We account for this overhead on lines 30--31.
This requires knowing the cost of the first conjunct $FirstConjTime$,
which is computed by code on lines 4 and 27--28.

There are two other important pieces of information that we compute,
although neither are shown in the algorithm.
The first is ``CPU utilisation'';
it is the sum of all the chunks' lenghts which include the costs but not the
delays.
The second is the ``dead time'':
it is the sum of the time spent waiting on each future plus the time that
the original context spends waiting at the barrier for the conjunction to
finish.
This represents the amount of time that contexts consume memory without
the contexts being actively used.
We do not currently use either of these metrics to make decisions about
parallelisations,
but they could be used to break ties between alternative parallelisations
with similar parallel speedups.


\section{Choosing how to parallelise a conjunction}
\label{sec:overlap_howto}

\status{Draft finished, This is ready for review by Zoltan.}

A conjunction with more than two conjuncts can be converted into several
different parallel conjunctions.
Converting all the commas into ampersands
(e.g.\ $G_1,~G_2,~G_3$ into $G_1~\&~G_2~\&~G_3$)
yields the most parallelism.
Unfortunately, this will often be \emph{too} much parallelism,
because in practice many conjuncts are unifications
and arithmetic operations whose execution takes very few instructions.
Executing such conjuncts in their own threads
costs far more in overheads than can be gained from their parallel execution.
To see just how big conjunctions can be,
let us consider the quadratic equation example
from Section~\ref{sec:backgnd_mercury},
which we have repeated here.
The quadratic equation is often written as a single conjunct:

\begin{verbatim}
X = (-B + sqrt(pow(B, 2) - 4*A*C)) / (2*A)
\end{verbatim}

\noindent
Which is decomposed by the compiler into 12 conjuncts:

\begin{verbatim}
V_1  = 2,               V_2  = pow(B, V_1),     V_3 = 4,
V_4  = V_3 * A,         V_5  = V_4 * C,         V_6 = V_2 - V_5,
V_7  = sqrt(V_6),       V_8  = -1,              V_9 = V_8 * B,
V_10 = V_9 + V_7,       V_11 = V_1 * A,         X   = V_10 / V_11
\end{verbatim}

\noindent
If the quadratic equation were involved in a parallel conjunction,
we would not want to create 12 separate parallel tasks for it.
Therefore in most cases,
we want to transform sequential conjunctions with $n$ conjuncts into
parallel conjunctions with $k$ conjuncts where $k < n$.
Each conjunct should consist of a contiguous sequence
of one or more of the original sequential conjuncts,
effectively partitioning the original conjuncts into groups.

Deciding how to parallelise something in this way can be thought of as
choosing which of the $n-1$ conjunction operators (\samp{,}) in an $n$-ary
conjunction to turn into parallel conjunction operators (\samp{\&}).
For example, $G_1,~G_2,~G_3$ can be converted
into any of:

\begin{tabular}{l}
$G_1~\&~(G_2,~G_3)$; \\
$(G_1,~G_2)~\&~G_3$; or \\
$G_1~\&~G_2~\&~G_3$.
\end{tabular}

%\noindent
%Our algorithm will also consider
%
%\begin{tabular}{l}
%$(G_1~\&~G_2),~G_3$; \\
%$G_1,~(G_2~\&~G_3)$; \\
%\end{tabular}

\noindent
Our goal is to find the best parallelisation of a conjunction from among
the various possible parallel conjunctions.
The solution space is as large as $2^{n-1}$,
and as the quadratic equation example demonstrates that $n$ can be quite
large even for simple expressions.
The largest conjunction we have seen and tried to parallelise contains about 150
conjuncts.
This is in the Mercury compiler itself.

\begin{algorithm}[tbp]
\begin{algorithmic}[1]
\Procedure{find\_best\_partition}{$InitPartition$, $InitTime$, $LaterConjs$}
  \If{empty($LaterConjs$)}
    \State \Return $(InitTime, InitPartition)$
  \Else
    \State $(Head, Tail) \gets$ deconstruct($LaterConjs$)
    \State $Extend \gets$ all\_but\_last($InitPartition$) ++
        [last($InitPartition$) ++ [$Head$]]
    \State $AddNew \gets InitPartition$ ++ [$Head$]
    \State $ExtendTime \gets$ find\_par\_time($Extend$)
    \State $AddNewTime \gets$ find\_par\_time($AddNew$)
    \State $NumEvals \gets NumEvals + 2$
    \If{$ExtendTime < AddNewTime$}
      \State $BestExtendSoln \gets$ find\_best\_partition($Extend$,
        $ExtendTime$, $Tail$)
      \State $(BestExTime, BestExPartSet) \gets BestExtendSoln$
      \If{$NumEvals < PreferLinearEvals$}
        \State $BestAddNewSoln \gets$ find\_best\_partition($AddNew$,
            $AddNewTime$, $Tail$)
        \State $(BestANTime, BestANPartSet) \gets BestAddNewSoln$
        \If{$BestExTime < BestANTime$}
          \State \Return $BestExtendSoln$
        \ElsIf{$BestExTime = BestANTime$}
          \State \Return $(BextExTime,$ choose($BestExPartSet$,
            $BestANPartSet$)
        \Else
          \State \Return $BestAddNewSoln$
        \EndIf
      \Else
        \State \Return $BestExtendSoln$
      \EndIf
    \Else
      \State symmetric with the then case
    \EndIf
  \EndIf
\EndProcedure
\State
\State $NumEvals \gets 0$
\State $BestPar \gets$ find\_best\_partition($[]$, 0, $MiddleGoals$)
\end{algorithmic}
\caption{Search for the best parallelisation}
\label{alg:best_par_search}
%\vspace{-2\baselineskip}
\end{algorithm}

The prerequisite for a profitable parallel conjunction is two or more
expensive conjuncts (conjuncts whose per-call cost is above a certain
threshold)
(Section~\ref{sec:overlap_approach}).
In such conjunctions we create a list of goals from the first expensive
conjunct to the last,
which we dub the ``middle goals''.
There are (possibly empty) lists of cheap goals before and after the 
list of middle goals.
Our initial search assumes that
the set of conjuncts in the parallel conjunction we want to create
is exactly the set of conjuncts in the middle.
A post-processing step then removes that assumption.

The initial search space is explored by
Algorithm~\ref{alg:best_par_search}, which processes the middle goals.
% That is a large space to search for the \emph{best} parallelisation,
% and it would be larger still if we allowed code reordering,
% that is, parallel conjuncts consisting of
% a \emph{non}contiguous sequence of the original conjuncts.
The algorithm starts with an empty list as  $InitPartition$,
zero as $InitTime$, and the list of middle conjuncts as $LaterConjs$.
$InitPartition$ expresses a partition of an initial sub-sequence
of the middle goals into parallel conjuncts
whose estimated execution time is $InitTime$,
and considers whether it is better to add the next middle goal
to the last existing parallel conjunct ($Extend$) (line 6),
or to put it into a new parallel conjunct ($AddNew$) (line 7).
The calls to \findpartime (lines 8 and 9) evaluate these two alternatives by
estimating their parallel overlap using
Algorithm~\ref{alg:dep_par_conj_overlap_complete}.
%from section \ref{sec:overlap_overlap_alg}.
If $Extend$ is more profitable than $AddNew$,
then the recursive call in line 11 searches for the best parallel
conjunction beginning with $Extend$;
otherwise similar code is executed on line 26.

It is desirable to explore the full $O(2^n)$ solution space however doing so
for large conjunctions is infeasible.
Therefore we compromise by using an algorithm that is
initially complete,
but can switch into a greedy (linear) mode if the solution space is too
large.
Before the algorithm begins we set the global variable $NumEvals$ to zero.
It counts how many overlap calculations have been performed
and if this reaches the threshold ($PreferLinearEvals$) then our algorithm
explores only the most profitable of the two alternatives at every choice
point.
If the threshold has not been reached then the code on lines 15--22 
is executed
which explores the extensions of the less profitable alternative ($AddNew$)
as well.
Symmetric code which explores $Extend$ when it is the least profitable
belongs on line 26 and the lines omitted after it.
The algorithm returns the best of the two solutions,
or chooses one of the solutions when they are of equal value on line 20.
Currently this choice is arbitrary,
but it could be based on other metrics such as CPU utilisation.
If the threshold has been reached the least profitable alternative is not
explored (line 24).

The actual algorithm in the Mercury system is more complex than
Algorithm~\ref{alg:best_par_search}:
our algorithm will also test each partial parallelisation against the best
solution found so far.
If the expected execution time for the candidate currently being considered
is already greater than the fastest existing complete parallelisation,
we can stop exploring that branch;
it cannot lead to a better solution.
This is a simple branch-and-bound algorithm.

There are some simple ways to improve this algorithm.

\begin{itemize}
\item
Most invocations of \findpartime specify a partition
that is an extension of a partition processed in the recent past.
In such cases, \findpartime should do its task
incrementally, not from scratch.
\item
Sometimes consecutive conjuncts do things that are
obviously a bad idea to do in parallel, such as building a ground term.
The algorithm should treat these as a single conjunct.
% XXX we could make the third item tr only if we need space
\item
Take other metrics such as total CPU utilisation, dead time, or GC pressure
into account,
at least by using it to break ties on parallel execution time.
\item
Also, the current implementation does not make an estimate of the
minimum cost of the work that is yet to be scheduled after the current point.
This affects the amount of pruning that the branch-and-bound code
is able to achieve.
\end{itemize}

\noindent
At the completion of the search,
we select one of the equal best parallelisations,
and post-process it to adjust both edges.
Suppose the best parallel form of the middle goals is $P_1~\&~\ldots~\&~P_p$,
where each $P_i$ is a sequential conjunction.
We compare the execution time of $P_1~\&~\ldots~\&~P_p$
with that of $P_1,~(P_2~\&~\ldots~\&~P_p)$.
If the former is slower,
which can happen if $P_1$ produces its outputs at its very end
and the other $P_i$ consume those outputs at their start,
then we conceptually move $P_1$ out of the parallel conjunction
(from the ``middle'' part of the conjunction to the ``before'' part).
We keep doing this for $P_2$, $P_3$ et cetera, until either
we find a goal worth keeping in the parallel conjunction,
or we run out of conjuncts.
We also do the same thing at the other end of the middle part.
This process can shrink the middle part.

In cases where we do not shrink an edge, we can consider expanding that edge.
Normally, we want to keep cheap goals out of parallel conjunctions,
since more conjuncts tends to mean
more shared variables and thus more synchronisation overhead,
but sometimes this consideration is overruled by others.
Suppose the goals before $P_1~\&~\ldots~\&~P_p$
in the original conjunction were $B_1,~\ldots,~B_b$
and the goals after it $A_1,~\ldots,~A_a$,
and consider $A_1$ after $P_p$.
If $P_p$ finishes before the other parallel conjuncts,
then executing $A_1$ just after $P_p$ in $P_p$'s context
may be effectively free:
the last context could still arrive at the barrier at the same time,
but this way, $A_1$ would have been done by then.
Now consider $B_b$ before $P_1$.
If $P_1$ finishes before the other parallel conjuncts,
\emph{and} if none of the other conjuncts wait for variables produced by $P_1$,
then executing $B_b$ in the same context as $P_1$ can be similarly free.

We loop from $i=b$ down towards $i=1$, and check whether
including $B_i,~\ldots,~B_b$ at the start of $P_1$ is improvement.
If not, we stop; if it is, we keep going.
We do the same from the other end.
The stopping points of the loops of the contraction and expansion phases
dictate our preferred parallel form of the conjunction, which
(if we shrunk the middle at the left edge and expanded it at the right)
will look something like
$B_1,$ $\ldots,$ $B_{b},$ $P_1,$ $\ldots~P_k,$
$(P_{k+1}$ $\&$ $\ldots$ $\&$ $P_{p-1}$ $\&$ $(P_p,$ $A_1,$ $\ldots,$ $A_j)),
A_{j+1},$ $\ldots,$ $A_a$.
% $B_1, \ldots, B_{i-1},
% ((B_i, \ldots, B_b, P_1) \& P_2, \ldots \& P_{p-1} \& (P_p, A_1, \ldots, A_j)),
% A_{j+1}, \ldots, A_a$.
If this preferred parallelisation is better than
the original sequential version of the conjunction by at least 1\% (a
configurable threshold),
then we include a recommendation for its conversion to this form
in the feedback file we create for the compiler.

% These two loops are specifically designed
% to allow the inclusion of cheap goals in the parallel conjunction.
% Note that this algorithm always tries to arrange
% \emph{all} the conjuncts in the conjunction,
% not just the conjuncts from the first costly goal to the last.
% Normally, we want to keep cheap goals out of parallel conjunctions,
% since more conjuncts usually means more shared variables,
% which means more synchronisation overhead.
% The reason why we expanding the scope of the parallel conjunction
% is that sometimes this consideration is overruled by others.
% Consider $A_1$ after $P_p$.
% If $P_p$ finishes before the other parallel conjuncts,
% then executing $A_1$ just after it in $P_p$'s context may be effectively free:
% the last context could still arrive at the barrier at the same time,
% but this way, $A_1$ would have completed by then.
% Now consider $B_b$ before $P_1$ where $P_1$ is still in a parallel conjunct.
% If $P_1$ finishes before the other parallel conjuncts,
% \emph{and} if none of the other conjuncts
% wait for variables produced by $P_1$,
% then executing $B_b$ in the same context as $P_1$ can be similarly free.

% \begin{itemize}
% \item
% Currently no tie breaking is done and we have not explored
% using other formulas for the search's objective function.
% \end{itemize}

\section{Pragmatic issues}
\label{sec:overlap_pragmatic}

\status{Draft finished, This is ready for review by Zoltan.}

In this section we discuss several pragmatic issues with our implementation.
We also discuss several directions for further research.

% \subsection{The effects of module boundaries}
% \label{sec:pragmamoduleboundary}

% pushing waits and signals into calles stops at module boundaries

\subsection{Merging parallelisations from different ancestor contexts}
\label{sec:overlap_pragma_merge}

We search the program's
call graph for parallelisation opportunities
(Section~\ref{sec:overlap_approach}).
This means that we may visit the same procedure multiple times from
different ancestor contexts.
In a different context a procedure may have a different optimal
parallelisation,
or no profitable parallelisation.
% In addition to this,
% multiple conjunctions in the procedure may be parallelised.
Me must somehow resolve cases where there are multiple candidate
parallelisations for the same procedure.

At the moment, for any procedure and conjunction within that procedure
which our analysis indicates is worth parallelising in any context,
we pick one particular parallelisation (usually there is only one anyway),
and transform the procedure accordingly.
This gets the benefit of parallelisation when it is worthwhile,
but incurs its costs even in contexts when it is not.

\begin{figure}
\begin{center}
\subfigure[$j$ is called from two different ancestor contexts]{
\includegraphics[width=0.8\textwidth]{pics/par_specialisation}
\label{fig:par_specialisation_before}}
\subfigure[$j$'s parallel version is a specialisation, which requires
specialising $i$]{
\includegraphics[width=0.8\textwidth]{pics/par_specialisation2}
\label{fig:par_specialisation_after}}
\end{center}
\caption{Parallel specialisation}
\label{fig:par_specialisation}
\end{figure}

In the future, we plan to use multi-version specialisation.
For every procedure with different parallelisation recommendations
in different ancestor contexts,
we intend to create a specialised version for each recommendation,
leaving the original sequential version.
This will of course require the creation of specialised versions
of its parent, grandparent etc procedures,
until we get to an ancestor procedure that can be used to separate the
specialised version, or versions, from the original versions.
This is shown in Figure~\ref{fig:par_specialisation}, in the figure $j$ is
being parallelised when it is called from the
$main \calls f \calls h \calls i \calls j$ ancestor context,
but not when it is called from the $main \calls f \calls g \calls i \calls j$
context.
Therefore by specialising both $j$ and $i$ into new versions 
$j\_par$ and $i\_par$ respectively and calling $i\_par$ from $h$ rather than
$i$ we can parallelise $j$ only when it is called from the correct ancestor
context.

Each candidate parallel conjunction sent to the compiler includes a goal
path and procedure;
this points to the procedure and conjunction within the procedure that we
wish to parallelise.
When different conjunctions in the same procedure have parallelisation
advice, we save all this advice into the feedback file.
When the compiler acts on this advice,
it applies the parallelisations deeper within compound goals first.
This ensures that as we apply advice,
the goal path leading to each
successive conjunction that we parallelise, still points to the correct
conjunction.
To ensure that the compiler applies advice correctly,
it attempts to identify the parallel conjunction in the procedure's body
that the parallelisation advice refers to.
It does this by comparing the goal structure, and the calls within the
procedure body to
the goal structure and calls in the candidate conjunction in the feedback
data.
It compares calls using the callee's name and the names of some of the
variables in the argument list.
Not all variables are used as many are named automatically by other compiler
transformations,
the compiler knows which ones are automatically introduced by other
transformations.
If any conjunction cannot be found,
then the match fails and the compiler skips that piece of advice and issues
a warning.

% \subsection{Searching for parallelism opportunities}
% \label{sec:pragmabestfirst}
%
% Our candidates list actually contains
% both cliques and conjunctions within cliques.
%
% The candidates list should contain cliques, since
% \begin{itemize}
% \item
% the entry points of some child cliques are not in conjunctions
% (e.g.\ they can be switch arms), and
% \item
% we want to delay breaking a clique down into its constituent conjunctions,
% since this way if our traversal stops before getting to a clique,
% then we never have to break it down.
% \end{itemize}
%
% The candidates list should also contain conjunctions,
% since a clique can contain both cheap and expensive conjunctions,
% and we do not want to evaluate the cheap ones
% until we have processed all the more expensive conjunctions
% not just in this clique but in all other cliques;
% again, we expect that this way,
% our traversal will stop before it gets to the cheapest conjunctions.

\subsection{Parallelising children vs ancestors}
\label{sec:overlap_pragma_child_ancestor}

What happens when we decide that a conjunction that should be parallelised
has an ancestor that we decided should also be parallelised?
This can happen both with an ancestor in the same procedure (a compound goal
in a parallel conjunction that also contains a parallel conjunction)
or a call from a parallel conjunction to the current procedure.
In either case our options are:

\begin{enumerate}
\item parallelise neither,
\item parallelise only the ancestor,
\item parallelise only this conjunction, or
\item parallelise both
\end{enumerate}

The first alternative (parallelise neither) has already been rejected twice,
since we concluded that (2) was better than (1)
when we decided to parallelise the ancestor,
and we concluded that (3) was better than (1)
when we decided to parallelise this conjunction.

Currently our system will parallelise both the ancestor and the current
conjunction.
Our system will not explore parts of the call graph
(and therefore conjunctions)
when it thinks there is enough parallelism in the procedure's ancestors
to occupy all the CPUs.
In the future we could choose among the three reasonable alternatives:
we could evaluate the speedup from each of them, and just pick the best.
This is simple to do when both candidates are in the same procedure.
When one of the candidates is in an ancestor call we must also 
take into account
the fact that for each invocation of the ancestor conjunction,
we will invoke the current conjunction many times.
Therefore we will incur both the overheads and the benefits
of parallelising the current conjunction many times.
We will be able to determine the actual number of invocations from the
profile.

% This complicates how we might handle specialisation in the future.
% Consider a situation where
% for some ``current'' clique,
% you want to parallelise the ancestor,
% but for some other current clique,
% you do not want to parallelise the same ancestor?

\subsection{Parallelising branched goals}
\label{sec:overlap_pragma_push}

Many programs have code that looks like this:
\begin{verbatim}
( if ... then
    ...,
    expensive_call_1(...),
    ...
else
    ...,
    cheap_call(...),
    ...
),
expensive_call_2(...)
\end{verbatim}
If the condition of the if-then-else succeeds only rarely,
then the average cost of the if-then-else
may be below the threshold of what we consider to be an expensive goal.
We therefore would not consider
parallelising the top-level conjunction
(the conjunction of the if-then-else and \code{expensive\_call\_2});
this is correct as its overheads would probably outweigh its benefits.

In these cases we remember the expensive goal within the if-then-else
so that when we see \code{expensive\_call\_2} we virtually \emph{push}
\code{expensive\_call\_2} into both branches of the if-then-else and test if it
creates profitable parallelism in the branch next to
\code{expensive\_call\_1}.
If we estimate that this will create profitable parallelism then when we
give feedback to the compiler we tell the compiler to push
\code{expensive\_call\_2} just as we have.
Then during execution parallelism is only used when the then branch of this
if-then-else is executed.
\begin{verbatim}
( if ... then
    ...,
    expensive_call_1(...),
    ...,
    expensive_call_2(...)
else
    ...,
    cheap_call(...),
    ...,
    expensive_call_2(...)
)
\end{verbatim}

This transformation is only applied when it does not involve reordering goals.
For example, we do not push \code{expensive\_call\_2} past
\code{other\_call} in the following code:

\begin{verbatim}
( if ... then
    ...,
    ( if ... then
        expensive_call_1(...),
    else
        ...
    ),
    other_call(...),
else
    ...,
    cheap_call(...),
    ...
),
expensive_call_2(...)
\end{verbatim}

\subsection{Garbage collection and estimated parallelism}
\label{sec:overlap_pragma_gc}

We described the garbage collector's effects on parallel performance in
detail in Section~\ref{sec:rts_gc}.
In this section we will describe how it relates to automatic parallelism in
particular.

Throughout this chapter we have measured the costs of computations as call
sequence counts,
however this metric does not take into account memory allocations or
collection.
There is a fixed (and usually small) limit
on the number of instructions that the program can execute
between increments of the call sequence count
(though the limit is program-dependent).
There is no such limit on the collector, which can be a problem.
Since a construction unification does not involve a call,
our profiler considers its cost (in CSCs) to be zero.
Yet if the memory allocation required by a construction
triggers a collection,
then this nominally zero-cost action can actually take as much time
as many thousands of CSCs.

The normal way to view the time taken by a collection
is to simply distribute it among the allocations,
so that one CSC represents the average time taken
by the mutator between two calls
plus the average amortised cost of the collections
triggered by the unifications between those calls.
For a sequential program, this view works very well.
For a parallel program, it works less well,
because the performance of the mutator and the collector scale differently
(Section~\ref{sec:rts_gc}).

We could take allocations into account in the algorithms above in a couple
of different ways.
%We already proposed using other metrics to split ties between alternative
%parallelisations.
We could create a new unit for time that combines call sequence counts
and allocations and use this to determine how much parallelism we can
achieve.
At first glance allocations take time and parallelising them against one
another may be beneficial.
Therefore we may wish to arrange parallel conjunctions so that memory
allocation is done in parallel.
However this may lead to slowdowns when there is too much parallel
allocation.
%allocations exhaust their thread
%ocal memory pools and increase the contention for the shared memory pool.
Memory can be allocated from the shared memory pool or from thread local
pools.
Threads will try to allocate memory from their own local memory pool first.
This amortises the accesses to the shared memory pool across a number of
allocations.
Therefore parallelising memory allocation will increase contention on the
shared memory pool, slowing the computation down.
High allocation rates also correspond to high memory bandwidth usage.
Parallel computations with high memory bandwidth demands may exhaust the
available bandwidth,
which will reduce the amount that parallelism can improve performance.
Therefore,
if we use memory allocation information when estimating the benefit of
parallelisation we must weigh up these factors carefully.

There is also the consideration that with the
Boehm-Demers-Weiser~\citep{boehm:1988:gc}
a collection stops the world,
and the overheads of this stopping scale with the number of CPUs being used.
The overheads of stopping include
not just the direct costs of the interruption,
but also indirect costs,
such as having to refill the cache after the collector trashes it.

\section{Performance results}
\label{sec:overlap_perf}

\status{Draft finished, This is ready for review by Zoltan.}

% \begin{table*}
% \begin{center}
% \begin{tabular}{l|rrrrrrrrrr}
%  ~ & \multicolumn{1}{|c|}{Seq RT} &
%   \multicolumn{1}{|c|}{Par RT} &
%   \multicolumn{2}{|c|}{No Deps} &
%   \multicolumn{2}{|c|}{Na\"ive} &
%   \multicolumn{2}{|c|}{Num Vars} &
%   \multicolumn{2}{|c}{Overlap} \\
% \multicolumn{1}{c|}{Program} & \multicolumn{1}{|c|}{345Time} &
%   \multicolumn{1}{|c|}{Time} &
%   \multicolumn{1}{|c|}{Conjs} & \multicolumn{1}{|c|}{Time} &
%   \multicolumn{1}{|c|}{Conjs} & \multicolumn{1}{|c|}{Time} &
%   \multicolumn{1}{|c|}{Conjs} & \multicolumn{1}{|c|}{Time} &
%   \multicolumn{1}{|c|}{Conjs} & \multicolumn{1}{|c}{Time} \\ \hline
% quicksort acc &   &   & 0 &   & 1 &   &   &   & 0 &   \\
% quicksort app &   &   & 1 &   & 1 &   &   &   & 1 &   \\
% fibs & 1 & a & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\
% icfp2000 & 1 & a & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\
% mandelbrot & 1 & a & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\
% mmc & 1 & a & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9
% \end{tabular}
% \end{center}
% \caption{Results}
% \label{tab:results_temp}
% \end{table*}

% Report analysis times as well as sequential and parallel execution times
% and CPU usage (as integral if possible, as well as peack CPU usage).

We tested our system on four benchmark programs:
the first three are modified versions of the raytracer and mandelbrot
program from Chapter~\ref{chap:rts},
the fourth is a matrix multiplication program.
The two versions of the raytracer and the mandelbrot program only have
dependent parallelism available.
%Mandelbrot's modifications make it use dependent AND-parallelism
Mandelbrot uses the \mapfoldl predicate in Figure~\ref{fig:map_foldl}.
\mapfoldl is used to iterate over the rows of pixels in the image.
Raytracer does not use \code{map\_foldl},
but does use a similar code structure to perform a similar task.
This similar structure is not an accident:
\emph{many} predicates use this kind of code structure,
partly because programmers in declarative languages
often use accumulators to make their loops tail recursive.
The second raytracer program renders \emph{chunks} of 16 rows at a time in
each iteration of its render loop.
This is not normally how a programmer would write such a program;
we have done this deliberately to aid a discussion below.
Matrixmult has abundant independent AND-parallelism.
% All three programs do lots of floating point arithmetic,
% and the mandelbrot program does a lot of integer arithmetic as well.
% The Mercury backend we are using always boxes floating point numbers,
% so each floating point operation requires the creation of a new cell on the heap.
% No it does not -pbone.
% This makes matrixmult and raytracer very memory allocation intensive.
% Since the garbage collector accounts for a large fraction of their runtimes,
% Amdahl's law dictates the maximum speedup we can get by speeding up their mutators
% will be correspondingly limited.

We ran all four programs
with one set of input parameters to collect profiling data,
and with a \emph{different} set of input parameters to produce
the timing results in the following table.
All tests were run on the same system as the benchmarks on Chapter~\ref{chap:rts},
(the system is described on page~\pageref{cabsav}).
% taura
% a Dell Optiplex 755 PC with a 2.4~GHz Intel Core 2 Quad Q6600 CPU
% running Linux 2.6.31.
Each test was executed 40 times,
we give the reason for the high number of samples below.

% \begin{table*}[h]
% \begin{center}
% \begin{tabular}{l||r|r|r|r|r|r}
% Program     & Seq   & No autopar   & 1 CPU        & 2 CPUs      & 3 CPUs      &
% 4 CPUs \\
% \hline
% mandelbrot  & 33.4  &  35.3 (0.95) &  35.4 (0.94) & 18.0 (1.85) & 12.2 (2.74) &
 % 9.4 (3.55) \\
% raytracer   & 12.33 & 14.01 (0.88) & 14.77 (0.83) & 9.40 (1.31) & 7.59 (1.62) &
% 6.70 (1.84) \\
% \end{tabular}
% \end{center}
% % \caption{Results}
% % \label{tab:results_temp}
% \end{table*}

% \vspace{-2mm}
% \begin{table*}[h]
% \begin{center}
% \begin{tabular}{|l|l||r|r|r|r|r|}
% %\hhline{|-|-||-|-|-|-|-|}
% \hline
% \multicolumn{1}{|c|}{\textbf{Program}} &
% \multicolumn{1}{ c||}{\textbf{Par}}    &
% \multicolumn{1}{ c|}{\textbf{1 CPU}}   &
% \multicolumn{1}{ c|}{\textbf{2 CPUs}}  &
% \multicolumn{1}{ c|}{\textbf{3 CPUs}}  &
% \multicolumn{1}{ c|}{\textbf{4 CPUs}}  \\
% %\hhline{|-|-||-|-|-|-|-|}
% \hline
% matrixmult & indep & 14.60 (0.75) &  7.55 (1.46) &  6.07 (1.81) &  5.21 (2.11) \\
% seq 11.00  & naive & 14.61 (0.75) &  7.53 (1.46) &  6.75 (1.63) &  5.17 (2.12) \\
% par 14.60  & overlap  & 14.59 (0.75) &  7.57 (1.45) &  5.26 (2.09) &  5.37 (2.05) \\
% %\hhline{|-|-||-|-|-|-|-|}
% \hline
% mandelbrot & indep & 35.27 (0.95) & 35.31 (0.95) & 35.15 (0.95) & 35.31 (0.95) \\
% seq 33.47  & naive & 35.33 (0.95) & 17.87 (1.87) & 12.07 (2.77) &  9.17 (3.65) \\
% par 35.27  & overlap  & 35.16 (0.95) & 17.91 (1.87) & 12.02 (2.78) &  9.15 (3.65) \\
% %\hhline{|-|-||-|-|-|-|-|}
% \hline
% raytracer  & indep & 11.33 (0.87) & 11.37 (0.87) & 11.36 (0.87) & 11.36 (0.87) \\
% seq  9.85  & naive & 11.20 (0.88) &  7.48 (1.32) &  5.91 (1.66) &  5.39 (1.83) \\
% par 11.29  & overlap  & 11.28 (0.87) &  7.56 (1.30) &  5.94 (1.66) &  5.38 (1.83) \\
% %\hhline{|-|-||-|-|-|-|-|}
% \hline
% \end{tabular}
% \end{center}
% % \caption{Results}
% % \label{tab:results_temp}
% \end{table*}
% \vspace{-2mm}

% \vspace{-2mm}
\begin{table}[tb]
\begin{center}
\begin{tabular}{l|rr|rrrr}
\Cbr{\textbf{Version}} &
\multicolumn{2}{c|}{\textbf{Sequential}} &
\multicolumn{4}{c}{\textbf{Parallel w/ $N$ Engines}} \\
\Cbr{} & \C{\textbf{not TS}} & \Cbr{\textbf{TS}} &
\C{\textbf{1}} & \C{\textbf{2}} & \C{\textbf{3}} & \C{\textbf{4}} \\
\hline
\hline
\multicolumn{7}{c}{Raytracer} \\
\hline
indep    & 14.7 & 16.9 & 16.9 (1.00) & 16.9 (1.00) & 17.0 (0.99) & 17.0 (1.00) \\
naive    &    - &    - & 17.5 (0.97) & 13.4 (1.26) & 10.4 (1.62) &  8.9 (1.90) \\
overlap  &    - &    - & 17.4 (0.97) & 13.2 (1.28) & 10.5 (1.61) &  9.0 (1.88) \\
%raytracer  & indep    & 26.2 (0.87) & 26.3 (0.86) & 26.1 (0.87) & 26.2 (0.87) \\
%seq 22.7   & naive    & 25.3 (0.90) & 16.0 (1.42) & 11.2 (2.03) &  9.4 (2.42) \\
%par 26.5   & overlap  & 25.1 (0.90) & 16.0 (1.42) & 11.2 (2.03) &  9.4 (2.42) \\
\hline
\hline
\multicolumn{7}{c}{Raytracer w/ 16-row chunks} \\
\hline
indep    & 14.9 & 17.8 & 17.9 (1.00) & 18.0 (0.99) & 18.0 (0.99) & 18.1 (0.98) \\
naive    &    - &    - & 17.4 (1.02) & 10.2 (1.75) &  7.9 (2.26) &  6.5 (2.73) \\
overlap  &    - &    - & 17.5 (1.02) & 10.2 (1.75) &  7.9 (2.27) &  6.5 (2.73) \\
\hline
\hline
\multicolumn{7}{c}{Right recursive dependent Mandelbrot} \\
\hline
indep    & 15.2 & 15.1 & 15.1 (1.01) & 15.2 (1.01) & 15.2 (1.01) & 15.2 (1.01) \\
naive    &    - &    - & 15.2 (1.01) &  9.2 (1.66) &  5.7 (2.69) &  4.1 (3.76) \\
overlap  &    - &    - & 15.2 (1.01) &  9.8 (1.56) &  5.7 (2.66) &  4.0 (3.85) \\
%mandelbrot & indep    & 35.2 (0.95) & 35.1 (0.95) & 35.2 (0.95) & 35.3 (0.95) \\
%seq 33.4   & naive    & 35.4 (0.94) & 18.0 (1.86) & 12.1 (2.76) &  9.1 (3.67) \\
%par 35.2   & overlap  & 35.6 (0.94) & 17.9 (1.87) & 12.1 (2.76) &  9.1 (3.67) \\
\hline
\hline
\multicolumn{7}{c}{Matrix multiplication} \\
\hline
indep    & 5.10 & 7.69 & 7.69 (1.00) & 3.87 (1.99) & 2.60 (2.96) & 1.97 (3.90) \\
naive    &    - &    - & 7.69 (1.00) & 3.87 (1.98) & 2.60 (2.96) & 1.97 (3.90) \\
overlap  &    - &    - & 7.69 (1.00) & 3.87 (1.99) & 2.60 (2.96) & 1.97 (3.90) \\
%matrixmult & indep    & 14.6 (0.75) &  7.5 (1.47) &  7.0 (1.66) &  5.2 (2.12) \\
%seq 11.0   & naive    & 14.6 (0.75) &  7.6 (1.45) &  5.2 (2.12) &  5.2 (2.12) \\
%par 14.6   & overlap  & 14.6 (0.75) &  7.5 (1.47) &  6.2 (1.83) &  5.2 (2.12) \\
%\hhline{|-|-||-|-|-|-|-|}
\end{tabular}
\end{center}
\caption{Automatic parallelism performance results}
\label{tab:autopar}
\end{table}

Table~\ref{tab:autopar} presents these results.
Each group of three rows reports the results for one benchmark.
We auto-parallelised each program three different ways:
executing expensive goals in parallel
only when they are independent (``indep'');
even if they are dependent, regardless of how dependencies affect the
estimated available parallelism (``naive'');  and
even if they are dependent, but only if they have good overlap (``overlap'').
The next two columns give sequential execution times of each program,
the first of these is the
runtime of the program when compiled for sequential execution,
the second is
its runtime when compiled for parallel execution
but without enabling auto-parallelisation.
This shows the overhead of support for parallel execution
when it does not have any benefits.
The last four columns give the runtime in seconds
of each of these versions of the program
using one to four Mercury engines,
with speedups compared to the sequential thread safe version.

Compiling for parallelism but not using it
often yields a slowdown, sometimes as high as 50\% (for matrixmult).
(We observe such slowdowns for other programs as well.)
There are two reasons for this.
The first reason is that the runtime system, and in particular the garbage
collector, must be thread safe.
We tested how the garbage collector affected parallel performance
extensively in Section~\ref{sec:rts_gc}.
The results shown in Table~\ref{tab:autopar} were collected using one marker
thread in the garbage collector.
The second reason for a slowdown in thread safe grades comes from how we
compile Mercury code.
Mercury uses four
of the x86\_64's six callee-save registers~\citep{sysv-abi} to implement
four of the Mercury abstract machine registers\footnote{
    We only use four real registers because we may use only 
    callee-save registers,
    and we do not want to over-constrain the C compiler's register
    allocator.}
(Section~\ref{sec:backgnd_merpar}).
The four abstract machine registers that use real registers are:
the stack pointer,
the success instruction pointer (continuation pointer),
and the first two general purpose registers.
The parallel version of the Mercury system
needs to use a real machine register
to point to thread-specific data,
such as each engine's other abstract machine registers
(Section~\ref{sec:backgnd_merpar}).
On x86\_64, this means that only the first abstract general purpose register
is implemented using a real machine register.

The parallelised versions running on one Mercury engine get
only this slowdown,
plus the (small) additional overheads of all the parallel conjunctions
which cannot get any parallelism.
However, when we move to two, three or four engines,
many of the auto-parallelised programs do get speedups.

In mandelbrot and both raytracers,
all the parallelism is dependent,
which is why indep gets no speedup for them as it refuses to create
dependent parallel conjunctions.
The non-chunking raytracer achieves speedups of 1.90 and 1.88 with four Mercury engines.
We have found two reasons for these poor results.
The first reason is that, as we discussed in
Section~\ref{sec:rts_gc},
the raytracer's performance is limited by how much time it spends in the
garbage collector.
The second reason is the ``right recursion problem'' (see below).
For mandelbrot, overlap gets a speedup
that is as good as one can reasonably expect: 3.85 on four engines.
At first glance there appears to be a difference in the performance of the
naive and overlap versions of the mandelbrot program.
However the feedback data given to the compiler is the same in both cases,
so the generated code will be the same.
Any apparent difference is due to the high degree of variability in the
results,
which is why we executed each benchmark 40 times.
This variability does not exist in the indep case,
so we believe it comes from how dependencies in parallel conjunctions are
handled.
The performance is probably affected by
Both the implementation of futures,
and the ``right recursion problem''
(Section~\ref{sec:rts_original_scheduling_performance}),
for which we introduce a work around 
in Section~\ref{sec:rts_reorder}.
The work around reorders the conjuncts of a parallel conjunction to
transform right recursion into left recursion.
However this cannot be done to dependent conjunctions as it would create
mode incorrect code.
If non-chunking raytracer and mandelbrot do not reach the context limit,
the garbage collector needs to scan the large number of stacks created by
right recursion.
This means that parallel versions of these programs put more demands on the
garbage collector than their sequential equivalents.
Additionally,
the more Mercury engines that are used, the more sparks are stolen and
therefore more contexts are created.
Dependent parallel right recursive code will not scale to large numbers of
processors well.
We fix the right recursion problem
in Chapter~\ref{chap:loop_control},
and provide better benchmark results for explicitly-parallel versions of the
non-chunking raytracer, mandelbrot and a dependent version of matrixmult.

The version of the Mercury runtime system that we used in these tests
does not behave exactly as described in
Section~\ref{sec:rts_work_stealing2}.
In some cases the context limit is not used to reduce the amount of memory
consumed by contexts' stacks.
This means that right recursive dependent parallel conjunctions can consume
more memory (reducing performance),
but they can also create more parallel tasks (usually improving
performance).
This does not affect the analysis of how well automatic parallelism works,
it merely makes it difficult to compare these results with results in the
other chapters of the dissertation.

We verified the right recursion problem's impact in garbage collection by
testing the chunking version of raytracer.
This version renders 16 rows of the image on each iteration and therefore
creates fewer larger contexts.
We can see that chunking raytracer performs much better than the
non-chunking raytracer when either naive or overlap parallelism is used.
However it performs worse without parallelism, including indep
auto-parallelism (which cannot find any candidate parallel conjunctions).
We are not sure why it performs worse without parallelism;
one idea is that the garbage collector is affected somehow by how and where
memory is allocated.
These versions spawn many fewer contexts, thus putting much less load
on the GC.
We know that this speed up is due to the interaction of the right recursion
problem and the GC because mandelbrot,
which creates just as many contexts but has a much lower allocation rate,
already runs very optimally.
This shows that
program transformations that cause more work to be done in each context
are likely to be optimisations,
this includes the work in Chapter~\ref{chap:loop_control}.
% We thus expect that applying throttling
% (as described in section~\ref{sec:overlap})
% should significantly improve these results.

The parallelism in the main predicate of matrixmult is independent.
Regardless of which parallel cost calculation we used (naive, indep or
overlap),
the same feedback was sent to the compiler.
All versions of matrixmult were parallelised the same way and performed the
same.
This is normal for such a small example with simple parallelism.
(We compare this to a more complicated example of the Mercury compiler
below.)
In an earlier version of our system,
the version that was used in \citet*{bone:2011:overlap},
the naive strategy parallelised matrixmult's main loop
differently,
it included an extra goal inside the parallel conjunct thereby creating
a dependent parallel conjunction even though the parallelisation was
independent.
Therefore in \citet*{bone:2011:overlap},
naive performed worse than either indep or overlap.

% For matrixmult, the bottleneck is almost certainly CPU-memory bandwidth.
% Each step in this program does only one multiply and one add (both integer)
% before creating a new cell on the heap and filling it in.
% On current CPUs, the arithmetic takes much less time than the memory writes,
% and since the new cells are never accessed again, caches do not help,
% which makes it easy to saturate the memory bus.

%The raytracer is very memory-allocation-intensive,
%because it stores intermediate values, such as colours and vectors, in
%records on the heap.
%Each operation requires the creation of a cell on the heap to store the
%result.
%Because of this, memory bandwidth may also be an issue for it,
%but its bigger problem is GC.
%We measured the effects of GC on this program in section \ref{sec:rts_gc}.

Most small programs like these benchmarks
have only one loop that dominates their runtime.
In all four of these benchmarks, and in many others,
the naive and overlap methods will parallelise the same loops,
and usually the same way;
they tend to differ only in how they parallelise code
that executes much less often (typically only once)
whose effect is lost in the noise.
%The raw timings show a great deal of variability for dependent parallelism:
%Some of this variability remains even after filtering and averaging.
%However, the raw times showed significant variability,
%and this process does not entirely eliminate that variability.

% we have seen two consecutive runs of the same program on the same data
% differ in their runtime by as much as 15\%.
% (One possible cause of this is differences
% in whether the OS puts frequently-communicating engines
% on cores on the same die, or cores on two different dies.)
% As the table shows,
% Some of this variability remains even after filtering and averaging.
% However, the raw times showed significant variability,
% and this process does not entirely eliminate that variability.

\label{page:conjs_in_mcc}
To see the difference between naive and overlap,
we need to look at larger programs.
Our standard large test program is the Mercury compiler, which contains
69 conjunctions with two or more expensive goals
(goals with a per call cost of at least 10,000csc).
Of these, 66 are dependent,
and only 34 have an overlap
that leads to a predicted local speedup of more than 2\%.
Our algorithms can thus prevent
the unproductive parallelisation of $69-34=35$ (51\%) of these conjunctions.
Unfortunately, programs that are large and complex enough
to show a performance effect from this saving
also tend to have large components
that cannot be profitably parallelised with existing techniques,
%and incur the overheads of thread safety in the garbage collector and cannot
%use an extra real CPU register.
which means that (due to Amdahl's law)
our autoparallelisation system cannot yield overall speedups for them yet.

On the bright side,
our feedback tool can process the profiles of small programs like these
benchmarks in less than a second 
and in only a minute or two even for much larger profiles.
The extra time taken by the Mercury compiler
when it follows the recommendations in feedback files
is so small that it is not noticeable.


\section{Related work}
\label{sec:overlap_related}

\status{Draft finished, This is ready for review by Zoltan.}

% Mercury's strong mode system
% greatly simplifies the parallel execution of logic programs,
% making the comparison of parallel Mercury with parallel Prolog difficult.
% For example, \cite{Hermenegildo1995} defines non-strict
% goal independence such that goals that are non-strictly independent can be
% run in parallel without leading to incorrect results.
% Because Mercury
% statically determines a single goal in a conjunction to bind each variable,
% and because Mercury does not permit variables to be aliased,
% the conditions of non-strict goal independence
% are not necessary for Mercury to guarantee correctness.
% Similarly, other existing work on AND-parallelism in Prolog
% is not closely related to the present work,
% because Mercury sidesteps the
% problems that work seeks to overcome.
% \peter{Is that too hand-wavey and dismissive?}

Mercury's strong mode and determinism systems
greatly simplify the parallel execution of logic programs.
As we discussed in
Sections~\ref{sec:intro_par_logic},~\ref{sec:backgnd_merpar} and~\ref{sec:backgnd_deppar},
they make it easy to implement dependent AND-parallelism efficiently.
The mode system provides complete information allowing us to identify
shared variables at compile time.
%\citet*{DBLP:journals/tcs/GrasH09} and \citet{Hermenegildo1995} describe
%much less complete analyses in Prolog.

% That is what they were \emph{designed} to do.
% The information gathered by semantic analysis in Mercury
% Many problems in the parallel execution of Prolog and Prolog-like languages,
% like testing the independence of goals
% in systems that support only independent AND-parallelism,
% discovering producer-consumer relationships at runtime
% in systems that also support dependent AND-parallelism,
% and having to handle nondeterministic conjuncts,
% disappear completely,
% with the answers to the problem being presented on a silver platter
% Our group designed Mercury specifically to ensure this.

% We know of no work in other logic programming languages such as Prolog that
% is compareable to our overlap analysis.
% This is probably because no other logic programming environment provides the
% information necessary to attempt such an analysis.

Most research in parallel logic programming so far
has focused on trying to solve the problems
of getting parallel execution to \emph{work} well,
with only a small fraction trying to find
when parallel execution would actually be \emph{worthwhile}.
Almost all previous work on automatic parallelisation
has focused on granularity control:
reducing the number of parallel tasks while increasing their size
\citep{lopez96:granularity},
and properly accounting for the overheads
of parallelism itself \citep{shen_98_granularity-control}.
Most of the rest has tried to find opportunities
to exploit independent AND-parallelism
during the execution of otherwise-dependent conjunctions
\citep{DBLP:journals/jlp/MuthukumarBBH99,DBLP:conf/lopstr/CasasCH07}.

Our experience with our feedback tool shows that
for Mercury programs, this is far from enough.
For most programs,
it finds enough conjunctions with two or more expensive conjuncts,
but almost all are dependent,
and, as we mention in Section~\ref{sec:overlap_perf},
many of these have too little overlap to be worth parallelising.
% For example, the Mercury compiler contains
% 50 conjunctions with two or more expensive goals.
% 49 of these are dependent.
% Of these, only 38 of these have any overlap,
% and only for 31 does the overlap
% lead to a predicted local speedup of more than 1\%.

We know of only four attempts to estimate the overlap
between parallel computations,
and two of these are in Mercury.

\citet*{tannier:2007:parallel_mercury} attempted to automatically parallelise
Mercury programs.
\citeauthor{tannier:2007:parallel_mercury}'s
approach was to use the number of shared variables in a parallel
conjunction as an analogue for how dependent the conjunction was.
While two conjuncts are indeed less likely
to have useful parallel overlap if they have more shared variables,
we have found this heuristic too inaccurate to be useful.
Conjunctions with only a single variable can significantly vary in
their amount of overlap,which we have shown with the examples in this
chapter.
Analysing overlap properly is important as most parallel conjunctions have
very little overlap,
making their parallelisation wasteful.

After
\citeauthor{tannier:2007:parallel_mercury}'s
work,
we attempted to estimate overlap more accurately in
our prior work (\citet{bone:2008:hons}
\footnote{
    This earlier work was done as part of my honours project and contributed
    to the degree of Bachelor of Computer Science Honours at The University of
    Melbourne.}).
Compared with the work in this chapter,
our earlier work performed a much simpler analysis of the
parallelism available in dependent parallel conjunctions.
It could handle only conjunctions with two conjuncts and a single shared
variable.
It also did not use the ancestor context specific information provided by the
deep profiler or perform the call graph traversal.
The call graph traversal is important as it allows us to avoid parallelising
a callee when the caller already provides enough parallelism.
Most critically our earlier work did not attempt to calculate the costs of
recursive calls,
and therefore failed to find any parallelism in any of the test programs we
used,
including the raytracer that we use in this chapter.

Another dependency aware auto-parallelisation effort
was in the context of speculative execution in imperative programs.
Given two successive blocks of instructions,
\citet*{von_Praun:2007:implicit_parallelism_with_ordered_transactions}
% estimates the likely speedup
% from executing the two blocks in parallel
% by using the difference between the addresses of two instructions
decide whether the second block should be executed speculatively
based on the difference between the addresses of two instructions,
one that writes a value to a register and one that reads from that register.
% This is effectively a binary metric.
This only works when instructions take a bounded time to execute,
but in the presence of call instructions
this heuristic will be inaccurate.

The most closely related work to ours is that of
\citet*{Pontelli97automaticcompile-time}.
They generated parallelism annotations for the ACE AND/OR-parallel system.
They recognised the varying amounts of parallelism that may be available in
dependent AND-parallel code.
This system used, much as we do,
estimates of the costs of calls
and of the times at which variables are produced and consumed.
It produced its estimates through static analysis of the program.
This can work for small programs,
where the call trees of the relevant calls can be quite small and regular.
However, in large programs, the call trees of the expensive calls
are almost certain to be both tall and wide,
with a huge gulf between best-case and worst-case behaviour.
\citeauthor{Pontelli97automaticcompile-time}'s analysis is a whole program
analysis,
which can also be troublesome for large programs.
Our whole-program analysis covers only the parts of the call graph that are
deemed costly enough to be worth exploring,
which is another benefit of profiler feedback analysis.
\citeauthor{Pontelli97automaticcompile-time}'s analysis of the variable use
times is incomplete when analysing branching code.
We recognised the impact that diverging code may have and created coverage
profiling so that we could gather information and analyse diverging code
accurately \citep{bone:2008:hons}.
Using profiling data is the only way
for an automatic parallelisation system to find out
what the \emph{typical} cost and variable use times are.
Finally, \citeauthor{Pontelli97automaticcompile-time}'s overlap calculation is
pairwise:
it considers the overlap of only two conjuncts at a time.
We have found that one must consider the overlap of the whole parallel
conjunction, as a delay in the production of a variable in one conjunct can
affect the variable production times of other conjuncts.
This is also why we attempt to perform a complete search for the most optimal
parallelisation.
Changes anywhere within a conjunction's execution can affect other parts of
the conjunction.
For example,
the addition of an extra conjunct at the end of the conjunction can create a
new shared variable which increases the costs of earlier conjuncts.

% There is a risk that the program could have changed between the
% profiling build and the parallelised build,
% this makes it more difficult for the compiler to apply the profiling
% advice.
% To reduce this risk the profiling build should be built with the same
% optimizations that the parallelised build will be built with.
% In usual circumstances inlining should be disabled during profiling so
% that a programmer can more easily understand their program's profile.
% Our implementation re-enables inlining in profiling builds if a
% suitable optimization level is selected and
% \code{--profile-for-implicit-parallelism} is passed to the compiler.
% % XXX: These details may be unimportant, especially the name of this
% % compiler option,  But this is (for now) an easy way to describe
% % this.

Our system's predictions of the likely speedup from parallelising a conjunction
are also fallible, since they currently ignore several relevant issues,
including cache effects
and the effects of bottlenecks
such as CPU-memory buses and stop-the-world garbage collection.
However, our system seems to be a sound basis for further refinements like
these.
% However, they come much closer
% to predicting actual overlaps than previous attempts,
% and our system seems to be a sound basis for further refinements.
% \begin{itemize}
% \item
% It is hard to define what a typical workload is,
% and we do not yet implement profile merging.
% \item
% The feedback framework is general purpose
% and can be used for other optimizations.
% \item
% \zoltan{I haven't covered any technical details about the feedback framework.
% I guess there's not much to say.}
% \end{itemize}
In the future, we plan to support parallelisation as a specialisation:
applying a specific parallelisation only when a predicate is called
from a specific parent, grandparent or other ancestor.
% we will look at how best to resolve cases
% where our tool gives different parallelisation advice for the same conjunction
% due to the different behaviour of that conjunction in different contexts.
We also plan to modify our feedback tool
to accept several profiling data files,
with a priority or (weighted) voting scheme to resolve any conflicts.
% between their advice.
There is also potential further work in the handling of loops and recursion.
We will talk about this in Chapter~\ref{chap:conc}
after we improve the efficiency of some loops in the next chapter.

This chapter is an extended and revised version of
\citet{bone:2011:overlap}.


% TODO Items.

% \begin{algorithm}
% \begin{verbatim}
% MaxBefore := 0
% N := num_conjuncts(Conjs)
% for i in 1 to N:
%     if conjunct i in Conjs is below threshold then
%         MaxBefore := i
%     else
%         break
%
% MinAfter := N+1
% for i in N downto 1:
%     if conjunct i in Conjs is below threshold then
%         MinAfter := i
%     else
%         break
%
% BestTime := infinity
% Arrangements := [[[conjunct MaxBefore+1]]]
% # each element in Arrangements is
% #   a list of parallel conjuncts
% # each parallel conjunct consists of
% #   a list of consecutive conjuncts
% for i in MaxBefore+2 to MinAfter-1:
%     NewArrangements := []
%     for Arrangement in Arrangements:
%         ExtendLast := all_but_last(Arrangement)
%             ++ [last(Arrangement) ++ conjunct i]
%         AddNewLast := Arrangement ++ [conjunct i]
%         NewArrangements := NewArrangements ++
%             [ExtendLast, AddNewLast]
%     Arrangements := NewArrangements
%
%     for Before in 0 to MaxBefore:
%         for After in MinAfter to N+1:
%             GoalsBefore := conjuncts 1 .. Before in Conjs
%             GoalsAfter  := conjuncts After .. N in Conjs
%             # GoalsBefore and/or GoalsAfter may be empty
%             ExtraGoalsBefore := conjuncts (Before+1) .. MaxBefore in
%                Conjs
%             ExtraGoalsAfter := conjuncts MinAfter .. (After-1) in
%                Conjs
%
%             for each Arrangement in Arrangements
%                 Arrangement := [ExtraGoalsBefore ++ first(Arrangement)] ++
%                    all_but_first_and_last(Arrangement) ++
%                     [last(Arrangement) ++ ExtraGoalsAfter]
%                 ParConj := par_conj(Arrangement)
%                 OverallGoal :=
%                     seq_conj(GoalsBefore ++ [ParConj] ++ GoalsAfter)
%                 Time := compute_par_exec_time(OverallGoal)
%                 if Time < BestTime:
%                     BestTime := Time
%                     BestGoal := OverallGoal
% \end{verbatim}
% \caption{Search for best parallelisation}
% \label{alg:branch_and_bound_search}
% \end{algorithm}