experiments.tex

\section{Experiments}\label{sec:experiments}
In this section, we demonstrate our experiments.
First, we give the experiment setup with used machines, algorithms, datasets and queries.
Second, we vary the linear-search threshold of the binary search used by our \textsc{WCOJ}
algorithms to determine the best values.
Then, we provide a comparision of the run-time of a sequentially run Spark binary join with our
reference implementation of the \textsc{LFTJ}.
As next experiment, we compare the \textsc{LFTJ} algorithm with our graph pattern matching
specialized GraphWCOJ.
Followed, by an analysis of the scaling behaviour of GraphWCOJ in Spark's local mode on
a single machine.
Finally, we show the speedup our distributed version reaches on four machines with up
to 384 cores and compare the run-time with Spark's binary joins in a distributed setting.

We run our experiments on machines of the type \textit{diamond} of the Scilens cluster of the CWI Database Architecture research
group.
These machines feature 4 Intel Xeon E5-4657Lv2 processors with 12 cores each and hyperthreading of 2 (48 cores / 96 threads)
Each core has 32 KB of 1st level cache, 32KB 2nd level cache.
The 3rd level cache provides 30 MB shared between 12 cores.
The main memory consists of 1 TB of RAM DDR-3 memory.

The machines run a Fedora version 30 Linux system with the 5.0.17-300.fc30.x86\_64 kernel.
We use Spark 2.4.0 with Scala 2.11.12 on Java OpenJDK 1.8.
In the majority of our experiments, we use Spark in its standard configuration with enabled code generation.
We also tune the parameters for driver and executor memory usage (\texttt{spark.driver.memory} and \texttt{spark.executor.memory}) to fit
all necessary data into main memory.

\subsection{Algorithms}
In our experiments, we use 4 different join algorithms.
Two of them are worst-case optimal joins.
That is, our Leapfrog Triejoin implementation and a graph-pattern matching
specialized Leapfrog Triejoin developed in this thesis: Graph\textsc{WCOJ}.
LFTJ is only run as a sequential algorithm as a baseline against GraphWCOJ.
We compare these two algorithms in \cref{subsec:lftj-vs-graphWCOJ}.

The other two algorithms are Spark's binary joins: \textit{BroadcastHashjoin} and \textit{SortmergeJoin}.
We compare them against the sequential version of \textsc{LFTJ} and GraphWCOJ in \cref{subsec:spark-vs-lftj}.

We adjust the \texttt{spark.sql.autoBroadcastJoinThreshold} parameter to control
if Spark is using a \textit{BroadcastHashjoin} or a \textit{SortMergeJoin}.

\subsection{Datasets}
We run our experiments on multiple datasets from two different use-cases: social networks and product co-purchase.
We explain our choice in the next paragraphs.
\Cref{table:datasets} includes a list of all graph datasets mentioned throughout the thesis.

\begin{table}[]
    \centering
    \begin{tabular}{llrrl} \toprule
    Name    & Variant              &  Vertices   & Edges          & Source          \\ \midrule
    \textbf{SNB}         & sf1     &             &  453.032       & \cite{snb}      \\
    \textbf{Amazon}      & 0302    & 262,111     &  1,234,877     & \cite{snapnets} \\
    & 0601    & 403,394     &  3,387,388     & \cite{snapnets} \\
    \textbf{Twitter}     & sc-d    & 81,306      &  1,768,135     & \cite{snapnets} \\
    \textbf{LiveJournal} &         & 4,847,571   & 68,993,773     & \cite{snapnets} \\
    \textbf{Orkut}       &         & 3,072,441   & 117,185,083    & \cite{snapnets} \\ \bottomrule
    \end{tabular}
    \caption{A summary of all datasets mentioned in the thesis.}
    \label{table:datasets}
\end{table}

The SNB benchmark~\cite{snb} generates data emulating the posts, messages and friendships in a social network.
For our experiments, we only use the friendships relationship (\texttt{person\_knows\_person.csv}) which is an undirected relationship.
Only edges of the kind \textit{src $<$ dst} exist, we generate the opposing edges before loading the dataset, such
that the edge table becomes truly undirected.

The benchmark comes with an extensively parameterizable graph generation engine
which allows us to experiment with sizes as small as 1GB and up to 1TB for big experiments and different levels of selectivity.
The different sizes are called scale-factor or sf, e.g. SNB-sf1 refers to a Social network benchmark dataset generated with
default parameters and scale-factor 1.

The Amazon co-purchasing network contains edges between products that have been purchased together and hence are closely related to each other~\cite{snapnets}.
This is a directed relationship from the product purchased first to the product purchased second, both directions of an edge can exist if the order in which
products have been purchased varies.

The Snap dataset collection contains multiple Amazon co-purchase datasets, each of them containing a single day of purchases.
We choose the smallest and biggest dataset from the 2nd of March and the 1st of June 2003 which we call them Amazon-0302 and
Amazon-0601.

We pick co-purchase datasets for evaluation because former work often concentrated on social networks and web crawl based
graphs~\cite{myria-detailed,ammar2018distributed} but~\cite{salihoglu2018} points out that the biggest graphs are actually graphs like
the aforementioned Amazon graph containing purchase information.

To allow comparisons with former work, we run a subset of our experiments on the Twitter social circle network from~\cite{snapnets}.
This dataset includes the follower relationship of one thousand Twitter users; each of these follows 10 to 4.964 other users and
relationships between these are included.

The LiveJournal and Orkut graph represent the friendship relationships of a medium-sized social network.

\subsection{Queries}
In this section, we detail the graph patterns used throughout our experiments.
Most of the queries are cyclic because that is the primary use-case for \textsc{WCOJ} in former research~\cite{olddog,myria-detailed}.
\textsc{WCOJ}s also have been successfully applied to selective path queries in~\cite{olddog,longbin}.

We apply filters to most of our queries to make them more realistic, e.g. a clique query does make more sense if it is combined with a
smaller-than filter, which requires that the attributes are bound such that \textit{a} smaller than \textit{b}, smaller than \textit{c}.
Otherwise, one gets the same clique in all possible orders, which not only takes much more time but is also most
likely not the result a user would want.

We ensure that filters can be pushed down through or in the join by Spark as well as by the WCOJ to compare both algorithms on an equal basis.
A complete list of all queries and filters used is shown in~\cref{table:patterns}.
\Cref{fig:all-queries} shows depiction of all graph patterns.

Patterns and filters are combined as follows.
Cliques and the Kite query use smaller than filters which require the bindings to increase
in value according to the variable ordering.
All other queries are run with a filter such that each of their bindings must be distinct.

\begin{table}[]
    \centering
    \begin{tabular}{@{}lcccp{6cm}@{}}
        \toprule
        Name     & Parameters                 & Vertices & Edges

        \\ \midrule
        triangle & NA                         & $3$        & 3                 \\
        n-clique & \# vertices                & $n$        & $1/2 \times n \times (n - 1)$ \\
        n-cycle  & \# vertices                & $n$        & $n$                 \\
        n-s-path & \# edges / selectivity     & $n$        & $n - 1$                 \\
        Kite     & NA                         & $4$        & $5$                    \\
        House    & NA                         & $5$        & $9$                 \\
        Diamond  & NA                         & $4$        & $4$                 \\
        \textbf{Filters}   &                            &          &                   \\
        distinct  &                            &          &                   \\
        less-than &                            &          &                   \\ \bottomrule
    \end{tabular}
    \caption{Summary of patterns and filters used.}
    \label{table:patterns}
\end{table}

\begin{figure}
    \centering
    \subfloat[triangle]{\includesvg[width=0.2\textwidth]{svg/triangle}}
    \subfloat[4-clique]{\includesvg[width=0.2\textwidth]{svg/4clique}}
    \subfloat[5-clique]{\includesvg[width=0.2\textwidth]{svg/5clique}}
    \subfloat[4-cycle]{\includesvg[width=0.2\textwidth]{svg/4cycle}}\\
    \subfloat[5-cycle]{\includesvg[width=0.2\textwidth]{svg/5cycle}}
    \subfloat[Diamond]{\includesvg[width=0.2\textwidth]{svg/diamond}}
    \subfloat[Kite]{\includesvg[width=0.2\textwidth]{svg/kite}}
    \caption{Queries used in our experiments.}
    \label{fig:all-queries}
\end{figure}

For a selective path query, we first select two sets of nodes with respect to the \textit{selectivity} parameter.
Then we search for all paths of a certain length according to the \textit{edges} parameter, e.g. 4-0.1-path finds all
paths between two randomly selected fixed sets of vertices of length 4.
The sets of nodes contain roughly 10\% of all input nodes and are
not guaranteed to be intersection free.

\subsection{Linear search threshold}\label{subsec:linear-search-threshold}

We run \textsc{LFTJ} and GraphWCOJ with different settings for the \textit{linear search threshold}.
As explained in \cref{subsubsec:leapfrog-triejoin}, we use a binary search to implement the \textit{seek}
method of the \textit{TrieIterators};
it is also used for GraphWCOJ.
It is well known, that a binary search can be optimized by ending it with a linear search on small
search spaces because linear memory access patterns are cheaper than random accesses.
The threshold gives the size of the search space from which to use a linear search instead of
a binary search, e.g. a threshold of 40 means that the algorithm switches to a linear search
once the search space is 40 numbers or less.

We note that \textsc{LFTJ} and GraphWCOJ could behave differently for the same threshold.
This is because Leapfrog Triejoin uses a binary search for both levels of its \textit{TrieIterators},
while GraphWCOJ only uses the binary search for the second levels;
the first level is indexed in a \textsc{CSR}.

In this experiment, we vary the threshold between 1 and 1600 to determine the best value.
These values are chosen such that 1 does not trigger any linear search and that 1600
does not improve the performance anymore (for \textsc{LFTJ}) and triggers no
binary search for GraphWCOJ.
We do so for the 5-clique query on the SNB-1 and the Twitter dataset.

The results are shown in~\cref{subsec:linear-search-threshold}.
The optimum for LFTJ is around 200 while GraphWCOJ shows the best performance at 1600 and 800 for
SNB-1 respectively Twitter.
This means that GraphWCOJ performs best when there are nearly no binary searches.
A threshold of 1600 triggers no binary search in either of the datasets.


\begin{figure}
    \centering
    \subfloat[SNB-1\label{fig:linear-search-threshold-snb}]{\includesvg[width=0.5\textwidth]{svg/linear-search-threshold-snb}}
    \subfloat[Twitter\label{fig:linear-search-threshold-twitter}]{\includesvg[width=0.5\textwidth]{svg/linear-search-threshold-twitter}}
    \caption{Runtime of \textsc{WCOJs} with different settings for the linear search thresholds.}
    \label{fig:linear-search-threshold}
\end{figure}

Also, we note that the effect the linear search threshold has on the performance is bigger for the Leapfrog Triejoin.
We explain the observations as follows: \textsc{LFTJ} does use searches on the first
and second level of the \textit{TrieIterators} while GraphWCOJ uses it only on the second
level.
Hence, the impact is bigger.

The optimal values are different because data of the levels are differently distributed.
The first level lists nearly all vertices while the second level is made of adjacency lists which are more
sparse.
Hence, we assume that the linear searches on the first level are generally longer than
the one on the second level;
note that the threshold only gives a maximum length for linear searches but this is not
necessarily a good indicator for the length of the performed search.

We tried to use different threshold values for the two levels in \textsc{LFTJ}.
We choose the values 200 for the first level and 1600 for the second level because these
are the optimal values according to our experiments.
However, we note that no huge performance gain can be measured.
This is most likely because the runtime is dominated by the searches on the first level.
For simplicity, we do not use two different thresholds for \textsc{LFTJ} in any further experiments.

From this experiment, we conclude that the optimal threshold for \textsc{LFTJ} is 200 and 800
for GraphWCOJ.
We choose 800 for GraphWCOJ because it is on the safe side:
a binary search performance degrades less than the one of a linear search.
We set these values accordingly in all further experiments.

\subsection{Baseline: \texttt{BroadcastHashjoin} vs \textsc{LFTJ}} \label{subsec:spark-vs-lftj}
In this experiment, we compare the runtime of our sequential Leapfrog Triejoin implementation with the runtime of Spark's \texttt{BroadcastHashjoin}.
Towards this goal we ran all queries from~\cref{table:patterns} on our three of our datasets: Amazon-0302, Amazon-0601
and SNB-sf1.

Our experiment measures the time it takes to perform a count on the cached dataset using \texttt{BroadcastHashjoin} and
\textsc{LFTJ}.
For \texttt{BroadcastHashjoin}, the time to run the whole query is reported.
For the \textsc{LFTJ}, we the time, it takes to run the join, excluding setup time.
Setup time includes sorting and materialization.

This section is focused on comparing the runtimes excluding the setup time, because the final system is meant to cache the readily sorted
and formatted as \textsc{CSR}'s and reuse it for multiple queries.
We anticipate that this is necessary to benefit from \textsc{WCOJ}s in general.

We compare against Spark's \texttt{BroadcastHashjoin} instead of \texttt{SortMergeJoin} because even when all data is arranged in a single
partition, for simple sequential processing, Spark schedules its \texttt{SortMergeJoin} to use a shuffle.
A shuffle writes and reads data to and from disk.
Hence, \texttt{SortMergeJoin} is much slower than a \texttt{BroadcastHashjoin}.
We compared the algorithms on the Amazon-0601 dataset for the triangle (8.1 seconds vs 58.9 seconds) and
5-clique pattern (32.9 seconds vs 850.9 seconds).
We assume that Spark can optimize its broadcasts when \texttt{local[1]} is used to start the Spark session because then Spark uses the driver as executor.

We point out that Spark's code generation is a huge advantage for the \texttt{BroadcastHashjoin} compared
to our interpreted \textsc{LFTJ}.
We ran Spark without code generation for comparision on the Amazon-0302 dataset for the triangle
query and 5-clique: with code generation Spark takes 3.1 and 4.2 seconds, and without 14 and 16 seconds.

We show our results in~\cref{fig:spark-vs-lftj}.
Next, we analyze the results.

We are able to beat Spark's \texttt{BroadcastHashjoin} on all datasets and queries except 5-clique and house on Amazon-0602 and
path queries on SNB-1.
Generally, we see that for n-clique patterns the speedup over Spark decreases for bigger $n$.
This is because many binary joins in a n-clique are semi-joins which to do not increase but decrease the size of intermediary results,
e.g. for 5-clique on Amazon-0302 only 3 out of 9 joins lead to a bigger intermediary result.

The cycle query results are highly interesting because we see an increasing speedup for higher $n$ on Amazon-0602 but a
decreasing speedup on Amazon-0302.

The House and 5-clique pattern seem to be quite similiar - the House is a 5-clique with two missing
edges.
However, as the count of their results indicates these two edges and the fact that we use a less-than for the 5-clique and a distinct
vertice filter (the bindings must be distinct but not ordered as in a less-than filter) for the House query lead to dramatically different
outcomes.
Hence, their different timing and speedup behaviour.

The Kite pattern produces consistently the second highest speedup after the 3-clique.
Most likely since a Kite is two triangles back-to-back.

The path query shows very different behaviour on the Amazon and the SNB datasets.
This might be due to the different selectivity; it is extremely high on the co-purchase datasets and rather low on the social network
benchmark.
This difference in selectivity is not surprising given that the SNB network fulfils the small-world property, while the
Amazon dataset relates products purchased together which naturally leads to multiple loosely connected components.

Finally, we observe that all three datasets lead to quite different results which are most likely not comparable to each other without deeper research
in the characteristics of the datasets themselves.

In particular, it becomes clear that co-purchase datasets and social network datasets must have very different characteristics.
Although, SNB-sf1 is much smaller than Amazon-0601, queries on it take a similar or even much more time,
e.g. 5-clique takes 14.21 seconds on the bigger dataset and 12.65 seconds smaller, even though, the result set is much
smaller on SNB-sf1;
4-cycles takes roughly 8 times longer on the small dataset and has a much bigger result set.
In general, we see a higher speedup on SNB-sf1.

\begin{figure}
    \centering
    \subfloat[Amazon0302]{
    \includesvg[width=0.5\textwidth]{svg/spark-wcoj-amazon}
    \includesvg[width=0.5\textwidth]{svg/spark-wcoj-amazon-long}
    }\\
    \subfloat[Amazon0601]{
    \includesvg[width=0.5\textwidth]{svg/spark-wcoj-amazon0601}
    \includesvg[width=0.5\textwidth]{svg/spark-wcoj-amazon0601-long}
    }\\
    \subfloat[SNB-sf-1]{
    \includesvg[width=0.5\textwidth]{svg/spark-wcoj-snb}
    \includesvg[width=0.5\textwidth]{svg/spark-wcoj-snb-long}
    }
    \caption{Runtime of a Leapfrog Triejoin and Spark's \texttt{BroadcastHashjoin}}
    \label{fig:spark-vs-lftj}
\end{figure}


%\begin{table}
%    \centering
%
%    \input{generated/seq-table-ama0302}
%    \vspace{0.3cm}
%
%    \input{generated/seq-table-ama0601}
%
%    \vspace{0.3cm}
%    \input{generated/seq-table-snb-sf1}
%    \caption{Runtimes for \texttt{BroadcastHashjoin} and \texttt{seq}.
%    The speedup is calculated between join times and excludes setup.
%    From top to bottom for dataset: \texttt{ama-0302}, \texttt{ama-0601} and \texttt{snb-sf1}.
%    All times in seconds.
%    }
%    \label{table:seq-vs-bhj}
%\end{table}

\subsection{\textsc{LFTJ} vs GraphWCOJ} \label{subsec:lftj-vs-graphWCOJ}

In this experiment, we compare sequentials runs of the Leapfrog Triejoin and GraphWCOJ
on the Amazon, SNB-1 and Twitter datasets.
We do not show the run-time of Path queries because GraphWCOJ is not able to run
them\footnote{Path queries require us to filter the input relationship before
the join. This has not been implemented for GraphWCOJ due to time constraints.}.

We show the run-time of the queries on different datasets in~\cref{fig:lftj-graphWCOJ}.
We present the performance of the Leapfrog Triejoin, GraphWCOJ and GraphWCOJ without the
materialization optimization.

We first analyze the impact of using the \textsc{CSR} data structure as basis for the join.

Throughout all datasets, we see that GraphWCOJ is faster than a \textsc{LFTJ} for all queries.
The biggest speedups are reached for 5-cliques and the lowest speedup for 4-cycles.
The maximum speedup over all queries and datasets is 11.4 for a 5-clique query on Amazon-0601.
The lowest speedup is 1.2 on a 4-cycle query on SNB-sf-1.

It shows the biggest speedup for the clique queries which is increasing with the size of the clique.
The effect on the House and Kite query is generally lower.
The Diamond and 4-cycle query do not improve much when we use GraphWCOJ.

This can be explained by the fact that \textsc{CSR} mostly improves by implementing searches on
the first level of the \textit{TrieIterators} as a two array reads instead of a binary search for
the column-based implementation of the original Leapfrog Triejoin.
The denser the query the more first level \textit{TrieIterator} accesses are used.
Hence, clique, House and Kite query profit more than the sparse cycle and Diamond query.
For example, the 5-clique query uses the 10 \textit{TrieIterators} with 10 first levels to iterate while
the 4-cylce and Diamond need only 4 \textit{TrieIterators}.

We find that materialization does not improve the algorithm much.
The strongest speedup can be seen for dense queries, e.g. 5-clique.
However, sparse queries, as the 4-cycle, take longer with enabled materialization.

To explain these weak results, we used \textit{perf} to monitor cache hits and misses while running
GraphWCOJ with and without materialization on the 5-clique query on the SNB dataset.
On this query and dataset materialization shows its strongest impact.
We cannot measure any significant difference in cache utilization between the two configurations.

The last reason why materialization can improve the performance is because it uses binary
intersections starting with the smallest intersection which leads to less \textit{seek} calls.
We measure this effect by counting all \textit{seek} calls for the different queries on the
SNB-1 dataset and show the results in~\cref{table:seek-calls}.
We note a clear correlation between the percentage of saved calls to the run-time difference with materialization:
Kite, 4-clique and 5-clique profit from the optimization and save more than 15\% of all \textit{seek} calls, while
the other queries show no big difference or a slow-down and save less than 10\% of the \textit{seek} calls.
In the next paragraph, we explain these results.

The effect on dense clique queries is highest because they employ intersections between up to
four adjacency lists while the 4-cycle intersects at-most two adjacency lists.
Hence, using binary intersection makes no difference for 4-cycles.
We still save \textit{seek} calls because we defer them for the first-level \textit{TrieIterators} until after building
the intersection.
This query gets slower with materialization because of the overhead of copying values into a new buffer.

\begin{table}
    \centering
    \begin{tabular}{lrrr}
        \toprule
        Query       & w\textbackslash o materialization & materialization  &   $\Delta$   [\%]  \\ \midrule
        3-clique    &        34.739.080                    &     33.526.024      &    3.4       \\
        4-clique    &       118.451.741                    &     99.402.372      &   16.1       \\
        5-clique    &       262.304.687                    &    192.296.784      &   26.7       \\
        kite        &       346.636.041                    &    272.840.747      &   21.2       \\
        house       &      5985.294.145                    &  5.550.487.243      &    7.2       \\
        4-cycle     &     4.591.408.924                    &  4.402.790.869      &    4.1       \\
        diamond     &    10.230.067.028                    &  9.680.437.365      &    5.3       \\
        \bottomrule
    \end{tabular}
    \caption{Count of all \textit{TrieIterator} \textit{seek} calls for different queries on the SNB-1 dataset with and
    without materialization and difference in percent.}
    \label{table:seek-calls}
\end{table}

\begin{figure}
    \subfloat[Amazon-0302]{\includesvg[width=0.3\textwidth]{svg/lftj-graphWCOJ-amazon}}
    \subfloat[Amazon-0601]{
    \includesvg[width=0.3\textwidth]{svg/lftj-graphWCOJ-amazon0601}
    \includesvg[width=0.3\textwidth]{svg/lftj-graphWCOJ-amazon0601-long}
    }\\
    \subfloat[SNB-1]{
    \includesvg[width=0.3\textwidth]{svg/lftj-graphWCOJ-snb}
    \includesvg[width=0.3\textwidth]{svg/lftj-graphWCOJ-snb-long}
    }
    \subfloat[Twitter]{
    \includesvg[width=0.3\textwidth]{svg/lftj-graphWCOJ-twitter}
    %    \includesvg[width=0.3\textwidth]{svg/lftj-graphWCOJ-twitter-long}
    }
    \caption{
    \textsc{WCOJ} run time of \textsc{LFTJ} and GraphWCOJ on different datasets and queries.
    Diamond, House and cycle queries are not reported for Twitter because of their high
    run-time of over an hour.
    }
    \label{fig:lftj-graphWCOJ}
\end{figure}

\subsection{Scaling of GraphWCOJ} \label{subsec:scaling-graphWCOJ}

In this section, we aim to analyse and compare the scaling of Graph\textsc{WCOJ} using different
partitioning schemes.
Towards this goal, we run Graph\textsc{WCOJ} on datasets of different size namely Twitter,
LiveJournal and Orkut.
We compare two partitioning schemes: Shares and work-stealing.
These are the two most promising schemes identified in \cref{sec:worst-case-optimal-join-parallelization}.
The experiment is performed on 3-clique, 4-clique and 5-clique.
3-clique is the smallest of our queries.
Therefore, it is most difficult to scale.
4-clique and 5-clique take much longer than 3-clique.
Hence, it shows how query size influences the scaling.
Also, it increases the job size for the \textit{work-stealing} partitioning scheme.

We first describe our expectations of the experiment outcome.
We assume that scaling improves with the dataset size.
Hence, we should see the highest speedups for Orkut, then LiveJournal and the lowest speedups for Twitter.
Also, we expect the scaling to improve with the query size.
Both hypotheses are grounded in the fact that more work to distribute often leads to stronger scaling.
Additionally, we believe that \textit{work-stealing} shows better scaling than Shares because it does not duplicate work.
Finally, we have no clear cut expectations for the scaling behaviour of \textit{work-stealing}.
Theoretically, we could expect linear scaling for it because no work is duplicated, synchronization overhead is minimal and
work balance should be given by the scheme.
However, we measure on a quite complex hardware platform which complicates scaling behaviour.

First of all, we work on a machine with 4 sockets.
This can influence scaling positively and negatively.
Positively because adding more sockets means to add significantly more L3 cache (30 MB shared per socket).
If we do not use all cores on a socket, each used core can use a bigger share of this cache.
Negatively because each socket is in a different NUMA zone and the graph is not guaranteed to be cached in all
NUMA zones.
Indeed, Spark shares the broadcasts for all tasks on a single executor:
there is only one copy in memory.

Additionally, we run on an Intel processor with hyperthreading.
Hence, we can not expect linear speedup above 48 workers because after multiple threads will share resources and cannot be
expected to reach the same performance as two cores.

To conclude, we expect sub-linear speedup for Shares and better but still sub-linear speedup for \textit{work-stealing}.
Anyhow, super-linear scaling in MapReduce like systems is not unheard of and could be possible on our machines.

We show the scaling behaviour of Shares and work-stealing on different datasets and for 3 queries in~\cref{fig:graphWCOJ-scaling}.
The run-times and speedup in raw numbers are given as an appendix~\cref{app:local-mode-scaling}.

\begin{figure}
    \centering
    \subfloat[Twitter dataset\label{fig:graphWCOJ-scaling-twitter}]
    {\includesvg[width=0.5\textwidth]{svg/graphWCOJ-scaling-twitter}}
    \subfloat[LiveJournal dataset\label{fig:graphWCOJ-scaling-livej}]
    {\includesvg[width=0.5\textwidth]{svg/graphWCOJ-scaling-livej}}
    \newline
    \subfloat[Orkut dataset\label{fig:graphWCOJ-scaling-orkut}]
    {\includesvg[width=0.5\textwidth]{svg/graphWCOJ-scaling-orkut}}
    \subfloat[Twitter 3-clique join run-time only without overheads \label{fig:twitter-3-clique-scaling}]
    {\includesvg[width=0.5\textwidth]{svg/twitter-3-clique-scaling}}
    \caption{Scaling behaviour of Shares and work-stealing on three different datasets
    and two different queries.
    The batch size parameter for \textit{work-stealing} is chosen for balance between lock contention and worker skew:
    50 for Twitter and 3-clique on LiveJournal, 1 for 5-clique on LiveJournal and 20 on the Orkut dataset.
    }
    \label{fig:graphWCOJ-scaling}
\end{figure}

We describe our observations per dataset;
starting with Twitter.
As expected, both partitioning schemes scale better when we increase the query size.
For 5-clique, \textit{work-stealing} exhibits near-linear scaling up to 48 workers, while
clique-3 reaches the maximum speedup of 6.22 for 8 workers.
The highest speedup for 5-clique is 45 on 96 workers; clique-3 reaches its highest speedup
with 13.2 on 64 workers.
Shares lags behind in scaling for both queries and all levels of parallelism.
The best-observed speedup is 21.3 for 5-clique and 96 workers.

We note that the 3-clique query does not scale well because there is not enough work.
Therefore, the run-time is dominated by overheads, e.g. the time it takes until Spark starts
the first job and the time it takes to finalize the query by Spark.
The overhead calculated by $(queryEnd - queryStart) - (lastTaskCompleted - firstTaskScheduled)$ is
roughly 0.13 seconds for all levels of parallelism and the whole query runs for 0.19 seconds on
48 cores.
We depict the speedup achieved when we measure only the time spent with the join in \cref{fig:twitter-3-clique-scaling}.

% TODO bad scaling for hyperthreading

The experiment on LiveJournal confirms our hypothesis that bigger datasets lead to better
speedups;
the highest observed speedup is 61.2 for \textit{work-stealing} on 3-clique and
36.81 for Shares on 5-clique each with 96 workers.
Also, we can confirm that Shares scales better on 5-clique than on 3-clique; with the exception of 32 workers.
However, this is not the case for \textit{work-stealing}.
\textit{work-stealing} shows better speedups on clique-3 than on clique-5.
Nevertheless, \textit{work-stealing} beats Shares on both queries and all levels of parallelism.
% TODO analyse, higher skew rigth, at least name reason, is skew higher because of 0th vertice?
% TODO effect of dataset size leads to 5-clique scaling the same or worse

Additionally, we see three unexpected scaling behaviours for LiveJournal.
First, super-linear scaling for 3-clique and \textit{work-stealing}.
We hypnotize that this is the fact because if the 32 processes are distributed over all 4 sockets they share in total
120 MB of L3 cache while a single process can use only 30 MB of L3 cache.
%To confirm this we rerun the experiment with 1, 8, 16 and 32 workers while using \textit{taskset} to bind the application
%to the first 8, 16 or 32 cores.
%This rules out the use of more than 1, 2 or 3 sockets respectively.
%In this experiment, we measure speedup of 8.6, 16.6 and 32.9 for 8, 16 and 32 workers.
%This is significantly lower than the speedup measurements without \textit{taskset}.
%We conclude that this confirms our hypothesis and believe that the slight super-linear
%scaling that remains arises from the bigger amount of L1 and L2 cache in the system.
% TODO run with perf counter
% TODO note why do we have locallity in data accesses?

Second, the scaling of 4- and 5-clique is worse than for 3-clique.
Our explanation is that 4- and 5-clique show skew even with \textit{work-stealing}.
% TODO skew relationship

This is because \textit{work-stealing} partitions work along the first variable binding.
Hence, the size of a single job in \textit{work-stealing} is never smaller than the work of finding all bindings for a single
first binding.
That means a single long-running job towards the end of the \textit{work-stealing} queue can
result in a single task running longer than the others which delays the whole query.

We measure the time at which a task finishes.
We define skew for \textit{work-stealing} as the time between the average worker finishes and
the time of the last worker to finish.
In \cref{table:skew-liveJ} and \cref{table:skew-orkut} we show the total skew and the percentage of
skew in the whole query time for the LiveJournal respectively the Orkut dataset.

We note that the residual skew correlates with the scaling behaviour;
the higher the percentage of skew in the whole query time the worse the query scales.

The total skew grows with the level of parallelism.
We explain this as follows.
Let us assume there is at least one job which takes significantly longer than most of the jobs in
the work-stealing queue.
This job is picked by a worker.
When more executors are added which work on the remaining jobs, the likelihood of this one big
job adding significant skew raises because the other jobs are finished faster.

This experiment shows that the job size is not fine-grained enough for bigger queries and a high level of parallelism.
We see that the skew can raise to nearly half of the total run-time for the 5-clique query
on Orkut.
We address this issue in~\cref{subsubsec:finer-grained-work-stealing}.

\begin{table}
    \centering
    \input{generated/skew-liveJ.tex}
    \caption{
    Total skew in seconds and percentage of skew in the total query time displayed for different
    queries and levels of parallelism on the LiveJournal dataset.
    }\label{table:skew-liveJ}
\end{table}

\begin{table}
    \centering
    \input{generated/skew-orkut.tex}
    \caption{
    Total skew in seconds and percentage of skew in the total query time displayed for different
    queries and levels of parallelism on the Orkut dataset.
    }\label{table:skew-orkut}
\end{table}

Fourth, Shares exhibits lower speedup of 12.1 for 64 workers which is
lower than for 32 workers (14.7) and 14.8 for 96 workers.
This can be explained by the chosen Shares configuration.
For 32 workers, the best configuration is given by the hypercube of the sizes 4, 4, 2.
For 64 workers, we get the hypercube with 4 workers on each axis.
Hence, although we are doubling the number of workers, we use the new workers only to partition
work along the last axis, in the case of 3-clique along the C attribute axis.
Partitioning work along the last axis leads to a high amount of duplicated work on the first
two axes.
Additionally, with 64 workers at least 12 of these workers are not exclusive cores but cores shared by
two hyperthreads.
In total, we get a lower speedup.
This changes slightly for 96 workers because the optimal hypercube configuration here is 6, 4, 4 which
adds more workers along the first axis.
However, the scaling only increases marginally (14.7 to 14.8) from 32 workers which is quite disappointing given
that the number of threads increased by a threefold.

One could argue that we should use a different definition of \textit{best} hypercube configuration.
As we see, it is not necessarily efficient to distribute the computation along the last axis.
%We implemented to version of the configuration finder that considers only the first \textit{i} axes and
%call this partitioning scheme \textit{i-prefixShares}.
%However for time constraints, we do not investigate this issue further and do not include \textit{i-prefixShares}
%in our further experiments.

The Orkut and LiveJournal datasets lead to highly similar scaling results:
super-linear scaling for \textit{work-stealing} up to 48 workers,
\textit{work-stealing} scales significantly better than Shares.
Shares exhibits less speedup for 64 workers than for 32 and 64 workers.

\subsection{Distributed work-stealing}\label{subsec:distributed-work-stealing}
We run the distributed version of work-stealing as described in~\cref{subsubsec:distributed-work-stealing}
on the LiveJournal and Orkut dataset for the 3-clique and 5-clique query.

For this experiment, we use four \textit{diamond} machines as described in the beginning of the section.
Each of this machine has 48 physical cores with hyper-threading.
In total, the Spark cluster has 192 physical cores and 384 virtual cores.
We run the experiments on 16 to 384 of these cores.
Each machine runs one executor which uses $\frac{1}{4}$ of the available cores.
The tasks are evenly distributed over all four machines for all levels of parallelism by
the standard behaviour of Spark's standalone mode scheduler.

We also used Spark's \texttt{BroadcastHashjoin} as well as \texttt{SortMergeJoin} implementations
for the 3-clique query on LiveJournal.
We present the total run-time of these joins and the time needed by the distributed GraphWCOJ work-stealing
algorithm in~\cref{table:spark-vs-distributed-work-stealing}.

\begin{table}
	\centering
    \begin{tabular}{lrrr}
        \toprule
        Parallelism & Spark BHJ [s] & Spark SMJ [s] & GraphWCOJ [s] \\ \midrule
        192         &  428.53       &  423.98       &  4.3          \\
        384         &  467.19       &  531.68       &  3.5          \\
        \bottomrule
    \end{tabular}
    \caption{
    	Total run-time of the 3-clique query for Spark's \texttt{BroadcastHashjoin} (BHJ),
    	\texttt{SortMergeJoin} (SMJ) and GraphWCOJ on LiveJournal.
    	    	Spark uses 3 times as many partitions than cores as recommended by the Spark documentation: https://spark.apache
    .org/docs/latest/tuning.html.}
    \label{table:spark-vs-distributed-work-stealing}
\end{table}

We note that the best configuration for Spark 100 times slower than GraphWCOJ.

We show the speedup of the different queries on the LiveJournal and Orkut dataset in \cref{fig:distributed-scaling}.
The run-times and speedup in raw numbers are given as an appendix~\cref{app:distributed-scaling}.
We analyse the results by query and dataset.


\begin{figure}
    \subfloat[LiveJournal]{\includesvg[width=0.5\textwidth]{svg/distributed-scaling-livej}}
    \subfloat[Orkut]{\includesvg[width=0.5\textwidth]{svg/distributed-scaling-orkut}}
    \caption{Speedup for 16 to 384 workers evenly distributed over 4 machines for
    3-clique and 5-clique on two datasets.
    }
    \label{fig:distributed-scaling}
\end{figure}

\begin{table}
    \centering
    \resizebox{\textwidth}{!}{%
    \input{generated/distributed-skew-liveJ.tex}
    }
    \caption{Total skew and percentage of skew in the total run-time for queries on the LiveJournal dataset.
    It is displayed for 3-clique run with a batch size of 1 and a batch size of 40 for the work-stealing
    algorithm.
    }
    \label{table:distributed-skew-livej}
\end{table}

\begin{table}
    \centering
    \resizebox{\textwidth}{!}{%
    \input{generated/distributed-skew-orkut.tex}
    }
    \caption{Total skew and percentage of skew in the total run-time for queries on the Orkut dataset.
    }

    \label{table:distributed-skew-orkut}
\end{table}

The 3-clique query scales very differently on both datasets.
We show two plots for the 3-clique query on LiveJournal with different parameters for the work-stealing
batch size.
One with the batch size of 1 and one with a batch size of 40.
We explain this choice later.

On the Orkut dataset, we see similar speedup as in the local version.
It is super-linear in the beginning up to 96 cores then the curve flattens out.
Partly this is due to the use of hyper-threading and partly due to skew.

Skew can be broken down into two components: intra executor skew and inter executor skew.

Intra executor skew is defined as the different end times of the tasks running and sharing work on
a single executor.
There is no intra executor skew for the 3-clique query;
all tasks on the same executor finish work in a time-span of maximal 5 microseconds.

Inter executor skew is the difference between the duration of the longest-running executor and the average
run-time of all other executors.
We show the absolute skew in seconds and the percentage of skew in the total query run-time
in~\cref{table:distributed-skew-orkut} for the Orkut dataset.

The total skew decreases with higher levels of parallelism.
This is because the work is equally distributed on 4 executors and with more tasks running per executor,
the run-time difference between executor falls when there is no intra executor skew.

Skew accords for 5\% to 10\% of the total query run-time with a decreasing trend for higher levels of
parallelism.
Although, decreasing this contributes to a flattening of the speedup curve for 128 workers and more.

On the LiveJournal dataset for 3-clique, we see lower speedup.
With a batch size of 1, it is sub-linear until 48 cores and deteriorates after.
Again, there is no intra executor skew.

During our study of inter executor skew, we noticed that the worker which takes longest can differ widely
for different repetitions of the experiment for the same configuration.
This is unexpected because in all of them the executors receive the same workload.
We run the experiments with a batch size of 1 for the work-stealing algorithm.
Therefore, we suspect contention of the work-stealing queue to be the reason speedup deterioration and
repeat the experiment with a batch size of 40.

The speedup for a batch size of 40 is weaker than using a batch size of 1 up to 48 nodes.
With more cores, it is stronger.
This supports our hypothesis that lock contention is a problem for higher levels of parallelism.
The scaling is weaker for less than 64 nodes because the higher batch size leads to more skew as we see
in~\cref{table:distributed-skew-livej}.

This experiment shows that choosing the right batch size is not trivial for queries which have
short running work-stealing tasks.
When we choose it too low for the level of parallelism, the work-stealing queue becomes contended which leads
to highly unpredictable performance.
However, using batching raises the skew in the query, which leads to lower performance on less
than 48 cores and to suboptimal speedup with higher levels of parallelism due to skew.

We first discuss the 5-clique query on the LiveJournal.
As expected we see stronger speedup than for 3-clique as there is more work to share.
The scaling is sub-linear due to intra and inter executor skew.
The intra executor skew is similar to the skew displayed for our local mode experiment in~\cref{table:skew-liveJ}.

The total skew is shown in~\cref{table:distributed-skew-livej}.
We see that the total and relative skew decrease with the number of workers used which is to be
expected as explained above.
Nevertheless, it leads to sub-optimal scaling behaviour and the total value of nearly 3 minutes
can be quite high for a low amount of workers.
Reducing the intra executor skew is likely to reduce the inter executor skew;
we outline a solution for the intra skew in~\cref{subsubsec:finer-grained-work-stealing}.

The 5-clique query run on the Orkut dataset scales less well than the 3-clique query due
to much higher intra and inter executor skew.
Again the intra executor skew is similar to the skew in the local experiment~(see \cref{table:skew-orkut}).
The total skew is displayed in~\cref{table:distributed-skew-orkut}.

We compare the speedup of the distributed work-stealing version against the speedup of the local
work-stealing version in~\cref{table:local-vs-distributed}.
The performance of the distributed version is strongly dataset and query dependent.
The 3-clique query on Orkut and 5-clique query on LiveJournal scale better than in the
local version;
in particular, for high levels of parallelism.
This is most likely because no hyper-threading is used in the distributed version
up to 192 tasks, while the local version uses hyper-threading for more than 48 tasks.

The 5-clique query on Orkut shows roughly half the speedup up to 48 workers then the difference
in speedup decreases due to the use of virtual cores in the local version.

The distributed version lacks behind for the 3-clique query on LiveJournal independent of
the batch size chosen.

\begin{table}
    \centering
\begin{tabular}{lllrrrrr}
\toprule
Dataset & Query & Version & 16 & 32 & 48 & 64 & 96 \\\midrule
\multirow{4}{*}{LiveJournal}  & \multirow{2}{*}{3-clique} & local & 19.6  & 36.7 & 50.0 & 54.7 & 61.3 \\
& & distributed & 13.7 & 27.1 & 34.7 & 24.3 & 41.1 \\
\cline{2-8}
& \multirow{2}{*}{5-clique} & local & 14.3 & 28.3 & 40.3 & 44.1 & 48.7 \\
& &  distributed &  17.4 & 29.2& 37.5 & 46.7 & 62.7\\\hline
\multirow{4}{*}{Orkut}  & \multirow{2}{*}{3-clique} & local & 18.8  & 36.9 & 54.0 & 58.0 & 69.6 \\
& & distributed & 19.2 & 38.5 & 55.5 & 71.8 & 98.6 \\
\cline{2-8}
& \multirow{2}{*}{5-clique} & local & 14.3 & 28.1 & 35.2 & 31.6 &  29.8 \\
& &  distributed & 7.1 & 13.2 & 19.1 & 25.9 & 36.8\\
    \bottomrule
\end{tabular}
    \caption{Speedup of local and distributed work-stealing version on different queries and datasets.
      For 3-clique on LiveJournal with the distributed version, we report the speedup of batch size 1.
    }
    \label{table:local-vs-distributed}
\end{table}


% comparison against other work
%   Dewitt
%   Andreas Amler
%   Old dog
%   LFTJ
%   Richard