introduction.tex

\section{Introduction} \label{sec:introduction}
Worst-case optimal join (WCOJ) algorithms, e.g. Leapfrog Triejoin, in the last few years turned conventional thinking about join processing
on its head because these multi-join algorithms have provable lower complexity than classical binary joins,
i.e. join algorithms that join just two tables at-a-time.
In the areas of data warehousing and OLAP, this finding does not have much impact, though,
since the join patterns most commonly encountered there are primary-foreign-key joins, short (PK-FK joins),
where the join shapes a tree or snowflake and contains no cycles.
The computational complexity of PK-FK joins is by definition linear in size of the inputs.
In these \textit{conventional} cases, binary joins are worst-case optimal already, e.g. hash joins.

However, analytical graph queries often use FK-FK joins which can grow over linearly in the size of their inputs,
and often contain cycles.
For these use-cases, binary joins often exhibit highly suboptimal run-times because they generate a rapidly increasing set of
intermediary results, e.g. when navigating a social graph with an out-degree in the hundreds.
Many of these intermediary results are eliminated in later joins, e.g. a join that closes a cycle.
Hence, an algorithm which avoids generating these results in the first place is to perform much better and
closer to the optimal possible performance given by the output size.
\textsc{WCOJ} avoid many of the intermediary results and are guaranteed to reach the best possible run time in terms of
the output size~\cite{agm}.

Because of the frequent presence of cyclic join-patterns in graph-pattern matching, we believe that worst-case optimal join algorithms
could be a useful addition to (analytical) graph database systems.
Therefore, we aim to integrate a scalable, \textsc{WCOJ} algorithm in Spark which is used by some modern graph engines~\cite{caps,gcore,
graphframe}.

The rest of our introduction is structures as follows.
\Cref{subsec:graph-pattern-matching} defines the term graph pattern matching, its translation into datalog and
relational queries and two examples of cyclic graph pattern used in practice.
We aim to give the reader an intuitive understanding of why \textsc{WCOJ}s are superior to binary joins for graph pattern matching
in~\cref{subsec:intuitive-example}.
\Cref{subsec:graphs-on-spark} motivates our choice to use Spark as the base for our thesis.
Next, we state our research questions and contributions in~\cref{subsec:research-questions-and-contributions}.
Finally in~\cref{subsec:thesis-idea}, we outline the main ideas behind the thesis and their connections.

\subsection{Graph pattern matching}\label{subsec:graph-pattern-matching}
Graph pattern matching is the problem of finding all instances of a specific subgraph in a graph.
The subgraph to find is described as a pattern or query.
In this thesis, we use datalog queries to define subgraph queries.

For example, \cref{eqn:triangle} shows the datalog query describing a triangle.

\begin{equation}
    \textit{triangle(a, b, c) $\leftarrow$ R(a, b), S(b, c), T(c, a)} \label{eqn:triangle}
\end{equation}

Here we join three atoms $S, R$ and $T$ with two attributes each $(a, b)$, $(b, c)$ and $(a, c)$ respectively.
The task of enumerating all triangles within the three atoms can be also be described as finding all possible
bindings for the join variables $a, b$ and $c$ within them.

The translation from datalog queries to graph patterns is straightforward.
An attribute or a variable refers to a vertice in a graph and an atom to an edge.
A depiction of the subgraph pattern described by~\cref{eqn:triangle} is shown in~\cref{fig:pattern-triangle}.

\begin{figure}
    \centering
    \includesvg[width=0.2\textwidth]{svg/triangle}
    \caption{Depiction of the triangle subgraph query.}
    \label{fig:pattern-triangle}
\end{figure}

In relational terms, a graph pattern matching query is an n-ary, conjunctive, self-equijoin on the edge relationship of the graph.
In this thesis, all join queries discussed belong to this subcategory of possible join queries.
Other join queries can be useful to describe more complex graph patterns, e.g. disjunction for two edges of which only one needs to
exist or negation to exclude instances that have too many connections.
Some techniques used in this work can be extended to cover these cases, we mention related literature but do not focus
our efforts on these extensions.

Graph pattern matching is fundamental to analytical graph analysis workloads~\cite{fraud-detection,flake2002,
bodaghi2018automobile,newman2004detecting}.
We show two graph patterns which are used in practice below and explain the use-cases.

\Cref{fig:pattern-diamond} shows the diamond query which is used by Twitter to recommend their users new people to follow.
The idea is that if user $a$ is following multiple accounts $c_1, \dots, c_k$ who all follow a person $b$ then it is likely that
$b$ would be interesting to follow for $a$ as well.
In the figure, we see the diamond query for $k = 2$.
This is the diamond query as discussed in most papers in academia~\cite{olddog,myria-detailed,mhedhbi2019}, although,
Twitter uses $k = 3$ in production~\cite{twitter-diamond}.

\begin{figure}
    \centering
    \includesvg[width=0.2\textwidth]{svg/diamond}
    \caption{The dimaond query is used by Twitter. The vertices are users and the edges follower relationships.
    In the example, they could recommend $A$ to follow $D$ because $A$ follows $B_1$ and $B_2$ which both follow $C$.
    }
    \label{fig:pattern-diamond}
\end{figure}

Our second concrete use-case example is the n-cycle.
As explained in~\cite{fraud-detection}, cycles can be used to detect bank fraud.
A typical bank-fraud often involves so-called \textit{fraud-rings}.
These are two or more people who combine their legitimate contact information in new ways to craft multiple false identities.
For example, two people share real phone numbers and addresses to craft four fake identities; all combinations possible with two pieces
of information.
They open accounts under wrong names with real contact information, use these accounts normally to build trust with the bank and
build up bigger credit lines.
At a certain date, they max out all credit lines and disappear.
The phone numbers are dropped and the actual people living at the addresses deny ever knowing the identities that opened the accounts.

This scheme can be detected using graph pattern matching.
Let us assume, we have a graph database in which customer of the bank, their addresses and phone numbers are all vertices and the
relationship of an address or phone number belonging to a customer are edges.
Then, the case described above forms an 8-cycle of 4 persons (fake identities) connected by the shared use of phone numbers and
addresses.
The imagined cycle is shown in~\cref{fig:graph-pattern-example-bank-fraud}.

\begin{figure}
    \centering
    \includesvg[width=0.2\textwidth]{svg/bank-fraud}
    \caption{Schematics of a bank fraud ring.
    2 fraudsters share their phone numbers and addresses (labelled $P$ and $A$) to create four
    fake customers ($C$ vertices).
    }
    \label{fig:graph-pattern-example-bank-fraud}
\end{figure}


\subsection{Binary joins vs \textsc{WCOJ}s: an intuitive example} \label{subsec:intuitive-example}
We introduce the triangle query and possible binary join plans.
Then we point out the general problem of binary join plans on this query and the idea of how \textsc{WCOJ}s can improve the situation.
Next, we give a concrete example of a database instance to illustrate the aforementioned problem.
We conclude our motivation to use worst-case optimal joins by reporting multiple papers that show that these joins are highly beneficial
to graph pattern matching queries in practice.

The simplest example of a cyclical join query enumerates all triangles in a graph.
It is shown in~\cref{eqn:triangle} and \cref{fig:pattern-triangle}.

Traditionally, this would be processed by using multiple binary joins:
\begin{equation}
    R \bowtie S \bowtie T
\end{equation}
The join above can be solved in 3 different orders: $ (R \bowtie S) \bowtie T$, $ (R \bowtie T) \bowtie S$ and
$ R \bowtie (T \bowtie S)$.
Independent of the chosen order, database instances exist where the intermediary result size is in $\mathcal{O}(N^2)$ with
\textit{N}= |\textit{R}| = |\textit{S}| = |\textit{T}|.
However, it is provable that the output of this query is guaranteed to be in $\mathcal{O}(n^{3/2})$~\cite{agm,skew-strikes-back}
for any database instance.
Hence, binary joins materialize huge intermediary results after processing parts of the query,
which are much bigger than the final result.

The described problem is a fundamental issue with traditional binary join plans~\cite{agm,skew-strikes-back}.
We call these plans also \textit{join-at-a-time} approach because they process whole joins at the time.

Fortunately, worst-case optimal join algorithms can materialize cyclic joins with memory usage linear to their output size
by solving the join \textit{variable-at-a-time} which avoids materializing big intermediary results~\cite{lftj,nprr}.

In variable-at-a-time the algorithm finds a binding for the first variable $a$, then one for $b$ and
finally one for $c$.
After this, it emits the tuple as part of the output.
Then it finds further bindings via backtracking until they enumerated the whole join when all bindings for $a$ have been explored.

% Example concrete
A simple example graph database instance gives an idea of why a variable-at-a-time approach is beneficial for cyclic queries.
In \cref{fig:edge-rel-example}, we see an edge relationship.
It is repeated three times labelled with different attributes to ease the understanding of the following explanation;
however, in a system's implementation, only one table exists and is used by all joins as input.

\begin{figure}
    \centering
    \subfloat{
      $R$
    \begin{tabular}{rr}
        \toprule
        a &  b \\\midrule
        1 &  2 \\
        2 &  7 \\
        2 &  8 \\
        2 &  9 \\
        2 & 10 \\
        3 &  2 \\
        4 &  2 \\
        5 &  2 \\
        6 & 11 \\
        11 & 12 \\
        12 &  6 \\\bottomrule
    \end{tabular}
    }
    \hspace{0.2\textwidth}
    \subfloat{
     $S$
    \begin{tabular}{rr}
        \toprule
        b &  c \\\midrule
        1 &  2 \\
        2 &  7 \\
        2 &  8 \\
        2 &  9 \\
        2 & 10 \\
        3 &  2 \\
        4 &  2 \\
        5 &  2 \\
        6 & 11 \\
        11 & 12 \\
        12 &  6 \\\bottomrule
    \end{tabular}
    }
    \hspace{0.2\textwidth}
    \subfloat{
    $T$
    \begin{tabular}{rr}
        \toprule
        c &  a \\\midrule
        1 &  2 \\
        2 &  7 \\
        2 &  8 \\
        2 &  9 \\
        2 & 10 \\
        3 &  2 \\
        4 &  2 \\
        5 &  2 \\
        6 & 11 \\
        11 & 12 \\
        12 &  6 \\\bottomrule
    \end{tabular}
    }
    \caption{
    Three aliases to an edge relationship which contains three triangles, the permutations of \{6, 11, 12\},
    and one skewed value.
    }
    \label{fig:edge-rel-example}
\end{figure}

A binary join plan which joins $R$ and $S$ via $b$ first produces $16 + 3$ intermediary results;
4 times 4 results for $b = 2$ and one for 6, 11, 12 each.
The next join reduces these 16 results to the three triangle instances; all permutations of the set \{6, 11, 12\}.

A variable-at-a-time approach finds 4 bindings for $a$, namely  $2, 6, 11, 12$;
the intersections of both columns labelled $a$.

Intersecting both columns of $b$ values we notice $2, 6, 11, 12$ could be possible bindings for $b$.
When we fix an $a$ value these four possibilities are reduced to the $b$ values which exist for this
$a$ value in the leftmost table.
So once we fixed a binding for $a$, we find one possible binding for $b$ each;
except the binding $a = 2$ for which we cannot find a matching $b$ value.

Finally, we find all three instances of the triangle by completing the three $a, b$ bindings with
the matching $c$ binding;
only one exists for each $a, b$ binding.

We can drastically reduce the workload by formulating the join as a problem of
finding variable bindings using information from all parts of the join, instead of, using only one constraint at the time
and building it join-by-join.

We do not claim that the example above illustrates the generality of why binary join plans are provable worse than
\textsc{WCOJ}s.
Clearly, the example does not show an intermediary result of $N^2$ as $N = 11$ and the intermediary result has the size of 16.
However, we note that even in such a simple example all possible binary join orders produce an intermediary result of size 16.
While all possible variable orderings for a variable-at-a-time approach eliminate the skewed value (2) after finding no binding
for the second variable.
A more general but less concrete example is explained in~\cite{skew-strikes-back}.

% WCOJ's in practice
In practice, these worst-case optimal join algorithms are highly beneficial for cyclic queries in analytical graph
workloads in an optimized, single machine system~\cite{lftj,olddog}.
\cite{olddog} compares a system using \textsc{WCOJ}s against multiple general-purpose database
systems using binary joins and some graph pattern matching engines on 15 datasets and 7 queries and
finds that worst-case optimal joins can beat all other systems in the vast majority of queries
and datasets, often by the order of magnitudes or even being the only system to finish within 30 minutes.

Later worst-case optimal joins have been applied successfully to a distributed shared-nothing settings~\cite{myria-detailed,
ammar2018distributed};
we describe these systems in more detail in~\cref{subsec:myria} and~\ref{subsec:wcoj-timely-data-flow}.

\subsection{Graphs on Spark}\label{subsec:graphs-on-spark}
Spark is an attractive target for big graph processing, due to its generality, widespread acceptance in the industry, the ability to use
cloud hardware and its fault tolerance by design.
For example, GraphFrames~\cite{graphframe}, GraphX~\cite{graphx} (a Pregel~\cite{pregel} implementation) or graph query languages
as \mbox{G-CORE}~\cite{gcore} and \mbox{openCypher} with `Cyper for Apache Spark' or \textsc{CAPS}~\cite{caps} all aim to ease graph
processing on Spark.
The last two technologies translate their graph specific operations to the relational interface of Spark (SparkSQL)
to profit from Spark's relational query optimizer Catalyst~\cite{spark-sql}.
Moreover, they allow the user to formulate graph pattern matching queries naturally.

Hence, we believe that the \textsc{WCOJ}s, with their efficiency for analytical graph queries, are a valuable addition to Spark's
built-in join algorithms in general and these graph-on-spark systems in particular.
Ideally, they are integrated such that they can be naturally used in the ecosystem of Catalyst.
This would allow easier use in SQL like graph languages as \textsc{G-CORE} or Cypher for graph pattern matching.

\subsection{Research questions and contributions}\label{subsec:research-questions-and-contributions}
We identify two challenging, novel directions for our research.
First, all papers about \textsc{WCOJ} focus on queries widely used in graph pattern matching, e.g. clique finding or path queries.
As explained above, graph pattern matching uses only self-joins on a single relationship with two attributes
namely the edge relationship of the graph.
However, all systems use worst-case optimal joins developed for general n-ary joins.
This raises the question if and how \textsc{WCOJ}s can be specialized for graph pattern matching.

Second, while the communication costs for worst-case optimal joins in MapReduce like systems\footnote{
An excellent definition of the term MapReduce like systems is given in~\cite{shares}}
is well-understood~\cite{shares,shares-skew,shares-proof,shares-skew-proof},
their scalability has not been studied in depth.
Given that the only integration in a MapReduce like system exhibits a speedup of 8 on 64 nodes over two workers (an efficiency of 0.125)
~\cite{myria-detailed},
we find that designing a scalable, distributed \textsc{WCOJ} for a MapReduce like system is an unsolved challenge.

It is time to investigate how these algorithms scale in the provable most widely used, general-purpose big data processing engine: Spark.
To the best of our knowledge, this is also the first time a worst-case optimal join is integrated with an industrial-strength cluster
computing model.
We detail our research questions below.

\begin{enumerate}
    \item Can we gain performance in \textsc{WCOJ}s by specializing them to graph pattern matching?
    \begin{enumerate}
        \item How much performance can we gain by using compressed sparse row representations as backing data structure to \textsc{WCOJ}s?
        \item Can we find a more suitable algorithm to build intersections for graph-pattern matching than the complex n-ary approach
         proposed originally?
    \end{enumerate}
    \item How well do \textsc{WCOJ}s scale in Spark when used for graph pattern matching?
    \begin{enumerate}
        \item How well does a previously proposed, optimal partitioning scheme, named Shares, scale?
          We explain Shares in detail in~\cref{subsubsec:shares}.
        \item How to integrate scalable work-stealing into Spark to counter tuple replication and skew?
    \end{enumerate}
\end{enumerate}

Towards answering our research questions, we make the following contributions.
\begin{enumerate}
    \item We integrate a sequential, general worst-case optimal join into Spark.
    This implementation serves as a baseline for our \textsc{WCOJ} optimized to graph pattern matching.
    \item We design and implement GraphWCOJ which is a worst-case optimal join specialized to graph pattern matching.
    It is backed by a compressed sparse row representation of the graph which reduces its memory footprint and speeds up execution by up
    to 11 times over a normal \textsc{LFTJ} because it acts as an index. % NUMBER
    Furthermore, we exploit the typical low out-degree of most graphs to by specializing the \textsc{LFTJ} for small intersections.
    \item We analyse how many tuples Shares replicates for typical graph pattern matching queries.
    From this analysis and the fact that Shares is an optimal partitioning scheme, we conclude that replication is inevitable for complex
    graph-pattern matching queries.
    Therefore, we cache the graph in the memory of all workers.
    \item Based on a replicated edge relationship, we design \textit{logical} Shares.
    This is an approach where the graph is fully replicated but we use Shares partitioning to divide work between executors.
    We measure a speedup of 13 on 64 workers for some queries and beat an existing implementation of (physical) Shares which reaches
    a speedup of 8 for the same number of machines.
    The results show that Shares is good in dealing with skew but requires too much replicated work to scale well.
    \item Therefore, we abandon static partitioning and design a \textsc{WCOJ} that applies work-stealing.
    We show that work-stealing can scale linearly on some input queries and beats \textit{logical} Shares for all levels of parallelism
    \item for 3-cliques and 5-cliques on three different datasets.
    \item We run experiments on 5 datasets, 6 queries, for up to 382 workers using Spark's build-in hash join,
    a general Leapfrog Triejoin and our specialized GraphWCOJ.
\end{enumerate}

\subsection{Thesis overview} \label{subsec:thesis-idea}
In this section, we outline the main ideas, motivation and decisions taken in this thesis.
We summarize the whole thesis as a graph in~\cref{fig:thesis-overview}.
We see the background that motivates our decision in the corners and our system in the centre.
The edges show connections between different ideas and components.
We give an overview of the thesis in the next paragraphs.

\begin{figure}
    \centering
    \includesvg[width=\textwidth]{svg/thesis-overview}
    \caption{
    Main ideas and components of the thesis.
    Background and related work shown in the corners.
    The center shows the main component of our parallized worst-case optimal join
    }
    \label{fig:thesis-overview}
\end{figure}

As mentioned in~\cref{subsec:graphs-on-spark}, Spark is a good platform for our work because many graph pattern matching systems use
it, like neo4j's Cypher on Apache Spark (CAPS) and LDBC's G-CORE~\cite{caps,gcore}.
In particular, they build on top of Spark's structured query execution offered by Catalyst.
Catalyst is designed to be easily extendable and allows to introduce new operators, such as a worst-case optimal join,
without modifying the core of Spark.
This is even possible in a way such that these new operators can be used with a native, unchanged installation of Spark.
We describe Catalyst query compilation process in~\cref{subsubsec:catalyst}.
Our integration is detailed in~\cref{sec:spark-integration}.

Normally, Spark achieves parallelism and distributed algorithms by partitioning data over all workers.
When data from different tables needs to be joined, Spark repartitions the data such that each worker can process parts of the join locally.
However, shuffling is an expensive operation in Spark, involving disk writes and reads, which should be avoided if possible.
Moreover, we can show that a communication-optimal partitioning scheme degenerates into a full broadcast for bigger graph pattern queries;
we explain this in detail in~\cref{subsubsec:shares}.

Given this finding, we design our system to build on a replicated and cached edge relationship on each worker.
As a short study of us reveals, most large graph problems described in literature, have edge structures that would fit in
main-memory (see~\cref{subsec:graph-analysis});
and our compressed storage format further helps to keep memory usage under control.
This has a few distinct advantages.

First, the broadcast can be done once at system startup.
Then, we can reuse it for any graph pattern matching query;
all of them need to join over the edge relationship many times.
Therefore, we can answer many graph pattern matching queries without shuffling data.
We explain the integration of replicated edge relationships into Spark in~\cref{subsec:spark-integration-graphWCOJ} and introduce
the necessary background in~\cref{subsubsec:broadcast-variables}.

Furthermore, such reuse helps to amortize the non-trivial setup costs for worst-case optimal joins;
they require their input data to be sorted.

Finally, a replicated data structure allows us to use dynamic work-sharing schemes as work-stealing cheaply without relocating
data.
Normally, partitioning is done completely statically in Spark.
However, this is problematic given that many real-world graphs are highly skewed, e.g. power-law graphs such as many follower graphs
(Facebook, Twitter) or web graphs.
We find that this skew can easily lead to bad load-balancing with the static partitioning of Spark.
We explain how to integrate work-stealing with Spark in~\cref{subsec:work-stealing}.

Using a fully replicated edge relationship and potentially work-stealing leads to the necessity to build data-partitioning into
the worst-case optimal join operators.
This is because Spark would normally have an operator work on all local data and archive parallelism via physically partitioning the
data over multiple workers.
We call partitioning built into our operators \textit{logical} partitioning.
The concept and its implementation are described in~\cref{sec:worst-case-optimal-join-parallelization}.

We choose to use the Leapfrog Triejoin~\cite{lftj} as a basis for our system;
this choice is motivated in~\cref{subsec:worst-case-optimal-join-algorithm}.
This join requires its input relationships to be presented in a sorted data structure which is searchable for upper bounds in
$\mathcal{O} (\log N)$.
Furthermore, it mainly uses intersections to compute the join.
The algorithm is explained in detail in~\cref{subsubsec:leapfrog-triejoin}.

We specialize the Leapfrog Triejoin to graph pattern matching by introducing a compressed sparse row representation~(\textsc{CSR}, see
\cref{subsec:csr-background}) as backing data structure for the input relationships.
\textsc{CSR} can compress the graph edge relationship by a compression factor of nearly 2.
Additionally, we show that it speeds up the \textsc{WCOJ} execution to be backed by a \textsc{CSR} because this representation
acts like an index.

Another graph specific optimization we apply to \textsc{LFTJ} is that we change the intersection building algorithm for one
that is specialized in small intersections.
This is motivated by the fact that real-world graphs have normally small average out degrees.
Hence, the intersection of multiple adjacency lists is predictably small.
We discuss both specializations to the Leapfrog Triejoin algorithm in~\cref{sec:graphwcoj}.


% Also introduce cached edge table, ref related work for why communication does not work
% and goals for explanation of why we think caching is helpful
% use mcsherry, other guy here?  --> read them

% These academic systems are not very usable nor used, nor is LogicBlox on the market as a database system.
% For all practical senses and purposes, there are no %systems available that implement WCOJs. Apache Spark is currently the most popular
% analytical data processing system. It does not implement WCOJs yet and has %multiple popular graph processing APIs or subsystems, among
% which GraphFrames, CAPS (neo4j's Cypher on Apache Spark) and the recent LDBC effort to implement the G-%CORE query language on Apache
% Spark. All of these APIs could potentially benefit greatly from a WCOJ algorithm.

%Spark offers a well optimized Relational interface [SparkSQL] [Catalsyst]
%Relational interfaces rely on JOINS which have different characteristics different for graphs than for traditional star or snowflake schemes.
% - they are cyclic for important graph algorithms (cluster, ?subgraphing?)
%  - large intermediary results which make the queries really expensive for the CPU as well as the in memory
%   - work is done for nothing because most of the intermediary results are filtered out later
% - they are highly selective (paths starting from a specific node)
%  - allowing to safe work when all "filters" are applied simultaneously
%  - allowing for big jumps on sorted keys with a seek operation, naturally, applied by LFTJ
% - therefore, they are a prime area of application for a new class of join algorithms with worst-case guarantees, which guarantee that no big intermediary results build up
%because the evaluate multiple joins as once only materializing results that fulfil all join filters.
% - Furthermore, they are naturally suited for highly selective queries because of using an O(log(N)) seek a method to jump over "uninteresting" parts of a sorted result.