-
Notifications
You must be signed in to change notification settings - Fork 0
/
main.tex
126 lines (90 loc) · 4.51 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
\documentclass[11pt]{article}
\input{preamble.tex}
\title{
}
\author{}
\date{}
\begin{document}
\thispagestyle{empty}
\begin{center}
Vrije Universiteit Amsterdam
\vspace{1mm}
\includegraphics[height=25mm]{figures/vu-griffioen.pdf}
\vspace{1cm}
{\Large Master Thesis}
\vspace*{1.5cm}
\rule{.9\linewidth}{.6pt}\\[0.4cm]
{\Large \bfseries Fast, scalable worst-case optimal joins for graph-pattern matching on in-memory graphs in Spark\par}\vspace{0.4cm}
\rule{.9\linewidth}{.6pt}\\[1.5cm]
\vspace*{2mm}
{\Large
\begin{tabular}{l}
{\bf Author:} ~~Per Fuchs ~~~~ (2614613)
\end{tabular}
}
\vspace*{2cm}
\begin{tabular}{ll}
{\it 1st supervisor:} & Prof.dr P.A. Peter Boncz (CWI) \\
{\it daily supervisor:} & Prof.dr P.A. Peter Boncz \\
{\it 2nd reader:} & Dr. Bogdan Ghit (Databricks)
\end{tabular}
\vspace*{2.5cm}
\textit{A thesis submitted in fulfillment of the requirements for\\ Parallel and Distributed Systems Master of Science degree in
Computer Science}
% \vspace*{1.8cm}
\today % Date
\end{center}
\newpage
\maketitle
\abstract{
\setlength{\parindent}{0em}
\setlength{\parskip}{0.7em}
Graph pattern matching is a challenge for data processing systems like Spark because the size of the intermediary
result required by finding the matches with standard binary join operators grow over linear with regards to the
inputs.
Recently popularized worst-case optimal join algorithms, \textsc{WCOJ}s, seem a better match to tackle this challenge
because they do not materialize the aforementioned large intermediary results.
We investigate two major open questions regarding \textsc{WCOJ}s in graph use cases.
First, we develop a \textsc{WCOJ} specialized to graph-pattern matching, where the joins are self-joins on an edge table and compare its performance with a generic \textsc{WCOJ}. Second, we propose a novel method for distributed execution of such WCOJs. We show that the previously proposed methods to distribute WCOJs that would fit Spark, do not scale well to large graph patterns (five vertices and more). Based on this result, we propose to keep the edge relationship cached in compressed form on all workers but distribute the computation using a logical partitioning.
Our approach trades off memory usage for better scalability.
This is reasonable as most graphs today fit in main memory~\cite{snap}.
In all, our work provides the first scalable Spark implementation of a \textsc{WCOJ}, very suited for graph use.
}
\newpage
\setlength{\parindent}{0em}
\setlength{\parskip}{0.7em}
\section*{Acknowledgements}
First, I want to thank my supervisors Peter Boncz and Bogdan Ghit for their support in this thesis.
Both read multiple drafts, provided feedback and new ideas.
Special thanks to Peter for the long discussions in the beginning to find a topic that connects so many of my favourite topics:
distribution, algorithms from theory to practice and graphs.
Furthermore, for giving me the big chance to present my thesis at the LDBC workshop after Sigmod in front of
multiple interested companies and two authors of important related \textsc{WCOJ} papers.
Bogdan always kept a positive attitude to the project and kept me motivated with friendly comments and encouraging thoughts.
Additionally, he asked many critical questions about angles I didn't give enough attention.
The Spark integration of this work has been inspired by the IndexedDataframes~\cite{uta2019low} work of Alex Uta and the supervisors
of this thesis.
We used a similar design to extend the DataSet interface with an additional method to trigger a worst-case optimal join computation.
Richard Gankema, a PhD student at CWI, offered valuable hints towards the optimization of the Leapfrog Triejoin.
Bogdan Ghita and Matheus Nerone, my office companions, helped during the proposal writing by being partners for discussions and later
by teaching me to play table tennis.
Finally, I'd like to thank my friends and in particular, my parents who helped me to stay focused on my thesis through a difficult private
situation in the last months.
\newpage
\tableofcontents
\newpage
\input{introduction.tex}
\input{background.tex}
\input{partitionings.tex}
\input{graphWCOJ.tex}
\input{implementation.tex}
\input{spark-integration.tex}
\input{experiments.tex}
\input{related-work.tex}
\input{conclusions.tex}
\clearpage
\addcontentsline{toc}{section}{References}
\printbibliography
\clearpage
\input{appendix.tex}
\end{document}