-
Notifications
You must be signed in to change notification settings - Fork 312
Example Apps
The best way to get familiar with GraphChi is to run and study the example applications. You should start with the simple applications, Pagerank and Connected Components prior to looking at matrix factorization or triangle counting.
To run the examples, you need an input graph (network). Currently !GraphChi can read graphs in the standard [Edge-List-Format] or [Adjacency-List-Format], but it is fairly easy to write your own parsers. Remember that GraphChi can run even very big graphs on a normal PC or laptop!
A good set of graphs of different sizes are available below:
A great collection of graphs is available at ( http://law.di.unimi.it/datasets.php ), but notice that they are in the compressed format of WebGraph Framework ( http://webgraph.dsi.unimi.it/ ).
- Large Twitter Graph.
All example programs automatically convert the input graph to GraphChi's internal shards (see Introduction-To-GraphChi). That is, you just need to compile and run the application. This conversion needs to be done only once for each application, and the system will detect whether preprocessing is needed. Also, if the value type stored in edges is of the same size, different applications can use the same shards.
What you will learn:
- Basic structure of GraphChi programs
- How to "communicate" via edges
- Flavor of asynchronous computation
Source: example_apps/pagerank.cpp
To build and run: make example_apps/pagerank bin/example_apps/pagerank file GRAPH-NAME
Above, replace GRAPH with the input graph. If the graph has not been preprocessed, the program will ask for the format of the graph (edgelist or adjacency-list).
You may adjust the number of iterations by specifying parameter "niters" as follows:
> bin/example_apps/pagerank file GRAPH-NAME niters 10
The application will print the ids of the top 20 vertices with the highest PageRank score.
GraphChi will write the values of the edges in a binary file, which is easy to handle in other programs. The name of the file containing vertex values is GRAPH-NAME.4B.vout
. Here "4B" refers to the vertex-value being a 4-byte type (float).
You can also analyze the vertex-data programmatically, in memory-efficient manner, by using VertexAggregators.
While this Pagerank example is simple, it is not the fastest way to compute Pagerank with GraphChi. Particularly, it unnecessarily loads the values for out-edges, while the value is never read, it is just written. There is another version of Pagerank available, example_apps/pagerank_functional.cpp, which uses an experimental alternative "functional" API, which is roughly 50% more efficient for Pagerank.
What you will learn:
- Dealing with undirected edges
- Understanding asynchronous computation
- Selective (dynamic) scheduling
Source: example_apps/connectedcomponents.cpp
To build and run:
> make example_apps/connectedcomponents
> bin/example_apps/connectedcomponents file GRAPH-NAME
The program will produce output GRAPHNAME_components.txt
, which on each line has component id, num-of-vertices. As an interesting analysis, you can plot a histogram of the sizes of the connected components.
What you will learn:
- How to exchange information via edges in both directions
Source: example_apps/communitydetection.cpp
To build and run:
> make example_apps/communitydetection
> bin/example_apps/communitydetection file GRAPH-NAME
Same output as connected components, but instead of components, it lists community-ids and their sizes.
What you will learn:
- How to integrate linear algebra package (Eigen)
- How to deal with a bipartite graph
- Special graph input / output
- Two strategies of reading neighbor vertex value (via edges, and in-memory)
There are two versions of the program. The first version, edge-version, writes a vertex's value (a latent factor vector) to its adjacent edges; the vertices-in-memory version uses a separate array to store vertices in-memory. See the code for more details.
You need the Eigen matrix library for compiling the matrix factorization programs. It is available here, as a headers-only installation.
Source: example_apps/als_edgefactors.cpp
To build and run:
> make example_apps/matrix_factorization/als_edgefactors
> bin/example_apps/matrix_factorization/als_edgefactors file MATRIX-NAME
Source: example_apps/als_vertices_inmem.cpp
> make example_apps/matrix_factorization/als_vertices_inmem
> bin/example_apps/matrix_factorization/als_vertices_inmem file MATRIX-NAME
For this program, you need to provide your input in the Matrix Market format: http://math.nist.gov/MatrixMarket/
For example, if you are running ALS for the Netflix movie rating dataset, the matrix is a sparse matrix containing the ratings that users have given each movie.
The matrix factors are output in the Matrix Market format with names
- GRAPH-NAME_U.mm
- GRAPH-NAME_V.mm
What you will learn:
- A special design pattern, which requires working outside the basic abstraction
- Writing a special preprocessor that orders the vertices by their degree
Source: https://github.com/GraphChi/graphchi-cpp/blob/master/example_apps/trianglecounting.cpp
This application starts with a base-graph, and streams additional edges to the graph from a streaming graph file (which must be in the EdgeListFormat), while simultaneously computing pagerank. The source code contains also an example of how to use an HTTP-based admin dashboard for !GraphChi applications. More information is added later, and be warned that the code in the example application is not finished and contains a lot of demonstration code.