In modern data ecosystems, tracking data transformations is critical for ensuring integrity and compliance. Graph processing systems are powerful for analyzing connected data, but tracking their changes is complex. Provenance, which traces data lineage, helps manage this complexity by tracking graph transformations while enabling debugging and performance analysis.
ProvX introduces a reference design for integrating provenance into graph processing systems, providing fine-grained control over operations. It features a provenance query API for customizable data capture and a metrics system for monitoring performance. We demonstrate that users can control both the data captured and the overhead of provenance tracking.
Directory | Description |
---|---|
lib |
ProvX reference design implemented on top of Apache Spark's GraphX. |
metarunner |
Parameterisable Magpie configuration generator for running experiments on differently sized Apache Spark clusters. |
results |
Jupyter notebooks to plot results. |
run |
Experiment working directory and results location. |
- Scala 2.13 (with JDK11) used to compile ProvX library for GraphX.
- Apache Spark (3.3.2 with support for Scala 2.13) installation with Hadoop 3.2.4
- SLURM cluster to run Apache Spark (via Magpie scripts).
- Go 1.23 for compiling
metarunner
program. - Python 3.10 for plotting the experiment results (
results
directory) just
command runner
- Install the appropriate Python, Go and Scala versions.
- Compile
lib
by runningjust build
in thelib
directory. This will produce the JAR file at./lib/target/scala-2.13/provxlib-assembly-0.1.0-SNAPSHOT.jar
. - Compile
metarunner
by runningjust build
in themetarunner
directory. Make sure to adjust the compilation flags depending on the target architecture of your SLURM cluster machines. rsync
this repo (with the compiled library and metarunner) to your SLURM cluster.
- Download and extract Apache Spark and Hadoop
$ wget https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3-scala2.13.tgz
$ tar xvzf spark-3.3.2-bin-hadoop3-scala2.13.tgz
$ wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz
$ tar xvzf hadoop-3.2.4.tar.gz
- Clone Magpie repo
$ git clone git@github.com:LLNL/magpie.git
- Patch Apache Spark installation
$ cd spark-3.3.2-bin-hadoop3-scala2.13
$ patch -p1 < <magpie-repo-root>/patches/spark/spark-3.3.2-bin-hadoop3-alternate.patch
🔥 Make sure you followed all steps in the previous section! 🔥
- Update the experiment configurations in
metarunner/configs
to point to the correct JAR location for the ProvX library. - Adjust
metarunner/templates/magpie.sbatch-srun-provx-with-yarn-and-hdfs
to suit your SLURM cluster's configuration. - Generate the launch scripts by running the metarunner:
./metarunner -configDir ./configs -outputDir ./scripts
. In thescripts
directory you will find thesbatch
job scripts for setting up the Apache Spark cluster on your SLURM cluster. You can submit them individually via, e.g.sbatch magpie.sbatch-srun-provx-es01-baseline-06.sh
depending on which experiment you want to reproduce.
- SSH into the Apache Spark's headnode (Check the submitted job file for the line starting with
#SBATCH --output=
to determine which output file the job is writing to. This file will contain the hostname of the Spark headnode). - Change into this repo's
run
directory on the Spark headnode. - Update the
justfile
depending on your setup. - Run the following commands to setup the environment variables, download and copy the datasets from Graphalytics onto Hadoop and do a dry-run to determine if everything is setup properly:
$ just env
$ source ./env
$ just setup
$ just dry
- Finally, the experiment can be executed using:
just bench "<experiment description>"
The results can be found in the location configured in the justfile
in the run
directory.