This code provides connectivity between Apache Accumulo and Apache Spark.
- Provide native Spark interface to connect to Accumulo
- Minimize data transfer between Spark and Accumulo
- Enable use of Machine Learning with Accumulo as the datastore
# Read from Accumulo
df = (spark
.read
.format("com.microsoft.accumulo")
.options(**options) # define Accumulo properties
.schema(schema)) # define schema for data retrieval
# Write to Accumulo
(df
.write
.format("com.microsoft.accumulo")
.options(**options)
.save())
See Pyspark notebook for a more detailed example.
See Scala benchmark notebook for details on how our evaluation.
- Native Spark Datasource V2 API
- Row serialization using Avro
- Filter pushdown (server-side)
- Expressive filter language using JUEL
- ML Inference pushdown (server-side) using MLeap
- Support Spark ML pipelines
- Minimal Java-runtime
The connector is composed of two components:
- The Datasource component provides the interface used on the Spark side
- The Iterator component provides server-side functionality on the Accumulo side
The components can be built and tested with Maven (version 3.3.9 or higher) using Java version 8.
mvn clean install
Alternatively the JARs are published to the Maven Central Repository
The following steps are needed to deploy the connector:
- Deploy iterator JAR to Accumulo lib folders on all nodes and restart the cluster
# use locally built shaded jar in connector/iterator/target folder
# or
# use maven to download iterator from central repository
mvn dependency:get -Dartifact=com.microsoft.masc:microsoft-accumulo-spark-iterator:[VERSION]
- Add Datasource JAR in Spark
# use locally built shaded jar in connector/datasource/target folder or
# or
# pull in package from maven central repository
com.microsoft.masc:microsoft-accumulo-spark-datasource:[VERSION]
While the iterator JAR can run on Accumulo tablet servers using JDK versions >= 1.8, the Spark Datasource component is only compatible with JDK version 1.8 (not higher) due to Spark's Java support.