This project houses the experimental client for Spark Connect for Apache Spark written in Rust
Currently, the Spark Connect client for Rust is highly experimental and should not be used in any production setting. This is currently a "proof of concept" to identify the methods of interacting with Spark cluster from rust.
The spark-connect-rs
aims to provide an entrypoint to Spark Connect, and provide similar DataFrame API interactions.
docker compose up --build -d
use spark_connect_rs;
use spark_connect_rs::{SparkSession, SparkSessionBuilder};
use spark_connect_rs::functions as F;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let spark: SparkSession = SparkSessionBuilder::remote("sc://127.0.0.1:15002/")
.build()
.await?;
let df = spark
.sql("SELECT * FROM json.`/opt/spark/examples/src/main/resources/employees.json`")
.await?;
df.filter("salary >= 3500")
.select(F::col("name"))
.show(Some(5), None, None)
.await?;
// +-------------+
// | show_string |
// +-------------+
// | +------+ |
// | |name | |
// | +------+ |
// | |Andy | |
// | |Justin| |
// | |Berta | |
// | +------+ |
// | |
// +-------------+
Ok(())
}
git clone https://github.com/sjrusso8/spark-connect-rs.git
git submodule update --init --recursive
docker compose up --build -d
cargo build && cargo test
The following section outlines some of the larger functionality that are not yet working with this Spark Connect implementation.
- TLS authentication & Databricks compatability
- StreamingQueryManager
- Window and
Pivotfunctions - UDFs or any type of functionality that takes a closure (foreach, foreachBatch, etc.)
Spark Session type object and its implemented traits
Spark DataFrame type object and its implemented traits.
Spark Connect should respect the format as long as your cluster supports the specified type and has the required jars
DataFrameWriter | API | Comment |
---|---|---|
format | ||
option | ||
options | ||
mode | ||
bucketBy | ||
sortBy | ||
partitionBy | ||
save | ||
saveAsTable | ||
insertInto |
Start a streaming job and return a StreamingQuery
object to handle the stream operations.
DataStreamWriter | API | Comment |
---|---|---|
format | ||
foreach | ||
foreachBatch | ||
option | ||
options | ||
outputMode | Uses an Enum for OutputMode |
|
partitionBy | ||
queryName | ||
trigger | Uses an Enum for TriggerMode |
|
start | ||
toTable |
A handle to a query that is executing continuously in the background as new data arrives.
StreamingQuery | API | Comment |
---|---|---|
awaitTermination | ||
exception | ||
explain | ||
processAllAvailable | ||
stop | ||
id | ||
isActive | ||
lastProgress | ||
name | ||
recentProgress | ||
runId | ||
status |
Spark Column type object and its implemented traits
Only a few of the functions are covered by unit tests.
Spark schema objects have not yet been translated into rust objects.
Create Spark literal types from these rust types. E.g. lit(1_i64)
would be a LongType()
in the schema.
An array can be made like lit([1_i16,2_i16,3_i16])
would result in an ArrayType(Short)
since all the values of the slice can be translated into literal type.