spark-rapids-examples

This is the RAPIDS Accelerator for Apache Spark examples repo. RAPIDS Accelerator for Apache Spark accelerates Spark applications with no code changes. You can download the latest version of RAPIDS Accelerator here. This repo contains examples and applications that showcases the performance and benefits of using RAPIDS Accelerator in data processing and machine learning pipelines. There are broadly five categories of examples in this repo:

For more information on each of the examples please look into respective categories.

Here is the list of notebooks in this repo:

	Category	Notebook Name	Description
1	SQL/DF	Microbenchmark	Spark SQL operations such as expand, hash aggregate, windowing, and cross joins with up to 20x performance benefits
2	SQL/DF	Customer Churn	Data federation for modeling customer Churn with a sample telco customer data
3	XGBoost	Agaricus (Scala)	Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the agaricus dataset
4	XGBoost	Mortgage (Scala)	End-to-end ETL + XGBoost example to predict mortgage default with Fannie Mae Single-Family Loan Performance Data
5	XGBoost	Taxi (Scala)	End-to-end ETL + XGBoost example to predict taxi trip fare amount with NYC taxi trips data set
6	ML/DL	PCA	Spark-Rapids-ML based PCA example to train and transform with a synthetic dataset
7	ML/DL	DL Inference	Several notebooks demonstrating distributed model inference on Spark using the `predict_batch_udf` across various frameworks: PyTorch, HuggingFace, vLLM, and TensorFlow

Here is the list of Apache Spark applications (Scala and PySpark) that can be built for running on GPU with RAPIDS Accelerator in this repo:

	Category	Notebook Name	Description
1	XGBoost	Agaricus (Scala)	Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the agaricus dataset
2	XGBoost	Mortgage (Scala)	End-to-end ETL + XGBoost example to predict mortgage default with Fannie Mae Single-Family Loan Performance Data
3	XGBoost	Taxi (Scala)	End-to-end ETL + XGBoost example to predict taxi trip fare amount with NYC taxi trips data set
4	ML/DL	PCA	Spark-Rapids-ML based PCA example to train and transform with a synthetic dataset
5	UDF	URL Decode	Decodes URL-encoded strings using the Java APIs of RAPIDS cudf
6	UDF	URL Encode	URL-encodes strings using the Java APIs of RAPIDS cudf
7	UDF	CosineSimilarity	Computes the cosine similarity between two float vectors using native code
8	UDF	StringWordCount	Implements a Hive simple UDF using native code to count words in strings

Name		Name	Last commit message	Last commit date
Latest commit History 474 Commits
.github		.github
datasets		datasets
dockerfile		dockerfile
docs		docs
examples		examples
scripts		scripts
tools		tools
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

spark-rapids-examples

About

Uh oh!

Releases 20

Packages

Uh oh!

Contributors 24

Uh oh!

Languages

License

NVIDIA/spark-rapids-examples

Folders and files

Latest commit

History

Repository files navigation

spark-rapids-examples

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 20

Packages 0

Uh oh!

Contributors 24

Uh oh!

Languages

Packages