This repository is a complementary to the my blog post regarding stateful streaming in Spark. It contains a working example which uses the same data structures discussed in the post.
This example assumes you already have Spark set up locally or remotely.
Steps to make this code run:
- Clone the code
- Set a checkpoint directory inside the application.conf file under the "checkpoint-directory" key:
spark {
spark-master-url = "local[*]"
checkpoint-directory = ""
timeout-in-minutes = 5
}
- This application uses a socket stream to consume data (this was the simplest way to make this work). In order for that to work, you need to pass two arguments to the program: host and port
For anyone using IntelliJ, you can configure the "Program Arguments" in the configuration:
Otherwise you can pass it as arguments to spark-submit
or pass them locally via your favorite IDE.
- Start up netcat on the same port you pass the program.
- Take the data from resources/sample-data.txt and send them via netcat.
- Set breakpoints as you desire and watch the app run!