Clickstream

Sample project that processes clickstream data using Kafka and Apache Spark.

Install scala, kafka and apache-spark using homebrew.

export JAVA_HOME="$(/usr/libexec/java_home)"
export PATH=$JAVA_HOME:$PATH

export SCALA_HOME="/usr/local/Cellar/scala/2.12.4"  # find it using `brew info`
export PATH=$SCALA_HOME/bin:$PATH

Make sure that /usr/local/bin is also added to your $PATH.

Use pyenv or similar to manage your python versions and virtual environments. After creating a virtual environment, install dependencies with: pip install -r requirements.txt.

To use production data, copy the CSV file into data/production.csv.

Quickstart

See the make commands in for running the services locally.

iterm walkthrough

Start Zookeeper: make zookeeper
Start Kafka: make kafka
New tab, create the clickstream topic with make create_topic (unless it already exists).
Start the simple Spark stream that monitors the clickstream topic and prints the messages to the command line: make spark_read
New tab, stream some sample data to Kafka: make sample_data
The sample data should appear in the simple stream in the previous tab.
Make sure that your production data (a really big CSV) is found under data/production.csv
Start importing production data with make production_data
Start the categories stream with make spark_categories
The categories should appear counted, with the sliding interval of 10 seconds.
The output of the previous stream should also appear writen to the file system in the output directory.

TODO

Remove log4j logs! Could not find a way to use the with the spark-submit command.
Continue with setting up Docker. Likely by creating a new Docker image, with Spark and Python3 on it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Clickstream

Quickstart

iterm walkthrough

TODO

Files

README.md

Latest commit

History

README.md

File metadata and controls

Clickstream

Quickstart

iterm walkthrough

TODO