Spark at Scale

This demo simulates a stream of email metadata. Data flows from akka -> kafka -> spark streaming -> cassandra

Kafka Setup

See the Kafka Setup Instructions in the KAFKA_SETUP.md file

Setup the KS/table

Note: You can change RF and compaction settings in this cql script if needed.

cqlsh -f /path_to_SparkAtScale/LoadMovieData/conf/email_db.cql

Setup Akka Feeder

Build the feeder fat jar

sbt feeder/assembly

Edit kafkaHost and kafkaTopic if needed. kafaHost should match the setting in kafka/conf/server.properties and kafkaTopic should match that used when creating the topic.

Run the feeder

Parameters:

Number of feeders to start
Time interval (ms) between sent requests by each feeder (1 feeder sending a message every 100 ms will equate to 10 messages/sec)
Feeder name

Note: You will want to update the KafkaHost param in dev.conf to match settings in kafka/conf/server.properties java -Xmx5g -Dconfig.file=dev.conf -jar feeder/target/scala-2.10/feeder-assembly-0.1.jar 1 100 emailFeeder 2>&1 1>feeder-out.log &

Run Spark Streaming

Build the streaming jar

sbt streaming/assembly

Note: You will want to reference the correct Spark version, for example running against Spark 1.4 use 1.4.1 instead of 1.5.0

Parameters:

kafka broker: Ex. 10.200.185.103:9092
debug flag (limited use): Ex. true or false
checkpoint directory name: Ex. cfs://[optional-ip-address]/emails_checkpoint, dsefs://[optional-ip-address]/emails_checkpoint
spark.streaming.kafka.maxRatePerPartition: Maximum rate (number of records per second)
batch interval (ms)
auto.offset.reset: Ex. smallest or largest
topic name
kafka stream type: ex. direct or receiver
number of receivers to create (controls read parallelism) (receiver approach: typically this should be the number of nodes in the cluster)
processesing parallelism (controls write parallelism) (receiver approach: you'll want to match whatever used when creating the topic)
group.id that id's the consumer processes (receiver approach: you'll want to match whatever used when creating the topic)
zookeeper connect string (e.g localhost:2181) (receiver approach: you'll want to match whatever used when creating the topic)

Running on a server in foreground

dse spark-submit --driver-memory 2G --class sparkAtScale.StreamingDirectEmails streaming/target/scala-2.10/streaming-assembly-0.1.jar <kafka-broker-ip>:9092 true dsefs://[optional-ip-address]/emails_checkpoint 50000 5000 smallest emails direct 1 100 test-consumer-group localhost:2181

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
Kafka		Kafka
LoadMovieData		LoadMovieData
feeder/src/main		feeder/src/main
notebooks		notebooks
project		project
streaming/src/main/scala/sparkAtScale		streaming/src/main/scala/sparkAtScale
.gitignore		.gitignore
KAFKA_SETUP.md		KAFKA_SETUP.md
README.md		README.md
build.sbt		build.sbt
build.sbt.4.6		build.sbt.4.6
dev.conf		dev.conf
movie_ids.csv		movie_ids.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark at Scale

Kafka Setup

Setup the KS/table

Setup Akka Feeder

Build the feeder fat jar

Run the feeder

Run Spark Streaming

Build the streaming jar

Running on a server in foreground

About

Uh oh!

Releases

Packages

Languages

rocco408/SparkAtScale

Folders and files

Latest commit

History

Repository files navigation

Spark at Scale

Kafka Setup

Setup the KS/table

Setup Akka Feeder

Build the feeder fat jar

Run the feeder

Run Spark Streaming

Build the streaming jar

Running on a server in foreground

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages