This demo simulates a stream of email metadata. Data flows from akka -> kafka -> spark streaming -> cassandra
See the Kafka Setup Instructions in the KAFKA_SETUP.md file
Note: You can change RF and compaction settings in this cql script if needed.
cqlsh -f /path_to_SparkAtScale/LoadMovieData/conf/email_db.cql
sbt feeder/assembly
Edit kafkaHost and kafkaTopic if needed. kafaHost should match the setting in kafka/conf/server.properties and kafkaTopic should match that used when creating the topic.
Parameters:
-
Number of feeders to start
-
Time interval (ms) between sent requests by each feeder (1 feeder sending a message every 100 ms will equate to 10 messages/sec)
-
Feeder name
Note: You will want to update the KafkaHost param in dev.conf to match settings in kafka/conf/server.properties
java -Xmx5g -Dconfig.file=dev.conf -jar feeder/target/scala-2.10/feeder-assembly-0.1.jar 1 100 emailFeeder 2>&1 1>feeder-out.log &
sbt streaming/assembly
Note: You will want to reference the correct Spark version, for example running against Spark 1.4 use 1.4.1 instead of 1.5.0
Parameters:
-
kafka broker: Ex. 10.200.185.103:9092
-
debug flag (limited use): Ex. true or false
-
checkpoint directory name: Ex. cfs://[optional-ip-address]/emails_checkpoint, dsefs://[optional-ip-address]/emails_checkpoint
-
spark.streaming.kafka.maxRatePerPartition: Maximum rate (number of records per second)
-
batch interval (ms)
-
auto.offset.reset: Ex. smallest or largest
-
topic name
-
kafka stream type: ex. direct or receiver
-
number of receivers to create (controls read parallelism) (receiver approach: typically this should be the number of nodes in the cluster)
-
processesing parallelism (controls write parallelism) (receiver approach: you'll want to match whatever used when creating the topic)
-
group.id that id's the consumer processes (receiver approach: you'll want to match whatever used when creating the topic)
-
zookeeper connect string (e.g localhost:2181) (receiver approach: you'll want to match whatever used when creating the topic)
dse spark-submit --driver-memory 2G --class sparkAtScale.StreamingDirectEmails streaming/target/scala-2.10/streaming-assembly-0.1.jar <kafka-broker-ip>:9092 true dsefs://[optional-ip-address]/emails_checkpoint 50000 5000 smallest emails direct 1 100 test-consumer-group localhost:2181