- Spark
- Scala (only)
- UCI Data Repository
- For all computations use Spark API, do not gather data in order to use standard Scala
-
Play with example spark project:
- build jar (sbt package or sbt assembly if you need to generate jar that contains your dependencies)
- deploy jar as spark job (spark-submit, local master)
-
vagrant up vagrant ssh cp -r /project/example /tmp #building in shared folder is painfully slow cd /tmp/example ./sbt assembly spark-submit --class ml.lsdp.example.Demo --master 'local[*]' target/scala-2.11/spark-example-assembly-0.0.1.jar
- you can look at UI on http://localhost:4040
- explain arguments: local[*], class, what other can you use?
- find out how to turn off spark info logging
- Do you need to create project, build it and run using spark-submit each time? (spark-shell, play with it)
-
Select dataset from UCI repository that has:
- numeric attributes
- categorical attributes
- class attribute
- at least 100000 instances
-
Create new spark project (you can use rapid project start, or copy example)
-
Create a case class that represents your data instance (single row). Name attributes appropriately
-
Load your data to spark as DataSet
- spark.read
- provide schema or use inference
- convert dataframe to dataset using .as[]
-
Summarize your data using two methods:
-
built-in
-
implement your own summarization logic. A minimal set of attributes:
-
numerical:
- min
- max
- mean
- 95% percentile
-
categorical:
- count of unique values
- counts for each value
- most common value
- most uncommon value
-
-
-
Train a classifier on your data:
-
split data into train and test
-
select a classifier
-
train it
-
evaluate classifier in two ways:
-
use built in evaluation metrics
-
implement your own evaluation metrics:
- accuracy
- F1
- precision
- recall
-
-
-
Prepare short markdown file with results:
- data statistics
- classifier metrics
- execution times:
- think about how to measure times of execution properly (JIT, SD, etc.)