LSDP - Task list 3

Spark deployment

Tech:

Spark
Scala (only)

Tasks

UCI Data Repository
For all computations use Spark API, do not gather data in order to use standard Scala

Play with example spark project:
- build jar (sbt package or sbt assembly if you need to generate jar that contains your dependencies)
- deploy jar as spark job (spark-submit, local master)
- ```
  vagrant up
  vagrant ssh
  cp -r /project/example /tmp #building in shared folder is painfully slow
  cd /tmp/example
  ./sbt assembly
  spark-submit --class ml.lsdp.example.Demo --master 'local[*]' target/scala-2.11/spark-example-assembly-0.0.1.jar
```
- you can look at UI on http://localhost:4040
- explain arguments: local[*], class, what other can you use?
- find out how to turn off spark info logging
- Do you need to create project, build it and run using spark-submit each time? (spark-shell, play with it)
Select dataset from UCI repository that has:
- numeric attributes
- categorical attributes
- class attribute
- at least 100000 instances
Create new spark project (you can use rapid project start, or copy example)
Create a case class that represents your data instance (single row). Name attributes appropriately
Load your data to spark as DataSet
- spark.read
- provide schema or use inference
- convert dataframe to dataset using .as[]
Summarize your data using two methods:
- built-in
- implement your own summarization logic. A minimal set of attributes:
  - numerical:
    - min
    - max
    - mean
    - 95% percentile
  - categorical:
    - count of unique values
    - counts for each value
    - most common value
    - most uncommon value
Train a classifier on your data:
- split data into train and test
- select a classifier
- train it
- evaluate classifier in two ways:
  - use built in evaluation metrics
  - implement your own evaluation metrics:
    - accuracy
    - F1
    - precision
    - recall
Prepare short markdown file with results:
- data statistics
- classifier metrics
- execution times:
  - think about how to measure times of execution properly (JIT, SD, etc.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!