Skip to content

Latest commit

 

History

History
83 lines (62 loc) · 2.57 KB

l3.md

File metadata and controls

83 lines (62 loc) · 2.57 KB

LSDP - Task list 3

Spark deployment

Tech:

Tasks

  • UCI Data Repository
  • For all computations use Spark API, do not gather data in order to use standard Scala
  1. Play with example spark project:

    • build jar (sbt package or sbt assembly if you need to generate jar that contains your dependencies)
    • deploy jar as spark job (spark-submit, local master)
    •   vagrant up
        vagrant ssh
        cp -r /project/example /tmp #building in shared folder is painfully slow
        cd /tmp/example
        ./sbt assembly
        spark-submit --class ml.lsdp.example.Demo --master 'local[*]' target/scala-2.11/spark-example-assembly-0.0.1.jar
    • you can look at UI on http://localhost:4040
    • explain arguments: local[*], class, what other can you use?
    • find out how to turn off spark info logging
    • Do you need to create project, build it and run using spark-submit each time? (spark-shell, play with it)
  2. Select dataset from UCI repository that has:

    • numeric attributes
    • categorical attributes
    • class attribute
    • at least 100000 instances
  3. Create new spark project (you can use rapid project start, or copy example)

  4. Create a case class that represents your data instance (single row). Name attributes appropriately

  5. Load your data to spark as DataSet

    • spark.read
    • provide schema or use inference
    • convert dataframe to dataset using .as[]
  6. Summarize your data using two methods:

    • built-in

    • implement your own summarization logic. A minimal set of attributes:

      • numerical:

        • min
        • max
        • mean
        • 95% percentile
      • categorical:

        • count of unique values
        • counts for each value
        • most common value
        • most uncommon value
  7. Train a classifier on your data:

    • split data into train and test

    • select a classifier

    • train it

    • evaluate classifier in two ways:

      • use built in evaluation metrics

      • implement your own evaluation metrics:

        • accuracy
        • F1
        • precision
        • recall
  8. Prepare short markdown file with results:

    • data statistics
    • classifier metrics
    • execution times:
      • think about how to measure times of execution properly (JIT, SD, etc.)