Skip to content

Large-scale-data-processing/l4-v2-2019-base

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

L4 - 2019

Scope

  1. pySpark
  2. Linear regression
  3. Binary classification
  4. Multi-class classification

Tasks

Implement the third stage of the data processing architecture: machine learning Detailed information about components can be found in further tasks.

For each task make sure that tox passes.

Please do not modify the provided tox manifest!

  1. Preparation

    Copy appropriate source code from the previous task into this repository to be able to access gathered data

  2. Start questions

    • Guide
    • why do we need to install Java, how pySpark works?
    • do we need to use Java 8
    • can we connect to an external cluster from python code
    • can we deploy our python code to Spark cluster
    • how can we observe Spark jobs progress (Spark HTTP UI)
    • logistic regression vs linear regression
    • multi-class vs multi-label
    • Spark:
      • what is RDD
      • what is DataFrame
      • what is DataSet
      • how Spark generally works (master, worker)
      • Spark stack (SQL, ML, GraphX etc.)
      • shuffling
      • reduceByKey vs groupByKey
  3. Spark installation

    Update Dockerfile and docker-compose:

    • download and instal Spark
    • run code using pySpark command
  4. MongoDB connection

    Read data from mongodb using SparkSQL and appropriate connector.

    • You need to add jar to Spark runtime, use --packages flag for pySpark
    • mongodb needs to be a separate service in docker-compose, utilize appropriate directives to get connections
    • use DataFrame API
  5. Data split

    Divide data into training and testing sets using some criterion (date of acquisition, first n items)

    • implement data split manualy (do not use ready to use function)
    • use Bernoulli experiment in order to select to which part data point should go
    • how to select seed for random number generator in each trial? (MAP IN PARTITIONS, CALCULATE SEED USING PARTITION INDEX, WHY?)
  6. Regression

    Create regression pipeline:

    • select a tweet attribute that we want to be our dependent variable (retweets? hearts?)
    • map data to contain a dependent variable and features vector columns
    • create ML pipeline
    • evaluate your regressor using RMSE on train and test sets
  7. Binary classification

    Create a binary classification pipeline:

    • select a tweet attribute that we want to be our class (has comments? has retweets?)
    • map data to contain class and features vector columns
    • create ML pipeline
    • evaluate your classifier computing F1 metric
  8. Multi-class classification

    Create a multi-class classification pipeline:

    • select a tweet attribute that we want to be our class (multi-class, device? discretize retweets into buckets?)
    • map data to contain class and features vector columns
    • create ML pipeline
    • evaluate your classifier using MulticlassClassificationEvaluator

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published