- pySpark
- Linear regression
- Binary classification
- Multi-class classification
Implement the third stage of the data processing architecture: machine learning Detailed information about components can be found in further tasks.
For each task make sure that tox
passes.
Please do not modify the provided tox manifest!
-
Preparation
Copy appropriate source code from the previous task into this repository to be able to access gathered data
-
Start questions
- Guide
- why do we need to install Java, how pySpark works?
- do we need to use Java 8
- can we connect to an external cluster from python code
- can we deploy our python code to Spark cluster
- how can we observe Spark jobs progress (Spark HTTP UI)
- logistic regression vs linear regression
- multi-class vs multi-label
- Spark:
- what is RDD
- what is DataFrame
- what is DataSet
- how Spark generally works (master, worker)
- Spark stack (SQL, ML, GraphX etc.)
- shuffling
- reduceByKey vs groupByKey
-
Spark installation
Update Dockerfile and docker-compose:
- download and instal Spark
- run code using pySpark command
-
MongoDB connection
Read data from mongodb using SparkSQL and appropriate connector.
- You need to add jar to Spark runtime, use --packages flag for pySpark
- mongodb needs to be a separate service in docker-compose, utilize appropriate directives to get connections
- use DataFrame API
-
Data split
Divide data into training and testing sets using some criterion (date of acquisition, first n items)
- implement data split manualy (do not use ready to use function)
- use Bernoulli experiment in order to select to which part data point should go
- how to select seed for random number generator in each trial? (MAP IN PARTITIONS, CALCULATE SEED USING PARTITION INDEX, WHY?)
-
Regression
Create regression pipeline:
- select a tweet attribute that we want to be our dependent variable (retweets? hearts?)
- map data to contain a dependent variable and features vector columns
- create ML pipeline
- evaluate your regressor using RMSE on train and test sets
-
Binary classification
Create a binary classification pipeline:
- select a tweet attribute that we want to be our class (has comments? has retweets?)
- map data to contain class and features vector columns
- create ML pipeline
- evaluate your classifier computing F1 metric
-
Multi-class classification
Create a multi-class classification pipeline:
- select a tweet attribute that we want to be our class (multi-class, device? discretize retweets into buckets?)
- map data to contain class and features vector columns
- create ML pipeline
- evaluate your classifier using MulticlassClassificationEvaluator