L4 - 2019

Scope

pySpark
Linear regression
Binary classification
Multi-class classification

Tasks

Implement the third stage of the data processing architecture: machine learning Detailed information about components can be found in further tasks.

For each task make sure that tox passes.

Please do not modify the provided tox manifest!

Preparation

Copy appropriate source code from the previous task into this repository to be able to access gathered data
Start questions
- Guide
- why do we need to install Java, how pySpark works?
- do we need to use Java 8
- can we connect to an external cluster from python code
- can we deploy our python code to Spark cluster
- how can we observe Spark jobs progress (Spark HTTP UI)
- logistic regression vs linear regression
- multi-class vs multi-label
- Spark:
  - what is RDD
  - what is DataFrame
  - what is DataSet
  - how Spark generally works (master, worker)
  - Spark stack (SQL, ML, GraphX etc.)
  - shuffling
  - reduceByKey vs groupByKey
Spark installation

Update Dockerfile and docker-compose:
- download and instal Spark
- run code using pySpark command
MongoDB connection

Read data from mongodb using SparkSQL and appropriate connector.
- You need to add jar to Spark runtime, use --packages flag for pySpark
- mongodb needs to be a separate service in docker-compose, utilize appropriate directives to get connections
- use DataFrame API
Data split

Divide data into training and testing sets using some criterion (date of acquisition, first n items)
- implement data split manualy (do not use ready to use function)
- use Bernoulli experiment in order to select to which part data point should go
- how to select seed for random number generator in each trial? (MAP IN PARTITIONS, CALCULATE SEED USING PARTITION INDEX, WHY?)
Regression

Create regression pipeline:
- select a tweet attribute that we want to be our dependent variable (retweets? hearts?)
- map data to contain a dependent variable and features vector columns
- create ML pipeline
- evaluate your regressor using RMSE on train and test sets
Binary classification

Create a binary classification pipeline:
- select a tweet attribute that we want to be our class (has comments? has retweets?)
- map data to contain class and features vector columns
- create ML pipeline
- evaluate your classifier computing F1 metric
Multi-class classification

Create a multi-class classification pipeline:
- select a tweet attribute that we want to be our class (multi-class, device? discretize retweets into buckets?)
- map data to contain class and features vector columns
- create ML pipeline
- evaluate your classifier using MulticlassClassificationEvaluator

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

L4 - 2019

Scope

Tasks

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Large-scale-data-processing/l4-v2-2019-base

Folders and files

Latest commit

History

Repository files navigation

L4 - 2019

Scope

Tasks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages