Taming Pyspark

Completely the initial learning of running PySpark jobs locally and in a cluser

Environment Variables

BASE_DATA_PATH= # data directory relative to root project directory (ex. BASE_DATA_PATH=taming_pyspark/data)
MOVIE_LENS= 
FRIENDS=
TEMP_1800S=
WORD_COUNT=
CONSUMER_SPENDING=
HEROES=

Based on the folder names in the data.zip file
See config.py

Navigating taming_pyspark directory

Section describes what each folder focuses on

taming_pyspark/utils

Objective:

Location for functions used in multiple locations

taming_pyspark/book

Objective: Count unique words in the word_count/Book.txt file in data.zip

Files	Topic	Additional Notes
word_count_rdd.py	`- RDD` `- flatMap`	- Data in the `word_count` in the data.zip file - Word count includes special characters
word_count_df.py	`- Dataframes` `- RDD to Dataframes` `- Sorting`	- Data in the `word_count` in the data.zip file - Word count includes special characters
word_count_regex_rdd.py	`- RDD` `- Regular Expressions` `- Sorting` `- re module`	- Normalize words and remove special characters
word_count_regex_df.py	`- Dataframes` `- RDD to Dataframes` `- Regular Expressions` `- Sorting` `- re module`	- Normalize words and remove special characters

taming_pyspark/consumer_spending

Objective:

Use the customer-purchases/customer_orders.csv file in data.zip
Find the total amount of spend by consumer id
Sort by consumer id total spend

Files	Topic	Additional Notes
spending_df.py	`- Dataframes` `- Aggregrate Functions`
spending_rdd.py	`- RDDs` `- reduceByKey`

taming_pyspark/friends

Objective:

Use the friends-dataset/fakefriends.csv file in data.zip
Psuedo friends/peers dataset
Use pyspark to determine number of friends/peers by age

Files	Topic	Additional Notes
fake_friends_df.py	`- Dataframes` `- Aggregrate Functions`
fake_friends_rdd.py	`- RDDs` `- reduceByKey` `- Text to CSV Parsing`

taming_pyspark/heroes

Objective:

Use the friends-dataset/fakefriends.csv file in data.zip
Determines the most popular hero based on how many uniquely different hours they have had an encounter with

Files	Topic	Additional Notes
most_popular_hero.py	`- udf (user-defined functions)` `- Dataframes` `- Adding Columns (withColumns)` `- RDD to DF`

taming_pyspark/movies

Objective:

Uses the data found in the ml-100k/ folder in data.zip
Find the most watched movies
Count and group movies by rating
Recommend movie based on previous users using least squared algorithm

Files	Topic	Additional Notes
ratings_counter.py	`- RDD` `- Text to CSV Parsing` `- Ordered Dictionary` `- countByValue`
most_watched.py	`- DataFrames` `- udf (user defined functions)` `- udf with multiple inputs` `- Passing Dictionary to UDF`
recommendation.py	`- Dataframes` `- Complex Aggregations` `- Class Based PySpark Runs`

taming_pyspark/streams

Objective:

Introduction to PySpark Structured Streaming
Log Ingestion for HTTP Requests
Uses the streams/logs.txt file in data.zip to be the sample logs
Run the stream and add a copy of the logs.txt file to the logs directory that gets created when running the script

Files	Topic	Additional Notes
stream_logs.py	`- Structured Streaming`

taming_pyspark/temp_1800s

Objective:

Shows min weather temperature observed for each weather station by default

Files	Topic	Additional Notes
min_max_temp_df.py	`- DataFrames` `- filter`
min_max_temp_rdd.py	`- RDDs` `- filter` `- reduceByKey`

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
taming_pyspark		taming_pyspark
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
data.zip		data.zip
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Taming Pyspark

Environment Variables

Navigating taming_pyspark directory

taming_pyspark/utils

taming_pyspark/book

taming_pyspark/consumer_spending

taming_pyspark/friends

taming_pyspark/heroes

taming_pyspark/movies

taming_pyspark/streams

taming_pyspark/temp_1800s

About

Releases

Packages

Languages

Ydot19/taming-pyspark

Folders and files

Latest commit

History

Repository files navigation

Taming Pyspark

Environment Variables

Navigating taming_pyspark directory

About

Resources

Stars

Watchers

Forks

Languages