Completely the initial learning of running PySpark jobs locally and in a cluser
BASE_DATA_PATH= # data directory relative to root project directory (ex. BASE_DATA_PATH=taming_pyspark/data)
MOVIE_LENS=
FRIENDS=
TEMP_1800S=
WORD_COUNT=
CONSUMER_SPENDING=
HEROES=
- Based on the folder names in the data.zip file
- See config.py
- Section describes what each folder focuses on
Objective:
- Location for functions used in multiple locations
Objective: Count unique words in the word_count/Book.txt file in data.zip
Files | Topic | Additional Notes |
---|---|---|
word_count_rdd.py | - RDD - flatMap |
- Data in the word_count in the data.zip file - Word count includes special characters |
word_count_df.py | - Dataframes - RDD to Dataframes - Sorting |
- Data in the word_count in the data.zip file - Word count includes special characters |
word_count_regex_rdd.py | - RDD - Regular Expressions - Sorting - re module |
- Normalize words and remove special characters |
word_count_regex_df.py | - Dataframes - RDD to Dataframes - Regular Expressions - Sorting - re module |
- Normalize words and remove special characters |
Objective:
- Use the customer-purchases/customer_orders.csv file in data.zip
- Find the total amount of spend by consumer id
- Sort by consumer id total spend
Files | Topic | Additional Notes |
---|---|---|
spending_df.py | - Dataframes - Aggregrate Functions |
|
spending_rdd.py | - RDDs - reduceByKey |
Objective:
- Use the friends-dataset/fakefriends.csv file in data.zip
- Psuedo friends/peers dataset
- Use pyspark to determine number of friends/peers by age
Files | Topic | Additional Notes |
---|---|---|
fake_friends_df.py | - Dataframes - Aggregrate Functions |
|
fake_friends_rdd.py | - RDDs - reduceByKey - Text to CSV Parsing |
Objective:
- Use the friends-dataset/fakefriends.csv file in data.zip
- Determines the most popular hero based on how many uniquely different hours they have had an encounter with
Files | Topic | Additional Notes |
---|---|---|
most_popular_hero.py | - udf (user-defined functions) - Dataframes - Adding Columns (withColumns) - RDD to DF |
Objective:
- Uses the data found in the ml-100k/ folder in data.zip
- Find the most watched movies
- Count and group movies by rating
- Recommend movie based on previous users using least squared algorithm
Files | Topic | Additional Notes |
---|---|---|
ratings_counter.py | - RDD - Text to CSV Parsing - Ordered Dictionary - countByValue |
|
most_watched.py | - DataFrames - udf (user defined functions) - udf with multiple inputs - Passing Dictionary to UDF |
|
recommendation.py | - Dataframes - Complex Aggregations - Class Based PySpark Runs |
Objective:
- Introduction to PySpark Structured Streaming
- Log Ingestion for HTTP Requests
- Uses the streams/logs.txt file in data.zip to be the sample logs
- Run the stream and add a copy of the logs.txt file to the logs directory that gets created when running the script
Files | Topic | Additional Notes |
---|---|---|
stream_logs.py | - Structured Streaming |
Objective:
- Shows min weather temperature observed for each weather station by default
Files | Topic | Additional Notes |
---|---|---|
min_max_temp_df.py | - DataFrames - filter |
|
min_max_temp_rdd.py | - RDDs - filter - reduceByKey |