Skip to content

Latest commit

 

History

History
118 lines (79 loc) · 4.97 KB

README.md

File metadata and controls

118 lines (79 loc) · 4.97 KB

Taming Pyspark

Completely the initial learning of running PySpark jobs locally and in a cluser

Environment Variables

BASE_DATA_PATH= # data directory relative to root project directory (ex. BASE_DATA_PATH=taming_pyspark/data)
MOVIE_LENS= 
FRIENDS=
TEMP_1800S=
WORD_COUNT=
CONSUMER_SPENDING=
HEROES=
  • Based on the folder names in the data.zip file
  • See config.py

Navigating taming_pyspark directory

  • Section describes what each folder focuses on

Objective:

  • Location for functions used in multiple locations

Objective: Count unique words in the word_count/Book.txt file in data.zip

Files Topic Additional Notes
word_count_rdd.py - RDD
- flatMap
- Data in the word_count in the data.zip file
- Word count includes special characters
word_count_df.py - Dataframes
- RDD to Dataframes
- Sorting
- Data in the word_count in the data.zip file
- Word count includes special characters
word_count_regex_rdd.py - RDD
- Regular Expressions
- Sorting
- re module
- Normalize words and remove special characters
word_count_regex_df.py - Dataframes
- RDD to Dataframes
- Regular Expressions
- Sorting
- re module
- Normalize words and remove special characters

Objective:

  • Use the customer-purchases/customer_orders.csv file in data.zip
  • Find the total amount of spend by consumer id
  • Sort by consumer id total spend
Files Topic Additional Notes
spending_df.py - Dataframes
- Aggregrate Functions
spending_rdd.py - RDDs
- reduceByKey

Objective:

  • Use the friends-dataset/fakefriends.csv file in data.zip
  • Psuedo friends/peers dataset
  • Use pyspark to determine number of friends/peers by age
Files Topic Additional Notes
fake_friends_df.py - Dataframes
- Aggregrate Functions
fake_friends_rdd.py - RDDs
- reduceByKey
- Text to CSV Parsing

Objective:

  • Use the friends-dataset/fakefriends.csv file in data.zip
  • Determines the most popular hero based on how many uniquely different hours they have had an encounter with
Files Topic Additional Notes
most_popular_hero.py - udf (user-defined functions)
- Dataframes
- Adding Columns (withColumns)
- RDD to DF

Objective:

  • Uses the data found in the ml-100k/ folder in data.zip
  • Find the most watched movies
  • Count and group movies by rating
  • Recommend movie based on previous users using least squared algorithm
Files Topic Additional Notes
ratings_counter.py - RDD
- Text to CSV Parsing
- Ordered Dictionary
- countByValue
most_watched.py - DataFrames
- udf (user defined functions)
- udf with multiple inputs
- Passing Dictionary to UDF
recommendation.py - Dataframes
- Complex Aggregations
- Class Based PySpark Runs

Objective:

  • Introduction to PySpark Structured Streaming
  • Log Ingestion for HTTP Requests
  • Uses the streams/logs.txt file in data.zip to be the sample logs
  • Run the stream and add a copy of the logs.txt file to the logs directory that gets created when running the script
Files Topic Additional Notes
stream_logs.py - Structured Streaming

Objective:

  • Shows min weather temperature observed for each weather station by default
Files Topic Additional Notes
min_max_temp_df.py - DataFrames
- filter
min_max_temp_rdd.py - RDDs
- filter
- reduceByKey