Skip to content

Ydot19/taming-pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Taming Pyspark

Completely the initial learning of running PySpark jobs locally and in a cluser

Environment Variables

BASE_DATA_PATH= # data directory relative to root project directory (ex. BASE_DATA_PATH=taming_pyspark/data)
MOVIE_LENS= 
FRIENDS=
TEMP_1800S=
WORD_COUNT=
CONSUMER_SPENDING=
HEROES=
  • Based on the folder names in the data.zip file
  • See config.py

Navigating taming_pyspark directory

  • Section describes what each folder focuses on

Objective:

  • Location for functions used in multiple locations

Objective: Count unique words in the word_count/Book.txt file in data.zip

Files Topic Additional Notes
word_count_rdd.py - RDD
- flatMap
- Data in the word_count in the data.zip file
- Word count includes special characters
word_count_df.py - Dataframes
- RDD to Dataframes
- Sorting
- Data in the word_count in the data.zip file
- Word count includes special characters
word_count_regex_rdd.py - RDD
- Regular Expressions
- Sorting
- re module
- Normalize words and remove special characters
word_count_regex_df.py - Dataframes
- RDD to Dataframes
- Regular Expressions
- Sorting
- re module
- Normalize words and remove special characters

Objective:

  • Use the customer-purchases/customer_orders.csv file in data.zip
  • Find the total amount of spend by consumer id
  • Sort by consumer id total spend
Files Topic Additional Notes
spending_df.py - Dataframes
- Aggregrate Functions
spending_rdd.py - RDDs
- reduceByKey

Objective:

  • Use the friends-dataset/fakefriends.csv file in data.zip
  • Psuedo friends/peers dataset
  • Use pyspark to determine number of friends/peers by age
Files Topic Additional Notes
fake_friends_df.py - Dataframes
- Aggregrate Functions
fake_friends_rdd.py - RDDs
- reduceByKey
- Text to CSV Parsing

Objective:

  • Use the friends-dataset/fakefriends.csv file in data.zip
  • Determines the most popular hero based on how many uniquely different hours they have had an encounter with
Files Topic Additional Notes
most_popular_hero.py - udf (user-defined functions)
- Dataframes
- Adding Columns (withColumns)
- RDD to DF

Objective:

  • Uses the data found in the ml-100k/ folder in data.zip
  • Find the most watched movies
  • Count and group movies by rating
  • Recommend movie based on previous users using least squared algorithm
Files Topic Additional Notes
ratings_counter.py - RDD
- Text to CSV Parsing
- Ordered Dictionary
- countByValue
most_watched.py - DataFrames
- udf (user defined functions)
- udf with multiple inputs
- Passing Dictionary to UDF
recommendation.py - Dataframes
- Complex Aggregations
- Class Based PySpark Runs

Objective:

  • Introduction to PySpark Structured Streaming
  • Log Ingestion for HTTP Requests
  • Uses the streams/logs.txt file in data.zip to be the sample logs
  • Run the stream and add a copy of the logs.txt file to the logs directory that gets created when running the script
Files Topic Additional Notes
stream_logs.py - Structured Streaming

Objective:

  • Shows min weather temperature observed for each weather station by default
Files Topic Additional Notes
min_max_temp_df.py - DataFrames
- filter
min_max_temp_rdd.py - RDDs
- filter
- reduceByKey

About

Learning Pyspark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages