@author: Danial Shabbir
Medium Article: https://medium.com/@danial.shabbir77/a-complete-data-engineering-workflow-data-modelling-and-advanced-analytics-using-python-duckdb-af1f63d1f4b8
Reach out to me via linkedin: https://www.linkedin.com/in/danial-shabbir/
Reach out to me via gmail: danial.shabbir77@gmail.com
The New York City Taxi and Limousine Commission (TLC) collect extensive data concerning taxi trips within the city. This data provides a rich resource for analysis and modeling, aiding in various business and urban planning decisions. In this article, we delve into leveraging Jupyter Notebook to dissect and organize this data into meaningful Dimension and Fact tables for analytics purposes.
Dataset Download Link: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Note: For the purpose of this article we will be using dataset from year 2023 (Yellow Taxi)
The provided dataset contains detailed records of taxi trips from the year 2023 trips. It captures pivotal information such as pick-up and drop-off dates/times, trip distances, fare details, passenger counts, and more. It's important to note that this data was collected and supplied to the TLC by technology providers under authorized programs. Therefore, while being valuable, the accuracy and completeness of the dataset might not be guaranteed.
Below image contains all the data points we will getting and their descriptions from the dataset.
The analysis starts with loading the data into a Pandas DataFrame within the Jupyter Notebook environment. This initial step involves crucial data preprocessing tasks such as:
- Checking data types and ensuring appropriate formats for date-time columns.
- Removing duplicate entries to maintain data integrity.
- Generating a unique Trip ID for each trip record for identification purposes. (also known as surrogate key)
Once, the data is fully cleaned. We creates dimension tables for different entities
- datetime_dim: This table captures datetime-related attributes such as pickup and drop-off hours, days, months, years, and weekdays.
- passenger_count_dim: Contains passenger count information.
- trip_distance_dim: Holds details about trip distances.
- rate_code_dim: Provides insights into different rate codes and their corresponding names.
- location_dim: A look-up table for location which maps locationID for pickup/dropff into location_name
- pickup_location_dim: Stores data about pickup locations.
- dropoff_location_dim: Holds data regarding drop-off locations.
- payment_type_dim: Contains information about various payment types.
The Fact table, the heart of a dimensional model, is constructed by merging various Dimension tables. It encapsulates essential information such as trip details, fare amounts, payment details, dropoff_location_dim, pickup_location_dim, passenger details and more.
We will write some analytical queries using duckdb to query data from our data model. This will present and help us identify different usecases.
import pandas as pd
import duckdb
#load data from parquet file
df =pd.read_parquet('./nyc_taxi_trip_data/2023/2023.parquet')
display(df.head())
print(f'Dataset Contains: {df.shape[0]} rows, {df.shape[1]} columns')
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 2023-01-01 00:32:10 | 2023-01-01 00:40:36 | 1.0 | 0.97 | 1.0 | N | 161 | 141 | 2 | 9.3 | 1.00 | 0.5 | 0.00 | 0.0 | 1.0 | 14.30 | 2.5 | 0.00 |
1 | 2 | 2023-01-01 00:55:08 | 2023-01-01 01:01:27 | 1.0 | 1.10 | 1.0 | N | 43 | 237 | 1 | 7.9 | 1.00 | 0.5 | 4.00 | 0.0 | 1.0 | 16.90 | 2.5 | 0.00 |
2 | 2 | 2023-01-01 00:25:04 | 2023-01-01 00:37:49 | 1.0 | 2.51 | 1.0 | N | 48 | 238 | 1 | 14.9 | 1.00 | 0.5 | 15.00 | 0.0 | 1.0 | 34.90 | 2.5 | 0.00 |
3 | 1 | 2023-01-01 00:03:48 | 2023-01-01 00:13:25 | 0.0 | 1.90 | 1.0 | N | 138 | 7 | 1 | 12.1 | 7.25 | 0.5 | 0.00 | 0.0 | 1.0 | 20.85 | 0.0 | 1.25 |
4 | 2 | 2023-01-01 00:10:29 | 2023-01-01 00:21:19 | 1.0 | 1.43 | 1.0 | N | 107 | 79 | 1 | 11.4 | 1.00 | 0.5 | 3.28 | 0.0 | 1.0 | 19.68 | 2.5 | 0.00 |
Dataset Contains: 3066766 rows, 19 columns
CSV files store data in rows with values separated by delimiters, while JSON uses key-value pairs for structured data. Parquet, however, arranges data by columns, allowing more efficient storage and retrieval of column-specific information. It employs advanced compression techniques and encoding methods, like Run-Length and Dictionary Encoding, resulting in smaller file sizes and faster query performance. This optimized structure makes Parquet well-suited for analytical workloads, big data processing, and data warehousing, as it significantly reduces storage needs and enhances processing speeds compared to CSV and JSON formats.
personal note: I will always prefer working with parquet format for my analytical projects, even if the source file csv/json, i will always try to convert it into parquet format.
We will perform a basic preprocessing step
- drop duplicates
- add a surrogate key
Each record in dataset represents a trip. we can apply complex hashing (md5, sha256) of concatenating multiple columns to represent a unique record for each trip or we can just assign the monotically increasing id. Both can work, i will be applying monotonically increasing id. Pandas provide this out of the box when we load in the dataframe. it is called index
.
just for the reference, if some of the folks wants to take hashing route. one can apply the logic as below. please note it's gonna be a little slow
import hashlib
def genMd5Hash(*args) -> str:
string = ''.join([str(arg) for arg in args])
""" Function to generate md5 when given a string """
return hashlib.md5(string.encode())
df['trip_id'] = df.apply(genMd5Hash,axis=1)
# Just adding for sequence. the trip_id columns would be located at the end, so i will filtered out all the cols except `trip_id`
cols = list(filter(lambda x: x != 'trip_id', df.columns))
# combining array in this way ensures `trip_id` is always on the first columns, just for readability purposes
cols = ['trip_id'] + cols
df[cols].head()
df = df.drop_duplicates()
# It will insert `trip_id` to the first location in columns
df.insert(0,'trip_id',df.index)
df.head()
trip_id | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2 | 2023-01-01 00:32:10 | 2023-01-01 00:40:36 | 1.0 | 0.97 | 1.0 | N | 161 | 141 | 2 | 9.3 | 1.00 | 0.5 | 0.00 | 0.0 | 1.0 | 14.30 | 2.5 | 0.00 |
1 | 1 | 2 | 2023-01-01 00:55:08 | 2023-01-01 01:01:27 | 1.0 | 1.10 | 1.0 | N | 43 | 237 | 1 | 7.9 | 1.00 | 0.5 | 4.00 | 0.0 | 1.0 | 16.90 | 2.5 | 0.00 |
2 | 2 | 2 | 2023-01-01 00:25:04 | 2023-01-01 00:37:49 | 1.0 | 2.51 | 1.0 | N | 48 | 238 | 1 | 14.9 | 1.00 | 0.5 | 15.00 | 0.0 | 1.0 | 34.90 | 2.5 | 0.00 |
3 | 3 | 1 | 2023-01-01 00:03:48 | 2023-01-01 00:13:25 | 0.0 | 1.90 | 1.0 | N | 138 | 7 | 1 | 12.1 | 7.25 | 0.5 | 0.00 | 0.0 | 1.0 | 20.85 | 0.0 | 1.25 |
4 | 4 | 2 | 2023-01-01 00:10:29 | 2023-01-01 00:21:19 | 1.0 | 1.43 | 1.0 | N | 107 | 79 | 1 | 11.4 | 1.00 | 0.5 | 3.28 | 0.0 | 1.0 | 19.68 | 2.5 | 0.00 |
Now preprocessing of the data is completed. The next step would be to create data model.
Some definations of data model
and dimnensions
in a layman term
- Data Model
- A data model is like a blueprint or a plan that defines how data is organized, structured, and related within a database or system.
Why do we need it?
It helps in understanding the different types of data, their relationships, and how they can be accessed or retrieved.
AnalogyImagine you're building a house. Before construction begins, you have architectural plans that detail where rooms will be located, how they connect, and what materials will be used. Similarly, a data model outlines how data will be stored, what types of data are needed, and how different pieces of data relate to each other.
- Dimension
- A dimension in a model refers to a way of organizing or categorizing data to help understand and analyze information more effectively. It's like grouping or labeling data based on specific attributes or characteristics.
Why do we need it?
Dimensions provide context and structure to data, making it easier to study and gain insights. They help in organizing information in a meaningful way, allowing users to ask questions and perform analyses that can lead to better decision-making.
AnalogyIn a sales model, time can be a dimension. You might break down time into years, months, and days to analyze sales performance over different time periods. Similarly, in a customer model, age or location can be dimensions to categorize customers based on their age groups or geographic regions.
Now lets build the datetime dimension as depicted in the picture. As you can see, i have broken down the time to minute, hour, day, weekday, month, year for each pickup and dropoff. This is helpful for analysing the information. An analyst is now able to gain visibility upto minute level analytics of trips. This is also called Granularity
of data.
Granularity in simple terms means The level of detail at which the attributes and characteristics (columns) of data are defined
so the more we dive deeper the granularity of data becomes (individual level info) and the more we aggregate data the granularity becomes higher (department level info). High Granularity can be related to as seeing a whole picture (bigger view) and low level granularity means seeing individual elements of the whole picture. e.g seeing an individual salary would be lower granular and seeing the department's average salary would be high granular level of detail.
datetime_dim = df[['tpep_pickup_datetime','tpep_dropoff_datetime']].reset_index(drop=True)
# PK
datetime_dim['datetime_id'] = datetime_dim.index
# pickup datetime
datetime_dim['pick_hour'] = datetime_dim['tpep_pickup_datetime'].dt.hour
datetime_dim['pick_day'] = datetime_dim['tpep_pickup_datetime'].dt.day_name()
datetime_dim['pick_month'] = datetime_dim['tpep_pickup_datetime'].dt.month
datetime_dim['pick_year'] = datetime_dim['tpep_pickup_datetime'].dt.year
datetime_dim['pick_weekday'] = datetime_dim['tpep_pickup_datetime'].dt.weekday
# drop off datetime
datetime_dim['drop_hour'] = datetime_dim['tpep_dropoff_datetime'].dt.hour
datetime_dim['drop_day'] = datetime_dim['tpep_dropoff_datetime'].dt.day_name()
datetime_dim['drop_month'] = datetime_dim['tpep_dropoff_datetime'].dt.month
datetime_dim['drop_year'] = datetime_dim['tpep_dropoff_datetime'].dt.year
datetime_dim['drop_weekday'] = datetime_dim['tpep_dropoff_datetime'].dt.weekday
# giving proper ordering to columns for readability purpose
cols_seq = ['datetime_id', 'tpep_pickup_datetime', 'pick_hour', 'pick_day', 'pick_month', 'pick_year', 'pick_weekday',
'tpep_dropoff_datetime', 'drop_hour', 'drop_day', 'drop_month', 'drop_year', 'drop_weekday']
datetime_dim = datetime_dim[cols_seq]
display(datetime_dim)
datetime_id | tpep_pickup_datetime | pick_hour | pick_day | pick_month | pick_year | pick_weekday | tpep_dropoff_datetime | drop_hour | drop_day | drop_month | drop_year | drop_weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2023-01-01 00:32:10 | 0 | Sunday | 1 | 2023 | 6 | 2023-01-01 00:40:36 | 0 | Sunday | 1 | 2023 | 6 |
1 | 1 | 2023-01-01 00:55:08 | 0 | Sunday | 1 | 2023 | 6 | 2023-01-01 01:01:27 | 1 | Sunday | 1 | 2023 | 6 |
2 | 2 | 2023-01-01 00:25:04 | 0 | Sunday | 1 | 2023 | 6 | 2023-01-01 00:37:49 | 0 | Sunday | 1 | 2023 | 6 |
3 | 3 | 2023-01-01 00:03:48 | 0 | Sunday | 1 | 2023 | 6 | 2023-01-01 00:13:25 | 0 | Sunday | 1 | 2023 | 6 |
4 | 4 | 2023-01-01 00:10:29 | 0 | Sunday | 1 | 2023 | 6 | 2023-01-01 00:21:19 | 0 | Sunday | 1 | 2023 | 6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3066761 | 3066761 | 2023-01-31 23:58:34 | 23 | Tuesday | 1 | 2023 | 1 | 2023-02-01 00:12:33 | 0 | Wednesday | 2 | 2023 | 2 |
3066762 | 3066762 | 2023-01-31 23:31:09 | 23 | Tuesday | 1 | 2023 | 1 | 2023-01-31 23:50:36 | 23 | Tuesday | 1 | 2023 | 1 |
3066763 | 3066763 | 2023-01-31 23:01:05 | 23 | Tuesday | 1 | 2023 | 1 | 2023-01-31 23:25:36 | 23 | Tuesday | 1 | 2023 | 1 |
3066764 | 3066764 | 2023-01-31 23:40:00 | 23 | Tuesday | 1 | 2023 | 1 | 2023-01-31 23:53:00 | 23 | Tuesday | 1 | 2023 | 1 |
3066765 | 3066765 | 2023-01-31 23:07:32 | 23 | Tuesday | 1 | 2023 | 1 | 2023-01-31 23:21:56 | 23 | Tuesday | 1 | 2023 | 1 |
3066766 rows × 13 columns
passenger_count_dim = df[['passenger_count']].reset_index(drop=True)
passenger_count_dim['passenger_count_id'] = passenger_count_dim.index
# giving proper ordering to columns for readability purpose
cols_seq = ['passenger_count_id','passenger_count']
passenger_count_dim = passenger_count_dim[cols_seq]
display(passenger_count_dim)
passenger_count_id | passenger_count | |
---|---|---|
0 | 0 | 1.0 |
1 | 1 | 1.0 |
2 | 2 | 1.0 |
3 | 3 | 0.0 |
4 | 4 | 1.0 |
... | ... | ... |
3066761 | 3066761 | NaN |
3066762 | 3066762 | NaN |
3066763 | 3066763 | NaN |
3066764 | 3066764 | NaN |
3066765 | 3066765 | NaN |
3066766 rows × 2 columns
trip_distance_dim = df[['trip_distance']].reset_index(drop=True)
trip_distance_dim['trip_distance_id'] = trip_distance_dim.index
# giving proper ordering to columns for readability purpose
cols_seq = ['trip_distance_id', 'trip_distance']
trip_distance_dim = trip_distance_dim
display(trip_distance_dim)
trip_distance | trip_distance_id | |
---|---|---|
0 | 0.97 | 0 |
1 | 1.10 | 1 |
2 | 2.51 | 2 |
3 | 1.90 | 3 |
4 | 1.43 | 4 |
... | ... | ... |
3066761 | 3.05 | 3066761 |
3066762 | 5.80 | 3066762 |
3066763 | 4.67 | 3066763 |
3066764 | 3.15 | 3066764 |
3066765 | 2.85 | 3066765 |
3066766 rows × 2 columns
if we look in the data dictionary the information about rate code id are provided e.g 1 = 'Standard rate' etc. We can add this meta info in our dimension table to make it more helpful for our analysts.
# Mapping of rate code
rate_code_type = {
1:"Standard rate",
2:"JFK",
3:"Newark",
4:"Nassau or Westchester",
5:"Negotiated fare",
6:"Group ride"
}
rate_code_dim = df[['RatecodeID']].drop_duplicates().reset_index(drop=True)
rate_code_dim['rate_code_id'] = rate_code_dim.index
# map `id` to the name of rate type
rate_code_dim['rate_code_name'] = rate_code_dim['RatecodeID'].map(rate_code_type)
# giving proper ordering to columns for readability purpose
cols_seq = ['rate_code_id','RatecodeID','rate_code_name']
rate_code_dim = rate_code_dim[cols_seq]
display(rate_code_dim)
rate_code_id | RatecodeID | rate_code_name | |
---|---|---|---|
0 | 0 | 1.0 | Standard rate |
1 | 1 | 2.0 | JFK |
2 | 2 | 4.0 | Nassau or Westchester |
3 | 3 | 99.0 | NaN |
4 | 4 | 5.0 | Negotiated fare |
5 | 5 | 3.0 | Newark |
6 | 6 | 6.0 | Group ride |
7 | 7 | NaN | NaN |
We have PULocationID
and DOLocationID
column which doesn't mean anything to us unless we have a location name. Therefore, we will be using a lookup table which has mappings for each of location_id and their zone/location. If we refer to the above diagram, the lookup table which i am referring to is as follows
# loading lookup table
lookup_location = pd.read_csv('./nyc_taxi_trip_data/taxi+_zone_lookup.csv')[['LocationID','Zone']].drop_duplicates().reset_index(drop=True)
location_dim = lookup_location.rename(columns={'LocationID':'location_id','Zone':'location'})
location_dim.head()
location_id | location | |
---|---|---|
0 | 1 | Newark Airport |
1 | 2 | Jamaica Bay |
2 | 3 | Allerton/Pelham Gardens |
3 | 4 | Alphabet City |
4 | 5 | Arden Heights |
# preparing dimension table
pickup_location_dim = df[['PULocationID']].reset_index(drop=True)
pickup_location_dim['pickup_location_id'] = pickup_location_dim.index
cols_select = ['pickup_location_id','location']
pickup_location_dim = pickup_location_dim.merge(location_dim, left_on='PULocationID', right_on='location_id')[cols_select]
# giving proper ordering to columns for readability purpose
cols_seq = ['pickup_location_id','location']
pickup_location_dim = pickup_location_dim[cols_seq]
#rename `location` to `pickup_location` for good readability
pickup_location_dim = pickup_location_dim.rename(columns={'location':'pickup_location'})
display(pickup_location_dim)
pickup_location_id | pickup_location | |
---|---|---|
0 | 0 | Midtown Center |
1 | 5 | Midtown Center |
2 | 23 | Midtown Center |
3 | 24 | Midtown Center |
4 | 33 | Midtown Center |
... | ... | ... |
3066761 | 2865086 | City Island |
3066762 | 2687592 | Jamaica Bay |
3066763 | 2687593 | Jamaica Bay |
3066764 | 3033499 | Country Club |
3066765 | 3035801 | Country Club |
3066766 rows × 2 columns
# preparing dimension table
dropoff_location_dim = df[['DOLocationID']].reset_index(drop=True)
dropoff_location_dim['dropoff_location_id'] = dropoff_location_dim.index
cols_select = ['dropoff_location_id','location']
dropoff_location_dim = dropoff_location_dim.merge(location_dim, left_on='DOLocationID', right_on='location_id')[cols_select]
# giving proper ordering to columns for readability purpose
cols_seq = ['dropoff_location_id','location']
dropoff_location_dim = dropoff_location_dim[cols_seq]
#rename `location` to `pickup_location` for good readability
dropoff_location_dim = dropoff_location_dim.rename(columns={'location':'dropff_location'})
display(dropoff_location_dim)
dropoff_location_id | dropff_location | |
---|---|---|
0 | 0 | Lenox Hill West |
1 | 45 | Lenox Hill West |
2 | 120 | Lenox Hill West |
3 | 161 | Lenox Hill West |
4 | 197 | Lenox Hill West |
... | ... | ... |
3066761 | 1921746 | Rossville/Woodrow |
3066762 | 1947689 | Rossville/Woodrow |
3066763 | 2146389 | Rossville/Woodrow |
3066764 | 2780992 | Rossville/Woodrow |
3066765 | 2934582 | Governor's Island/Ellis Island/Liberty Island |
3066766 rows × 2 columns
Again, if we look in the data dictionary the information about payment_type is provided e.g 1 = 'Credit Card' etc. This would be helpful for analysts when making complex join on data for analytical purposes
payment_type_name = {
1:"Credit card",
2:"Cash",
3:"No charge",
4:"Dispute",
5:"Unknown",
6:"Voided trip"
}
payment_type_dim = df[['payment_type']].drop_duplicates().reset_index(drop=True)
payment_type_dim['payment_type_id'] = payment_type_dim.index
payment_type_dim['payment_type_name'] = payment_type_dim['payment_type'].map(payment_type_name)
# giving proper ordering to columns for readability purpose
cols_seq = ['payment_type_id','payment_type','payment_type_name']
payment_type_dim = payment_type_dim[cols_seq]
display(payment_type_dim)
payment_type_id | payment_type | payment_type_name | |
---|---|---|---|
0 | 0 | 2 | Cash |
1 | 1 | 1 | Credit card |
2 | 2 | 4 | Dispute |
3 | 3 | 3 | No charge |
4 | 4 | 0 | NaN |
The most important table, where all of the data gets connected. This table doesn't hold descriptive information about the items themselves or the customers—instead, it links to other tables (like dimensions) that contain this additional information. So, while the fact table holds the numeric facts or measurements, it references other tables to provide the context or details about those facts. think of a fact table as the central hub of a data warehouse or database. It's where all the key metrics and measurements are stored, organized, and connected.
fact_table = df.merge(passenger_count_dim, left_on='trip_id', right_on='passenger_count_id', how='left') \
.merge(trip_distance_dim, left_on='trip_id', right_on='trip_distance_id', how='left') \
.merge(rate_code_dim, left_on='RatecodeID',right_on='RatecodeID',how='left') \
.merge(pickup_location_dim, left_on='trip_id', right_on='pickup_location_id',how='left') \
.merge(dropoff_location_dim, left_on='trip_id', right_on='dropoff_location_id',how='left') \
.merge(datetime_dim, left_on='trip_id', right_on='datetime_id',how='left')\
.merge(payment_type_dim, left_on='payment_type', right_on='payment_type',how='left')
cols_required = ['trip_id','VendorID', 'datetime_id', 'passenger_count_id',
'trip_distance_id', 'rate_code_id', 'store_and_fwd_flag', 'pickup_location_id', 'dropoff_location_id',
'payment_type_id', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
'improvement_surcharge', 'total_amount']
fact_table = fact_table[cols_required]
display(fact_table.head())
trip_id | VendorID | datetime_id | passenger_count_id | trip_distance_id | rate_code_id | store_and_fwd_flag | pickup_location_id | dropoff_location_id | payment_type_id | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2 | 0 | 0 | 0 | 0 | N | 0 | 0 | 0 | 9.3 | 1.00 | 0.5 | 0.00 | 0.0 | 1.0 | 14.30 |
1 | 1 | 2 | 1 | 1 | 1 | 0 | N | 1 | 1 | 1 | 7.9 | 1.00 | 0.5 | 4.00 | 0.0 | 1.0 | 16.90 |
2 | 2 | 2 | 2 | 2 | 2 | 0 | N | 2 | 2 | 1 | 14.9 | 1.00 | 0.5 | 15.00 | 0.0 | 1.0 | 34.90 |
3 | 3 | 1 | 3 | 3 | 3 | 0 | N | 3 | 3 | 1 | 12.1 | 7.25 | 0.5 | 0.00 | 0.0 | 1.0 | 20.85 |
4 | 4 | 2 | 4 | 4 | 4 | 0 | N | 4 | 4 | 1 | 11.4 | 1.00 | 0.5 | 3.28 | 0.0 | 1.0 | 19.68 |
as we can see the table contains foreign keys of the dimension table and it represent a consolidated table which can link different tables to get different information of the trips.
What good a data model is when we can't apply analytics to it. I have prepared some questions in advance that we can answer. I will be using DuckDB for this exercise but in future articles (i hope :v ) i will be utilising clickhouse.
- What is the average trip distance?
- What is the most common payment type?
- Average trip distance for trips with a tip_amount greater than $10, excluding trips with a payment type of other than 'Cash'.
- How many trips were made by each VendorID?
- Which vendor has the highest average fare_amount?
- What is the average number of passengers for trips that occurred on Sundays?
- What are the peak hours for taxi trips based on the pick-up datetime?
- identify zones with the highest average total_amount and the lowest average trip_distance.
- Which pickup location has the highest tip_amount amount?
- How many trips were made for each rate code type?
- What is the average total_amount for each rate code type?
The pandas variables which holds information about dimensions
- datetime_dim: This table captures datetime-related attributes such as pickup and drop-off hours, days, months, years, and weekdays.
- passenger_count_dim: Contains passenger count information.
- trip_distance_dim: Holds details about trip distances.
- rate_code_dim: Provides insights into different rate codes and their corresponding names.
- location_dim: A look-up table for location which maps locationID for pickup/dropff into location_name
- pickup_location_dim: Stores data about pickup locations.
- dropoff_location_dim: Holds data regarding drop-off locations.
- payment_type_dim: Contains information about various payment types.
- fact_table: Info about all relation to dimensions
import duckdb
def run(q):
out = duckdb.query(q)
display(out)
return out
# What is the average trip distance?
q = """
SELECT round(avg(trip_distance),2) as average_distance
FROM trip_distance_dim
"""
print('What is the average trip distance? ')
run(q)
# What is the most common payment type?
TOP_K = 3
q = f"""
WITH PAYMENT_NAME_AGG_COUNTS AS (
SELECT payment_type, payment_type_name , count(*) as count_
FROM payment_type_dim
JOIN fact_table ON fact_table.payment_type_id = payment_type_dim.payment_type_id
GROUP BY payment_type_dim.payment_type, payment_type_dim.payment_type_name
), PAYMENT_RANK AS (
SELECT *, ROW_NUMBER() OVER(ORDER BY count_ DESC) as rnk
FROM PAYMENT_NAME_AGG_COUNTS
) SELECT payment_type_name, count_ FROM PAYMENT_RANK WHERE rnk <= {TOP_K}
"""
# """
print('What is the most common payment type? ')
run(q)
# - Average trip distance for trips with a tip_amount greater than $10, excluding trips with a payment type of other than 'Cash'.
q = """
WITH TRIP_DISTANCE_BASE AS (
SELECT round(avg(trip_distance_dim.trip_distance)) as 'average_trip_distance(miles)'
FROM fact_table
JOIN trip_distance_dim ON fact_table.trip_id = trip_distance_dim.trip_distance_id
JOIN payment_type_dim ON fact_table.payment_type_id = payment_type_dim.payment_type_id
WHERE fact_table.tip_amount > 10 and lower(payment_type_dim.payment_type_name) = 'cash'
) select * from TRIP_DISTANCE_BASE
"""
print('What is the Average trip distance for trips with a tip_amount greater than $10, excluding trips with a payment type of "Cash" ')
run(q)
What is the average trip distance?
┌──────────────────┐
│ average_distance │
│ double │
├──────────────────┤
│ 3.85 │
└──────────────────┘
What is the most common payment type?
┌───────────────────┬─────────┐
│ payment_type_name │ count_ │
│ varchar │ int64 │
├───────────────────┼─────────┤
│ Credit card │ 2411462 │
│ Cash │ 532241 │
│ NULL │ 71743 │
└───────────────────┴─────────┘
What is the Average trip distance for trips with a tip_amount greater than $10, excluding trips with a payment type of "Cash"
┌──────────────────────────────┐
│ average_trip_distance(miles) │
│ double │
├──────────────────────────────┤
│ 15.0 │
└──────────────────────────────┘
┌──────────────────────────────┐
│ average_trip_distance(miles) │
│ double │
├──────────────────────────────┤
│ 15.0 │
└──────────────────────────────┘
- How many trips were made by each VendorID?
- Which vendor has the highest average fare_amount?
# How many trips were made by each VendorID?
q = """
SELECT VendorID, count(*) as total_trips
FROM fact_table
GROUP BY VendorID
ORDER BY count(*) DESC
"""
print("How many trips were made by each VendorID?")
run(q)
# Which vendor has the highest average fare_amount?
q = """
WITH VENDOR_FARE_AMOUNT AS (
SELECT VendorID, round(avg(fare_amount),2) as avg_fare_amount
FROM fact_table
GROUP BY VendorID
), VENDOR_MAX_AMOUNT_RNK AS (
SELECT VendorID, avg_fare_amount, row_number() OVER(ORDER BY avg_fare_amount DESC) as RNK
FROM VENDOR_FARE_AMOUNT
) SELECT VendorID, avg_fare_amount FROM VENDOR_MAX_AMOUNT_RNK WHERE RNK = 1
"""
print("Which vendor has the highest average fare_amount?")
run(q)
How many trips were made by each VendorID?
┌──────────┬─────────────┐
│ VendorID │ total_trips │
│ int64 │ int64 │
├──────────┼─────────────┤
│ 2 │ 2239399 │
│ 1 │ 827367 │
└──────────┴─────────────┘
Which vendor has the highest average fare_amount?
┌──────────┬─────────────────┐
│ VendorID │ avg_fare_amount │
│ int64 │ double │
├──────────┼─────────────────┤
│ 2 │ 18.71 │
└──────────┴─────────────────┘
┌──────────┬─────────────────┐
│ VendorID │ avg_fare_amount │
│ int64 │ double │
├──────────┼─────────────────┤
│ 2 │ 18.71 │
└──────────┴─────────────────┘
- What is the average number of passengers for trips that occurred on Sundays?
- What are the peak hours for taxi trips based on the pick-up datetime?
#- What is the average number of passengers for trips that occurred on Sundays?
q = """
SELECT round(avg(passenger_count_dim.passenger_count),2) as average_passenger_count
FROM fact_table
JOIN passenger_count_dim ON passenger_count_dim.passenger_count_id = fact_table.passenger_count_id
JOIN datetime_dim ON datetime_dim.datetime_id = fact_table.datetime_id
WHERE lower(datetime_dim.pick_day) = 'sunday' or datetime_dim.drop_day = 'sunday'
"""
print("What is the average number of passengers for trips that occurred on Sundays?")
run(q)
#- What are the peak hours for taxi trips based on the pick-up datetime?
print("What are the peak hours for taxi trips based on the pick-up datetime?")
q = """
WITH TRIPS_BY_HOUR_COUNT AS (
SELECT datetime_dim.pick_hour, count(*) as total_trips
FROM fact_table
JOIN datetime_dim ON datetime_dim.datetime_id = fact_table.datetime_id
GROUP BY datetime_dim.pick_hour
), TRIPS_BY_HOUR_RANK AS (
SELECT pick_hour, total_trips, DENSE_RANK() OVER(ORDER BY total_trips) as rnk
FROM TRIPS_BY_HOUR_COUNT
) SELECT pick_hour,total_trips FROM TRIPS_BY_HOUR_RANK WHERE rnk = 1
"""
run(q)
What is the average number of passengers for trips that occurred on Sundays?
┌─────────────────────────┐
│ average_passenger_count │
│ double │
├─────────────────────────┤
│ 1.45 │
└─────────────────────────┘
What are the peak hours for taxi trips based on the pick-up datetime?
┌───────────┬─────────────┐
│ pick_hour │ total_trips │
│ int32 │ int64 │
├───────────┼─────────────┤
│ 4 │ 17835 │
└───────────┴─────────────┘
┌───────────┬─────────────┐
│ pick_hour │ total_trips │
│ int32 │ int64 │
├───────────┼─────────────┤
│ 4 │ 17835 │
└───────────┴─────────────┘
- identify zones with the highest average total_amount and the lowest average trip_distance.
- Which pickup location has the highest tip_amount amount?
# - identify zones with the highest average total_amount and the lowest average trip_distance.
# - Which pickup location has the highest tip_amount amount?
q = """
WITH BASE AS (
SELECT
pickup_location_dim.pickup_location as zone,
round(avg(total_amount),2) as avg_total_amount,
round(avg(trip_distance_dim.trip_distance_id),2) as avg_trip_distance
FROM fact_table
join trip_distance_dim ON fact_table.trip_id = trip_distance_dim.trip_distance_id
JOIN pickup_location_dim ON pickup_location_dim.pickup_location_id = fact_table.pickup_location_id
WHERE pickup_location_dim.pickup_location IS NOT NULL
GROUP BY pickup_location_dim.pickup_location
) SELECT zone, avg_total_amount as avg_highest_trip_amount, avg_trip_distance as avg_lowest_trip_distance
FROM BASE ORDER BY avg_total_amount DESC, avg_trip_distance ASC LIMIT 1
"""
print("identify zones with the highest average total_amount and the lowest average trip_distance.")
run(q)
q = """
WITH BASE AS(
SELECT
pickup_location_dim.pickup_location as zone,
fact_table.tip_amount,
ROW_NUMBER() OVER(ORDER BY fact_table.tip_amount DESC) as rnk
FROM fact_table
JOIN pickup_location_dim ON pickup_location_dim.pickup_location_id = fact_table.pickup_location_id
) SELECT zone, tip_amount as highest_tip_amount FROM BASE WHERE rnk = 1
"""
print("Which pickup location has the highest tip_amount amount?")
run(q)
identify zones with the highest average total_amount and the lowest average trip_distance.
┌────────────────┬─────────────────────────┬──────────────────────────┐
│ zone │ avg_highest_trip_amount │ avg_lowest_trip_distance │
│ varchar │ double │ double │
├────────────────┼─────────────────────────┼──────────────────────────┤
│ Newark Airport │ 104.38 │ 1212392.68 │
└────────────────┴─────────────────────────┴──────────────────────────┘
Which pickup location has the highest tip_amount amount?
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
┌───────────────────────┬────────────────────┐
│ zone │ highest_tip_amount │
│ varchar │ double │
├───────────────────────┼────────────────────┤
│ Upper East Side North │ 380.8 │
└───────────────────────┴────────────────────┘
┌───────────────────────┬────────────────────┐
│ zone │ highest_tip_amount │
│ varchar │ double │
├───────────────────────┼────────────────────┤
│ Upper East Side North │ 380.8 │
└───────────────────────┴────────────────────┘
- How many trips were made for each rate code type?
- What is the average total_amount for each rate code type?
# How many trips were made for each rate code type?
q = """
SELECT rate_code_dim.rate_code_name, count(*) as trip_count
FROM fact_table
JOIN rate_code_dim ON rate_code_dim.RatecodeID = fact_table.rate_code_id
GROUP BY rate_code_dim.rate_code_name
ORDER BY trip_count DESC
"""
print("How many trips were made for each rate code type?")
run(q)
q = """
SELECT rate_code_dim.rate_code_name, round(avg(fact_table.total_amount),2) as avg_total_amount
FROM fact_table
JOIN rate_code_dim ON rate_code_dim.RatecodeID = fact_table.rate_code_id
GROUP BY rate_code_dim.rate_code_name
ORDER BY avg_total_amount DESC
"""
print("What is the average total_amount for each rate code type?")
run(q)
How many trips were made for each rate code type?
┌───────────────────────┬────────────┐
│ rate_code_name │ trip_count │
│ varchar │ int64 │
├───────────────────────┼────────────┤
│ Standard rate │ 114239 │
│ Nassau or Westchester │ 15043 │
│ Newark │ 13106 │
│ Negotiated fare │ 8958 │
│ JFK │ 4366 │
│ Group ride │ 6 │
└───────────────────────┴────────────┘
What is the average total_amount for each rate code type?
┌───────────────────────┬──────────────────┐
│ rate_code_name │ avg_total_amount │
│ varchar │ double │
├───────────────────────┼──────────────────┤
│ Group ride │ 132.78 │
│ JFK │ 119.95 │
│ Negotiated fare │ 107.4 │
│ Standard rate │ 89.09 │
│ Nassau or Westchester │ 81.39 │
│ Newark │ 37.76 │
└───────────────────────┴──────────────────┘
┌───────────────────────┬──────────────────┐
│ rate_code_name │ avg_total_amount │
│ varchar │ double │
├───────────────────────┼──────────────────┤
│ Group ride │ 132.78 │
│ JFK │ 119.95 │
│ Negotiated fare │ 107.4 │
│ Standard rate │ 89.09 │
│ Nassau or Westchester │ 81.39 │
│ Newark │ 37.76 │
└───────────────────────┴──────────────────┘