Skip to content

Commit

Permalink
Merge pull request #17 from ttan06/tim_branch_test
Browse files Browse the repository at this point in the history
Adding Tests and Additional Documentation
  • Loading branch information
ttan06 authored Mar 13, 2024
2 parents 594c62b + bf4c0a2 commit c03e7f4
Show file tree
Hide file tree
Showing 13 changed files with 496 additions and 33 deletions.
30 changes: 30 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Data

## MLB Schedule

For more details and to refresh the schedule for a future season, visit the notebook 'Schedule Scraping.ipynb' in the notebooks section of this repository.

The MLB schedule dataset used by the app is 'final_mlb_schedule.csv' which contains nearly 3000 rows, each with the details of a game, including the teams, date, time, and coordinates of the stadium.

## Cost Matrix

The cost matrix is a csv 'cost_df.csv', which is created by running the following:

```commandline
python -m data.create_cost_matrix
```

### Description
It is 900 rows, with a from and to location, the cost of the trip, and the locations.
The intended usage is to pull the route of from and to locations, and the "fare" column, which is based on cost and time.

### Acquisition
The script uses the file 'team_airport_key.csv', which is a mapping of the team stadium to the closest major airport.

It also uses the file 'Consumer_Airfare_Report__Table_6_-_Contiguous_State_City-Pair_Markets_That_Average_At_Least_10_Passengers_Per_Day_20240309.csv'

This file is obtained from the [Consumer Airfare Report](https://data.transportation.gov/Aviation/Consumer-Airfare-Report-Table-6-Contiguous-State-C/yj5y-b2ir/data). Since this downloaded csv is quite large, it is not included in the repository and must be downloaded prior to running the 'create_cost_matrix.py' file.

### Other Details
Since not every path between two teams has a flight, we fill in the missing cost data with a driving estimate. We use the average cost of a gallon of gas in 2023, which is $3.29 as of March 2024 [(Source)](https://www.finder.com/economics/gas-prices#:~:text=National%20average%3A%20The%20current%20national%20average%20cost,for%20gas%20is%20%243.23%20%28Feb.%2022%2C%202024%29%20%281%29).
We also use the average miles per gallon, which was 36mpg [(Source)](https://www.caranddriver.com/research/a31518112/what-is-good-gas-milage/).
47 changes: 32 additions & 15 deletions data/create_cost_matrix.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,32 @@
"""
Script that creates a cost matrix from the Consumer Airfare Report data
Download the base dataset as a csv from:
https://data.transportation.gov/Aviation/Consumer-Airfare-Report-Table-6-Contiguous-State-C/yj5y-b2ir/data
Also uses team airport key.
Run using 'python -m data/create_cost_matrix.py' from in terminal from root repo
Requires pandas, numpy and makeRoute
"""
import pandas as pd
import numpy as np
# pylint: disable=import-error
from makeRoute.distance import dist

df = pd.read_csv('data/Consumer_Airfare_Report__Table_6_-_Contiguous_State_City-Pair_Markets_That_Average_At_Least_10_Passengers_Per_Day_20240309.csv')
# file name may change depending on local file
df = pd.read_csv('data'
'/Consumer_Airfare_Report__Table_6_-_Contiguous_State_City-Pair_Markets_That_'
'Average_At_Least_10_Passengers_Per_Day_20240309.csv')
team_airport_key = pd.read_csv('data/team_airport_key.csv')

cities = df[['city1', 'city2','fare', 'Year', 'quarter']]
cities = cities.loc[df['Year'] == 2023]
cities = cities.loc[df['quarter'] == 3]

df_filtered = cities[cities['city1'].isin(team_airport_key['AirportCity']) & cities['city2'].isin(team_airport_key['AirportCity'])]
df_filtered = cities[cities['city1'].isin(team_airport_key['AirportCity'])
& cities['city2'].isin(team_airport_key['AirportCity'])]

teams1 = []
teams2 = []
Expand All @@ -19,18 +36,21 @@
teams2.append(team2)

df_team_base = pd.DataFrame({'Team1':teams1, 'Team2':teams2})
df_team_base = df_team_base.merge(team_airport_key, left_on=['Team1'], right_on=['Team'], how='left')
df_team_base = df_team_base.merge(team_airport_key, left_on=['Team1']
, right_on=['Team'], how='left')
df_team_base['AirportCity1'] = df_team_base['AirportCity']
df_team_base= df_team_base.drop(columns = ['Team', 'AirportCity'])
df_team_base = df_team_base.merge(team_airport_key, left_on=['Team2'], right_on=['Team'], how='left')
df_team_base = df_team_base.merge(team_airport_key, left_on=['Team2']
, right_on=['Team'], how='left')
df_team_base['AirportCity2'] = df_team_base['AirportCity']
df_team_base= df_team_base.drop(columns = ['Team', 'AirportCity'])
df_filtered = df_filtered.drop(columns = ['Year', 'quarter'])
df_reverse = pd.DataFrame({'city1': df_filtered['city2'], 'city2': df_filtered['city1'], 'fare':df_filtered['fare']})
df_reverse = pd.DataFrame({'city1': df_filtered['city2'], 'city2': df_filtered['city1'],
'fare':df_filtered['fare']})
df_full = pd.concat([df_filtered, df_reverse])
df_full = df_full.sort_values(by = ['city1', 'city2']).reset_index().drop(columns = ['index'])
df_flights = df_team_base.merge(df_full, left_on = ['AirportCity1', 'AirportCity2'], right_on = ['city1', 'city2'], how = 'left')
#df_flights.loc[df_flights['AirportCity1'] == df_flights['AirportCity2'], ['fare']] = 0
df_flights = df_team_base.merge(df_full, left_on = ['AirportCity1', 'AirportCity2'],
right_on = ['city1', 'city2'], how = 'left')
df_flights['city1'] = df_flights['AirportCity1']
df_flights['city2'] = df_flights['AirportCity2']
df_travel = df_flights.drop(columns = ['AirportCity1', 'AirportCity2'])
Expand All @@ -53,20 +73,17 @@
dists.append(distance)
df_travel['dist'] = dists
df_travel = df_travel.rename(columns={'fare': 'airfare'})
avg_mpg = 36
cost_per_gallon = 3.29
cost_per_mile = cost_per_gallon / avg_mpg

df_travel['car_fare'] = df_travel['dist'] * cost_per_mile
AVG_MPG = 36
COST_PER_GALLON = 3.29
COST_PER_MILE = COST_PER_GALLON / AVG_MPG

df_travel['car_fare'] = df_travel['dist'] * COST_PER_MILE
df_travel['fare'] = df_travel['airfare']
df_travel['fare'] = df_travel['fare'].fillna(df_travel['car_fare'])
df_travel['min_fare'] = df_travel[['car_fare','fare']].min(axis=1)

#only take mininum fare where distance is less than 125 i.e. less than 2.5 hours drive time
df_travel['fare'] = np.where(df_travel['dist'] < 125, df_travel['min_fare'], df_travel['fare'])

#df_flights.loc[df_flights['AirportCity1'] == df_flights['AirportCity2'], ['fare']] = 0
#print(df_travel.loc[(df_travel['Team1']=='Baltimore Orioles') & (df_travel['Team2']=='Seattle Mariners')].reset_index()['fare'][0])
#print(len(df_travel.loc[(df_travel['dists']<125) & (df_travel['dists']>0)]))

df_travel.to_csv('data/cost_df.csv')
29 changes: 26 additions & 3 deletions docs/Component_Spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,42 @@
### Software Components

#### Input Handler/UI
* User interface that handles user inputs
* Takes in parameters from UI that are inputted by user
* User interface that handles user inputs. This is built with Dash.
* Takes in parameters from UI that are inputted by user.
* Dates - Strings converted to datetime objects - desired date range.
* Teams - Strings - desired teams to visit (up to 6)
* Optimization - String - how to optimize the schedule
* Applies inputted parameters to schedule/route to check legality (e.g. home games for teams exist during given time period)
* Output parameters to route generator to check legality and passes along user inputs
* Sends the above parameters to the next stage.

#### Route Creator
* Program that generates routes based on user-provided and intrinsic parameters
* Takes in inputs from input handler and from data sets and sources
* Uses python-tsp package to create routes ordered by optimal cost or distance, depending on user input.
* Dates - Strings converted to datetime objects - desired date range.
* Teams - Strings - desired teams to visit (up to 6)
* Optimization - String - how to optimize the schedule
* Schedule - Pandas DataFrame - the MLB schedule where game details are acquired from.
* Cost Matrix - Pandas DataFrame - cost it takes to travel from one place to another.
* Uses custom package to create routes ordered by optimal cost, time or distance, depending on user input.
* Checks legality of package-created routes
* Outputs several legal schedules and routes to route visualizer and UI
* Route - list - list of teams, ordered by visit
* Schedule - Pandas DataFrame - MLB schedule, but one game per each team in the route, optimized by user choice.
* Metrics - strings/float - The metrics (total cost, distance, time) from the optimized schedule.

#### Route Visualizer
* Program that visualizes routes on a map within the UI
* Takes in different legal routes in the form of directed graph from the route creator
* Route - list - list of teams, ordered by visit
* Schedule - Pandas DataFrame - MLB schedule, but one game per each team in the route, optimized by user choice.
* Metrics - strings/float - The metrics (total cost, distance, time) from the optimized schedule.
* Displays route on map using coordinates and edges
* Has toggle that can change route, and both map and table are updated
* Outputs visualized map and table displaying each step of the schedule or route to UI
* Map - Dash Graph Object - map of the US displaying the traveling route
* Schedule Table - Dash data table - tables of the inputted schedule
* Metrics Table - Dash data table - tables of the metrics from the optimized schedule.

### Interactions
#### Use Case: Generate Schedules
Expand All @@ -37,3 +55,8 @@
* The user hovers over a point on the map on the UI, the UI tracks the movement and retrieves the information from that point from the Route Visualizer and displays the data to the user.

![](images/View%20Routes%20and%20Costs%20on%20Map.png)

#### Route Creation Pipeline
* This is a more detailed view of how the routes are created in the route creator.

![](images/schedule_builder_pipeline.png)
4 changes: 4 additions & 0 deletions docs/Functional_Spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,7 @@ The objective of the user interaction is to view the outputted schedules and rou

The expected interactions are that after the user generates schedules, they can view and interact with the map portion of the UI and hover over cities and routes with their map. The system will display the map and other information depending on what the user chooses to view on the UI.

#### Use Case: Understanding Distance and Cost of Traveling in the US
The objective of this user interaction is to be able to see how much it costs to travel in the US and how far it may be.

The expected interactions are for the user to view the table on the bottom left of the dashboard that summarizes the trip.
29 changes: 17 additions & 12 deletions docs/Milestones.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,22 @@
## Milestones

Each section and subsection is listed in order of priority.

### Data Processing
* Scrape MLB schedule into table
* Join home city coordinates to each game entry
* Create distance matrix with average flight costs instead of distance
1. Scrape MLB schedule into table
2. Join home city coordinates to each game entry
3. Create distance matrix with average flight costs instead of distance
### User Interface
* Decide on package or interface program
* Create method for reading in user UI inputs and store as parameters
* Method to display map on UI
* Find way to host or create instructions on how to run UI
1. Decide on package or interface program
2. Create method for reading in user UI inputs and store as parameters
3. Method to display map on UI
4. Find way to host or create instructions on how to run UI
### Schedule Creation
* Method to check if user inputted parameters are legal
* Method to create several potential routes ordered by optimality (lowest cost or distance) using python-tsp
* Method to align routes with MLB schedule
1. Method to check if user inputted parameters are legal
2. Method to create potential routes
3. Method to align routes with MLB schedule
4. Method to order routes by user-chosen optimality
### Map Creation
* Method that takes in schedule and converts to graph
* Method to visualize graph onto a US Map (Think about how to handle Toronto and international games)
1. Method that takes in schedule and converts to graph
2. Method to visualize graph onto a US Map (Think about how to handle Toronto and international games)
3. Method to obtain additional information by hovering over graph (cost, coordinates)
Binary file added docs/images/Coverage_Test.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/schedule_builder_pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 11 additions & 0 deletions makeRoute/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
## Packages

### Distance

This file contains the functions used to calculate the distance between two points and build a distance matrix between several locations.

### Search

This file contains the functions used to create a schedule based on the desired teams and dates, optimized by cost, time or distance.

![](../docs/images/schedule_builder_pipeline.png)
9 changes: 8 additions & 1 deletion makeRoute/distance.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ def dist(x, y):
:param y: tuple - coordinates of second location
:return: float - distance in miles bewteen two points
"""

if len(x) != 2 or len(y) != 2:
raise TypeError('Incorrect coordinate shape')
lat1 = radians(x[0])
lon1 = radians(x[1])
lat2 = radians(y[0])
Expand Down Expand Up @@ -50,6 +51,12 @@ def dist_matrix(lat_long_df):
:return: numpy array - shape(n,n) where n is number of rows in lat_long_df.
each entry is the distance between the row team and column team
"""
if 'Latitude' not in lat_long_df.columns or 'Longitude' not in lat_long_df.columns:
raise ValueError('No Latitude or Longitude column')
if 'home team' not in lat_long_df.columns:
raise ValueError('No home team column')
if len(lat_long_df) < 1:
raise ValueError('Need at least 1 row of data')
lat_long = lat_long_df[['Latitude', 'Longitude']]
distances = pdist(lat_long.values, metric=dist)
points = lat_long_df['home team']
Expand Down
25 changes: 23 additions & 2 deletions makeRoute/exhaustiveSearch.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
This module implements various functions to build a route of games for a user to travel through.
Functions:
home_game_exists(schedule, team): Function that takes in a team and a schedule and
finds if the team has a game in that schedule
reduce_schedule(schedule, teams, start_date, end_date): function that creates a subset
of the MLB schedule
find_all_routes(teams): Function that provides all route combinations for teams.
Expand All @@ -24,6 +26,20 @@
import pandas as pd
from .distance import dist_matrix

def home_game_exists(schedule, team):
"""
Function that takes in a team and a schedule and finds if the team has a game in
that schedule
:param schedule: pandas data frame - contains every game for the MLB season
with teams, location, etc.
:param team: string - team name
:return: boolean - returns true if home team has a game in the schedule
"""
for home_team in schedule["home team"]:
if home_team == team:
return True
return False

def reduce_schedule(schedule, teams, start_date, end_date):
"""
Function that takes in teams and a date range to create a subset of the MLB schedule
Expand All @@ -35,18 +51,23 @@ def reduce_schedule(schedule, teams, start_date, end_date):
:return: sched_subset: data frame - filtered subset of MLB schedule for
specific teams and between dates
"""
if (pd.to_datetime(end_date) - pd.to_datetime(start_date)).days + 1 < len(teams):
raise ValueError('More teams than days')
sched_subset = schedule.loc[(schedule['date'] >= start_date) & (schedule['date'] <= end_date)]
for team_ in teams:
if home_game_exists(sched_subset, team_) is False:
raise ValueError(team_ + ' do not have a home game in this time frame')
sched_subset = sched_subset.loc[schedule['home team'].isin(teams)]
return sched_subset

def find_all_routes(teams):
"""
Function that provides all route combinations for teams. Raises an error if there are more
than 7 teams, to reduce runtime
than 6 teams, to reduce runtime
:param teams: list - list of teams
:return: list: list of routes, where a route is a list of teams in a specific order
"""
if len(teams) > 7:
if len(teams) > 6:
raise ValueError('Too many selections')
routes = [list(route) for route in permutations(teams)]
return routes
Expand Down
29 changes: 29 additions & 0 deletions tests/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Baseball Stadium Travels

## Testing

This directory contains the test files for the baseball stadium travels project.


### Unit Tests
To run the unit tests for search and distance, run the following:

```commandline
python -m tests.test_search
python -m tests.test_distance
```

### Coverage

To run the coverage tests, run:

```commandline
pip install coverage
python -m coverage run -m unittest discover
python -m coverage report
```

#### Last Results

![](../docs/images/Coverage_Test.png)

Loading

0 comments on commit c03e7f4

Please sign in to comment.