Bus cost contd #1010

csuyat-dot · 2024-02-02T19:17:00Z

Continuing work related to issue #897

Per meeting with Hunter and Zack, propulsion type and bus size categories are needed to get a apples-to-apples comparisons for bus costs.
Cleaned FTA_bus_grant_analysis notebook. Found there were rows that did not capture the propulsion type correctly. Modified lists and dictionaries as needed to ensure propulsion types was extracted from all rows.

Double checked tircp_bus_analysis data and confirmed all propulsion types were extracted correctly.

Made lost of changes to cost_per_bus_analysis notebook. Adjustedaggfuc to calculate cpb AFTER aggregating by propulsion type. Re-ran totals and charts with updated propulsion type data. Cleaned up notebook by consolidating lines and removing preview lines. Edited the summary cell and added text to charts.

Next, reading in/cleaning dgs_usage_report_bus_analysis notebook. Will add to cost per bus notebook when done.

…harts

…rts on cpb notebook

…itional on other columns

…reports and will work on adding dgs data into cpb notebook after

github-actions · 2024-02-02T19:21:50Z

nbviewer URLs for impacted notebooks:

tiffanychu90 · 2024-02-08T19:49:34Z

Next cleaning steps related to this PR:

Notebooks right now have mixed data cleaning + exploratory (look at values) + charts. Start separating these out.
Move data cleaning into scripts (.py) instead of notebooks (.ipynb) now that you know most of what you want to clean. You will still be able to look at df.some_col.value_counts() this way while keeping your code cleaner. Any additional things you want to clean, you'll discover within a notebook, but add it in the function for data cleaning.
An example of a script to do the initial importing of a csv + cleaning + saving out as parquet.
I used cost_per_bus_analysis.ipynb as an example, but the concepts should be applied to all notebooks that have had some data cleaning already

def clean_fta():
   df= pd.read_csv(some_file.csv)

   drop_cols = ["col1", "col2"]
   # define as dictionary rather than straight df.columns = new_column names, which relies
   # on the order being the same. a dictionary looks for an old name and replaces with new name.
   rename_cols = {
      "old_name": "new_name",
      "old_name2": "new_name2"	
   }

   df1 = df[df.bus_count > 0].drop(columns =  drop_cols).rename(columns = rename_cols)

   # named the transformed df as df1, and you can 
   # compare what happens when you return df vs df1
   # when you have more steps of data cleaning, you might transform df1 some more

   return df1

def clean_tircp():
   df = pd.read_csv(file.csv)

   drop_cols = some_list
   rename_cols = some_dict

   # see how this line is now very similar to the one in the earlier function
   df1 = df.drop(columns = drop_cols).rename(columns = rename_cols)
   return df

def merge_fta_tircp(fta, tircp):
   # i don't see a merging step, i see a concatenation, but are you 
   # sure that each row lines up perfectly? what if one df is sorted alphabetically and the other isn't?
   # it's safer to merge on a column or list of columns
   df = pd.merge(
	 fta, 
	 tircp, 
	 on = "agency_name",
	 how = "left" or "inner"
    )

return df

tiffanychu90 · 2024-02-08T19:53:12Z

Good job on functions for making charts.

equal sign means that's the default value. in this case, the default x column is agency_name, and the default df is zscore_bus. Do you want to define the default df though? What if you call the df something else later, and now the function errors?

# if you take out the default df being defined, it would look like this
def make_chart(y_col, title, data,  x_col="agency_name"):
    data.sort_values(by=y_col, ascending=False).head(10).plot(
        x=x_col, y=y_col, kind="bar", color="skyblue"
    )
    plt.title(title)
    plt.xlabel(x_col)
    plt.ylabel(y_col)
    
    plt.ticklabel_format(style='plain', axis='y')
    plt.show()

csuyat-dot added 13 commits January 29, 2024 19:06

cleared up prop type col on manual check df

26f987d

read in cleaned fta data back into cpb notebook, also adjusting the c…

5fe4484

…harts

editing charts using aggregate df

b17f09c

confirmed TIRCP prop_type values are good. laid out game plan for cha…

2531bd8

…rts on cpb notebook

creating more charts for zeb and non-zeb only buses

e624e6f

adjusted distribution chart by removing outliers for zeb only dist

1d57ff1

cleaned up summary and reorganized charts

a02a8b0

recalculated cost per bus for bus_agg function. reran charts

38a7fab

consolidating and reorganizing cells

6d67726

almost done organizing cpb notebook. also started to marge of dgs data

6ef3c7d

merged both DFs on common columns names

c4d73fc

copied some code from FTA notebook. working on total cost column cond…

776c6c9

…itional on other columns

finalizing cost per bus notebook first. then continue to work on dgs …

969598c

…reports and will work on adding dgs data into cpb notebook after

csuyat-dot merged commit 7d700c1 into main Feb 2, 2024
3 checks passed

csuyat-dot deleted the bus_cost_contd branch February 2, 2024 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bus cost contd #1010

Bus cost contd #1010

csuyat-dot commented Feb 2, 2024

github-actions bot commented Feb 2, 2024

tiffanychu90 commented Feb 8, 2024

tiffanychu90 commented Feb 8, 2024

Bus cost contd #1010

Bus cost contd #1010

Conversation

csuyat-dot commented Feb 2, 2024

github-actions bot commented Feb 2, 2024

tiffanychu90 commented Feb 8, 2024

tiffanychu90 commented Feb 8, 2024