Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus cost contd #1010

Merged
merged 13 commits into from
Feb 2, 2024
Merged

Bus cost contd #1010

merged 13 commits into from
Feb 2, 2024

Conversation

csuyat-dot
Copy link
Contributor

Continuing work related to issue #897

Per meeting with Hunter and Zack, propulsion type and bus size categories are needed to get a apples-to-apples comparisons for bus costs.
Cleaned FTA_bus_grant_analysis notebook. Found there were rows that did not capture the propulsion type correctly. Modified lists and dictionaries as needed to ensure propulsion types was extracted from all rows.

Double checked tircp_bus_analysis data and confirmed all propulsion types were extracted correctly.

Made lost of changes to cost_per_bus_analysis notebook. Adjustedaggfuc to calculate cpb AFTER aggregating by propulsion type. Re-ran totals and charts with updated propulsion type data. Cleaned up notebook by consolidating lines and removing preview lines. Edited the summary cell and added text to charts.

Next, reading in/cleaning dgs_usage_report_bus_analysis notebook. Will add to cost per bus notebook when done.

@csuyat-dot csuyat-dot merged commit 7d700c1 into main Feb 2, 2024
3 checks passed
@csuyat-dot csuyat-dot deleted the bus_cost_contd branch February 2, 2024 19:22
@tiffanychu90
Copy link
Member

Next cleaning steps related to this PR:

  • Notebooks right now have mixed data cleaning + exploratory (look at values) + charts. Start separating these out.
  • Move data cleaning into scripts (.py) instead of notebooks (.ipynb) now that you know most of what you want to clean. You will still be able to look at df.some_col.value_counts() this way while keeping your code cleaner. Any additional things you want to clean, you'll discover within a notebook, but add it in the function for data cleaning.
  • An example of a script to do the initial importing of a csv + cleaning + saving out as parquet.
  • I used cost_per_bus_analysis.ipynb as an example, but the concepts should be applied to all notebooks that have had some data cleaning already
def clean_fta():
   df= pd.read_csv(some_file.csv)

   drop_cols = ["col1", "col2"]
   # define as dictionary rather than straight df.columns = new_column names, which relies
   # on the order being the same. a dictionary looks for an old name and replaces with new name.
   rename_cols = {
      "old_name": "new_name",
      "old_name2": "new_name2"	
   }

   df1 = df[df.bus_count > 0].drop(columns =  drop_cols).rename(columns = rename_cols)

   # named the transformed df as df1, and you can 
   # compare what happens when you return df vs df1
   # when you have more steps of data cleaning, you might transform df1 some more

   return df1

def clean_tircp():
   df = pd.read_csv(file.csv)

   drop_cols = some_list
   rename_cols = some_dict

   # see how this line is now very similar to the one in the earlier function
   df1 = df.drop(columns = drop_cols).rename(columns = rename_cols)
   return df

def merge_fta_tircp(fta, tircp):
   # i don't see a merging step, i see a concatenation, but are you 
   # sure that each row lines up perfectly? what if one df is sorted alphabetically and the other isn't?
   # it's safer to merge on a column or list of columns
   df = pd.merge(
	 fta, 
	 tircp, 
	 on = "agency_name",
	 how = "left" or "inner"
    )

return df

@tiffanychu90
Copy link
Member

Good job on functions for making charts.

  • equal sign means that's the default value. in this case, the default x column is agency_name, and the default df is zscore_bus. Do you want to define the default df though? What if you call the df something else later, and now the function errors?
# if you take out the default df being defined, it would look like this
def make_chart(y_col, title, data,  x_col="agency_name"):
    data.sort_values(by=y_col, ascending=False).head(10).plot(
        x=x_col, y=y_col, kind="bar", color="skyblue"
    )
    plt.title(title)
    plt.xlabel(x_col)
    plt.ylabel(y_col)
    
    plt.ticklabel_format(style='plain', axis='y')
    plt.show()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants