This dataset was downloaded from DataSF as a csv file named Assessor_Historical_Secured_Property_Tax_Rolls.csv
. It contains many columns but only the following are relevant for this project: "Closed Roll Year", "Assessor Neighborhood", and "Assessed Land Value". Assessed land value is treated as the home price since this is the 'valuation of a residence for tax purposes'.
Note: This csv file is not uploaded because Github limits file sizes to 100 MB.
Two Python (version 3.6) modules are used in processing the data:
- Pandas
- datetime
# Import Dependencies
import pandas as pd
import datetime
The csv file is loaded into a dataframe.
csv_path = "Assessor_Historical_Secured_Property_Tax_Rolls.csv"
home_df = pd.read_csv(csv_path, encoding='utf-8')
First, only the relevant columns were retained in the dataframe.
# Create a new dataframe that contains only the relevant columns
home_df2 = home_df[["Closed Roll Year", "Assessor Neighborhood", "Assessed Land Value"]]
Second, the data was grouped based on year and neighbourhood.
# Group the data by neighbourhood
home_gb = home_df2.groupby(["Closed Roll Year", "Assessor Neighborhood"])
home_gb_mean = home_gb.mean()
home_gb2 = home_gb_mean.reset_index() # index (year) is treated as a column
"Closed Roll Year" is a column containing floats. The values need to be converted to integers (to get rid of the decimal place) and then to the datetime format.
# Convert Closed Roll Year values to integers
home_gb2["Closed Roll Year"].astype(int)
# Convert integers to year and put the data into a new column
home_gb2["Closed Roll Year 2"] = pd.to_datetime(home_gb2["Closed Roll Year"].astype(int),
format = '%Y', errors = 'coerce').dt.year
# Drop the original Closed Roll Year column and replace with the new one.
home_gb2 = home_gb2.drop(["Closed Roll Year"], axis = 1)
home_gb2 = home_gb2.rename(columns = {"Closed Roll Year 2": "Closed Roll Year"})
The first five rows of the resulting dataframe looks like this:
Assessor Neighborhood | Assessed Land Value | Closed Roll Year | |
---|---|---|---|
0 | Alamo Square | 356275.88 | 2007 |
1 | Anza Vista | 249759.91 | 2007 |
2 | Balboa Terrace | 308412.95 | 2007 |
3 | Bayview | 181096.93 | 2007 |
4 | Bayview Heights | 128157.54 | 2007 |
It is saved as a csv file in the Data folder, ready for analyses.
# Save file as a csv file
home_gb2.to_csv("combine_updated.csv")