Skip to content

ssaeed85/dsc-ph2-KingsCountyRealEstate

Repository files navigation

img

Machine Learning in the county of Kings

Business Problem


Our chosen stakeholder is the real-estate agency Keller Williams, who's looking to expand into King County in Washington. They want an analytically supported strategy based on inferential and predictive analysis of the data available on the king county website. Our approach to formulating the business question was to first define our recommended strategy and formulate the business question around it. Accordingly, we formulated three questions that we wanted to answer using our data analysis and based our recommendations on those questions.

We defined our strategy based on the volume metrics of the data and determined that the best way moving forward is to target sellers and buyers of houses that are in highest demand. We chose this strategy because we felt this approach would maximize your future potential revenue. This is based on the notion that a higher quantity of sales would result in more revenues than higher-value sales.

Given our recommended strategy, the business question we formulated is : What types of houses are in most demand and where are they located?

Data


The data that we used originally came from the King County website, which describes a years worth of sales information starting from May 2014 to May 2015. It contains a good mix of categorical and numerical data. We wanted to focus on variables that corresponded to features that determine the demand of any given house.

Boundary and mapping data was sourced from https://gis-kingcounty.opendata, a repository of publicly available datsets.

Methodology


Python libraries pandas, numpy were used for working with the data, sklearn was our primary modeling tool, and folium was used as a mapping visualizer.

Our approach to data preparation was systematic. We removed some extraneous outliers that would skew our model, dropped columns that that didn't have enough data to incorporate into our model and didn't speak to our business problem. In the end, our model leveraged:

  • ZIP Code used by the United States Postal Service
  • Year when house was built
  • Square footage of living space in the home
  • The square footage of interior housing living space for the nearest 15 neighbors
  • Number of bedrooms and bathrooms
  • Quality of view from the property

img

To speak directly to our original business question,

What types of houses are in most demand and where are they located?

Location as a feature is key. Looking at the data in conjunction with zipcode boundary information we can highlight areas around King County that are in demand.

img

Visit the interactive map - A clone of the repo would be required

There are a few zipcodes that have a high volume of listings, however the areas around Green Lake really stand out considering the size of these districts. Most of these zipcodes have over 500 listings in that year alone.

Looking at how properties are priced we can deduce the market demands in a region. Focusing on Green Lake we see that aside from being in demand, homes surrounding it are generally are in the mid to high range.

img

Visit the interactive map - A clone of the repo would be required

In fact, per our dataset, the median home price in King County is 450,000$. In comparison, the median price of properties in the surrounding areas is around 550,000$.

In-Demand Features

We identified that the most prevelent features of a home in or sales data were a 3 bedroom house with 2.5 bathrooms, with no view and a square foot living around 1,800 sqft built in the last 20 years. To identify the most common features of sold homes we took the highest counts for view, bedrooms, bathrooms, year built, condion, and zipcode of the home. For the square footage of living space and the square footage of interior housing living space for the nearest 15 neighbors we used the median values.

The Model

When tested, our model receives an R-Squared value of .823. This means our model accounts for 82.3% of the varince in prices. In addition our tested root mean squared error was $85,474.40. This determines the range around our predicted price. Entering the key features into our model we are predicting that a home with these features in the designated zipcode will sell for between $610,569 and $696,043. We advise the real estate company target transactions in this price range to maximize their potential revenue.

img

From the plot above we can see that our residuals are slightly heteroscedastic. That is they aren't completely uniform. In addition you can tell the distribution on the right a tall peak meaning the tails of our data are fatter.

img

The tails of the QQ plot above suggest our data may still have extreme values. Further cleaning would remove outliers and improve oour model.

img

The graph demonstrates how close some of our model's predictions have been getting in our target price range.

Conclusions


In conclusion, the answers to the questions above translates into actionable recommendations. The zip codes which have sold the most homes are the zip codes we recommend targeting in terms of sellers. Similarly with the features of the home, such as bedroom etc. We recommend targeting sellers that have these features in their homes. We recommend using our model to input these values to predict what price the house will ultimately sell for, thereby also predicting future potential revenue.

Next steps


For Next Steps, if we gather more first-party data we can analyze the potential costs of our recommended strategy and provide a more holistic overview of our current model. With more historical data we can upgrade our model to include time series analysis which will formulate more accurate predictions and also illustrate both sales and price trends throughout the year, which would be invaluable to Keller Williams Realty moving forward. Furthermore, we can use the upgraded model to determine whether the housing market is over or under-inflated at any given time. This would be a great predictive tool in understanding whether to gear marketing efforts to sellers or buyers and vice versa.

Repository Structure


├── Workspace  
│     ├── Nimeshi
│     │   ├── Notes.md
│     │   └── binningworkflow.ipynb
│     ├── Saad
│     │   ├── FoliumChoropleth.ipynb
│     │   ├── FoliumMarkers.ipynb
│     │   ├── Notebook.ipynb
│     │   ├── Notebook_obsolete.pynb
│     │   └── Notes.md
│     └── Zach
│          ├── Final Jupyter Notebook Copy.ipynb
│          ├── Jupyter_Final copy.ipynb
│          ├── Models 6-8.ipynb
│          ├── Refined Analysis.ipynb
│          ├── Scrapwork.ipynb
│          ├── Scrapwork2.ipynb
│          └── Notes.md
│
├── data
│     ├── column_names.md
│     ├── kc_house_data.csv
│     └── Zipcodes_for_King_County_and_Surrounding_Area___zipcode_area.geojson
├── images
├── maps
│     ├── PropertiesPentileDisplay.html
│     └── choropeth_zip_salecounts.html
├── README.md
├── Slides_Price_Prediction_and_Analysis_in_KingCounty.pdf
└── Price_Prediction_and_Analysis_in_KingCounty.ipynb

Authors:


Our model and analysis can be found at our github repo: Price Prediction and Analysis in KingCounty

Citations:


Images and logo of King County are properties of kingcounty.gov
Zipcode geoJSON: gis-kingcounty.opendata

Folium References:
python-visualization.github
towardsdatascience.com/visualizing-data-at-the-zip-code-level-with-folium
towardsdatascience.com/how-to-step-up-your-folium-choropleth-map-skills
towardsdatascience.com/folium-and-choropleth-map-from-zero-to-pro
write-geojson-into-a-geojson-file-with-python

About

Machine Learning in the county of Kings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •