This repository is established to explore data on LianJia, which is a second-hand housing trading platform and covers house info like area size, facing direction, elevator ratio and so on. Overall, the work in this repo could be summarized in following aspects:
- Collect the house information using python crawler.
- Preprocess the data (clean, discretize, match, normalize, etc).
- Conduct feature engineering to analyse the data.
- Apply various classic machine learning algorithms to predict house price.
- Construct BiLSTM model to parse raw describtive text of house and combine it to MLP.
The documents of our work are available here: [report].
- scrapy (web crawling)
- numpy and pandas (data preprocessing)
- scikit-learn (ML-algorithms)
- matplotlib and seaborn (data visualization)
- tensorflow (BiLSTM text parsing model)
We use scrapy to crawl raw data from LianJia. See the directory /code/Crawler_code/
for codes. XPath is used to parse the html and extract data information.
Below is visualization plot of all house prices:
We use python libraries pandas (using class dataframe) and re to preprocess the raw data. See /preprocess/preprocess.py for code.
You can find the preprocessed data in /data/csv_after_process/, where intergral_data.csv is well-prepared and used for visualization,
while final_traning_data.csv for training models.
Check /doc/report.pdf
to see more preprocess detail.
See directory /code/Data_Visualizer
.
We analyzed feature coorelation and feature distribution respectively. We found 22 features that can play a role in prediction models.
Model | R2 value | Mean Error | time / s |
---|---|---|---|
Random Forest | 0.916 | 53.442 | 1.695 |
Bagging Predictors | 0.913 | 54.143 | 1.693 |
Gradient Boosting | 0.907 | 69.277 | 3.626 |
MLP | 0.885 | 70.762 | 16.559 |
Bayesian Interpolation | 0.795 | 108.579 | 0.157 |
Elastic Net | 0.657 | 151.809 | 0.076 |
AdaBoost | 0.611 | 213.527 | 2.027 |
SVR | 0.248 | 172.069 | 128.438 |
The R2 and MSE plot of the above models:
The overall model structure:
Train loss and R2 plot for different embedding size:
Here is extended description.