The project covers the following:
- exploratory data analysis
- regression model
- ANOVA Analysis
- Model Diagnosis
- Check Constant Variance
- Check Influence Point (Delete Influence Point)
- Check Nomality
- Check Variance Inflation Factor
- Model Selection (Step BIC)
https://www.kaggle.com/datasets/hellbuoy/car-price-prediction
model = smf.ols("price~symboling+fueltype+aspiration+doornumber+carbody+drivewheel+enginelocation+wheelbase+carlength+carwidth+carheight+curbweight+enginetype+cylindernumber+enginesize+fuelsystem+boreratio+stroke+compressionratio+horsepower+peakrpm+citympg+highwaympg", data=train_data).fit()
ANOVA Typ1
table shows additional significant predictors: [aspiration, carlength, doornumber, drivewheel, fuelsystem, horsepower, symboling, wheelbase],ANOVA Typ3
will have the same result asANOVA Typ2
Both table show that [carbody, carwidth, curbweight, cylindernumber, enginelocation, enginesize, enginetype, fueltype, peakrpm, stroke] are the significant predictors of the fitted model.
Model:
As
Summary for VIF process, the predictors without violating the multicollinearity issue:
['aspiration',
'doornumber',
'drivewheel',
'enginelocation',
'wheelbase',
'carlength',
'carwidth',
'carheight',
'enginesize',
'boreratio',
'stroke',
'compressionratio',
'horsepower',
'peakrpm',
'highwaympg']
After removing the influence point, we fix the normality
issue and keep the model away from heteroscedasticity
.
We use Step-BIC
to find the best model using the updated dataset.
Final Selected Model:
price ~ enginesize + cylindernumber + enginetype + horsepower + carwidth + stroke + compressionratio + enginelocation + curbweight + peakrpm + carlength + doornumber