This project is based of 'Data science job' for building a salary prediction model.
- Salary data has been taken from glassdoor.com
- Built 3 models Linear, Lasso, and Random Forest.
- Optimized them using GridsearchCV to reach the best model.
- Job Title
- Salary Estimate
- Job Description
- Rating
- Company Name
- Location
- Company Headquarters
- Company Size
- Company Founded Date
- Company Age
- Type of Ownership
- Industry
- Sector
- Revenue
- Competitors
- Splited the data into 80% train data and 20% test data for traing and testing purpose.
- Built 3 models to get select the best model from.
- Evaluated them using Mean Absolute Error. Choose MAE because it is relatively easy to interpret and outliers weren’t particularly bad for this type of model.
- Multiple Linear Regression – Set Baseline for the model.
- Lasso Regression – Because of the sparse data from the many categorical variables, Normalized regression like lasso would be effective.
- Random Forest – With the sparsity associated with the data, This would be a good fit.
Random Forest model outperformed the other approaches on the test and validation sets.
- Random Forest : MAE = 11.2
- Linear Regression: MAE = 18.8
- Lasso Regression :MAE =19.6