This project aims to predict the number of calories burned during exercise using various machine learning models. It uses data from two CSV files: 'calories.csv' and 'exercise.csv'.
- Dependencies
- Data Preparation
- Exploratory Data Analysis
- Feature Engineering
- Model Training and Evaluation
This project requires the following Python libraries:
- numpy
- pandas
- matplotlib
- seaborn
- scikit-learn
- xgboost
You can install these dependencies using pip: pip install numpy pandas matplotlib seaborn scikit-learn xgboost
The project starts by loading two datasets:
- 'calories.csv': Contains information about calories burned
- 'exercise.csv': Contains exercise-related features
These datasets are merged, and duplicate columns are removed.
Several visualizations are created to understand the data:
- A scatter plot of Height vs Weight
- Scatter plots of Age, Height, Weight, and Duration vs Calories
- Distribution plots for numerical features
- The 'Gender' column is encoded (male: 0, female: 1)
- A correlation heatmap is created to identify highly correlated features
- 'Weight' and 'Duration' columns are removed to avoid multicollinearity
The data is split into training and validation sets (90% training, 10% validation).
Features are standardized using StandardScaler.
Five different models are trained and evaluated:
- Linear Regression
- XGBoost Regressor
- Lasso Regression
- Random Forest Regressor
- Ridge Regression
For each model, the Mean Absolute Error (MAE) is calculated for both training and validation sets.
To run this project:
- Ensure all dependencies are installed
- Place 'calories.csv' and 'exercise.csv' in the same directory as the script
- Run the Python script
The script will output visualizations and model performance metrics.