This is a fictional project for studying purposes. The business context and the insights are not real.
Production prediction is one of the core problems in a company. The provided dataset is a set of nearby wells located in the United States and their 12 months cumulative production. The company needs a production prediction model to serve as one of the tools to support the company decisions. So, the company data scientist needs to build a model from scratch to predict production and show the manager that the model can perform well on unseen data.
Machine Learning Regression Model: Using the dataset provided by the company. A machine learning regression model was created to be used for future predictions.
The notebook used to create the model is available here.Streamlit App for Production Prediction: The model is available on the Streamlit Cloud and can be used through the Streamlit App created. The App is available here.
Attribute | Description |
---|---|
treatment company | The treatment company who provides treatment service. |
azimuth | Well drilling direction. |
md (ft) | |
tvd (ft) | True vertical depth. |
date on production | First production date. |
operator | The well operator who performs drilling service. |
footage lateral length | Horizontal well section. |
well spacing | Distance to the closest nearby well. |
porpoise deviation | How much max (in ft.) a well deviated from its horizontal. |
porpoise count | How many times the deviations (porpoises) occurred. |
shale footage | How much shale (in ft) encountered in a horizontal well. |
acoustic impedance | The impedance of a reservoir rock (ft/s * g/cc). |
log permeability | The property of rocks that is an indication of the ability for fluids (gas or liquid) to flow through rocks. |
porosity | The percentage of void space in a rock. |
poisson ratio | Measures the ratio of lateral strain to axial strain at linearly elastic region. |
water saturation | The ratio of water volume to pore volume. |
toc | Total Organic Carbon, indicates the organic richness (hydrocarbon generative potential) of a reservoir rock. |
vcl | The amount of clay minerals in a reservoir rock. |
p-velocity | The velocity of P-waves (compressional waves) through a reservoir rock (ft/s). |
s-velocity | The velocity of S-waves (shear waves) through a reservoir rock (ft/s). |
youngs modulus | The ratio of the applied stress to the fractional extension (or shortening) of the reservoir rock parallel to the tension (or compression) (giga pascals). |
isip | When the pumps are quickly stopped, and the fluids stop moving, these friction pressures disappear and the resulting pressure is called the instantaneous shut-in pressure, ISIP. |
breakdown pressure | The pressure at which a hydraulic fracture is created/initiated/induced. |
pump rate | The volume of liquid that travels through the pump in a given time. |
total number of stages | Total stages used to fracture the horizontal section of the well. |
proppant volume | The amount of proppant in pounds used in the completion of a well (lbs). |
proppant fluid ratio | The ratio of proppant volume/fluid volume (lbs/gallon). |
production | The 12 months cumulative gas production (mmcf). |
- Understand the Business problem.
- Clean the dataset removing outliers, NA values and unnecessary features.
- Explore the data to create hypothesis, think about a few insights and validate them.
- Prepare the data to be used by the modeling algorithms encoding variables, splitting train and test dataset and other necessary operations.
- Create the models using machine learning algorithms.
- Evaluate the created models to find the one that best fits to the problem.
- Tune the model to achieve a better performance.
- Deploy the model in production so that it is available to other people.
- Find possible improvements to be explored in the future.
I1: Wells with a greater number of stages produce more,
True: This relationship doesn't apply for all values of total number of stages, but it tends to be true.
I2: Wells that started producing longer ago produce less.
True: Productions from newer wells are better.
I3: Wells that are farther from the others produce more.
False: The production doesn't increase according to the distance from other wells.
I4: Wells in which more proppant were used produce more.
True: More proppant indicates a greater production.
I5: Wells in which the rocks have higher values of porosity produce more.
False: More porosity does not mean more production.
The final result of this project is a regression model. Therefore, some machine learning models were created. So, 7 models were created, Linear Regression, Lasso, SVM, Random Forest, XGBoost, LightGBM and CatBoost.
Boruta (feature selection algorithm) was used to select features for the model and 11 features were selected to the final model. The models were evaluated considering three metrics, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Root Mean Squared Error (RMSE). The initial models performances are in the table below.
Model Name | MAE | MAPE | RMSE |
---|---|---|---|
CatBoost | 502.93 | 0.2817 | 781.34 |
LightGBM | 522.03 | 0.2936 | 806.55 |
XGBoost | 535.10 | 0.3094 | 813.48 |
Random Forest | 564.38 | 0.3281 | 852.23 |
SVM | 648.01 | 0.4468 | 931.77 |
Linear Regression | 679.33 | 1012.51 | |
Lasso | 1018.08 | 0.4259 | 1396.98 |
To decide which would be the final model, a cross-validation was carried out to evaluate the performance of the algorithms in a more robust way. These metrics are represented in the table below.
Model Name | MAE | MAPE | RMSE |
---|---|---|---|
Linear Regression | 687.8 +/- 49.40 | 0.49 +/- 0.04 | 974.12 +/- 90.88 |
Lasso | 1023.65 +/- 61.45 | 0.89 +/- 0.06 | 1348.19 +/- 96.97 |
SVM | 651.62 +/- 28.27 | 0.51 +/- 0.06 | 897.34 +/- 60.87 |
Random Forest | 521.82 +/- 26.99 | 0.36 +/- 0.02 | 768.7 +/- 74.63 |
XGBoost | 526.78 +/- 14.36 | 0.35 +/- 0.02 | 773.11 +/- 52.73 |
LightGBM | 525.71 +/- 31.97 | 0.34 +/- 0.02 | 767.4 +/- 58.25 |
CatBoost | 490.18 +/- 16.5 | 0.32 +/- 0.02 | 724.79 +/- 54.17 |
As the table presents, the Catboost model was the best one and was chosen to be deployed. After choosing which would be the final model, a random search hyperparameter optimization algorithm was used to improve the performance of the model. The final model evaluation metrics are in the table below.
Model Name | MAE | MAPE | RMSE |
---|---|---|---|
CatBoost Tuned | 485.66 +/- 23.01 | 0.32 +/- 0.02 | 714.4 +/- 64.6 |
Although the dataset has many features, it is small and has a significant amount of missing values. The model presented a larger error than expected, this problem could be circumvented with a larger amount of data. Using the app, other people can easily make predictions just setting the values and pressing the prediction button.
- Find a better way to replace missing values.
- Find the best way of dealing with the outliers.
- Search for models that could perform better with this small dataset.
- Try some dimensionality reduction algorithm to improve the model prediction capabilities.
- Improve the Streamlit app adding more functions.