This repository contains the source code of my Final Master's degree project in Decision Systems Engineering, titled Wind Power Forecasting using Machine Learning techniques, coursed in Rey Juan Carlos University. It is based on the Data Science challenge posed by the Compagnie nationale du Rhône.
For further information, you can read the master's thesis here.
This application is intended to be a flexible and configurable tool in order to easily build and analyze models for this forecasting problem. It is based on Kedro API for the sake of applying software engineering best practices to data and machine-learning pipelines. MLflow tracking is used to record and query experiments (code, data, config, and results).
The packages to re-create the necessary conda environment are listed in ./requirements.txt
.
The main pipelines implemented are:
- Prepare data for EDA (
eda
). Transforms raw data into a proper format for Exploratory Data Analisys. - Data engineering (
de
). Gets the data ready to be consumed by Machine Learning algorithms. - Feature engineering (
fe
). Allows to explore and add new features to the data sets. - Modeling (
mdl
). Trains the selected algorithm from among the following: MARS, KNN, RF, SVM. It also optimizes model hyperparameters and make predictions on the test set.
There are other two additional pipelines:
- CNR pipeline. It contains several subpipelines to get predictions and submission file for the CNR Data Science Challenge.
- Neural Networks. In progress ...
There are configuration files for every pipeline consisting of prameters.yml
and catalog.yml
files. The first one contains all the parameters required for the pipeline run. The second is the project-shareable Data Catalog. It's a registry of all data sources available for use by the project and it manages loading and saving of data. Both configuration files are located at conf/base
.
As a kedro application, the CLI can be used to run pipelines, among all other options you can check in kedro documentation. To run the main pipelines of this project these are some basic command examples, choosing the Wind Farm (wf
) and the algorithm (alg
) to build the model:
- Prepare data for EDA:
kedro run --pipeline eda --params wf:WF1
- Data engineering:
kedro run --pipeline de --params wf:WF1
- Feature engineering:
kedro run --pipeline fe --params wf:WF1,max_k_bests:3
- Modeling:
kedro run --piepeline mdl --params wf:WF1,alg:KNN
You can overwrite any parameter value defined in parameter configuration files, as well as the the data set used as the first input whenever it is defined in any of the existing data catalogs.
Important: It's necessary to put raw data in data/01_raw/
. Raw data is available here (free registration for the challenge is required).
Using the plugin kedro-viz
(need to be installed) by running kedro viz
, you'll visualize data and machine-learning pipelines. For instance, this is the visualization of the data enegineering pipeline:
- Mlflow tracking ui:
kedro mlflow ui
. It serves the tracking tool as a web on localhost (by default port 5000) - Jupyter notebook:
kedro jupyter notebook
. It launches jupyter notebook loading all the kedro context variables so you can easily access pipelines, data catalogs, parameters and many other useful stuff from your notebook.
To use mlflow ui
you need to install the plugin kedro-mlflow.