📈 Weather forecasts as a point multivariate time series forecasting problem with Seq2Seq neural networks.
Trying to predict the most exact temperature, wind speed etc. for hours and days ahead using LSTM, BiLSTM, TCN, Transformer, Spacetimeformer, NBEATSx. and some data analysis tools.
Note: work is being performed in MBelniak fork
There are 3 different datasets used in this project. Based on experiment settings neural network uses datasets 1., 1. & 2. or 1. & 2. & 3.
- Synop reports from ground stations https://danepubliczne.imgw.pl/
- Multiple parameters are fetched and used, see src/synop/consts.py
- wind velocity, direction and gusts are fetched from https://danepubliczne.imgw.pl/datastore for higher time and value resolution
- GFS 0.25° archive forecasts from https://rda.ucar.edu
- Multiple parameters are used, see src/wind_forecast/config/train_parameters/CommonGFSConfig.json
- Maximum Reflectivity images (CMAX) https://danepubliczne.imgw.pl/datastore
The flow of getting GFS archive data is described in gfs-archive-0-25 module. Synop data is fetched in src/synop/fetch_synop_data.py. CMAX data is fetched in src/radar/fetch_radar_CMAX.py and processed in radar/preprocess_cmax.py
- Pytorch for creating models
- Pytorch Lightning for training regime
- Weights & Biases for logging and plotting results
- Hydra for configuration
- Optuna for tuning
- Numpy, Pandas, scikit-learn, matplotlib, seaborn as tooling
All models work in Seq2Seq fashion, with configurable time window and forecast horizon.
- LSTM - encoder-decoder architecture with stacked LSTMs, as described in Sequence to Sequence Learning with Neural Networks
- BiLSTM - same as LSTM, but with bidirectional encoder
- TCN - encoder-decoder architecture as described in Temporal Convolutional Networks for the Advance Prediction of ENSO . There is also a model with just an encoder and a model with additional attention layers
- Transformer - model based on Attention is all you need
- Spacetimeformer - model based on Long-Range Transformers for Dynamic Spatiotemporal Forecasting
- NBeatsx - model based on Neural basis expansion analysis with exogenous variables: Forecasting electricity prices with NBEATSx
There are several scopes in which an experiment can be configured. For tips on configuring run from command line see scripts. Also, see how to create predefined config via config files at src/wind_forecast/config/experiment or src/wind_forecast/config/optim.
- config.experiment
- training regime - nr of epochs, skip training, save checkpoint etc,
- model specific config - models hyperparameters, dropout etc,
- problem specific config - time window length, horizon length, target parameter, target location etc,
- datasets config - val/test split, synop file, weather parameters to use, dates range
- config.optim
- lr, lr scheduler, optimizer, loss
- config.lightning
- deterministic training, gpus
- config.tune - tune config; set of params to check
There are multiple configurations (yaml files) already prepared in src/wind_forecast/config/experiment, but they all use Sequence2SequenceWithCMAXDataModule, which requires CMAX files (reflectivity images). If you don't use CMAX files, better use Sequence2SequenceDataModule together with use_cmax_data: False and load_cmax_data: False. Sequence2SequenceWithCMAXDataModule is used in my experiments to have equal datasets across all experiments in my thesis.
Obtaining datasets is described in synop readme, GFS readme and CMAX readme.
Prepared synop data (csv file) should be placed in src/data/synop directory. There are already some files ready. Prepared GFS and CMAX datasets should be placed in a pkl
directory placed in a directory pointed via GFS_DATASET_DIR and CMAX_DATASET_DIR environment variables.
First, create conda environment
conda env create -f environment.yml
Then, install dependencies
pip install -r requirements.txt
To run experiment, in src
directory:
python -m wind_forecast.main experiment=<experiment_yml_file> [options...]
# e.g.
python -m wind_forecast.main experiment=transformer experiment.batch_size=32 lightning.gpus=0
RUN_MODE variable from .env
file switches run mode. Do not specify in order to run a basic full training.
RUN_MODE=debug # Disables W&B logging and loads only a small part of datasets in order to start and perform the training process faster
RUN_MODE=tune # Performs tuning process. See [tune](https://github.com/MBelniak/WindForecast/tree/master/src/wind_forecast/config/tune) for examplary tune configs.
RUN_MODE=tune_debug # Joins the two above
Add the following to .env
to enable logging to W&B:
RESULTS_DIR=<relative to repo root, target dir for logs, checkpoints etc.>
WANDB_ENTITY=<your w&b username>
WANDB_PROJECT=<your w&b project name>
The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance
By default data is not loaded in parallel due to a problems on my Windows machine.
You can try speeding it up by setting experiment.num_workers
to a number of cores on your machine
or a smaller number if there are CUDA errors.