Author: Ivan Bongiorni, Data Scientist. LinkedIn.
The goal of this project is the implementation of multiple configurations of a Recurrent Convolutional Seq2seq neural network for the imputation of time series data. Three implementations are provided:
- A Recurrent Convolutional seq2seq model.
- A GAN (Generative Adversarial Network) based on the same architecture above, where an Imputer is trained to fool an adversarial Network that tries to distinguish real and fake (imputed) time series.
- A partially adversarial model, in which both Loss structures of previous models are combined in one: an Imputer model must reduce true error Loss, while trying to fool a Discriminator at the same time.
Models are Implemented in TensorFlow 2 and trained on the Wikipedia Web Traffic Time Series Forecasting dataset.
config.yaml
: configuration parameters for data preprocessing, training and testing.
Pipelines:
main_processing.py
: starts data preprocessing pipeline. Its outcomes are ready-to-train datasets saved in .npy (numpy
) format in/data_processed/
folder.main_train.py
: starts training pipeline. Trained model is saved in/saved_models/
folder, with the 'model_name
' provided inconfig.yaml
.
Scripts:
tools.py
: contains more technical functions that are iterated during preprocessing pipeline.model.py
: implementation of models' architectures.train.py
: contains functions for all training configurations.deterioration.py
: the script contains the function that calls an artificial deterioration of training data, in order to check imputation performance.
Notebooks and explanations:
how_it_works.md
: contains explanation of Deep Learning models in greater detail.nan_exploration.ipynb
: contains a study of the distribution of NaN's in the raw dataset, that lead to the development of the deterioration function.data_scaling_exploration.ipynb
: contains visualizations of the scaling function I employed in data preprocessing phase.imputation_visual_check.ipynb
: visualization of a models performance. The notebook loads the trained model specified inparams['model_name']
and check its performance on Validation and Test data.performance_comparison.ipynb
: shows the performances of three trained models on Test data, compared.
Folders:
data_raw/
: it is supposed to contain the raw Wikipedia Web Traffic Time Series Forecasting dataset, as it is downloaded (and unzipped) from Kaggle.data_processed/
: it contains the outcome of preprocesing pipeline, launched frommain_processing.py
. Observations will be stored in three sub-directories forTraining/
,Validation/
andTest/
.saved_models/
: where models are saved at the end of training pipepine. Model names can be changed inconfig.yaml
. In case a GAN is trained and config parametersave_discriminator
is set toTrue
, the Discriminator model will be saved as[model_name]_discriminator.h5
.
langdetect==1.0.8
numpy==1.18.3
pandas==1.0.3
scikit-learn==0.22.2.post1
scipy==1.4.1
tensorflow==2.1.0
- Luo, Y., Cai, X., Zhang, Y., & Xu, J. (2018). Multivariate time series imputation with generative adversarial networks. In Advances in Neural Information Processing Systems (pp. 1596-1607).
- Yoon, J., Jordon, J., & Van Der Schaar, M. (2018). Gain: Missing data imputation using generative adversarial nets. arXiv preprint arXiv:1806.02920.
- Guo, Z., Wan, Y., & Ye, H. (2019). A data imputation method for multivariate time series based on generative adversarial network. Neurocomputing, 360, 185-197.
- Liu, Y., Yu, R., Zheng, S., Zhan, E., & Yue, Y. (2019). NAOMI: Non-autoregressive multiresolution sequence imputation. In Advances in Neural Information Processing Systems (pp. 11238-11248).
- Luo, Y., Zhang, Y., Cai, X., & Yuan, X. (2019, August). E2GAN: End-to-End Generative Adversarial Network for Multivariate Time Series Imputation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (pp. 3094-3100). AAAI Press.
- Suo, Q., Yao, L., Xun, G., Sun, J., & Zhang, A. (2019, June). Recurrent Imputation for Multivariate Time Series with Missing Values. In 2019 IEEE International Conference on Healthcare Informatics (ICHI) (pp. 1-3). IEEE.
- Tang, X., Yao, H., Sun, Y., Aggarwal, C. C., Mitra, P., & Wang, S. (2020). Joint Modeling of Local and Global Temporal Dynamics for Multivariate Time Series Forecasting with Missing Values. In AAAI (pp. 5956-5963).
- Zhang, J., Mu, X., Fang, J., & Yang, Y. (2019). Time Series Imputation via Integration of Revealed Information Based on the Residual Shortcut Connection. IEEE Access, 7, 102397-102405.
- Fortuin, V., Baranchuk, D., Rätsch, G., & Mandt, S. (2020, June). GP-VAE: Deep Probabilistic Time Series Imputation. In International Conference on Artificial Intelligence and Statistics (pp. 1651-1661).
- Huang, T., Chakraborty, P., & Sharma, A. (2020). Deep convolutional generative adversarial networks for traffic data imputation encoding time series as images. arXiv preprint arXiv:2005.04188.
- Huang, Y., Tang, Y., VanZwieten, J., & Liu, J. (2020). Reliable machine prognostic health management in the presence of missing data. Concurrency and Computation: Practice and Experience, e5762.
- Jun, E., Mulyadi, A. W., Choi, J., & Suk, H. I. (2020). Uncertainty-Gated Stochastic Sequential Model for EHR Mortality Prediction. arXiv preprint arXiv:2003.00655.
- Qi, M., Qin, J., Wu, Y., & Yang, Y. (2020). Imitative Non-Autoregressive Modeling for Trajectory Forecasting and Imputation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12736-12745).
- Wang, Y., Menkovski, V., Wang, H., Du, X., & Pechenizkiy, M. (2020). Causal Discovery from Incomplete Data: A Deep Learning Approach. arXiv preprint arXiv:2001.05343.
- Yi, J., Lee, J., Kim, K. J., Hwang, S. J., & Yang, E. (2019). Why Not to Use Zero Imputation? Correcting Sparsity Bias in Training Neural Networks. arXiv preprint arXiv:1906.00150.
- Yoon, S., & Sull, S. (2020). GAMIN: Generative Adversarial Multiple Imputation Network for Highly Missing Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8456-8464).
I trained this model on a fairly powerful machine: a System76 Adder WS laptop with 64 GB of RAM and NVidia RTX 2070 GPU.