In this project I replicate parts of Campbell et al. (2020), relating cholera risk and essential climate variables (ECVs) and using a random forest classifier.
I focus on the years 2010 to 2018 and only on the most predictive ECVs (i. e. sea surface salinity, chlorophyll-a concentration and land surface temperature), as indicated by Campbell et al. (2020), to keep the amount of raw data manageable.
Contents
- Download the data
- Preprocess the data
- Exploratory data analysis and validation
- Create train and test set
- Modeling
- Open questions
- Requirements
- References
First, outbreaks and ECVs need to be downloaded:
- Outbreaks:
download_outbreaks.py
- Sea surface salinity:
download_sea_surface_salinity.sh
- Chlorophyll-a concentration:
download_chlorophyll_a_concentration.sh
- Land surface temperature:
download_and_preprocess_land_surface_temperature.py
The land surface temperature data need some extra preprocessing due to their daily temporal resolution and the resulting extremely large amount of data.
Second, downloaded outbreaks and ECVs need to be preprocessed:
- Outbreaks:
extract_and_clean_cholera_outbreaks.py
andpreprocess_cholera_outbreaks.ipynb
(I'm currently still working on this because it's so tricky to correctly extract data from pdf files; the target variable is thus still faulty and so modelling is also not working properly.) - ECVs:
preprocess_essential_climate_variables.ipynb
Third, preprocessed cholera outbreaks and ECVs need to be explored and validated:
exploratory_data_analysis_and_validation.ipynb
Fourth, validated cholera outbreaks and ECVs need to be merged to create a train and a test set on district and month level:
create_train_and_test_set.ipynb
Finally, a random forest classifier is trained to predict cholera outbreaks based on a range of ECV features:
modeling.ipynb
- Which local CRS is best to use (e.g. https://epsg.io/7755)?
- How exactly is the buffering done?
- How exactly are the areal means of terrestrial and oceanic variables computed?
- How is the number of non-outbreak data points of 8504 calculated? Intuitively I would calculate 9 years x 12 months x 40 coastal districts = 4320. I'm clearly missing something here.
- Which of the lag variables are used in the final model, i. e. actual lag values, rate of change and/or binary features indicating the rate of change's direction?
Assuming conda
is installed, run the following commands in your terminal to run the scripts and notebooks in this repository:
- create environment:
conda env create --file=cholera_risk_modeling.yaml
- activate environment:
conda activate cholera_risk_modeling
- you might need to do the following in addition:
pip install ghostscript
Run the scripts and notebooks directly in their respective directory.
Campbell AM, Racault M-F, Goult S, Laurenson A. Cholera Risk: A Machine Learning Approach Applied to Essential Climate Variables. International Journal of Environmental Research and Public Health. 2020; 17(24):9378. https://doi.org/10.3390/ijerph17249378
National Centre for Disease Control, Directorate General of Health Services. Integrated Disease Surveillance Programme. Available online: http://idsp.nic.in/ (accessed on 28 February 2021).
University of California, Berkely. Global Administrative Areas. Digital Geospatial Data. 2020. Available online: http://www.gadm.org (accessed on 28 February 2021).
Boutin, J.; Vergely, J.-L.; Reul, N.; Catany, R.; Koehler, J.; Martin, A.; Rouffi, F.; Arias, M.; Chakroun, M.; Corato, G.; Estella-Perez, V.; Guimbard, S.; Hasson, A.; Josey, S.; Khvorostyanov, D.; Kolodziejczyk, N.; Mignot, J.; Olivier, L.; Reverdin, G.; Stammer, D.; Supply, A.; Thouvenin-Masson, C.; Turiel, A.; Vialard, J.; Cipollini, P.; Donlon, C. (2021): ESA Sea Surface Salinity Climate Change Initiative (Sea_Surface_Salinity_cci): Weekly sea surface salinity product, v03.21, for 2010 to 2020. NERC EDS Centre for Environmental Data Analysis, 8 October 2021. https://catalogue.ceda.ac.uk/uuid/fad2e982a59d44788eda09e3c67ed7d5
Sathyendranath, S.; Jackson, T.; Brockmann, C.; Brotas, V.; Calton, B.; Chuprin, A.; Clements, O.; Cipollini, P.; Danne, O.; Dingle, J.; Donlon, C.; Grant, M.; Groom, S.; Krasemann, H.; Lavender, S.; Mazeran, C.; Mélin, F.; Müller, D.; Steinmetz, F.; Valente, A.; Zühlke, M.; Feldman, G.; Franz, B.; Frouin, R.; Werdell, J.; Platt, T. (2021): ESA Ocean Colour Climate Change Initiative (Ocean_Colour_cci): Global chlorophyll-a data products gridded on a geographic projection, Version 5.0. NERC EDS Centre for Environmental Data Analysis, 12 May 2021. https://catalogue.ceda.ac.uk/uuid/e9f82908fd9c48138b31e5cfaa6d692b
Ghent, D.; Veal, K.; Perry, M. (2022): ESA Land Surface Temperature Climate Change Initiative (LST_cci): Multisensor Infra-Red (IR) Low Earth Orbit (LEO) land surface temperature (LST) time series level 3 supercollated (L3S) global product (1995-2020), version 2.00. NERC EDS Centre for Environmental Data Analysis, 25 February 2022. doi:10.5285/ef8ce37b6af24469a2a4bdc31d3db27d. http://dx.doi.org/10.5285/ef8ce37b6af24469a2a4bdc31d3db27d
Data on essential climate variables is available in the ESA Climate Change Initiative's Open Data Portal