This is a code and analysis repository for the paper Cell phone mobility data and manifold learning: Insights into population behavior during the COVID-19 pandemic. Cell-phone mobility data offers a modern measurement instrument to investigate human mobility and behavior at an unprecedented scale. We investigate aggregated and anonymized mobility data (SafeGraph COVID mobility data) which measures how populations at the census-block-group geographic scale stayed at home in California, Georgia, Texas, and Washington in the beginning of the COVID-19 pandemic. Using manifold learning techniques, we find patterns of mobility behavior that align with stay-at-home orders, correlate with socioeconomic factors, cluster geographically, reveal subpopulations that likely migrated outof urban areas, and, importantly, link to COVID-19 case counts. The analysis and approach provides policy makers a framework for interpreting mobility data and behavior to inform actions aimed at curbing the spread of COVID-19.
To use the code from this repository, we need a Linux machine or Windows with WSL or Cygwin configured to run Linux commands. A conda installation of Python 3 and an installation of R are also required. An installation of a Jupyter Notebook is needed for the correct execution of the make commands (see below). The Python dependencies are specified in requirements.txt and for R the following pacakges need to be installed on the system: mclust (version 5.4.6), dplyr, tidycensus, data.table, clinfun, sf, jsonlite. The demo only requires mclust (version 5.4.6). The code was developed and tested on Ubuntu 18.04 computer with 16 GB RAM and Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz processor.
First, we clone the repository:
git clone https://github.com/InstituteforDiseaseModeling/covid-mobility-and-behavior.git
Make utility automates downloading the data and creating the environment. Typing make
in the terminal will show description of available commands.
First, we need to create a virtual environment and install the requirements for our project (the below commands should be executed from the root directory of the project). The following commands will create a conda virtual environment for the project and activate it:
make create_environment
source activate covid-mobility-and-behavior
After that, we install required packages and create a jupyter kernel for the project (make sure R is installed on the system):
make requirements
Note that the above command installs a new jupyter kernel for the created virtual environment. This could be avoided by commenting out the respective lines in Makefile.
Now, we can download the data. The following will download necessary external data, e.g. shapefiles (the command could take up to 20 min to run depending on the Internet speed):
make data
Now we should be able to run the demo notebook from the /demo folder. The installation process is expected to take up to 10 minutes (20 minutes for slow connections).
Note that the raw SafeGraph data is not publicly accessible and cannot be downloaded automatically. Access has to be requested through SafeGraph COVID data consortium. The CBG-level mobility data should be placed in data/raw
. While the results of our analysis could be viewed by accessing the /notebooks directory, the code would not run correctly without the raw SafeGraph data.
Since we cannot share the SafeGraph data directly, we provide a demo dataset to showcase our method.
Synthetic time series are generated for each CBG in Washington State. The time series are based on 4 basis functions: two different sine waves and two exponentials (one rising and one falling).
One county was selected to be based on each of the 4 basis functions. Time series are generated for each CBG within these counties by multiplying the appropriate basis function by a random number. Thus, all CBGs within a single county are the same function multiplied by a scalar. Noise is added to the synthetic time series.
Synthetic time series are generated for the remaining CBGs using a combination of two of the basis functions. Each county is assigned a pair of basis functions, and the time series for each CBG is the product of one basis function + a random weight and the other function + another random weight. These time series are essentially products of two basis functions. The basis functions by county are shown below. We expect the output of our method to look similar to this map.
The script /demo/make-synthetic-wa.R generates the synthetic dataset.The demo analysis could be run using /demo/Demo-Main-Analysis.ipynb notebook. The demo dataset is automatically downloaded and saved to data/demo
directory at the installation command make data
. Alternatively, the demo dataset could be downloaded from here and placed in data/demo
manually.
Below is the expected output of the demo analysis:
The demo analysis takes about 10 minutes to run on a computer with 16 GB RAM and Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz processor.
The raw SafeGraph data is not publicly accessible and we cannot directly share it. Access has to be requested through SafeGraph COVID data consortium. The CBG-level mobility data should be placed in data/raw
. After that our analysis could be reproduced by running notebooks from /notebooks and /censuscode directories.
├── LICENSE
├── Makefile <- Makefile with commands like `make data`, `make create_environment`, `make requirements`
├── README.md
├── requirements.txt <- Requirements file for reproducing the Python analysis environment
├── setup.py <- Installation script for the local package
│
├── demo
| ├──obj <- Directory to save computed helper objects
| |
│ ├── Demo-Main-Analysis.ipynb <- Main demo notebook with the dimesionality reduction + clustering pipeline applied to synthetic demo data
│ └── make-synthetic-wa.R <- Script to generate demo data: synthetic mobility dynamics in Washington state
│
├── data
│ ├── external <- Data from external sources, e.g. shapefiles for plotting maps (from census.gov)
│ ├── interim <- Intermediate data files
│ ├── processed <- Final data sets -- final clustering labels and final low-dimensional coordinates for every state
│ └── raw <- Raw data -- this is where SafeGraph mobility data should be placed
│
├── notebooks <- Jupyter notebooks with the analysis code and the code to generate figures
| ├──obj <- Directory to save computed helper objects
| |
│ ├── Main-Analysis-Figure2.ipynb <- Main notebook with the dimesionality reduction + clustering pipeline applied to all 4 states, produces Figure 2
│ ├── Schematic-Figure1.ipynb <- Generates panels for the pipeline description in Figure 1 of the paper
│ ├── Zoomed-Maps-Figure3.ipynb <- Generates zoomed-in maps for Figure 3 of the paper
| ├── Diffusion-Maps.ipynb <- Diffusion Maps code (Supplement, Figure S3)
│ └── income-population-KS.ipynb <- Analysis of income and population density in identified clusters, Kolmogorov-Smirnov test for response speed distributions
│
├── censuscode <- Source code for interpretation analysis
│ ├── get-acs2018data.R <- Script to download ACS data (requires inserting API key to access ACS data)
│ └── make-census-plots.R <- Script to interpret the clusters by correlating them with socieconomic data, produces Figures 4,5, and 6 of the paper
|
├── reports <- Final figures
│ └── figures
│
└── src <- Source code
|
├── data <- Scripts to download data (only downloads demo data and publicly available data like shapefiles, the SafeGraph data access should be requested from SafeGraph)
│
├── config.py <- Configurations defining data paths and color palettes
├── core_pipeline.py <- Source code for applying the pipeline of nonlinear dimensionality reduction + GMM clustering
├── dimensionality_reduction.py <- Functions for dimensionality reduction methods and their visualization
└── utils.py <- Helper functions
Project template based on the cookiecutter data science project template. #cookiecutterdatascience