This github repo replicates the materials in openICPSR repo Version 1.0, accessible here.
Repository of code associated with the paper "Using Neural Networks to Predict Micro-Spatial Economic Growth", which is accepted at the journal American Economic Review: Insights.
This project constructs and analyzes neural networks which predict Census outcomes of population and income using daytime satellite imagery. The code pipeline to recreate or adapt our work is organized into three phases: (1) raw data extraction and processing, (2) model training and validation, and (3) predictions and analysis. The output of phase (1) is a set of .tfrecord
files containing the train, validation, and test data. The output of phase (2) are trained CNNs. The output of phase (3) are predictions of our outcome variables and tables and graphs reflecting analysis on them.
Phases (1) and (2) will be of interest for those interested in extending our basic approach to other sources of imagery and/or outcome variables as well as those interested in understanding our approach in (possibly excruciating) detail. Phase (3) will be of interest to those who want to use the predictions generated by us for another application. Each of these phases are described in more detail below. Those interested primarily in using our predicted values are encouraged to skip to the section entitled "Phase (3) - Predictions" below. We have "checkpointed" the output of each phase and saved the result to GoogleDrive. See details below for links to folders on GoogleDrive.
This GitHub repo replicates the content in our openICPSR repo, but we may update it later with new applications or improvements to the code pipeline. Interested parties are encouraged to refer to this github repo for any future updates.
Satellite imagery data were processed and extracted from Google Earth Engine (GEE, Gorelick et. al 2017) using the scripts in code/extract_imagery of this repository. Extracts constructed through GEE are available in linked google drive repositories (below). These are intermediate datasets constructed from raw GEE datasets "LANDSAT/LE07/C01/T1_SR" and "NOAA/DMSP-OLS/NIGHTTIME_LIGHTS"; we only interact with these raw data through the GEE platform. To access these raw data, or repeat our imagery extraction process, researchers must create a free academic account with Google Earth Engine.
Census data and geographic shapefiles from the Decennial Censuses and 5-year ACS products were extracted from IPUMS National Historical Geographic Information System (Manson et. al 2020). Raw csv data tables used are included in data>labels>source_files. Users may extract this data freely after registering for an IPUMS account.
Census employment data was collected from the LODES program (Census, 2020). Raw zip files of these csv tables are included in data>labels>source_files>lodes. These data can also be accessed freely here.
Phase 2 and imagery porcessing stages in Phase 1 and Phase 3 were run on a Linux computer using Python 3.6–3.8, pip and venv >= 19.0, tensorflow >= 2.0, numpy >= 1.2
Other packages needed (see setup.sh
): tables, pydrive, pandas, tqdm
To start: run . ./setup.sh
. This script will create a virtual environment for Python, install dependencies, and sets the project root directory. Please ensure that the environment variable CNN_PROJECT_ROOT
is set to the root directory of this repository before continuing. That is echo $CNN_PROJECT_ROOT
should return something like /home/username/this_repo
.
Manipulation of Census source data in Phase 1 and construction of tables/figures in the manuscript was conducted on a Windows computer using Stata 16.
Extraction of imagery from google earth engine in Phase 1 was done in a python conda environment on a Windows computer. The file code>extract_imagery>gee_conda_requirements.txt describes the setupup for this environment.
Spatial computations in Phase 1 are conducting using ArcGIS Pro and the associate python API arcpy, which was set up in a Python 2.7 environment on a Windows computer. The file arpy_requirements.txt file in code>generate_image_labels>python lists the necessary python setup for these computations.
Neural Networks and related code were implemented using TensorFlow along with PyTables, NumPy, Pandas, and PyDrive.
The final result of the project which are included in the manuscript can be replicated in less than two hours using a modern desktop computer and following the instructions in phase 3.
The models created in this project are not feasible to run on a standard desktop machine. They were trained using a specialized University of California computing cluster built to perform large scale machine learning tasks. This cluster included dozens of high speed GPUs and several terabytes of disk space. We would estimate that re-training of all of our models would take 1 month or more if run on a modern super-computer level cluster.
The data extraction and cleaning steps can be run on a standard desktop computer but are also time intensive. We estimate that this phase (1) of the project would take approximately 3 weeks to run on a standard desktop computer.
To the extent possible, we have included intermediate datasets ("generated_files") in our repository to allow for independent execution, verirification, and modification of each stage of our analysis without starting the full data extraction and model training process from scratch. Some of these files are too large to store in this repository and are stored on GoogleDrive. Links have been provided in the documentation below.
In this phase, we extract the raw imagery and census data used to train our models. Cleaned and processed data ready for training (e.g. the output of this phase) is avilable in the data
folder of this GoogleDrive. This phase is divided into the following steps:
This sub-phase creates the raw image exports and downloads them to a local machine for training and cleaning.
General order of operations: export_*.py -> download_data.py
- Create raw data export from GoogleEarthEngine: This is performed by the file
code/extract_imagery/export*_.py
. These programs will define the extract and the resulting data will be written (as many small TFRecord files) to a folder in Google Drive. - Download Data: We next download the data produced by step (1) and prepare it for training. The script
code/extract_imagery/download_data.py
downloads the raw data from google drive (usinggoogle_drive_utils.py
), discard images that do not meet our urbanization threshold and convert them into a largeHDF5
file which is easier to store locally than many smalltfrecord files
. We also assign each image an identifier in this stage that can be used to match it with its label(s). Users will need to set theroot_dir_id
global variable in this file to align with the output folder from step (1) above. Users may also need to modify the "mode" argument to process a new data set. See the important note below regarding this step. To run the script, usepython download_data.py [large,small,mw]
, wherelarge
builds the "large" national imagery,small
builds the small imagery, and "mw" builds the midwest (high-resolution) data. This code takes a considerable amount of time (several days) to run in its entirety as it requires downloading a large amount of data.
Important Note: This phase requires interacting with GoogleDrive's Python API. The script google_drive_utils.py
helps automate this somewhat. To run the code, you will need to follow the instructions of PyDrive (here) to set up a GoogleAPI project and create a client_secrets.json
file. This file should be placed in the scripts
directory. The first time you run code, you will be asked to authenticate via a command line prompt. Copy and paste the URL from the command line into a browser and follow the instructions to authorize the PyDrive API. Subsequent runs of this code will cache the authentication. Unforauntely, we have not found a good way to make this process less cumbersome.
This sub-phase creates labels and baseline features for the raw data, merges these with the raw images downloaded in the previous step, and formats the data for training using the TensorFlow data pipeline described here.
General Order of Operations construct_labels.do -> prep_data_levels.py -> prep_data_diffs.py -> prep_data_testing.py -> shard_data.py
-
Construct Label File: The script
download_data.py
will produce a file (/output/valid_imgs.txt
) of all images meeting our urbanization threshold, and not otherwise invalid. This file is keyed by(lat,lng)
or equivalently, theimg_id
variable. This file can be used as input to the labeling scripts. -
Construct Ground Truth Labels: The script
code/generate_image_labels/generate_image_labels.do
conducts and describes how Census data are cleaned and interpolated into ground truth image labels. This script calls three subsequent stata scripts and indicates the order in which to run the associated python (arcpy) script computing intersections between image boundaries and Census block boundaries. -
Prepare Training Data: Next, we process the HDF5 file produced in step 3 into a form suitable for use in tensorflow. In this phase, we also match each image with its ground truth label (e.g. the outcome to be predicted), partition the data into train, validation, and test sets, and strip off the overlap that GoogleEarth engine adds (e.g. the KernelSize parameter in GEE). This is performed in
prep_data_levels.py
andprep_data_diffs.py
for levels and diffs models repsectively. The scriptprep_data_testing.py
prepares data for final prediction. This is done separately, because we use a slightly different format for prediction data than for training models. Finally, to improve processing speed by TensorFlow, we split the large TFrecord files producted by these scripts into small shards that can be loaded more efficiently. This is performed inshard_data.py
.
The output of this phase is made available in the data folder here GoogleDrive. Users who wish to use our existing data, but experiment with new model architectures may download this data, and uncompress (tar -xvf ...
) it to the data
sub-folder of this repository.
This phase defines model architectures, trains models, and performs hyperparameter tuning. Model architectures are defined in models.py
and training code is contained in train_level_model.py
and train_diff_model.py
. We have provided the run.sh
script which includes detailed descriptions of parameters and command line arguments used by the training scripts. See the "instructions" section below for more details on running training code.
General order of operations: train_level_model.py -> train_diff_model.py
Trained models are available in the weights folder here: GoogleDrive. These trained weights may be used for transfer learning (e.g. using our weights to generate features from different imagery). Models are named according to the following conventions:
block_[small,large]_national_[level,diff]_base[_feature]_[inc,pop,inc_pop][_all]
, where small/large
indicates the imagery size (e.g. the 40x40
vs 80x80
imagery), level/diff
indicates models for levels vs. diffs, the presence _feature
indicates models trained with initial conditions (e.g. auxiliary features), inc/pop/inc_pop
indicates the outcome variable, where using inc_pop
will train a model for income per capita, and the presence of _all
indicates models trained on the entire set of images in 2000/2010 (e.g. used to produce the out-of-period results in Table 2).
- Download data in data folder here GoogleDrive
- Move into the
data/
directory of this repository and un-compresstar -xvf ...
- Run the script
run_training.sh
. There are several different run configurations listed inrun_training.sh
which can reproduce the various aspects of the paper (e.g.RGB only
models or models with nighlights). Inspectrun_training.sh
for more detail. - Run tensorboard by running
tensorboard --logdir='out_dir/logs'
in terminal to monitor the training process and validation results.
General order of operations: make_predictions_level.py -> make_predictions_diff.py
This phase uses trained models to generate predictions for each image in our data set and finally produces the results in our draft based on these predictions.
- Download trained models in weights folder here GoogleDrive, or run the training scripts as described in the previous phase. Each model is represented by a directory in the folder linked here. Copy the models to the
/weights
folder of this repository. - Download the prediction data (if you haven't already) which is the
testing_dataset
in the data folder here GoogleDrive and unzip it to the/data
folder of this repository. - Run the script
run_predictions.sh
. There are several different run configurations listed inrun_training.sh
which can reproduce the various aspects of the paper. Inspectrun_predictions.sh
for more detail. The final hyperparameters we selected (via grid search as described in the paper) are hard coded in the calls here. Predictions are written as CSV files keyed byimg_id
to the/data/predictions
folder of this repository (although this can be configured to write somewhere else as needed). If you encounter errors complaining that a model could not be found, please ensure the environment variableCNN_PROJECT_ROOT
is set correctly (e.g. to the root directory of this repository).
Predictions and geographic shapefiles at the level of our large images can be accessed in this repository under data>applications. Also shared is a version which has been geographically crosswalked to 2010 Census Blocks. The unique identifier for images is img_id, and for Census Blocks is gisjoin (as defined by NHGIS here). Within each file are predictions of income and population which are generated in our out of sample model (as described in section 3.4). The variable inc_0_feature for example refers to the predicted log income in 2000, using the model including initial conditions. The variable dpop_9_19, conversely is our prediction of the log change in population from 2009 to 2019, in the model excluding initial conditions. Note that difference predictions are not crosswalked to the Block version of these data.
- The stata do-file
/code/produce_results/produce results.do
uses the predictions generated in the last step to construct and directly export each of the tables, figures, and maps included in our manuscript. This code relies on the use of the esttab package, which can be installed by entering "ssc install estout" in the stata console. - The maps in our manuscript are constructed in an ArcGIS Pro workspace which has been included in
code/produce_results/produce_maps/Satellite_CNN_Visuals.
The contained files create a replica of this workspace and show the layers producing the manuscript maps.
*LANDSAT daytime satellite imagery: Approximately 8TB as raw TFRecord extract from Google Earth Engine. 368GB when processed into tar files (in "data" folder on drive).
*DMSP-OLS nighttime light imagery: Approximately 29GB as raw csv extract from Google Earth Engine. 36MB when processed into labelled image files (i.e. data>nightlights>largeimg_dmsplabelled_merged_00_10.dta).
*Census data on population, income, and demographics: Approximately 11GB raw.
*LODES data on resident area employment in 2004: Approximately 87MB raw.
Note: Census and LODES data are processed together to create the joint file data>labels>generated_files>block_labels_cw.dta, which is approximately 3GB.
Folder data
link; (368gb; 9 files):
Contains processed satellite imagery in TFRecord format used directly as input to model training. Folder contents are as follows:
large_block_all_national.tar
(49GB
) - Large national imagery in all years (levels)
small_block_diff_mw.tar
(50GB
) - Small midwest imagery for 2010-2000 diffs
small_block_all_mw.tar
(50GB
) - Small MV imagery in all years (levels)
small_block_diff_national.tar
(33GB
) - Small national imagery for 2010-2000 diffs
small_block_all_national.tar
(36GB
) - Small national imagery for all years (levels)
large_block_diff_national.tar
(48GB
) - Large national imagery for 2010 - 2000 diffs
testing_dataset.zip
(80GB
) - Dataset used to test models (eg "test" column in tables)
Note: we will not be storing raw google earth export files long-term, as these files occupy over 8TB in total and can be reproduced using the code in Phase 1 and an academic Google Earth Engine account. Anyone interested in these TFRecord files may reach out and we will happily share the files we still have.
Folder weights
link ; (141gb; 18,511 files):
Contains trained model weights stored in TensorFlow format. Each model is represented as a directory and named according to the following convention:
block_[IMG_SIZE]_[REGION]_[YEAR]_[TYPE]_[FEATURES]_[RESOLUTION]_[OUTCOME]_[EPOCHS]_[ALL]
Where parameters are defined as follows:
IMG_SIZE
- Image size (small
or large
)
REGION
- national
or mw
YEAR
- level
or diff
TYPE
- base
(baseline models with all bands), RGB
(RGB only), nl
(all bands + nighlights)
FEATURES
- If present (_feature
), indicates model includes auxiliary/baseline features
RESOLUTION
- If present (_high
), indicates model trained on high-resolution imagery
OUTCOME
- pop
or inc
EPOCHS
- Integer number of epochs for which model was trained.
ALL
- If present, indicates model was trained using both train and validation data
For example block_small_national_level_base_feature_pop_200_all
indicates models trained on all (e.g. train and validation) small national imagery, using all specral bands (no nightlights), to predict population in levels over 200 epochs of training.
Folder contents are as follows. Approximate size for each directory is (1-5 GB
)
block_small_national_level_base_pop_200_all
, block_small_national_level_base_inc_200_all
, block_small_national_level_base_feature_pop_200_all
, block_small_national_diff_base_pop_100_all
, block_small_national_level_base_feature_inc_200_all
, block_small_national_diff_base_feature_inc_100_all
, block_large_national_level_base_feature_pop_200_all
, block_small_national_diff_base_inc_100_all
, block_small_national_diff_base_feature_pop_100_all
, block_large_national_level_base_feature_inc_200_all
, block_large_national_diff_base_feature_pop_100_all
, block_large_national_diff_base_pop_100_all
, block_large_national_diff_base_inc_100_all
, block_large_national_diff_base_feature_inc_100_all
, block_large_national_level_base_pop_200_all
, block_large_national_level_base_inc_200_all
, block_small_national_diff_base_feature_pop_100
, block_small_national_diff_base_feature_inc_100
, block_small_national_diff_base_pop_100
, block_small_national_diff_base_inc_100
, block_large_national_diff_base_pop_100
, block_large_national_diff_base_inc_100
, block_large_national_diff_base_feature_pop_100
, block_large_national_diff_base_feature_inc_100
, block_small_national_level_base_feature_pop_200
, block_small_national_level_base_pop_200
, block_small_national_level_base_inc_200
, block_small_national_level_base_feature_inc_200
, block_large_national_level_base_inc_200
, block_large_national_level_base_pop_200
, block_large_national_level_base_feature_pop_200
, block_large_national_level_base_feature_inc_200
, block_small_national_level_base_inc_pop_200
, block_small_national_level_base_feature_inc_pop_200
, block_small_national_diff_base_inc_pop_100
, block_small_national_diff_base_feature_inc_pop_100
, block_large_national_diff_base_inc_pop_100
, block_large_national_level_base_inc_pop_200
, block_large_national_level_base_feature_inc_pop_200
, block_large_national_diff_base_feature_inc_pop_100
, block_small_national_diff_RGB_pop_100
, block_small_national_diff_RGB_feature_pop_100
, block_small_national_diff_RGB_inc_100
, block_small_national_diff_RGB_feature_inc_100
, block_small_mw_diff_RGB_pop_100
, block_small_mw_diff_RGB_inc_100
, block_small_mw_diff_RGB_high_pop_100
, block_small_mw_diff_RGB_high_inc_100
, block_small_mw_diff_RGB_feature_pop_100
, block_small_mw_diff_RGB_feature_high_pop_100
, block_small_mw_diff_RGB_feature_inc_100
, block_small_mw_diff_RGB_feature_high_inc_100
, block_large_national_diff_RGB_pop_100
, block_large_national_diff_RGB_feature_pop_100
, block_large_national_diff_RGB_inc_100
, block_large_national_diff_RGB_feature_inc_100
, block_large_national_diff_nl_pop_100
, block_large_national_diff_nl_inc_100
, block_large_national_diff_nl_feature_inc_100
, block_large_national_diff_nl_feature_pop_100
, block_small_national_level_RGB_pop_200
, block_small_national_level_RGB_inc_200
, block_small_national_level_RGB_feature_pop_200
, block_small_national_level_RGB_feature_inc_200
, block_large_national_level_RGB_feature_pop_200
, block_large_national_level_RGB_feature_inc_200
, block_large_national_level_RGB_inc_200
, block_large_national_level_RGB_pop_200
, block_large_national_level_nl_pop_200
, block_large_national_level_nl_inc_200
, block_large_national_level_nl_feature_pop_200
, block_large_national_level_nl_feature_inc_200
, block_small_mw_level_RGB_feature_inc_200
, block_small_mw_level_RGB_inc_200
, block_small_mw_level_RGB_feature_high_pop_200
, block_small_mw_level_RGB_high_inc_200
, block_small_mw_level_RGB_pop_200
, block_small_mw_level_RGB_feature_pop_200
, block_small_mw_level_RGB_feature_high_inc_200
, block_small_mw_level_RGB_high_pop_200
Gorelick, N., M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore (2017). Google earth engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment.
Manson, S., J. Schroeder, D. Van Riper, T. Kugler, and S. Ruggles (2020). IPUMS National Historical Geographic Information System: Version 15.0 [dataset]. Minneapolis, MN: IPUMS. http://doi.org/10.18128/D050.V15.0.
U.S. Census Bureau. (2020). LEHD Origin-Destination Employment Statistics Data (2004) [dataset]. Washington, DC: U.S. Census Bureau, Longitudinal-Employer Household Dynamics Program, Version 7.
United States Geological Survey. "Landsat 7 Surface Reflectance Tier 1 (2000-2019)." Google Earth Engine. LANDSAT/LE07/C01/T1_SR. https://developers.google.com/earth-engine/datasets/catalog/LANDSAT_LE07_C01_T1_SR#description.
National Oceanic and Atmospheric Administration. "DMSP OLS: Nighttime Lights Time Series Version 4 (2000-2010)." Google Earth Engine. NOAA/DMSP-OLS/NIGHTTIME_LIGHTS. https://developers.google.com/earth-engine/datasets/catalog/NOAA_DMSP-OLS_NIGHTTIME_LIGHTS#description.
The data are licensed under a Creative Commons/CC-BY-NC license. See License.txt for details.