Repository of code associated with the paper "Using Neural Networks to Predict Micro-Spatial Economic Growth", which is forthcoming at the journal AER: Insights.
This code was run on a Linux computer using Python 3.6–3.8, pip and venv >= 19.0, tensorflow >= 2.0, numpy >= 1.2
Other packages needed (see run.sh
): pytables, pydrive, pandas, tqdm
Our code pipeline is generally divided into three phases: (1) raw data extraction and processing, (2) model training and validation, and (3) predictions and analysis. The output of phase (1) is a set of .tfrecord
files containing the train, validation, and test data. The output of phase (2) are trained CNNs. The output of phase (3) are predictions of our outcome variables.
Phases (1) and (2) will be of interest for those interested in extending our basic approach to other sources of imagery and/or outcome variables as well as those interested in understanding our approach in (possibly excruciating) detail. Phase (3) will be of interest to those who want to use the predictions generated by us for another application. Each of these phases are described in more detail below. Those interested primarily in using our predicted values are encouraged to skip to the section entitled "Phase (3) - Predictions" below.
In this phase, we extract the raw imagery and census data used to train our models. Cleaned and processed data ready for training (e.g. the output of this phase) is avilable on GoogleDrive. This phase is divided into the following steps:
General order of operations: extract_gee_data.py -> download_data.py -> prep_data_levels.py -> prep_data_diffs.py -> shard_data.py
- Create raw data export from GoogleEarthEngine: This is performed by the file
scripts/extract_gee_data.py
. The file will define the extract and the resulting data will be written (as many small TFRecord files) to a folder in Google Drive. - Download Data: We next download the data produced by step (1) and prepare it for training. The script
download_data.py
downloads the raw data from google drive (usinggoogle_drive_utils.py
), discard images that do not meet our urbanization threshold and convert them into a largeHDF5
file which is easier to store locally than many smalltfrecord files
. We also assign each image an identifier in this stage that can be used to match it with its label(s). Users will need to set theroot_dir_id
global variable in this file to align with the output folder from step (1) above. Users may also need to modify the "mode" argument to process a new data set. See the "important note" below regarding this step. To run the script, usepython download_data.py [large,small]
, wherelarge
builds the "large" national imagery, and "small" builds the small imagery. - Construct Label File: The script
download_data.py
will produce a file (../output/valid_imgs.txt
) of all images meeting our urbanization threshold, and not otherwise invalid. This file is keyed by(lat,lng)
or equivalently, theimg_id
variable. - Prepare Training Data: Next, we process the HDF5 file produced in the previous stage into a form suitable for use in tensorflow. In this phase, we also match each image with its ground truth label (e.g. the outcome to be predicted), partition the data into train, validation, and test sets, and strip off the overlap that GoogleEarth engine adds (e.g. the KernelSize parameter in GEE). This is performed in
prep_data_levels.py
andprep_data_diffs.py
for levels and diffs models repsectively. Finally, to improve processing speed by TensorFlow, we split the large TFrecord files producted by these scripts into small shards that can be loaded more efficiently. This is performed inshard_data.py
.
Important Note: This phase requires interacting with GoogleDrive's Python API which is a somewhat convoluted process. The script google_drive_utils.py
helps automate this somewhat. To run the code, you will need to follow the instructions of PyDrive (here) to set up a Google project and create a client_secrets.json
file. This file should be placed in the scripts
directory. The first time you run code, you will be asked to authenticate via a command line prompt. Copy and paste the URL from the command line into a browser and follow the instructions to authorize the PyDrive API. Unforauntely, we have not found a good way to make this process less cumbersome.
The output of this phase is made available here GoogleDrive. Users who wish to use our existing data, but experiment with new model architectures may download this data, and uncompress (tar -xvf ...
) it to the data
sub-folder of this repository.
This phase defines model architectures, trains models, and performs hyperparameter tuning. Model architecture and training code is contained in train_level_model.py
and train_diff_model.py
. We have provided the run.sh
script which includes detailed descriptions of parameters and command line arguments used by the training scripts.
General order of operations: train_level_model.py -> train_diff_model.py
Trained models are available here: GoogleDrive. These trained weights may be used for transfer learning (e.g. using our weights to generate features from different imagery). Models are named according to the following conventions:
block_[small,large]_national_[level,diff]_base[_feature]_[inc,pop,inc_pop][_all]
, where small/large
indicates the imagery size (e.g. the 40x40
vs 80x80
imagery), level/diff
indicates models for levels vs. diffs, the presence _feature
indicates models trained with initial conditions (e.g. auxiliary features), inc/pop/inc_pop
indicates the outcome variable, where using inc_pop
will train a model for income per capita, and the presence of _all
indicates models trained on the entire set of images in 2000/2010 (e.g. used to produce the out-of-period results in Table 2).
- Download data GoogleDrive
- Move into
data/
directory and un-compresstar -xvf ...
- Move into
weights/
directory and un-compresstar -xvf ..
- Run the script
run.sh
. The script will create a virtual-environment, install dependencies and run scripts for training. There are several different run configurations listed inrun.sh
which can reproduce the various aspects of the paper (e.g.RGB only
models or models with nighlights). Inspectrun.sh
for more detail. - Run tensorboard by running
tensorboard --logdir='out_dir/logs'
in terminal to monitor the training process and validation results.
Predictions and geographic shapefiles at the level of our large images can be accessed here. Also shared is a version which has been geographically crosswalked to 2010 Census Blocks. The unique identifier for images is img_id, and for Census Blocks is gisjoin (as defined by NHGIS here). Within each file are predictions of income and population which are generated in our out of sample model (as described in section 3.4). The variable inc_0_feature for example refers to the predicted log income in 2000, using the model including initial conditions. The variable dpop_9_19, conversely is our prediction of the log change in population from 2009 to 2019, in the model excluding initial conditions. Note that difference predictions are not crosswalked to the Block version of these data.
predict_level.py
are the codes for generating predictions for all images from 2000 to 2019. We have provided the run.sh
script which includes detailed descriptions of parameters and command line arguments used by the predicting scripts. Note that currently these scripts can only generate predicrtions of national base model for large/small
and inc/pop/inc_pop
with or without initial conditions.