Source code for:
J. Walchessen, A. Lenzi, and M. Kuusela. Neural Likelihood Surfaces for Spatial Processes with Computationally Intensive or Intractable Likelihoods. Preprint arXiv:2305.04634 [stat.ME], 2023. arxiv preprint
Contact Julia Walchessen at jwalches@andrew.cmu.edu with any questions.
This document contains the code for each case study in "Neural Likelihood Surfaces for Spatial Processes with Computationally Intensive or Intractable Likelihoods." Information about each folder and its contents in the repository. To quickly understand what a folder pertains to, please scroll down to find the information. The package environment for this project is contained in requirements.txt
This folder contains code to 1. generate data to train the neural network (generate_nn_data folder) and 2. train the neural network (train_nn folder) and 3. calibrate the neural network (calibration folder) as well as the neural network (models folder).
The script generate_nn_data.R simulates training and validation data to train the neural network. The user needs to specify the parameter space (what the boundaries of the parameter space are), how many parameters to sample from this parameter space, and how many spatial field realizations for each of the sampled parameters should be simulated. The script produces json files for each of the sample parameters (total = number of sampled parameters). Each of the json files is a list of the paired spatial field realization and parameter (permuted or not) and the class of the pair.
The script gaussian_process_data_shaping.py processes the json files produced by generate_nn_data into three numpy matrices: 1. spatial field realizations 2. and their corresponding paired parameters 3. and the classes which the paired spatial field and parameter belong to.
The script gp_nn.py constructs the neural network using a specific architecture and trains the neural network with tuning parameters (learning rate schedule and batch size) on a single gpu. The data used to train the neural network comes from the models/25_by_25/version_x/data/train folder and the models/25_by_25/version_x_data/validation folder. For the sake of keeping this repository small, there is no data in these two folders.
The script gp_nn_with_distributed_training.py constructs the neural network using a specific architecture and trains the neural network with tuning parameters (learning rate schedule and batch size) on multiple gpus. The data used to train the neural network comes from the models/25_by_25/version_x/data/train folder and the models/25_by_25/version_x_data/validation folder. For the sake of keeping this repository small, there is no data in these two folders.
This folder stores the models produced from gp_nn.py/gp_nn_with_distributed_training.py as well as diagnostics for these models. Our final model is stored in the folder 25_by_25/final_version. Within this folder, the architecture (gp_25_by_25_final_version_nn.json) and the weights (gp_25_by_25_final_version_nn_weights.h5) are stored in the folder model. The folders accuracy and loss contain visualizations of the training and validation accuracy and loss. The folder data contains the training and validation data in the form of numpy matrices of the parameters, spatial fields, and classes which trained the neural network. Note that the models stored in this folder are uncalibrated.
This folder contains scripts to 1. generate training data for calibration and to 2. calibrate the model via Platt scaling (a form of logistic regression).
To train the logistic regression model, we need training data of the following form (uncalibrated class probabilities (i.e. the output of the uncalibrated neural network
where
This script produces the sampled spatial fields y and the samples parameters θ (involving steps 1,2) and saves the data in lists in json files. This script is similar to the script generate_nn_data.R in the folder generate_nn_data.
This python script is for processing the json files produced from produce_training_and_test_data_for_calibration_part_2.R into numpy matrices for calibration. Specifically, three numpy matrices (parameters, spatial fields, and classes) will be saved in the folder calibration/data/25_by_25/final_version/train or calibration/data/25_by_25/final_version/test. This script is similar to the script gaussian_process_data_shaping.py in the folder generate_nn_data.
This script is for obtaining classifier outputs for the training (and test) data for calibration. Calibration requires the classifier output and the true class label. The classifier outputs will be saved as numpy matrices in the folder calibration/data/25_by_25/final_version/train (or calibration/data/25_by_25/final_version/test).
This script loads the data from the folder calibration/data/25_by_25/final_version/train and uses the data to train a logistic regression model which is then saved in the folder model/25_by_25/final_version/logistic_regression_model_with_logit_transformation.pkl. The test data will be used in the folder evaluate_nn to produce reliability diagrams which illustrate how effective calibration is in achieving calibrated class probabilities.
To evaluate our method in the Gaussian process case, we create an evaluation data set described in Section 4.1 of our paper. This evaluation data set consisters of 200 spatial field realizations per each parameter on a
This R file is a script to generate evaluation data for evaluating the neural likelihood surfaces, parameter estimates, and confidence regions for the single realization case. The evaluation data consists of spatial fields and the corresponding parameters which generated the spatial fields and the log likelihood field (over the parameter space) for the spatial field. The evaluation data is n single realizations per m parameters where the m parameters come from a grid over the parameter space. The evaluation data is saved as json files (the number of json files is equal to the number of parameters on the grid).
This R file is a script to generate evaluation data for evaluating the neural likelihood parameter estimates for the multiple realization case. The evaluation data consists of spatial fields and the corresponding parameters which generated the spatial fields and the log likelihood field (over the parameter space) for the spatial field. The evaluation data is n realizations of 5 replications of the spatial field per m parameters where the m parameters come from a grid over the parameter space. The evaluation data is saved as json files (the number of json files is equal to the number of parameters on the grid).
This python script processes the json files produced by either generate_evaluation_data_for_single_realization_case.R or generate_evaluation_data_for_multi_realization_case.R into numpy matrices (images, parameters, log likelihood fields) which are saved in the folders evaluate_nn/generate_data/data/25_by_25/single/reps/200 or evaluate_nn/generate_data/data/25_by_25/multi/5/reps/200.
This python scripts takes the log likelihood fields for the evaluation data stored in the folder evaluate_nn/generate_data/data/25_by_25/single/reps/200 or evaluate_nn/generate_data/data/25_by_25/multi/5/reps/200* and produces parameter estimates using these log likelihood fields. The parameter estimates for the evaluation data are stored in the folders produce_exact_likelihood_estimates/data/25_by_25/single/reps/200 and produce_exact_likelihood_estimates/data/25_by_25/multi/5/reps/200.
This folder contains scripts to generate neural likelihood surfaces for the evaluation data in the folder evaluate_nn/generate_data/data/25_by_25/single/reps/200 or evaluate_nn/generate_data/data/25_by_25/multi/5/reps/200*. To evaluate the effect of calibration, we produce both uncalibrated and calibrated neural likelihood surfaces for both the single and multiple realization cases. Hence, we have four different scripts for these four different cases. The neural likelihood surfaces once produced are stored in the folder evaluate_nn/produce_neural_likelihood_surfaces/data/25_by_25/final_version.
This python scripts takes the neural likelihood fields for the evaluation data stored in the folder evaluate_nn/produce_neural_likelihood_surfaces/data/25_by_25/final_version and produces parameter estimates for both the single and multiple realization case. Since calibration does not affect the point estimates, we use the uncalibrated neural likelihood surfaces to produce the point estimates. The parameter estimates for the evaluation data are stored in the folder evaluate_nn/produce_neural_likelihood_surfaces/data/25_by_25/final_version.
This is a timing study for the time to evaluate an exact or neural likelihood surface on average on the same fixed grid over the parameter space. There are two scripts--one for timing the exact likelihood surface and one for timing the neural likelihood surface. Both the neural and exact likelihood surfaces are computed using the full resources of my laptop which has an Intel Core i7-10875H processor with eight cores, each with two threads, and a NVIDIA GeForce RTX 2080 Super. To use the full resources of my laptop, parallel computing is utilized. The times to produce each of the 50 fields is stored in the folder timing_studies/data/25_by_25.
This folder contains many subfolders for producing the different visualizations that appear in our paper---reliability diagrams to understand the effect of calibration as well as neural and exact likelihood surfaces and approximate confidence regions and point estimates.
The notebook produce_reliability_diagram.ipynb produces a reliability diagram (empirical class probablity as function of predicted class probability) before and after calibration. The closer the function is to the identity after calibration the better the calibration. The figure configuration comes from https://github.com/hollance/reliability-diagrams but how we compute predicted and empirical class probability is different. The reliability diagrams are stored in the folder visualizations/produce_reliability_diagrams/diagrams/25_by_25/final_version.
This folder contains jupyter notebooks for visualizing the exact, uncalibrated, and calibrated neural likelihood surfaces for the evaluation data in the single realization case.
This folder contains jupyter notebooks for visualizing the exact, uncalibrated, and calibrated neural likelihood surfaces and the corresponding 95% approximate confidence regions for the evaluation data in the single realization case. There is also a jupyter notebook which plots the surfaces and 95% approximate confidence regions for exact, uncalibrated, and calibrated neural likelihood side by side. This visualizations appears in our paper.
This folder contains a jupyter notebook for visualizing the empirical coverage and confidence region area of exact, uncalibrated, and calibrated neural likelihood side by side. These two visualizations (empirical coverage and confidence region area) appear in our paper.
To visualize the point estimates across the parameter space
The script generate_nn_data.R simulates training and validation data to train the neural network. The user needs to specify the parameter space (what the boundaries of the parameter space are), how many parameters to sample from this parameter space, and how many spatial field realizations for each of the sampled parameters should be simulated. The script produces json files for each of the sample parameters (total = number of sampled parameters). Each of the json files is a list of the paired spatial field realization and parameter (permuted or not) and the class of the pair.
The script brown_resnick_data_shaping.py processes the json files produced by generate_nn_data into three numpy matrices: 1. spatial field realizations 2. and their corresponding paired parameters 3. and the classes which the paired spatial field and parameter belong to.
The script br_nn.py constructs the neural network using a specific architecture and trains the neural network with tuning parameters (learning rate schedule and batch size) on a single gpu. The data used to train the neural network comes from the models/25_by_25/version_x/data/train folder and the models/25_by_25/version_x_data/validation folder. For the sake of keeping this repository small, there is no data in these two folders.
The script br_nn_with_distributed_training.py constructs the neural network using a specific architecture and trains the neural network with tuning parameters (learning rate schedule and batch size) on multiple gpus. The data used to train the neural network comes from the models/25_by_25/version_x/data/train folder and the models/25_by_25/version_x_data/validation folder. For the sake of keeping this repository small, there is no data in these two folders.
This folder contains scripts to 1. generate training data for calibration and to 2. calibrate the model via Platt scaling (a form of logistic regression).
To train the logistic regression model, we need training data of the following form--uncalibrated class probabilities i.e. the output of the uncalibrated neural network
where
This script produces the sampled spatial fields y and the samples parameters θ (involving steps 1,2) and saves the data in lists in json files. This script is similar to the script generate_nn_data.R in the folder generate_nn_data.
This python script is for processing the json files produced from produce_training_and_test_data_for_calibration_part_2.R into numpy matrices for calibration. Specifically, three numpy matrices (parameters, spatial fields, and classes) will be saved in the folder calibration/data/25_by_25/final_version/train or calibration/data/25_by_25/final_version/test. This script is similar to the script brown_resnick_data_shaping.py in the folder generate_nn_data.
This script is for obtaining classifier outputs for the training (and test) data for calibration. Calibration requires the classifier output and the true class label. The classifier outputs will be saved as numpy matrices in the folder calibration/data/25_by_25/final_version/train (or calibration/data/25_by_25/final_version/test).
This script loads the data from the folder calibration/data/25_by_25/final_version/train and uses the data to train a logistic regression model which is then saved in the folder model/25_by_25/final_version/logistic_regression_model_with_logit_transformation.pkl. The test data will be used in the folder evaluate_nn to produce reliability diagrams which illustrate how effective calibration is in achieving calibrated class probabilities.
To evaluate our method in the Brown--Resnick case, we create an evaluation data set described in Section 4.1 of our paper. This evaluation data set consisters of 200 spatial field realizations per each parameter on a
This R file is a script to generate evaluation data for evaluating the neural likelihood surfaces, parameter estimates, and confidence regions for the single realization case. The evaluation data consists of spatial fields and the corresponding parameters which generated the spatial fields. The evaluation data is n single realizations per m parameters where the m parameters come from a grid over the parameter space. The evaluation data is saved as json files (the number of json files is equal to the number of parameters on the grid).
This R file is a script to generate evaluation data for evaluating the neural likelihood parameter estimates for the multiple realization case. The evaluation data consists of spatial fields and the corresponding parameters which generated the spatial fields. The evaluation data is n realizations of 5 replications of the spatial field per m parameters where the m parameters come from a grid over the parameter space. The evaluation data is saved as json files (the number of json files is equal to the number of parameters on the grid).
This python script processes the json files produced by either generate_evaluation_data_for_single_realization_case.R or generate_evaluation_data_for_multi_realization_case.R into numpy matrices (images and parameters) which are saved in the folders evaluate_nn/generate_data/data/25_by_25/single/reps/200 or evaluate_nn/generate_data/data/25_by_25/multi/5/reps/200.
This script is for producing the pairwise likelihood surfaces for the evaluation data in the single realization case. Changing the distance constraint (
This script is for producing the pairwise likelihood surfaces for the evaluation data in the multiple realization case. Changing the distance constraint (
This script processes the numpy matrices produced from the scripts produce_pairwise_likelihood_surfaces_for_evaluation_data_in_the_single_realization_case.R and produce_pairwise_likelihood_surfaces_for_evaluation_data_in_the_multi_realization_case.R into one numpy matrix each for all the pairwise likelihood surfaces and stores the numpy matrices in the folders produce_pairwise_likelihood_surfaces/data/25_by_25/dist_value/multi/5/reps/200 and produce_pairwise_likelihood_surfaces/data/25_by_25/dist_value/single/reps/200
This python scripts takes the pairwise likelihood fields for the evaluation data stored in the folder evaluate_nn/pairwise_likelihood_surfaces/data/25_by_25 and produces parameter estimates using these pairwise likelihood surfaces. The parameter estimates for the evaluation data are stored in the folders produce_pairwise_likelihood_estimates/data/25_by_25. Note that this is per distance constrain
This folder contains scripts to generate neural likelihood surfaces for the evaluation data in the folder evaluate_nn/generate_data/data/25_by_25/single/reps/200 or evaluate_nn/generate_data/data/25_by_25/multi/5/reps/200*. To evaluate the effect of calibration, we produce both uncalibrated and calibrated neural likelihood surfaces for both the single and multiple realization cases. Hence, we have four different scripts for these four different cases. The neural likelihood surfaces once produced are stored in the folder evaluate_nn/produce_neural_likelihood_surfaces/data/25_by_25/final_version.
This python scripts takes the neural likelihood fields for the evaluation data stored in the folder evaluate_nn/produce_neural_likelihood_surfaces/data/25_by_25/final_version and produces parameter estimates for both the single and multiple realization case. Since calibratin, does not affect the point estimates, we use the uncalibrated neural likelihood surfaces to produce the point estimates. The parameter estimates for the evaluation data are stored in the folder evaluate_nn/produce_neural_likelihood_surfaces/data/25_by_25/final_version.
This is a timing study for the time to evaluate pairwise or neural likelihood surface on average on the same fixed grid over the parameter space. There are two scripts--one for timing the pairwise likelihood surface and one for timing the neural likelihood surface. Both the neural and pairwise likelihood surfaces are computed using the full resources of my laptop which has an Intel Core i7-10875H processor with eight cores, each with two threads, and a NVIDIA GeForce RTX 2080 Super. To use the full resources of my laptop, parallel computing is utilized. The times to produce each of the 50 fields is stored in the folder timing_studies/data/25_by_25. Note that the time to produce the pairwise likelihood surface will vary depending on the distance constraint
This folder contains many subfolders for producing the different visualizations that appear in our paper---reliability diagrams to understand the effect of calibration as well as neural and pairwise likelihood surfaces and approximate confidence regions and point estimates.
The notebook produce_reliability_diagram.ipynb produces a reliability diagram (empirical class probablity as function of predicted class probability) before and after calibration. The closer the function is to the identity after calibration the better the calibration. The figure configuration comes from https://github.com/hollance/reliability-diagrams but how we compute predicted and empirical class probability is different. The reliability diagrams are stored in the folder visualizations/produce_reliability_diagrams/diagrams/25_by_25/final_version.
This folder contains jupyter notebooks for visualizing the unadjusted and adjusted pairwise likelihoods, uncalibrated, and calibrated neural likelihood surfaces for the evaluation data in the single realization case.
This folder contains jupyter notebooks for visualizing the unadjusted and adjusted pairwise likelihood, uncalibrated, and calibrated neural likelihood surfaces and the corresponding 95% approximate confidence regions for the evaluation data in the single realization case. There is also a jupyter notebook which plots the surfaces and 95% approximate confidence regions for exact, uncalibrated, and calibrated neural likelihood side by side. This visualizations appears in our paper.
This folder contains a jupyter notebook for visualizing the empirical coverage and confidence region area of unadjusted and adjusted pairwise likelihood, uncalibrated, and calibrated neural likelihood side by side. These two visualizations (empirical coverage and confidence region area) appear in our paper.
To visualize the point estimates across the parameter space