Skip to content

deepppl/evaluation

Repository files navigation

DeepStan Evaluation: the Stan++ to (Num)Pyro compiler

The fork of Stanc3 that compiles programs to (Num)Pyro is available at https://github.com/deepppl/stanc3.

Getting Started

You need to install the following dependencies:

  • opam: the OCaml package manager
  • bazel: required by tensorflow-probability

Stanc requires version 4.07.0 of OCaml which can be installed with:

opam switch create 4.07.0
opam switch 4.07.0

Then simply run the following command to install all the dependencies, including the compiler.

make init

Dockerfile

We also provide a dockerfile to setup an environment with all the dependencies. Build the image with (you might need to increase available memory in docker preferences):

make docker-build

Run with:

make docker-run

You can also follow the instruction of the dockerfile to install everything locally.


Experiments

To evaluate our compilation scheme from Stan to Pyro and Numpyro we consider the following questions:

  • RQ1: Can we compile and run all Stan models?
  • RQ2: What is the impact of the compilation on accuracy?
  • RQ3: What is the impact of the compilation on speed?

Then to evaluate DeepStan extensions, we consider two additional questions

  • RQ4: Are explicit variational guides useful?
  • RQ5: For deep probabilistic models, how does our extended Stan compare to hand-written Pyro code?

We assume that the current working directory is evaluation.

The code is organized by research questions in the evaluation/rq* directories. Each directory contains a makefile with a rule eval that launches the experiments.

make eval

⚠️ Some experiments take hours to complete. Executing the command directly in the evaluation directory will take a few days.

Each makefile also contains a rule scaled to run a faster and lighter version of a subset of the experiments. In particular, not all the backends and compilation scheme are executed, the number of chains for the inference is reduced, and some examples are not executed. But all the different test scripts are executed.

make scaled

On a MacBook Pro (2020), 2.3 GHz Quad-Core Intel Core i7, 32GB of memory we have the following durations:

  • RQ1: 26m52s make -C rq1 scaled
  • RQ2: 20m33s make -C rq2-3 scaled_accuracy
  • RQ3: 16m11s make -C rq2-3 scaled_speed
  • RQ4: 2m41s make -C rq4 scaled
  • RQ5: 2m23s make -C rq5 scaled
  • Total: 92m53s make scaled

We detail below how to run the tests individually and how to interpret the results.

RQ1

To compile all the examples of example-models from https://github.com/stan-dev/example-models, you can use the bash script test_example-models.sh. The script expects two arguments: the backend (pyro or numpyro), and the compilation mode (generative, comprehensive, or mixed).

cd rq1
./test_example-models.sh pyro comprehensive

This command generates a file named logs/pyro-comprehensive.csv containing the name of the compiled examples and the exit code: 0 meaning success, 1 meaning semantics error raised from stanc3, and 2 meaning compilation error due to the new backend. The summary of the results is printed on the standard output and add it to the file logs/summay.log.

To test the compilation and inference, we use the models and data of PosteriorDB that are available in the directory posteriordb. The Python the script test_posteriordb.py compiles and executes one iteration of the inference on all the examples of posteriordb. The script is parameterized by the backend and the compilation scheme. For example it can run with the numpyro backend and the comprehensive compilation scheme as follows:

python test_posteriordb.py  --backend numpyro --mode comprehensive

This will generate a csv file logs/YYMMDD_HHMM_numpyro_comprehensive.csv containing the exit code of each experiments and add a summary in logs/summay.log.

The summary of all the experiments can be display with:

cat logs/summay.log

RQ2-RQ3

To compare accuracy of our backends with Stan, you can use the test_accuracy.py script.

cd rq2-3
python test_accuracy.py --help

usage: test_accuracy.py [-h] --backend BACKEND [--mode MODE] [--test]
                        [--iterations ITERATIONS] [--warmups WARMUPS]
                        [--chains CHAINS] [--thin THIN]

Run accuracy experiment on PosteriorDB models.

optional arguments:
  -h, --help            show this help message and exit
  --backend BACKEND     inference backend (pyro, numpyro, or stan)
  --mode MODE           compilation mode (generative, comprehensive, mixed)
  --test                Run test experiment (iterations = 100, warmups
                        = 100, chains = 1, thin = 1)
  --posteriors POSTERIORS [POSTERIORS ...]
                        select the examples to execute
  --iterations ITERATIONS
                        number of iterations
  --warmups WARMUPS     warmups steps
  --chains CHAINS       number of chains
  --thin THIN           thinning factor

For instance, to test the NumPyro backend with the comprehensive translation using PosteriorDB configurations on all examples that have a reference, the command is:

python test_accuracy.py --backend numpyro --mode comprehensive

This will generate a csv file status_numpyro_comprehensive_YYMMDD_HHMMSS.csv containing a summary of the experiments.

To run the reference Stan implementation:

python test_accuracy.py --backend stan

To compare the speed of our backends with Stan, you can use the test_speed.py script.

python test_speed.py --help

usage: test_speed.py [-h] --backend BACKEND [--mode MODE] [--runs RUNS]
                     [--test] [--iterations ITERATIONS] [--warmups WARMUPS]
                     [--chains CHAINS] [--thin THIN]

Run experiments on PosteriorDB models.

optional arguments:
  -h, --help            show this help message and exit
  --backend BACKEND     inference backend (pyro, numpyro, or stan)
  --mode MODE           compilation mode (generative, comprehensive, mixed)
  --runs RUNS           number of runs
  --test                Run test experiment (iterations = 10, warmups =
                        10, chains = 1, thin = 1)
  --posteriors POSTERIORS [POSTERIORS ...]
                        select the examples to execute
  --iterations ITERATIONS
                        number of iterations
  --warmups WARMUPS     warmups steps
  --chains CHAINS       number of chains
  --thin THIN           thinning factor

For instance, to launch 5 runs with the NumPyro backend and the comprehensive translation using PosteriorDB configurations except for the seed which is picked randomly at each run, the command is:

python test_speed.py --backend numpyro --mode comprehensive --runs 5

This will generate 5 csv files (one per run) duration_numpyro_comprehensive_i_YYMMDD_HHMMSS.csv containing a summary of the experiments.

⚠️ Experiments with the pyro backend take a very long time (e.g., >60h for one example).

⚠️ A keyboard interrupt only stops one example.

A test version of the experiments can be run for both test_accuracy.py and test_speed.py with the --test option to verify the scripts on the entire set of examples with very few iterations.

⚠️ Accuracy and speed results computed with this option are not meaningful.

python test_accuracy.py --backend numpyro --mode comprehensive --test
python test_speed.py --backend numpyro --mode comprehensive --test

The option --posterior can be used to select a subset of examples to execute for both test_accuracy.py and test_speed.py. The example must be one of the posterior with reference draws. E.g.,

python test_accuracy.py --backend numpyro --posterior nes1976-nes earnings-earn_height
python test_speed.py --backend numpyro --posterior nes1976-nes earnings-earn_height

The script results_analysis.py can be used to analyze the results. Option --logdir specifies the repository containing the log files (default ./logs).

python results_analysis.py

Option --nopyro can be used to ignore Pyro results (e.g., after running make scaled which does not run the Pyro experiment).

RQ4

The script multimodal.py regenerates the plots of Figure 10 in pdf format to compare Stan NUTS, Stan ADVI, DeepStan NUTS, and DeepStan VI with explicit guides.

cd rq4
python multimodal.py

This will generate the files deepstan-vs-deepstansvi.pdf, deepstansvi-vs-stanadvi.pdf, and stan-vs-deepstansvi.pdf. The stan code is in the two files multimodal_model.stan and multimodal_guide_model.stan.

RQ5

The last experiments are on deep probabilistic models and compare our Stan extension with hand-written Pyro code. The variational autoencoder (VAE) example is in vae_model.stan and the hand-written Pyro version with the comparison code is in vae.py.

cd rq5
python vae.py

The script executes both versions and print the precision, recall, and f1 score on each on the standard output.

The MLP in DeepStan is in mlp_model.py and the hand-written Pyro version with the comparison code is in mlp.py.

python mlp.py

This script prints the result of the comparison on the standard output and produce a file pyro-vs-deepstan.pdf showing the distributions of accuracy of the sampled MLPs.

About

Evalution scripts for the Stan to (Num)Pyro compiler

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published