DISCS: A Benchmark for Discrete Sampling: paper
First, follow the guideline in https://github.com/google/jax#installation to install Jax.
Then navigate to the root of the project folder ./discs/
and run
pip install -e .
If you wish to run the experiments on Xmanager, please follow the steps described in https://github.com/google-deepmind/xmanager to setup your Xmanager.
To run a sampling experiment, we need to set up three main components. 1) The model we want to sample from (target distribution), 2) the sampler we want to use and 3) MCMC experimental configuration (number of chains, chain length, etc.). To achieve this, these three main components are being structured in DISCS package as below:
- Model configs which are defined under
./discs/models/configs/
. For each model, its corresponding config contains the shape and the number of categories of the sample and also additional model config values to set up the model parameters. - Sampler config which are defined under
./discs/samplers/configs/
. For each sampler, its corresponding config contains the values required to set up the sampler. - Experiment config which are defined under
./discs/experiment/configs/
. Note that the experimental setup varies depending on the type of model we wish to run. For all the experiments, we first load the common general experiment configs defined at./discs/common/configs.py
and then update the values depending on the task type../discs/experiment/configs/lm_experiment.py
contains the experiment setup for thetext_infilling
problem../discs/experiment/configs/co_experiment.py
contains the common configs of the combinatorial optimization problem which depending on the problem type additional configs of the graph type are defined under their different folder in./discs/experiment/configs/
.
Note: For adding new samplers, models and experiments the above configs need to get updated. To learn how to add your sampler or model to the packages, you can refer to the explanations provided in ./discs/samplers/
and ./discs/models/
.
Under the ./discs/samplers/
directory, you can see the list of all the samplers with their corresponding configuration under ./discs/samplers/configs/
.
List of the samplers:
- randomwalk
- hammingball
- blockgibbs
- gwg
- path_auxiliary
- dlmc (dlmcf: solver=euler_forward)
- dmala
Under the ./discs/models/
directory, you can see the list of all the models with their corresponding configuration under ./discs/models/configs/
.
List of Models
- Classical Models
- bernoulli
- categorical
- ising
- potts
- fhmm
- Combinatorial optimization problems
- maxcut
- ba, er, optsicom
- maxclique
- rb, twitter
- mis
- er_density, ertest, satlib
- normcut
- nets: INCEPTION, ALEXNET, MNIST, RESNET, VGG
- maxcut
- Energy based models
- rbm
- resnet
- text_infilling (language model)
Note: For running energy-based models, data_path
and for combinotorial optimization problems, data_root
, in the model config should be set. For the text infilling model, additional path of bert_model
should be set. Further information on the data
and how to access it can be found data
sections below.
Below we provide an example of how to run a sampling experiment for different tasks by passing the name of sampler and the model.
To run an experiment locally, under the root folder ./discs/
, run:
model=bernoulli sampler=randomwalk ./discs/experiment/run_sampling_local.sh
For combinatorial optimization problems you further need to set the graph type:
model=maxcut graph_type=ba sampler=path_auxiliary ./discs/experiment/run_sampling_local.sh
Note that for the experiments above the default config value of the sampler, model and the experiments are used. To define your own experiment setup, you can modify the corresponding config values.
Under the ./discs/run_configs/
you can find predefined experiment configs for all model types which are used to study the performance of different samplers and the effect of different config values of models, samplers and the experiment. To define your own experiment config please check below section.
To run an experiment on Xmanager, under the root folder ./discs/
, run:
config=./discs/run_configs/co/maxclique/rb_sampler_sweep.py ./discs/run_xmanager.sh
The provided example above will run all the samplers on all the maxclique
problems with graph type of rb
.
For defining your own Xamanger script to sweep over any of the experiment, sampler or model configs, you should follow the below structure.
from ml_collections import config_dict
def get_config():
"""Get config."""
config = config_dict.ConfigDict(
dict(
model='categorical',
## default sampler
sampler='path_auxiliary',
sweep=[
{
'config.experiment.chain_length': [100000, 200000],
'model_config.num_categories': [4, 8],
'sampler_config.name': [
'dmala',
'path_auxiliary',
'gwg',
],
'sampler_config.balancing_fn_type': [
'SQRT',
'RATIO',
'MAX',
'MIN',
],
},
],
)
)
return config
In the above example, sampler_config.name
is used to sweep over samplers, since all of them are locally balanced function based, sampler_config.balancing_fn_type
sweeps over the types. config.experiment.${any experiment config you want to sweep over}
is used to sweep over experiment config, which is the chain length in the above example. model_config.${any model config you want to sweep over}
is used to sweep over any model related config values.
Depending on type of the model we are running the sampling on, different metrics are being calculated and the results are being stored in different forms.
-
For the
classical model
, the ESS is computed over the chains after the burn-in phase. The ESS is normalized over running time and number of energy evaluations and stored as acsv
file. -
For
combinatorial optimization
, the objective function is being evaluated throughout the chain generation and stored in apickle
file. Note that fornormcut
problem, the best sample is also being stored in thepickle
for further post processing to get theedge cut ratio
andbalanceness
. -
For the
text_infilling
task, the generated sentences and their evaluated metrics includingbleu
,self-bleu
andunique-ngrams
are being stored in a pickle file. -
For energy based models, the image of selected samples are also further saved through the chain generation. To get the plots and arrange the results two main scripts
run_plot_results.sh
andrun_co_through_time.sh
are used that could be found at./discs/plot_results/
.
Note: For detailed explanation on the metrics used and the way that they are being calculated please refer to DISCS paper. For reproducing the tables and figures in the paper, please refer to the explanation provided in ./discs/plot_results/
and ./discs/models/
.
The data used in this package could be found here. The data contains the following components:
- Graph data of combinatorial optimization problems for different graph types under
/DISCS-DATA/sco/
. - Model parameters for energy-based models found at
DISCS-DATA/BINARY_EBM
andDISCS-DATA/RBM_DATA
. Binray ebm is trained on MNIST, Omniglot, and Caltech dataset and binary and categorical RBM are trained on MNIST and Fashion-MNIST dataset. For the language model, you could download the model parameters from here. - Text infilling data generated from WT103 and TBC found at
/DISCS-DATA/text_infilling_data/
.
For more details on how to plug in your sampler, model and evaluator please check the explanations under ./discs/samplers
, ./discs/models
and ./discs/evaluator
folders.
You can simply run pytest
under the root folder to test everything.
We welcome pull request, please check CONTRIBUTING.md for more details.
This package is licensed under the Apache License, Version 2.0.
This is not an officially supported Google product.