This repository contains code for the paper Multi-Fidelity Active Learning with GFLowNets.
In the last decades, the capacity to generate large amounts of data in science and engineering applications has been growing steadily. Meanwhile, the progress in machine learning has turned it into a suitable tool to process and utilise the available data. Nonetheless, many relevant scientific and engineering problems present challenges where current machine learning methods cannot yet efficiently leverage the available data and resources. For example, in scientific discovery, we are often faced with the problem of exploring very large, high-dimensional spaces, where querying a high fidelity, black-box objective function is very expensive. Progress in machine learning methods that can efficiently tackle such problems would help accelerate currently crucial areas such as drug and materials discovery. In this paper, we propose the use of GFlowNets for multi-fidelity active learning, where multiple approximations of the black-box function are available at lower fidelity and cost. GFlowNets are recently proposed methods for amortised probabilistic inference that have proven efficient for exploring large, high-dimensional spaces and can hence be practical in the multi-fidelity setting too. Here, we describe our algorithm for multi-fidelity active learning with GFlowNets and evaluate its performance in both well-studied synthetic tasks and practically relevant applications of molecular discovery. Our results show that multi-fidelity active learning with GFlowNets can efficiently leverage the availability of multiple oracles with different costs and fidelities to accelerate scientific discovery and engineering design.
We evaluate the proposed MF-GFN approach in both synthetic tasks (Branin, Hartmann) and benchmark tasks of practical relevance, such as DNA aptamer generation, antimicrobial peptide design and molecular modelling (read section 4 of the paper). Through comparisons with previously proposed methods as well as with variants of our method designed to understand the contributions of different components, we conclude that multi-fidelity active learning with GFlowNets not only outperforms its single-fidelity active learning counterpart in terms of cost effectiveness and diversity of sampled candidates, but it also offers an advantage over other multi-fidelity methods due to its ability to learn a stochastic policy to jointly sample objects and the fidelity of the oracle to be used to evaluate them.
- Install the GFlowNet codebase compatible with this implementation from alexhernandezgarcia/gflownet.
python -m pip install --upgrade https://github.com/alexhernandezgarcia/gflownet/archive/cont_mf.zip
- Clone this repository and install other dependencies by running
pip install -r requirements.txt
where this repository is cloned. - Set up the AMP oracles (optional, required only if you wish to run experiments with the anti-microbial peptide environment). Install the clamp-common-eval library from MJ10/clamp-gen-data by cloning the repo and then running the following where the repository is cloned:
pip install -r requirements.txt && pip install -e .
- Rename the log directory and data directory arguments (if necessary) in
config/user/anonymous.yaml
The project uses Hydra for configuration and Weights and Biases for logging.
For reproducing the results, configuration files with the default settings for experiments with the synthetic and benchmark tasks have been created in config/
. Run
python activelearning.py --config_name=<config-filename>
Options sf
and mf
for the <fid>
placeholder correspond to the single-fidelity and multi-fidelity variants respectively.
<fid>_branin
: Branin<fid>_hartmann
: Hartmann<fid>_amp
: Antimicrobial Peptides<fid>_aptamers
: DNA<fid>_mols_ea
: Molecules with the objective of maximizing electron affinity<fid>_mols_ip
: Molecules with the objective of maximizing (negative) ionization potential
Below we list other configuration options. See the config files in ./config
for all configurable parameters. Note that any config field can be overridden from the command line, and some configurations are not supported.
Environment Options
grid
(for the synthetic tasks, Branin and Hartmann)aptamers
(for DNA)amp
(for antimicrobial peptides)mols
(for molecules represented as SELFIES strings)
Oracle Options
The oracles are prefixed by the environment (branin
, hartmann
, amp
, aptamers
, mols_ea
, mols_ip
) and indexed by increasing level of fidelity.
Sampler Options
gflownet
(gflownet with the flowmatch objective)trajectorybalance
(gflownet with the trajectory balance objective, recommended for reproducing results)ppo
(RL baseline, proximal policy optimization algorithm)
Proxy Options
Options sf
and mf
for the <fid>
placeholder correspond to the single-fidelity and multi-fidelity variants repectively.
<fid>_gp
(exact gaussian process for regression)<fid>_svgp
(stochastic vartiation gaussian process for regression)<fid>_dkl
(deep kernel regressor with backbone as one of the model options and index kernel for learning the fidelity parameter)mf_dkl_linear
(multi-fidelity deep kernel regressor with backbone as one of the model options (below) and linear downsampling kernel for learning the fidelity parameter)
Model Options
mlp
transformer
mlm_cnn
(masked language model based on CNN)mlm_transformer
(masked language model based transformer)regressive
(based on the DNN-MFBO implementation)mf_mlp
(MLP with an additional layer over the concatenated representation of the feature with fidelity)
Acquisition Function Options
Single and multi-fidelity variants of the Max Entropy Search (MES) acquisiton function have been implemented. The suffix (<gp>
, <svgp>
, <dkl>
indicate with which regressor the implementation is compatible with.)
If you use any part of this code for your own work, please cite
@misc{hernandezgarcia2023multifidelity,
title={Multi-Fidelity Active Learning with GFlowNets},
author={Alex Hernandez-Garcia and Nikita Saxena and Moksh Jain and Cheng-Hao Liu and Yoshua Bengio},
year={2023},
eprint={2306.11715},
archivePrefix={arXiv},
primaryClass={cs.LG}
}