Authors: Montana N. Carlozo, Ning Wang, Alexander W. Dowling, and Edward J. Maginn
genFF_public is a repository used to calibrate a transferable FF for one- and two-carbon single-bonded refrigerants with elements of C, F, and H given experimental data. The key feature of this work is using machine learning (ML) tools in the form of Gaussian processes (GPs) and estimability analysis techniques to smartly design atom type schemes for tranferable FFs and optimize their LJ parameters. This work features the comparison of four atom typing schemes designed and optimized with ML and GAFF.
Note: For all files in this repository, AT-4 (main text) corresponds to AT-1 (repository files). Similarly, AT-6a corresponds to AT-2, AT-6b corresponds to AT-6, and AT-8 corresponds to AT-8.
Please cite as:
Montana N. Carlozo, Ning Wang, Alexander W. Dowling, Edward J. Maginn “Machine Learning to Optimize Transferable Hydrofluorocarbon Refrigerant Force Fields”, 2025
The repository is organized as follows:
genFF_public/ is the top level directory.
It contains the following:
- .gitignore prevents large files from the signac workflow and plots from being tracked by git and prevents tracking of other unimportant files.
- hfcs-fffit.yaml is the conda environment for use with this work.
- gen-gp-vle.py is the code used to create the GP models in this work from the data in our previous study.
- AT-results.xlsx is an excel file of results containing the numerical results of the eigen-decomposition of the FIM and estimability analysis as well as the final LJ parameters for each atom type scheme.
- init_gaff_ms.py is the initialization file for molecular simulations for the gaff LJ parameters.
- init_opt_at.py is the initialization file for atom type optimization.
- init_optff_ms.py is the initialization file for molecular simulations for the validation of LJ parameters for atom type schemes.
- molecule_exp_unc_data.csv is a csv of the weights and uncertainties used in atom type (AT) optimization.
- param-comp.xlsx is an excel file comparing the different LJ parameters used for each AT scheme for each molecule.
- post_analysis_ms.py is the script used to gather validation including the data for Table 4. Also generates Figures 3, 4, 5, and 6 and the files h-p-vap.pdf and vle.pdf.
- post_analysis_opt.py is the script used to optimize the transferable FF parameters. Generates the data for Table 7.
- rcc_opt_at_analysis.py is the script used to perform the estimability analysis and eigen-decomposition of the FIM. Generates the data for Table 6.
Directories gaff_ff_ms/, opt_at_params/, and opt_ff_ms/ are initially created via init_gaff_ms.py, init_opt_at.py, and init_optff_ms.py in the top directory through signac.
Each contains the following files/subdirectories:
- project_gaff_ms.py, project_opt_at.py, or project_optff_ms.py; The script for running the workflow using signac.
- templates/ are the templates required to run this workflow in signac on the crc.
- workspace/ will appear to save all raw results generated during the workflow after running init_gpbo*.py. This file is not tracked by git due to its size. the workspace/ folder for this study can be downloaded on Google Drive (see section 'Workflow Files and Results')
- signac_project_document.json will also appear to track the status of jobs in the signac workflow
Directory csv/ contains data used to train the GP models.
It contains the following files:
- rXX-density.csv; The MD density data.
- rXX-vle.csv; The GEMC data which is used to train the GP models.
Directory example_mcf_files/ contains sample .mcf files for all models and HFCs evaluated in this work.
It contains the following files/subdirectories:
- AT-Y/RXX-species1.mcf are the sample .mcf files for each FF model and refrigerant. The files list parameters for the intramolecular and partial charges for each molecule and AT scheme.
Directory fffit/fffit is a package which contains some critical functions for running the workflow.
It contains the following files/subdirectories:
- tests/ contains the tests for the functions in fffit/fffit.
__init__.py intializes the package.- models.py contains functions related to building GP models.
- pareto.py contains functions related to locating pareto-optimal parameter sets.
- plot.py contains some functions for plotting.
- signac.py contains functions related to parsing data from signac workspaces.
- utils.py contains utility functions necessary for this package.
Directory molec_gp_data/ contains the GPs and training/testing data for each refrigerant.
It contains the following files/subdirectories:
- RXX-vlegp are the subdirectories for each refrigerant.
- RXX-vlegp/sim_PROP_y_train.csv are the output training data for each property.
- RXX-vlegp/sim_PROP_y_test.csv are the output testing data for each property.
- RXX-vlegp/x_train.csv are the input training data for all properties.
- RXX-vlegp/x_test.csv are the input testing data for all properties.
- RXX-vlegp/vle-gps.pkl are the pickled GP models for each property.
The pymser/ directory is a clone of the pymser repository. Refer to their GitHub Page for more information.
The utils/ directory consists of the functions and files required for generalized FF optimization.
It contains the following files/subdirectories:
- molec_class_files/rXX.py are files containing class objects with the experimental data and relevant information for each refrigerant.
__init__.py intializes the package.- analyze_ms.py contains all functions necessary to analyze the molecular simulation data.
- atom_type.py contains classes for each data-informed atom type studied in this work.
- opt_atom_types.py contains classes and functions relevant for optimizing transferable FF parameters.
Running the analysis will cause results directories to appear in genFF_public/ with relevant human readable data and plots. Subdirectories further categorize the results by transferable FF.
- Results/ shows data where we analyze the results from transferable FF parameter optimization.
- Results_MS/ shows data where we analyze the results of the molecular simulations used to validate our transferable FFs and compare them with GAFF.
- Results_gp/ shows data where we analyze the best results based on how efficiently the GP predicted SSE was optimized.
We note that this repository is based on the branch public in the dowlinglab/generalizedFF repository, which is private.
All workflow iterations were performed inside either genFF_public/gaff_ff_ms/, genFF_public/opt_at_params/, or , genFF_public/opt_ff_ms/ where it exists.
Each iteration was managed with signac-flow. Inside gaff_ff_ms, opt_at_params, or opt_ff_ms/ you will find all the necessary files to
run the workflow. Note that you may not get the exact same simulation results due to differences in software versions, random seeds, etc.
All of the scripts for running the workflow are provided in this repository. post_analysis_ms.py, post_analysis_opt.py, and rcc_opt_at_analysis.py are the scripts used to perform data analysis.
All scripts required to generate the primary figures in the
manuscript and SI are reported under genFF_public/post_analysis_ms.py. When running analysis scripts, these figures are saved under Results_MS_/AT-1268/RXX.
It contains the following files/subdirectories:
- mapd_props.png is SI Figure S5.
- h_p_vap.pdf has property predictions for enthalpy of vaporization and vapor pressure for all FFs and moelcules.
- vle.pdf has property predictions for the VLE curve for all FFs and moelcules.
To run this software, you must have access to all packages in the hfcs-fffit environment (hfcs-fffit.yaml) which can be installed using the instructions in the next section.
This package has a number of requirements that can be installed in
different ways. We recommend using a conda environment to manage
most of the installation and dependencies. However, some items will
need to be installed from source or pip.
Running the simulations will also require an installation of pymser.
This can be installed separately (see installation instructions
here ).
An example of the procedure is provided below:
# Install pip/conda available dependencies
# with a new conda environment named gpbo-emul
conda env create -f hfcs-fffit.yaml
conda activate hfcs-fffit
pip install pymser
NOTE: We use Signac and signac flow
to manage the setup and execution of the workflow. These
instructions assume a working knowledge of that software.
WARNING: Running these scripts will overwrite your local copy of our data (Results/* and Results/*) with the data from your workflow runs.
To run LJ parameter optimization, follow the following steps:
- Use init_optff_ms.py to initialize files for simulation use. Change init_opt_at.py as necessary
cd genFF_public python init_opt_at.py - Do the following in opt_at_params directory:
- Generate pareto sets for 1st repeats
python project_opt_at.py submit -o gen_pareto_sets -f obj_choice [val] atom_type [val] - Run the optimization algorithm with repeats
python project_opt_at.py submit -o run_obj_alg -f obj_choice [val] atom_type [val] - Run the post analysis algorithm
cd genFF_public python post_analysis_opt.py
To run vapor-liquid-equilibrium iterations, follow the following steps:
- Use init_optff_ms.py to initialize files for simulation use
cd genFF_public python init_optff_ms.py - Do the following in opt_ff_ms directory:
- Check status a few times throughout the process
python project_optff_ms.py status - Create force fields
python project_optff_ms.py run -o create_forcefield - Calculate vapor/liquid box size
python project_optff_ms.py run -o calc_boxes - Run simulation and check for overlap
python project_optff_ms.py submit -o NVT_liqbox --bundle=12 --parallel python project_optff_ms.py run -o extract_final_NVT_config python project_optff_ms.py submit -o NPT_liqbox --bundle=12 --parallel python project_optff_ms.py run -o extract_final_NPT_config python project_optff_ms.py submit -o run_gemc --bundle=12 --parallel python project_optff_ms.py run -o check_prod_overlap - Calculate VLE Properties
python project_optff_ms.py run -o calculate_props
To run vapor-liquid-equilibrium iterations, follow the following steps:
- Use init_optff_ms.py to initialize files for simulation use
cd genFF_public python init_optff_ms.py - Do the following in the gaff_ff_ms directory:
- Check status a few times throughout the process
python project_gaff_ms.py status - Create force fields
python project_gaff_ms.py run -o create_forcefield - Calculate vapor/liquid box size
python project_gaff_ms.py run -o calc_boxes - Run simulation and check for overlap
python project_gaff_ms.py submit -o NVT_liqbox --bundle=12 --parallel python project_gaff_ms.py run -o extract_final_NVT_config python project_gaff_ms.py submit -o NPT_liqbox --bundle=12 --parallel python project_gaff_ms.py run -o extract_final_NPT_config python project_gaff_ms.py submit -o run_gemc --bundle=12 --parallel python project_gaff_ms.py run -o check_prod_overlap - Calculate VLE Properties
python project_gaff_ms.py run -o calculate_props
When both GAFF and OptFF Molecular Simulations Are Finished
- cd to genFF_public/ (top level directory)
- Extract VLE properties, Save to Results_MS directory and create pdf of plots
python post_analysis_ms.py
The instructions outlined above seem to be system-dependent. In some cases, users have the following error:
ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found
If you observe this, please try the following in the terminal
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
which should fix the problem. This is not an optimal solution and is something we would like to address. We found that related projects 1, 2 have similar issues. If you are aware of a robust solution to this issue, please let us know by raising an issue or sending an email!
This research is based upon work supported by the National Science Foundation under award number ERC-2330175 for the Engineering Research Center EARTH as well as grants EFRI 2029354 and CBET-1917474. Computing resources were provided by the Center for Research Computing (CRC) at the University of Notre Dame. MC acknowledges support from the Graduate Assistance in Areas of National Need fellowship from the Department of Education, grant number P200A210048.
Please contact Montana Carlozo (mcarlozo@nd.edu) or Dr. Edward Maginn (ed@nd.edu) with any questions, suggestions, or issues.
This section lists software versions for the most important packages.
cassandra==1.3.1
foyer==0.12.1
gpflow==2.9.2
matplotlib==3.10.1
mosdef_cassandra==0.4.0
numdifftools==0.9.41
numpy==1.26.4
packmol==20.16.1
pandas==2.2.3
panedr==0.8.0
pymser==1.0.21
python==3.12.10
scipy==1.15.2
signac==2.3.0
signac-flow==0.29.0