Dataset | Task | #Samples | #Tasks | Domain | Split | Metric | Date | Benchmark | Source |
---|---|---|---|---|---|---|---|---|---|
QM7 | PP | 7,165 | 1 | Molecule | Stratified | MAE | 2012* | MoleculeNet | GDB-13 |
QM7-b | PP | 7k | 14 | Molecule | Random | MAE | 2014 | MoleculeNet | GDB-13, info, info |
QM8 | PP | 22k | 12 | Molecule | Random | MAE | 2014 | MoleculeNet | GDB-17, info |
QM9 | PP,GM | 130-134k | 12-19 | Molecule | Random | MAE | 2012 | MoleculeNet | GDB-17, info, PyG |
PDBbind | PP | 5k-13k | 1 | Protein-ligand | Time | RMSE | 2004* | MoleculeNet | PDB, info |
ANI-1 | PP | 22M | 1 | Molecules | - | MAE, ROC AUC | 2017 | - | GDB-11, github, data-git |
Alchemy | PP | 120k | 12 | Molecule | Stratified^ | MAE | 2019 | Alchemy Contest | GDB MedChem, link, link |
Matbench | PP | 1k-100k | 13 | Molecules | StratifiedKFold^ | MAE, ROC-AUC | 2019 | Matbench | Materials Project, data |
Atom3D | PP | Varied | 8 | Mol., RNA, Prot. | Varied | Various | 2021 | Atoms3D | github |
Jarvis | PP | 1k-800k | Many | Random^ | Molecules | MAE | 2020* | Nist-Jarvis | Docu, Paper |
Open MatSci ML Toolkit | PP, GM | 1.5M | Varied | Crystal Structures | Stratified/Random | MAE | 2023* | Open MatSci ML Toolkit | Paper, github |
Therapeutic Data Commons | PP | Varied | Varied | Molecules, Proteins | Stratified^ | Varied | 2022 | TDC | Documentation, github |
TorchProtein | PP,SP | Varied | Varied | Proteins | Stratified^ | Varied | 2022 | TorchProtein | |
TorchDrug | PP,GM | Varied | Varied | Molecules | Stratified^ | Varied | 2022 | TorchDrug | Paper, github |
OC20 | PP,MD | 560k-133M | 3 | Materials | Extrapolation^ | MAE, EwT | 2020* | OCP | Paper, github |
OC22 | PP,MD | 50k-10M | 3 | Materials | Extrapolation^ | MAE, EwT | 2022* | OCP | Paper, gitub |
ODAC23 | PP, MD | <40M | 3 | Catalyst, MOF | Extrapolation^ | MAE, EwT | 2023* | Papergithub | |
QM7-X | PP,MD | 4.2M | 42 | Molecule | Extrapolation | MAE | 2022 | - | GDB-13 |
MD17 | MD | 150k-1M | 10 | Molecule | Extrapolation | MAE | 2017* | infoPyGHFgit | |
ISO17 | MD | 645K | 1 | Molecule | Extrapolation | MAE | 2016 | - | QM9 |
GEOM | GM | 37M | 1 | Molecule | Random | MAE, RMSD | 2021* | GEOM | Paper, github |
ProteinNet | SP | 35k-105k | 7 | Proteins | Stratified | Varied | 2019 | - | Paper, github |
Molecule3D | SP | 3.9M | 4 | Molecules | Random | MAE, RMSE, validity | 2021 | - | PubChemQC |
SPICE | SP | 1.1 M | 6 | Molecules | Random | MAE | 2023 | - | Paper, github |
MatBench Discovery | SP | 150k-250k | 2 | Crystal Structures | Varied^ | F1, Accuracy, MAE | 2023 | MatBench Discovery | Paper, github |
This table contains benchmark datasets for Geometric GNNs applied to 3D atomic systems. We categorize each dataset with respect to the application task detailed in Section 7.1 of our hitchhiker's guide. We display various properties (from left to right): the number of samples per dataset, the number of properties which can be predicted, the input data domain, the dataset split method, the metric used to measure performance, the date, the benchmark link and the source. The symbol * in the Date column signifies that the dataset has been updated recently. ^ means that there is an active leaderboard to submit test predictions and compare one’s results.
PP= Property Prediction
MD= Molecular Dynamics
GM= Generative Modelling
SP= Structure Prediction