Skip to content

Latest commit

 

History

History
35 lines (32 loc) · 7.56 KB

datasets.md

File metadata and controls

35 lines (32 loc) · 7.56 KB
Dataset Task #Samples #Tasks Domain Split Metric Date Benchmark Source
QM7 PP 7,165 1 Molecule Stratified MAE 2012* MoleculeNet GDB-13
QM7-b PP 7k 14 Molecule Random MAE 2014 MoleculeNet GDB-13, info, info
QM8 PP 22k 12 Molecule Random MAE 2014 MoleculeNet GDB-17, info
QM9 PP,GM 130-134k 12-19 Molecule Random MAE 2012 MoleculeNet GDB-17, info, PyG
PDBbind PP 5k-13k 1 Protein-ligand Time RMSE 2004* MoleculeNet PDB, info
ANI-1 PP 22M 1 Molecules - MAE, ROC AUC 2017 - GDB-11, github, data-git
Alchemy PP 120k 12 Molecule Stratified^ MAE 2019 Alchemy Contest GDB MedChem, link, link
Matbench PP 1k-100k 13 Molecules StratifiedKFold^ MAE, ROC-AUC 2019 Matbench Materials Project, data
Atom3D PP Varied 8 Mol., RNA, Prot. Varied Various 2021 Atoms3D github
Jarvis PP 1k-800k Many Random^ Molecules MAE 2020* Nist-Jarvis Docu, Paper
Open MatSci ML Toolkit PP, GM 1.5M Varied Crystal Structures Stratified/Random MAE 2023* Open MatSci ML Toolkit Paper, github
Therapeutic Data Commons PP Varied Varied Molecules, Proteins Stratified^ Varied 2022 TDC Documentation, github
TorchProtein PP,SP Varied Varied Proteins Stratified^ Varied 2022 TorchProtein
TorchDrug PP,GM Varied Varied Molecules Stratified^ Varied 2022 TorchDrug Paper, github
OC20 PP,MD 560k-133M 3 Materials Extrapolation^ MAE, EwT 2020* OCP Paper, github
OC22 PP,MD 50k-10M 3 Materials Extrapolation^ MAE, EwT 2022* OCP Paper, gitub
ODAC23 PP, MD <40M 3 Catalyst, MOF Extrapolation^ MAE, EwT 2023* Papergithub
QM7-X PP,MD 4.2M 42 Molecule Extrapolation MAE 2022 - GDB-13
MD17 MD 150k-1M 10 Molecule Extrapolation MAE 2017* infoPyGHFgit
ISO17 MD 645K 1 Molecule Extrapolation MAE 2016 - QM9
GEOM GM 37M 1 Molecule Random MAE, RMSD 2021* GEOM Paper, github
ProteinNet SP 35k-105k 7 Proteins Stratified Varied 2019 - Paper, github
Molecule3D SP 3.9M 4 Molecules Random MAE, RMSE, validity 2021 - PubChemQC
SPICE SP 1.1 M 6 Molecules Random MAE 2023 - Paper, github
MatBench Discovery SP 150k-250k 2 Crystal Structures Varied^ F1, Accuracy, MAE 2023 MatBench Discovery Paper, github

This table contains benchmark datasets for Geometric GNNs applied to 3D atomic systems. We categorize each dataset with respect to the application task detailed in Section 7.1 of our hitchhiker's guide. We display various properties (from left to right): the number of samples per dataset, the number of properties which can be predicted, the input data domain, the dataset split method, the metric used to measure performance, the date, the benchmark link and the source. The symbol * in the Date column signifies that the dataset has been updated recently. ^ means that there is an active leaderboard to submit test predictions and compare one’s results.

PP= Property Prediction
MD= Molecular Dynamics
GM= Generative Modelling
SP= Structure Prediction