Name	Name	Last commit message	Last commit date
parent directory ..
model_configures	model_configures
README.md	README.md
analysis.py	analysis.py
classification_inference.py	classification_inference.py
classification_train.py	classification_train.py
hyper.py	hyper.py
regression_inference.py	regression_inference.py
regression_train.py	regression_train.py
utils.py	utils.py

Molecular Property Prediction on a New CSV Dataset

The scripts in this directory are helpful for quickly prototyping a GNN-based model for molecular property prediction on a new CSV dataset.

The command line interface has been tested against the MoleculeNet benchmark. For more details, see here.

Data Preparation

For training, we assume that the molecular properties are recorded in a CSV file, where one column holds the SMILES strings of molecules and some other column(s) hold one or multiple molecular properties.

For inference, there are two options:

A .csv/.csv.gz file, where one column holds the SMILES strings of molecules
A .txt file, where each line corresponds to the SMILES string of a molecule

Data Analysis (Optional)

We can do a quick analysis to molecules in a CSV file with

python analysis.py -c X -sc Y

where X specifies the path to the CSV file and Y specifies the header for the SMILES column in the CSV file.

Other optional arguments include:

Number of Processes: -np processes [default=1]
- Specifies the number of processes to use for computing
Result Path: -p path [default=analysis_results]
- Specifies the path to save the analysis results

The analysis results will be saved to the following files in the result path above:

valid_canonical_smiles.txt: Canonical SMILES for valid molecules
summary.txt: A file of the analysis summarized, including
- Number/percentage of valid molecules
- Average number of atoms/bonds/rings per molecule
- Number of molecules for a particular atom/bond descriptor value to appear

Modeling

Currently we use a default setting as follows:

Construct molecular graphs for each molecule, where nodes are atoms and edges are bonds
Perform early stopping and save the model that achieves the best validation performance
(Optional) Automatically perform hyperparameter search using Bayesian optimization

Regression

Training

To train a model for predicting real-valued properties, we can use

python regression_train.py -c X -sc Y

where X specifies the path to the CSV file and Y specifies the header for the SMILES column in the CSV file.

Other optional arguments include:

Model: -mo model [default=GCN]
- Specifies the model to use.
- By default we use GCN for GCN followed by weighted sum and max pooling, other options include:
  - GAT for GAT followed by weighted sum and max pooling
  - Weave for Weave
  - MPNN for MPNN
  - AttentiveFP for AttentiveFP
  - gin_supervised_contextpred for GIN pre-trained with supervised learning and context prediction
  - gin_supervised_infomax for GIN pre-trained with supervised learning and mutual information maximization
  - gin_supervised_edgepred for GIN pre-trained with supervised learning and edge prediction
  - gin_supervised_masking for GIN pre-trained with supervised learning and attribute masking
  - NF for Neural Fingerprint
Atom Featurizer Type: -a feaurizer [default=canonical]
- Comes into effect only when the model is not a pre-trained GIN
- Specifies the initial featurization for atoms.
- By default we use canonical for CanonicalAtomFeaturizer. Alternatively, we can use attentivefp for the atom featurizer used in AttentiveFP.
Bond Featurizer Type: -b feaurizer [default=canonical]
- Comes into effect only when the model used is one of Weave, MPNN and AttentiveFP
- Specifies the initial featurization for bonds.
- By default we use canonical for CanonicalBondFeaturizer. Alternatively, we can use attentivefp for the bond featurizer used in AttentiveFP.
Number of Hyperparameter Search Trials: -ne num_trials [default=None]
- Specifies the number of trials for hyperparameter search. If not specified, we use the setting specified in model_configuration/{model_name}.json
Task: -t task1,task2,task3,...
- Specifies the headers for task columns in the CSV file. If not specified, we assume all columns are molecular properties except for the SMILES column.
Take the logarithm of the labels -lv [default=False]
- Whether to take logarithm of the labels for modeling
Split: -s split [default=scaffold_smiles]
- Specifies the split for the dataset
- By default we use 'scaffold_smiles' for scaffold split based on rdkit.Chem.Scaffolds.MurckoScaffold.MurckoScaffoldSmiles, alternatively we can use 'random' for random split or 'scaffold_decompose' for scaffold split based on rdkit.Chem.AllChem.MurckoDecompose.
Split Ratio: -sr a,b,c [default=0.8,0.1,0.1]
- Specifies the proportion of the dataset to be used for training, validation and test.
Evaluation Metric: -me metric [default=r2]
- Specifies the evaluation metric.
- By default we use Pearson correlation coefficient. Alternatively, we can use mae for mean absolute error, and rmse for root mean square error.
Num Epochs: -n number [default=1000]
- Specifies the maximum number of epochs for training. Early stopping will be performed based on validation metric.
Num of Processes for Data Loading: -nw number [default=1]
- Specifies the number of processes to use for data loading. A larger number might yield a faster speed.
Print Every: -pe number [default=20]
- The training progress will be printed every number minibatches.
Result Path: -p path [default=regression_results]
- Specifies the path to save training results.

Once the training is completed, we can find the following files in the Result Path mentioned above:

The trained model model.pth. If hyperparameter search is performed, this will be the best model found.
The evaluation result eval.txt. If hyperparameter search is performed, this will be the result for the best hyperparameters found.
The experiment configuration configure.json. If hyperparameter search is performed, this will be the best hyperparameters found, otherwise it will be just the setting specified in model_configuration/{model_name}.json.

Inference

To use the model trained above for prediction on new molecules

python regression_inference.py -f X

where X specifies the path to a .csv/.txt file of SMILES strings

Other optional arguments include:

SMILES Column: -sc column [default=None]
- Specifies the column of SMILES strings in the input .csv file. Can be omitted if the input file is a .txt file or the .csv file only has one column of SMILES strings
Train Result Path: -tp path [default=regression_results]
- Path to the training results saved, which will be used for loading the trained model and related configurations
Inference Result Path: -ip path [default=regression_inference_results]
- Path to the inference results, which will be used to save:
  - prediction.csv: A file of predicted properties associated with SMILES strings
Task: -t task1,task2,task3,...
- Task names for saving model predictions in the CSV file to output, which should be the same as the ones used for training. If not specified, we will simply use task1, task2, ...
Num of Processes for Data Loading: -nw number [default=1]
- Specifies the number of processes to use for data loading. A larger number might yield a faster speed.

(Multi-label) Binary Classification

Training

To train a model for predicting binary labels, we can use

python classification_train.py -c X -sc Y

where X specifies the path to the CSV file and Y specifies the header for the SMILES column in the CSV file.

Other optional arguments include:

Model: -mo model [default=GCN]
- Specifies the model to use.
- By default we use GCN for GCN followed by weighted sum and max pooling, other options include:
  - GAT for GAT followed by weighted sum and max pooling
  - Weave for Weave
  - MPNN for MPNN
  - AttentiveFP for AttentiveFP
  - gin_supervised_contextpred for GIN pre-trained with supervised learning and context prediction
  - gin_supervised_infomax for GIN pre-trained with supervised learning and mutual information maximization
  - gin_supervised_edgepred for GIN pre-trained with supervised learning and edge prediction
  - gin_supervised_masking for GIN pre-trained with supervised learning and attribute masking
  - NF for Neural Fingerprint
Atom Featurizer Type: -a feaurizer [default=canonical]
- Comes into effect only when the model is not a pre-trained GIN
- Specifies the initial featurization for atoms.
- By default we use canonical for CanonicalAtomFeaturizer. Alternatively, we can use attentivefp for the atom featurizer used in AttentiveFP.
Bond Featurizer Type: -b feaurizer [default=canonical]
- Comes into effect only when the model used is one of Weave, MPNN and AttentiveFP
- Specifies the initial featurization for bonds.
- By default we use canonical for CanonicalBondFeaturizer. Alternatively, we can use attentivefp for the bond featurizer used in AttentiveFP.
Number of Hyperparameter Search Trials: -ne num_trials [default=None]
- Specifies the number of trials for hyperparameter search. If not specified, we use the setting specified in model_configuration/{model_name}.json
Task: -t task1,task2,task3,...
- Specifies the headers for task columns in the CSV file. If not specified, we assume all columns are molecular properties except for the SMILES column.
Split: -s split [default=scaffold_smiles]
- Specifies the split for the dataset
- By default we use 'scaffold_smiles' for scaffold split based on rdkit.Chem.Scaffolds.MurckoScaffold.MurckoScaffoldSmiles, alternatively we can use 'random' for random split or 'scaffold_decompose' for scaffold split based on rdkit.Chem.AllChem.MurckoDecompose.
Split Ratio: -sr a,b,c [default=0.8,0.1,0.1]
- Specifies the proportion of the dataset to be used for training, validation and test.
Evaluation Metric: -me metric [default=roc_auc_score]
- Specifies the evaluation metric.
- By default we use ROC AUC score.
Num Epochs: -n number [default=1000]
- Specifies the maximum number of epochs for training. Early stopping will be performed based on validation metric.
Num of Processes for Data Loading: -nw number [default=1]
- Specifies the number of processes to use for data loading. A larger number might yield a faster speed.
Print Every: -pe number [default=20]
- The training progress will be printed every number minibatches.
Result Path: -p path [default=classification_results]
- Specifies the path to save training results.

Once the training is completed, we can find the following files in the Result Path mentioned above:

The trained model model.pth. If hyperparameter search is performed, this will be the best model found.
The evaluation result eval.txt. If hyperparameter search is performed, this will be the result for the best hyperparameters found.
The experiment configuration configure.json. If hyperparameter search is performed, this will be the best hyperparameters found, otherwise it will be just the setting specified in model_configuration/{model_name}.json.

Inference

To use the model trained above for prediction on new molecules

python classification_inference.py -f X

where X specifies the path to a .csv/.txt file of SMILES strings

Other optional arguments include:

SMILES Column: -sc column [default=None]
- Specifies the column of SMILES strings in the input .csv file. Can be omitted if the input file is a .txt file or the .csv file only has one column of SMILES strings
Train Result Path: -tp path [default=classification_results]
- Path to the training results saved, which will be used for loading the trained model and related configurations
Inference Result Path: -ip path [default=classification_inference_results]
- Path to the inference results, which will be used to save:
  - prediction.csv: A file of predicted properties associated with SMILES strings
Task: -t task1,task2,task3,... [default=None]
- Task names for saving model predictions in the CSV file to output, which should be the same as the ones used for training. If not specified, we will simply use task1, task2, ...
Soft Classification: -s [default=False]
- By default we will perform hard classification with binary labels. This flag allows performing soft classification instead.
Num of Processes for Data Loading: -nw number [default=1]
- Specifies the number of processes to use for data loading. A larger number might yield a faster speed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csv_data_configuration

csv_data_configuration

README.md

Molecular Property Prediction on a New CSV Dataset

Data Preparation

Data Analysis (Optional)

Modeling

Regression

Training

Inference

(Multi-label) Binary Classification

Training

Inference

Files

csv_data_configuration

Directory actions

More options

Directory actions

More options

Latest commit

History

csv_data_configuration

Folders and files

parent directory

README.md

Molecular Property Prediction on a New CSV Dataset

Data Preparation

Data Analysis (Optional)

Modeling

Regression

Training

Inference

(Multi-label) Binary Classification

Training

Inference