-
First, hyperparameter tuning was performed to find the optimal settings for feature selection on compound structure-target binding datasets. The analysis was conducted separately on two sets, each containing 100 sampled datasets: i) datasets with molecular descriptors as features, and ii) datasets with MACCS fingerprints as features. The results were stored in
compound_target_0.25_binary_feature_select_tuning/
. Result files of molecular descriptors start with 'descriptor_all'. Result files of MACCS fingerprints start with 'fingerprint_maccs'. The following types of results were generated:- number of selected features (files that end with 'select_features_number_summary.tsv')
- training performance (computed by 10-fold cross-validation, analysis repeated 20 times) by selected features (files that contain 'select_features_training_performance_summary'). Three types of metrics were used to evaluate model performance:
- AUROC (files that end with 'auc.tsv')
- Balanced accuracy (files that end with 'bac.tsv')
- F1 score (files that end with 'f1.tsv')
- testing performance by selected features (files that contain 'select_features_testing_performance_summary')
- testing performance by all features (files that contain 'all_features_summary')
- optimal hyperparameter setting (files that end with 'optimal_hyperparameters.txt'). The optimal hyperparameter setting is the setting that achieves maximum training AUROC.
-
Second, feature selection pipeline was implemented with the optimal hyperparameter settings on all compound structure-target binding datasets. The analysis was conducted separately on two sets: i) datasets with molecular descriptors as features and ii) datasets with MACCS fingerprints as features. The results were stored in
compound_target_0.25_binary_feature_select_implementation/
. Result files of molecular descriptors start with 'descriptor_all'. Result files of MACCS fingerprints start with 'fingerprint_maccs'. The following types of results were generated:- number of selected features (files that end with 'select_features_number_summary.tsv')
- training performance (computed by 10-fold cross-validation, analysis repeated 20 times) by selected features (files that contain 'select_features_training_performance_summary'). Three types of metrics were used to evaluate model performance:
- AUROC (files that end with 'auc.tsv')
- Balanced accuracy (files that end with 'bac.tsv')
- F1 score (files that end with 'f1.tsv')
- testing performance by selected features (files that contain 'select_features_testing_performance_summary')
- testing performance by all features (files that contain 'all_features_summary')
-
Meanwhile, feature selection with L1 regularization was implemented on all compound structure-target binding datasets, as a comparison against our feature selection pipeline. The analysis was conducted separately on two sets: i) datasets with molecular descriptors as features and ii) datasets with MACCS fingerprints as features. The results were stored in
compound_target_0.25_binary_regularization_implementation/
. Result files of molecular descriptors start with 'descriptor_all'. Result files of MACCS fingerprints start with 'fingerprint_maccs'. Two classification methods were adopted to build models upon L1 selected features: i) lasso regression and ii) randomforest. Result files of lasso regression contain 'lasso'. Result files of randomforest contains 'randomforest'. The following types of results were generated:- number of selected features (files that end with 'select_features_number_summary.tsv')
- training performance (computed by 10-fold cross-validation, analysis repeated 20 times) by selected features (files that contain 'select_features_training_performance_summary'). Three types of metrics were used to evaluate model performance:
- AUROC (files that end with 'auc.tsv')
- Balanced accuracy (files that end with 'bac.tsv')
- F1 score (files that end with 'f1.tsv')
- testing performance by selected features (files that contain 'select_features_testing_performance_summary')
- testing performance by all features (files that contain 'all_features_summary')
-
Next, basic analysis was conducted on subsets of feature selection results. The analysis was conducted separately on two sets: i) datasets with molecular descriptors as features (results were stored in
compound_target_0.25_binary_feature_select_implementation/descriptor_all_analysis/
) and ii) datasets with MACCS fingerprints as features (results were stored incompound_target_0.25_binary_feature_select_implementation/fingerprint_maccs_analysis/
). AUROC threshold of 0.85 was adopted to select a subset of feature selection results for further analysis. The following types of results were generated:- basic summary statistics of feature selection results selected for analysis (files that end with 'feature_selection_statistics.txt'). Summary statistics include average number of selected features, average testing AUROC of selected models, comparison against performance of generic models without feature selection by Wilconxon test.
- predicted target binding profile of all OFFSIDES drugs using our feature selection model (files that end with 'prediction_select_features.tsv')
- predicted target binding profile of all OFFSIDES drugs using generic model without feature selection (files that end with 'prediction_all_features.tsv')
- selected structure features for each compound structure-target binding dataset by our pipeline (files that end with 'structure.tsv')
-
Last, feature similarity analysis was conducted on a subset of feature selection results. The analysis was only conducted for datasets with molecular descriptors as features, as our feature selection pipeline performs better on these datasets in general. Testing AUROC of 0.85 was adopted as threshold to select the subset of feature selection results for further analysis. For each pair of compound structure-target binding datasets, Jaccard similarity of selected structure features was computed. The result was stored in
compound_target_0.25_binary_feature_select_implementation/descriptor_all_compare/
. -
In addition, the predictive structure features identified by feature selection pipeline were used to build classifiers for predicting target binding profile of compounds. The prediction was made for two structure datasets: i) MACCS fingerprints of 8,541 compounds screened by Tox21 project and ii) MACCS fingerprints of 708,409 compounds from DSSTox. The prediction results of Tox21 compounds were stored in
compound_target_0.25_binary_feature_select_implementation/fingerprint_maccs_tox21_pred/
. Two types of prediction results were generated:- prediction of binary target binding label (file that ends with 'label.tsv')
- predicted probability of target binding (file that ends with 'probability.tsv')
Then, compound target-Tox21 assay outcome datasets were generated using the predicted probability of target binding and assay screening results from Tox21. The generated datasets were stored incompound_target_tox21_data/fingerprint_maccs_probability/
. The folder contains generated full data for 15 Tox21 assay outcomes, as well as training/testing data by 80%/20% split.
- Feature selection pipeline was implemented on all compound target-adverse event datasets. The analysis was conducted separately on each of the 815 datasets. An optimal model that achieves maximum training AUROC was selected for each datasest. Meanwhile, generic prediction models using molecular descriptors as features, as well as generic models without feature selection were implemented as a comparison against our feature selection pipeline. The results were stored in
compound_target_all_adverse_event_feature_select_implementation/
. The following types of results were generated:- testing AUROC of 3 types of adverse event prediction models, along with 95% confidence interval (computed by bootstrapping).
- selected target features for each dataset
- all target features for each dataset