COSMO: COrrection of Sample Mislabeling by Omics
Multi-omics Enabled Sample Mislabeling Correction
- Download COSMO:
git clone https://github.com/bzhanglab/cosmo
-
Install Docker (>=19.03).
-
Install Nextflow. More information can be found in the Nextflow get started page.
All other tools used by COSMO have been dockerized and will be automatically installed when COSMO is run in the first time on a computer.
○ → nextflow run bzhanglab/COSMO --help
N E X T F L O W ~ version 21.02.0-edge
Launching `cosmo.nf` [high_noyce] - revision: 46dc5c6c96
=========================================
COSMO => COrrection of Sample Mislabeling by Omics
=========================================
Usage:
nextflow run bzhanglab/COSMO
Arguments:
--d1_file Dataset with quantification data at gene level.
--d2_file Dataset with quantification data at gene level.
--cli_file Sample annotation data.
--cli_attribute Sample attribute(s) for prediction. Multiple attributes.
must be separated by ",".
--outdir Output folder.
--help Print help message.
The formats for both datasets (--d1_file
, --d2_file
) are the same. An example input of quantification dataset (--d1_file
or --d2_file
) is shown below. The first column is the gene ID
and all the other columns are the expression of proteins at gene level in different samples.
Testing_1 | Testing_2 | Testing_3 | Testing_4 | Testing_5 | Testing_6 | Testing_7 | Testing_8 | Testing_9 | Testing_10 | ||
---|---|---|---|---|---|---|---|---|---|---|---|
A1BG | 1.5963 | 2.8484 | 2.1092 | 2.7922 | 2.4444 | 3.9907 | 3.6792 | 3.7321 | 3.6123 | 3.1739 | |
A2M | 5.9429 | 5.0089 | 6.0823 | 6.0093 | 6.4553 | 6.0097 | 6.014 | 6.9721 | 4.4766 | 6.481 | |
AAAS | 1.9337 | 2.951 | 3.5984 | 2.0419 | 2.1217 | 0.9662 | 1.0086 | NA | 2.4936 | 2.2399 | |
AACS | 1.7549 | NA | 2.3948 | NA | 0.9946 | 2.5969 | NA | NA | 1.6488 | NA | |
AAGAB | NA | NA | 0.9982 | NA | 1.0282 | 1.6296 | NA | NA | 1.8141 | NA | |
AAK1 | 1.0459 | 2.5435 | 1.7449 | NA | 1.0653 | 0.9855 | 2.0395 | 1.1588 | NA | NA |
The input for parameter --cli_file
is the sample annotation file and an example is shown below:
sample | age | gender | stage | colon_rectum | msi | tumor_normal |
---|---|---|---|---|---|---|
Testing_1 | 47 | Female | High | Colon | MSI-Low/MSS | Tumor |
Testing_2 | 68 | Female | High | Rectum | MSI-Low/MSS | Tumor |
Testing_3 | 52 | Male | Low | Colon | MSI-Low/MSS | Tumor |
Testing_4 | 54 | Female | Low | Colon | MSI-High | Tumor |
Testing_5 | 72 | Male | High | Colon | MSI-Low/MSS | Tumor |
Testing_6 | 61 | Male | High | Colon | MSI-Low/MSS | Tumor |
Testing_7 | 58 | Female | High | Colon | MSI-High | Tumor |
Testing_8 | 73 | Male | Low | Colon | MSI-Low/MSS | Tumor |
Testing_9 | 68 | Male | Low | Colon | MSI-Low/MSS | Tumor |
Below is an example run COSMO:
nextflow run bzhanglab/COSMO --d1_file example_data/test_pro.tsv \
--d2_file example_data/test_rna.tsv \
--cli_file example_data/test_cli.tsv \
--cli_attribute "gender,msi" \
--outdir ./results
The data to run the above example can be found in this folder: "example_data
".
Below are the folders and files generated by the COSMO.
data_use
This directory contains all the input data filestest_pro.tsv
test_rna.tsv
test_cli.tsv
results/method1
Output files from method_1-
genes.tsv
Chromosomes annotation of genes -
cleaned_data1.tsv
Preprocessed data from the first dataset, d1_file (Missing value imputed if any) -
cleaned_data2.tsv
Preprocessed data from the second dataset, d2_file (Missing value imputed if any) -
sample_correlation.csv
Pearson correlation between samples from the first dataset (Rows) and the second dataset (Columns) -
sample_correlation.png
Heatmap image file of 'sample_correlationc.csv' -
pairwise_matching.tsv
Matching generated by stable marriage correlation. Every row indicates one matching pair of samples from the first dataset (d1_label) to samples from the second dataset (d1_label). The column d1rank is the preferential rank of d1 sample matched to the d2 sample; d2rank is the preferential rank of d2 sample matched to the d1 sample.d1 d1_label d2 d1_label d1rank d2rank distance correlation 1 Testing_1 1 Testing_1 1 1 2 0.60889 2 Testing_2 2 Testing_2 1 1 2 0.59604 3 Testing_3 3 Testing_3 1 1 2 0.64042 4 Testing_4 4 Testing_4 1 1 2 0.76045 5 Testing_5 5 Testing_5 1 2 3 0.66900 6 Testing_6 7 Testing_7 1 1 2 0.77152 7 Testing_7 6 Testing_6 1 1 2 0.70996 8 Testing_8 8 Testing_8 1 1 2 0.69767 9 Testing_9 9 Testing_9 1 1 2 0.75336 -
clinical_attributes_pred.tsv
Classification results of every samples for both datasets, using method of winning team 1. Column gender_prob is the annotated binary label, d1gender_prob is the predicted probability of sample from the first dataset; while d2gender_prob is of sample from the second dataset. More columns will be generated if there are more clinical attributes.sample gender gender_prob d1gender d1gender_prob d2gender d2gender_prob pred_gender Testing_1 Female 0 Female 0.01724 Female 0.32446 0.17085 Testing_2 Female 0 Female 0.00930 Female 0.17867 0.09398 Testing_3 Male 1 Male 0.97656 Male 0.78810 0.88233 Testing_4 Female 0 Female 0.00489 Female 0.25205 0.12847 Testing_5 Male 1 Male 0.99710 Male 0.58199 0.78955 Testing_6 Male 1 Male 0.99831 Female 0.41568 0.70699 Testing_7 Female 0 Female 0.02782 Male 0.57772 0.30277 Testing_8 Male 1 Male 0.99377 Male 0.76312 0.87844 Testing_9 Male 1 Male 0.99856 Male 0.69589 0.84722 -
errors.tsv
count of different types of mislabeling errors -
final.tsv
table of corrected labels. Any inconsistency of id in the same row indicates the presence of mislabeling error. The table is generated using only classification results of method_1. The interpretation is the same as 'cosmo_final_results.tsv' in the 'final_res_folder'.
results/method2
-
test_ModelA_results.csv
classification results of every samples of the first dataset, using method of winning team 2. -
test_ModelB_results.csv
classification results of every samples of the second dataset, using method of winning team 2.
results/final
-
cosmo_final_result.tsv
table of corrected labels. The table is generated using integrated classification results of both method_1 and method_2. Each sample is assigned a number as unique id. A row with the consistent id across all the columns, indicates all the data belongs to the same patient and there is no mislabeling.sample Clinical Data1 Data2 Testing_1 1 1 1 Testing_2 2 2 2 Testing_3 3 3 3 Testing_4 4 4 4 Testing_5 5 5 5 A row with different id indicates mislabeling error. Below is the example of swapping error in which samples Testing_6 and Testing_7 get swapped in Data2.
sample Clinical Data1 Data2 Testing_6 6 6 7 Testing_7 7 7 6 The same id occurred twice indicates duplication error. Below is an example of a duplicate sample Testing_8 in Data1.
sample Clinical Data1 Data2 Testing_8 8 8 8 Testing_9 9 8 9 Shifting error are represented with a continuous switching of id. Table below shows an example of a shifting error, where Samples Testing_10, Testing_11 and Testing_12 get shifted consecutively in Data2.
sample Clinical Data1 Data2 Testing_10 10 10 10 Testing_11 11 11 10 Testing_12 12 12 11 Testing_13 13 13 12 Testing_14 14 14 14
The datasets used in the publication of COSMO are available at cosmo_datasets.