Models for the prediction and prioritization of ASD-associated candidate lncRNAs. This documentation is part of the supplementary information release for our paper "Prediction and prioritization of autism-associated long non-coding RNAs using gene expression and sequence features" (J. Wang and L. Wang, 2020).
- Python 3
- numpy
- pandas
- Bio
- regex
- sklearn
- pickle
Step 1, download LncAutism repository, named as LncAutism;
Step 2, put your input .fa file into LncAutism;
Step 2, open a terminal;
Step 3, go to the direcotry of LncAutism, and run:
python Model_prediction.py -i <your input .fa filename> -o <your output .csv filename>
For example,
python Model_prediction.py -i example_lncRNAs.fa -o example_result.csv
# Once finished, example_result.csv file will be availbe in the current working directory
A .fa file storing the ENSG ID and transcript sequence of each candidate lncRNA, for example:
>ENSG00000082929
CTGTTTCAACCATATCCTTTCAAACCAGATCAGTGAGGTCATGACCAGAAAACAAGCCCTGCCAGCCTCCTACCTCAAATCTAATTAATTATAATTTTCTTCCTTATGACAACCCACACAAAAGACAGAGATAAGAAAAACAAGGACTTCCTGGGAGGCTGTGGATCAATTACCAATGGACACCCAGAAGCAAATTCACAAGACTCACAATTCAAAGAACCAATTTTTTACAATTTTTTTTTTCCTGTCAGTTGAATTTGGGAAGGAAGGAACACGCAAAAATTTTTACCTTCTTCTTTCAATTGGACACTATGGACGGAAATCCAGGAGAGCTGACCTTGGAACTGCAGACACTGCAGATAAAACAGAGCCAGAATGCTTTGCTGCCAGCTGGACCTTTGACCCAAACCCCAGTGTGACTGTCTCCGGTGCTCACTCAACTGCAGTGCATCAATGAAGGAAAAAAGAACTGAGCATTGGCAAAAAGCTGAGGATGACAAGCTTAGGGGATGAAAGGCTGCCTTTTCCCCCCCTTCTGAGCGTTTCTGACAGCTCCCAGCTGGGAGAAACCAACATGACGAAAAGACAAAGAATACTGGAGAAGAGAGAGGGTGGGCAGAGCCACAACCTCATCCTCCCAGTGGTTCCTCTTGAGTTTGATTTGACAAGATGCCTTCCCACCCAGGTAACCCCGTGGGAACGTGCCAGTACCTCACCGCACGACCTCACTGAGTCCTCACAACAAATCCAGGCTGCAGATTTTTTTCCCCACTTGGAGATGACAAATTGAAATGCAGGAAGGTTAAAGAGTTTGCCTGAGGTTGCTTAGATAATAAAAGAATCTGGATTAGAACCCAGACCTACCCAGCTAAA
>ENSG00000083622
TGAAAACTTCCTGAGGCCTCCTCAGAAGCAGATGCTGCTATGCTTCCCGTACAGCCTGAAGAACCAAACATTTCTATACATTTATGAGACCCAACTCCAAAAGCATACTGGAATGGAATTAAAGACCGAACTACAGAAGCTAAAGAGAGCTGATCAAACTAAAAAAAAAAGTTAAAGGTGAGGAGGCCAAGGCTCAGAAGAGTCACTTGCCCATGATCACTGCTCTTATAGCGCCAACTCAGAGCTCAGGCAATGCTCATGTTCTTTCCACTGTGCCTCATCGCTCATGTCTGACTCTTCTAGAAATGTGGGCAAATCCCCTGCCTTCTGTGGGTCTCAGATTTCCATAAAAAATAAAATCAATGGATCAACTTAA
A .csv file storing the output of each model, which is the probability of each candidate lncRNA to be associated with ASD, for example:
Gene_ID | LR | SVM | RF |
---|---|---|---|
ENSG00000082929 | 0.7471047902 | 0.4217984245 | 0.3652055055 |
ENSG00000083622 | 0.8809675003 | 0.2803936188 | 0.3649550888 |
ENSG00000093100 | 0.9180220311 | 0.0201953225 | 0.3539223047 |
ENSG00000099869 | 0.7679051837 | 0.0879626703 | 0.3205380732 |
ENSG00000103472 | 0.929418295 | 0.6192257137 | 0.3921816345 |
Note: probabilities are the average values from models trained using tenfold cross-validations.