Skip to content

Latest commit

 

History

History
56 lines (41 loc) · 2.86 KB

README.md

File metadata and controls

56 lines (41 loc) · 2.86 KB

LncAutism

Models for the prediction and prioritization of ASD-associated candidate lncRNAs. This documentation is part of the supplementary information release for our paper "Prediction and prioritization of autism-associated long non-coding RNAs using gene expression and sequence features" (J. Wang and L. Wang, 2020).

Requirements

  • Python 3
  • numpy
  • pandas
  • Bio
  • regex
  • sklearn
  • pickle

Usage

Step 1, download LncAutism repository, named as LncAutism;

Step 2, put your input .fa file into LncAutism;

Step 2, open a terminal;

Step 3, go to the direcotry of LncAutism, and run:

python Model_prediction.py -i <your input .fa filename> -o <your output .csv filename>

For example,

python Model_prediction.py -i example_lncRNAs.fa -o example_result.csv

# Once finished, example_result.csv file will be availbe in the current working directory

Input file

A .fa file storing the ENSG ID and transcript sequence of each candidate lncRNA, for example:

>ENSG00000082929
CTGTTTCAACCATATCCTTTCAAACCAGATCAGTGAGGTCATGACCAGAAAACAAGCCCTGCCAGCCTCCTACCTCAAATCTAATTAATTATAATTTTCTTCCTTATGACAACCCACACAAAAGACAGAGATAAGAAAAACAAGGACTTCCTGGGAGGCTGTGGATCAATTACCAATGGACACCCAGAAGCAAATTCACAAGACTCACAATTCAAAGAACCAATTTTTTACAATTTTTTTTTTCCTGTCAGTTGAATTTGGGAAGGAAGGAACACGCAAAAATTTTTACCTTCTTCTTTCAATTGGACACTATGGACGGAAATCCAGGAGAGCTGACCTTGGAACTGCAGACACTGCAGATAAAACAGAGCCAGAATGCTTTGCTGCCAGCTGGACCTTTGACCCAAACCCCAGTGTGACTGTCTCCGGTGCTCACTCAACTGCAGTGCATCAATGAAGGAAAAAAGAACTGAGCATTGGCAAAAAGCTGAGGATGACAAGCTTAGGGGATGAAAGGCTGCCTTTTCCCCCCCTTCTGAGCGTTTCTGACAGCTCCCAGCTGGGAGAAACCAACATGACGAAAAGACAAAGAATACTGGAGAAGAGAGAGGGTGGGCAGAGCCACAACCTCATCCTCCCAGTGGTTCCTCTTGAGTTTGATTTGACAAGATGCCTTCCCACCCAGGTAACCCCGTGGGAACGTGCCAGTACCTCACCGCACGACCTCACTGAGTCCTCACAACAAATCCAGGCTGCAGATTTTTTTCCCCACTTGGAGATGACAAATTGAAATGCAGGAAGGTTAAAGAGTTTGCCTGAGGTTGCTTAGATAATAAAAGAATCTGGATTAGAACCCAGACCTACCCAGCTAAA
>ENSG00000083622
TGAAAACTTCCTGAGGCCTCCTCAGAAGCAGATGCTGCTATGCTTCCCGTACAGCCTGAAGAACCAAACATTTCTATACATTTATGAGACCCAACTCCAAAAGCATACTGGAATGGAATTAAAGACCGAACTACAGAAGCTAAAGAGAGCTGATCAAACTAAAAAAAAAAGTTAAAGGTGAGGAGGCCAAGGCTCAGAAGAGTCACTTGCCCATGATCACTGCTCTTATAGCGCCAACTCAGAGCTCAGGCAATGCTCATGTTCTTTCCACTGTGCCTCATCGCTCATGTCTGACTCTTCTAGAAATGTGGGCAAATCCCCTGCCTTCTGTGGGTCTCAGATTTCCATAAAAAATAAAATCAATGGATCAACTTAA

Output file

A .csv file storing the output of each model, which is the probability of each candidate lncRNA to be associated with ASD, for example:

Gene_ID LR SVM RF
ENSG00000082929 0.7471047902 0.4217984245 0.3652055055
ENSG00000083622 0.8809675003 0.2803936188 0.3649550888
ENSG00000093100 0.9180220311 0.0201953225 0.3539223047
ENSG00000099869 0.7679051837 0.0879626703 0.3205380732
ENSG00000103472 0.929418295 0.6192257137 0.3921816345

Note: probabilities are the average values from models trained using tenfold cross-validations.