A collection of utility scripts, data analysis pipeline, and documentation for RNA-Seq with non-model organisms.
Software used as part of this repo:
- Trinity (trinityRNAseq) / bowtie / RSEM
- TRIMMOMATIC
- Bowtie2
- EnTAP
- R (Bioconductor packages)
TODO Add scripts for the RNA QC pipeline and a README.
- Quality trim reads DONE
- Strip PolyA tails DONE
- Assemble
Make small test data set
TODO Add scripts for using EnTAP to annotate reads, USEARCH to cluster proteomes, and scripts to select nucleotide sequences representative of the clustered proteome (the refined assembly).
Files include an R Notebook file outlining the steps used to create the background for GoSeq from EnTAP annotations, create the named vectors that GoSeq will use, and run the enrichment analysis (GoSeq_Walkthrough.RMD). Also included are data files needed to run the script, and the output of running the script with those data files. The file GoTermsMap.py is used to create the one-to-one gene id to GO term map from EnTAP annotations.
Includes an R Notebook file (should be opened in RStudio) explaining the principles and formula for scaling transcripts per million from one species' transcriptome to another's in order to make the gene expression of the two species more fairly comparable. Also includes TODO data files to use to test the code and examine the results of scaling.
This procedure uses scaled TPM of two species for shared single-copy orthologues to find the minimum set of genes that are sufficient to distinguish one species from the other. Files include an R Notebook file with the steps involved and code to examine the results of removing biased genes, a standalone R script meant to be run on a highly parallel compute cluster, and a small set of test data.