Pipeline and scripts used for raw microarray and RNAseq data analysis in "Tânia Barata, Vítor Vieira, Rúben Rodrigues, Ricardo Pires das Neves, Miguel Rocha, Reconstruction of tissue-specific genome-scale metabolic models for human cancer stem cells, Computers in Biology and Medicine, Volume 142, 2022, 105177, ISSN 0010-4825, https://doi.org/10.1016/j.compbiomed.2021.105177"
Pipeline for RNAseq was developed in Bash and it uses docker containers. Requirements to run are: Linux system and Podman. It is recommended to use ensembl annotation and ensembl genome reference files.
- Example annotation file: Homo_sapiens.GRCh38.99.gtf.gz
- Example genome reference fasta: Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
- Fill Studies_RNAseq.txt with your studies info. Studies_RNAseq.txt is a tab-delimited file, under data directory. Columns:
- Study is the study identifier
- SampleId is sample identifier
- Reads has '1' or '2' to destinguish between foward and reverse reads, single-end studies have 'Unpaired'
- Link is the link to fastq.gz file.
- All other columns should be filled with 'NA' when there are no values.
- Move to rnaseq scripts folder:
mv scr/bash
- Edit base folder path and URLs of genome and annotation files in scr/bash/Edit
- Download genome ref and annotation files by doing:
./Dirs.sh
- Download files of a study with:
DownloadFiles.sh <Study>
- Confirm if files finished to download:
ps -e | grep <jobId>
To get Job ids of donwloadscd data/<Study>/rawData
and docat PIDs
- After all downloads finish, evaluate raw read quality with:
./GetQCfiles.sh <Study>
. - After this, manually check fastqc results and decide which contaminants/overrepresented sequences should be removed in each sample and add them to file Seq2RemoveFile in folder data//trimmedData so that Trimmomatic will remove those sequences. If no file is provided, trimmomatic runs without excluding those sequences. Example of Seq2RemoveFile content:
seqname ACTTTTTTTTTTTTTTTTTTT
- To define specific trimmomatic parameters for a sample, include a file named TrimParams in directory data/trimmedData where you can change trimmomatic parameters for each sample, if you see for example that reads need to be trimmed in that study. Otherwise, default parameters are run.
- To run the rest of the analysis:
./RNAseqAnalysis.sh <Study>
Results are in folders inside directory data/
To run in Windows OS with R. File with studies info is: Studies_Microarrays.xlsx Run script scr/R/MicroarrayNormalize.R Paths are hardcoded at beginning of the script