Skip to content

aditipate/CompBioMiniProject

Repository files navigation

Computational Biology Mini Project

Project Background

Human herpesvirus 5 is also known as Human cytomegalovirus and is typically abbreviated as HCMV. From Wikipedia: Although they may be found throughout the body, HCMV infections are frequently associated with the salivary glands. HCMV infection is typically unnoticed in healthy people, but can be life-threatening for the immunocompromised, such as HIV-infected persons, organ transplant recipients, or newborn infants. Congenital cytomegalovirus infection can lead to significant morbidity and even death. After infection, HCMV remains latent within the body throughout life and can be reactivated at any time. Eventually, it may cause mucoepidermoid carcinoma and possibly other malignancies such as prostate cancer.

alt text

The Comp Bio Mini Project aims to compare the HCMV transcriptomes from 2 patient donors 2- and 6-days post-infection and to observe the genetic diversity of the virus by comparing patient samples to other publicly available strains inorder to find the ones that are the most similar.

A Python wrapper was developed to automate the execution of various Bioinformatics software tools.

Software Requirements:

  • Linux/Unix
  • Python3
    • os
    • Pandas
    • Biopython
      • Entrez
      • SeqIO
  • Kallisto
  • Bowtie2
  • SPades
  • BLAST+

Run The Wrapper:

Clone Repository:

git clone https://github.com/aditipate/CompBioMiniProject

Move Into Project Directory:

cd CompBioMiniProject/

Run Wrapper:

python3 compbio_wrapper.py

Folders and Scripts:

testdata: folder containing files for the first 10000 lines of each SRR paired-end read

compbio_wrapper.py: python wrapper, calls other python scripts, creates various output files, writes significant output results to miniProject.log

getTestData.py: retrieves transcriptomes from two patient donors from SRA and convert to paired-end fastq files
Note: getTestData.py is not called in the python wrapper due to lengthy runtime of full length transcriptomes. However, a testdata folder containing shortened paired-end fastq files has been provided in the repository which the wrapper will use to run the pipeline. To run with full input reads uncomment #getData.getTranscriptome(SRRs) in getData.py and move or empty testdata folder

kallisto.py: builds a transcriptome index for HCMV (NCBI accession EF999921), quantifies TPM in each sample using Kallisto, creates kallisto output table

sleuth.R: reads in kallisto output table to create a sleuth object, performs the likelihood ratio test for differential expression between conditions

sleuth.py writes significant sleuth output to miniProject.log

bowtie2.py: creates an index for HCMV (NCBI accession EF999921), saves the reads that map to the HCMV index for use in assembly

spades.py: uses reads from Bowtie2 mapping to produce 1 assembly via SPAdes

contigs.py: from SPades assembly calculates the number of contigs with a length > 1000, calculates the length of the assembly, finds longest contig

blast.py: uses longest contig as blast+ input to query the nr nucleotide database limited to members of the Betaherpesvirinae subfamily

Output:

miniProject.log: contains significant results from running the pipeline including # CDS in the HCMV genome, significant (FDR < 0.05) sleuth results, the number of reads in each transcriptome before and after the Bowtie2 mapping, the number of contigs with a length > 1000, the length of the assembly, and the top 10 BLAST hits

miniProject_Aditi_Patel: contains any significant files generated from running the pipeline

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published