Skip to content
This repository has been archived by the owner on Mar 13, 2020. It is now read-only.
Roberto Rossini edited this page May 29, 2019 · 7 revisions

Welcome to the lferriphilum wiki

This repository is used as a journal to track Roberto Rossini's analysis steps to reproduce the paper Multi-omics Reveals the Lifestyle of the Acidophilic, Mineral-Oxidizing Model Species Leptospirillum ferriphilumT by Stephan Christel et al. (2018) doi:10.1128/AEM.02091-17 as part of the Genome Analysis course at Uppsala University (Bioinformatics Programme 2018/2019).

Background

Leptospirillum ferriphilum is a gram-negative prokaryote that plays an important role in acidic metal-rich environments, where is one of the main responsible of iron oxidation. Up until ~2 years ago, no complete genome for this organism was available. In 2017 Stephan Christel et al. sequenced Leptospirillum ferriphilum's genome using PacBio SMRT long-read sequencing in the attempt to produce an high quality genome assembly that could be used as a reference by other studies. Transcript and protein levels were also measured in order to explore differences in the metabolism of Leptospirillum ferriphilum when grown in different conditions, namely continuous culture with ferrous iron and bioleaching culture with chalcopyrite (CuFeS2).

Research questions and objectives

  • Produce an high quality genome assembly that can be used as reference in other studies
  • Annotate the genome to study genes involved in environment adaptation and stress response
  • Study how different environment conditions affect gene expression by comparing RNA-Seq expression levels and mass spectrometry protein-level data

Data

The data used in the course of this analysis consist in:

  • Raw DNA read data (PacBio SMRT)
  • Raw RNA read data (HiSeq2500)

For more information about the data, head over to the data section of the wiki.

Analysis outline

The analysis workflow can be schematized in:

  • Genome Assembly
  • Genome Annotation
  • Transcriptome Assembly
  • Differential expression analysis

For more detail, have a look at the analysis section of the wiki.

Software

For a complete list of software and settings, refer to the software section of the wiki.

For compute-intensive tasks, Rackham (UPPMAX) computing cluster was used. Data visualization and simple computing tasks were carried out on a consumer laptop running Manjaro-KDE 18.0.4 Illyria (Linux kernel 5.1.4).

Genome Assembly and Annotation

  • FastQC: raw reads quality check
  • kraken2: identification and removal of contaminant reads
  • Canu: Read pre-processing (read quality check and trimming) and genome assembly
  • QUAST, Gepard: Evaluation of assembly quality
  • Prokka, eggNOG: Genome Annotation

Transcriptome assembly

  • FastQC: raw reads quality check
  • kraken2: identification and removal of contaminant reads
  • BBDuk: read trimming and adapter removal
  • HISAT2: read mapping
  • samtools: SAM to BAM conversion, BAM sorting and merging
  • Trinity: De-Novo transcriptome assembly
  • TransDecoder: Identification of candidate coding regions
  • HMMER: Searching protein database using profile HMM
  • Diamond: Faster alternative to BLAST. Used to query protein databases

RNA-Seq Differential Expression Analysis

  • Salmon: Transcript quantification
  • edgeR: Differential expression analysis