Skip to content

max-baer/Trichuris_Incognita

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genetic makeup of Trichuris incognita, a novel Trichuris species naturally infecting humans and displaying signs of resistance to drug treatment

Author: Max Bär, max.baer[at]swisstph.ch

Contents:

This repository contains all code to reproduce the results and figures from the publication Genetic makeup of Trichuris incognita, a novel Trichuris species naturally infecting humans and displaying signs of resistance to drug treatment. 4 pipelines are presented, for the de-novo hybrid genome assembly, for the gene prediction and functional annotation of the newly assembly genome, for the assembly and phylogenetic inference of 747 individual mitogenomes of T. incognita. Nextflow was the main language used to construct the pipelines and each code snippet is elaborated on in the respective README.md file in the sub-directories. All scripts were run on the SciCORE computing cluster at the University of Basel with most modules pre-installed. Many scripts were adapted from Stephen Doyle's project on ancient and modern Trichuris genomes (Population genomics of ancient and modern Trichuris trichiura)

Table of contents

  1. De-novo hybrid genome assembly
  2. Gene prediction and funtional annotation
  3. Mitogenome assembly and phylogeny
  4. Genome wide association study

De-novo hybrid genome assembly

Visual overview

whole_genome_overview

Methods

Nextflow (version 23.04.1 build 5866) was used with Java (version 11.0.3). Nanopore raw reads were channeled into chopper (version 0.6.0) where reads below 5kb and low quality reads were filtered out and adapters trimmed. Trimmed reads were channeled into Kraken2 (version 2.1.1) and reads with a contamination ratio >0.5 were filtered out with a custom R (version 4.3.0) script. Illumina reads were channeled into Kraken2 where reads of impurities were identified and filtered out. Illumina and Nanopore channels were combined to run a hybrid de-novo genome assembly using MaSuRCA (version 4.0.9). The outputted contigs were channeled into RagTag (version 2.1.0) for correction and scaffolding using the Trichuris trichiura (PRJEB535) reference genome. The corrected scaffolds were then assigned to one of three linkage groups from T. Muris using Liftoff (version 1.6.3), as described in literature. The genome was frozen for quality assessment using BUSCO (version 5.1.2 with Metazoa_obd10 database) and QUAST (version 5.0.2).

Gene prediction and funtional annotation

Methods

Nextflow (version 23.04.1 build 5866) was used with Java (version 11.0.3). All available reference whole genomes of Trichuris species and Trichinella spiralis were channeled into the pipeline containing the newly assembled genome of T. incognita to run a de-novo gene prediction on all reference genomes and reduce methodological bias from gene prediction software. Repeat regions were masked using RepeatMasker (version 4.1.4) before channeling the genome to BRAKER (version 3.0.6, Braker2 pipeline using the Metazoa protein database from OrthoDB 11). Gene prediction statistics including exon and intron count, mean length and median length were gathered using a custom bash script.

Mitogenome assembly and phylogeny

Visual overview

mitogenome_overview

Methods

For all sequencing, assembly and annotation pipelines Nextflow (version 23.04.1 build 5866) was used with Java (version 11.0.3). Raw reads were first processed using Trimmomatic (version 0.39) ; for PE reads, adapters were removed and N-bases trimmed. Next reads were channeled to getorganelle (version 1.7.7.0) to reconstruct the mitochondrial genome. Runs which resulted in a complete mitochondrial genome were further processed by flipping in cases where the reverse complement was generated and adjustment to COX1 as starting gene. Reference and newly constructed mitochondrial genomes were annotated and visualized using MitoZ (version 3.6). Coding genes were then extracted and aligned as amino acid and nucleic acid sequences using MAFFT (version 7.490).

BEAST2 (version 2.7.5) was used for all phylogenetic inference with a pure birth process (Yule process). Aligned concatenated nucleic acid sequences from all assembled and reference mitochondrial genomes were treated as homochromous. The JC69 substitution model with gamma-distributed rate heterogeneity was used (JC69 + Γ4) and a strict molecular clock was assumed. Trees and clock models were linked and all model parameters were estimated jointly. A Markov Chain Monte Carlo was run for each analysis. Tracer (version 1.7.2) was used to assess convergence and effective sample size (should be over 200). The percentage of samples discarded as burn-in was at least 10%. For the amino acid inferred species tree, 2 assembled sequences of each subclade clade with a posterior distribution over 0.95 were selected at random. The mito REV substitution model was used and all genes were partitioned separately, allowing for different rates for each gene. Effective sample size (ESS) was at least 900 for each of the inferred parameters. The intraspecies phylogeny was inferred using a TN96 + Γ4 model. Each gene was partitioned separately and codon positions 1 and 2 were partitioned separately to 3.

Genome wide association study

Visual overview

GWAS_overview

Methods

For all sequencing, assembly and annotation pipelines Nextflow (version 23.04.1 build 5866) was used with Java (version 11.0.3). Raw reads were first processed using Trimmomatic (version 0.39). BWA (version 0.7.17) and SAMtools (version 1.14) were used to index the assembled reference genome, and align, sort and filter raw reads. Flagstat and Kraken2 (version 2.1.1) were used as QC tools. The standard GATK (version 4.2.6.1) pipeline was followed for variant calling, base quality score recalibration (BQSR) and genotyping. Duplicates were marked, initial basecalling was conducted for BQSR, hard filtering was done after visualizing plots for QUAL, DP, QD, FS, MQ, MQRankSum, SQR, ReadPosRankSum. gVCF’s of all worms were merged, and a joint genotyping was conducted to generate the final VCF. Using this VCF the population stratification was examined by PCA in PLINK (version 1.9). Next a positive control was ran using the gender data to confirm the pipeline in PLINK and assess the GWAS results for linear and logistical regressions. Finally SNP’s associated to sex were filtered out and a final linear regression was conducted on the two populations of drug sensitive and non-sensitive worms.

About

01-De-novo_hybrid_assembly

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages