Genetic makeup of Trichuris incognita, a novel Trichuris species naturally infecting humans and displaying signs of resistance to drug treatment

Author: Max Bär, max.baer[at]swisstph.ch

This repository contains all code to reproduce the results and figures from the publication Genetic makeup of Trichuris incognita, a novel Trichuris species naturally infecting humans and displaying signs of resistance to drug treatment. 4 pipelines are presented, for the de-novo hybrid genome assembly, for the gene prediction and functional annotation of the newly assembly genome, for the assembly and phylogenetic inference of 747 individual mitogenomes of T. incognita. Nextflow was the main language used to construct the pipelines and each code snippet is elaborated on in the respective README.md file in the sub-directories. All scripts were run on the SciCORE computing cluster at the University of Basel with most modules pre-installed. Many scripts were adapted from Stephen Doyle's project on ancient and modern Trichuris genomes (Population genomics of ancient and modern Trichuris trichiura)

Table of contents

De-novo hybrid genome assembly

Visual overview

Methods

Nextflow (version 23.04.1 build 5866) was used with Java (version 11.0.3). Nanopore raw reads were channeled into chopper (version 0.6.0) where reads below 5kb and low quality reads were filtered out and adapters trimmed. Trimmed reads were channeled into Kraken2 (version 2.1.1) and reads with a contamination ratio >0.5 were filtered out with a custom R (version 4.3.0) script. Illumina reads were channeled into Kraken2 where reads of impurities were identified and filtered out. Illumina and Nanopore channels were combined to run a hybrid de-novo genome assembly using MaSuRCA (version 4.0.9). The outputted contigs were channeled into RagTag (version 2.1.0) for correction and scaffolding using the Trichuris trichiura (PRJEB535) reference genome. The corrected scaffolds were then assigned to one of three linkage groups from T. Muris using Liftoff (version 1.6.3), as described in literature. The genome was frozen for quality assessment using BUSCO (version 5.1.2 with Metazoa_obd10 database) and QUAST (version 5.0.2).

Gene prediction and funtional annotation

Methods

Nextflow (version 23.04.1 build 5866) was used with Java (version 11.0.3). All available reference whole genomes of Trichuris species and Trichinella spiralis were channeled into the pipeline containing the newly assembled genome of T. incognita to run a de-novo gene prediction on all reference genomes and reduce methodological bias from gene prediction software. Repeat regions were masked using RepeatMasker (version 4.1.4) before channeling the genome to BRAKER (version 3.0.6, Braker2 pipeline using the Metazoa protein database from OrthoDB 11). Gene prediction statistics including exon and intron count, mean length and median length were gathered using a custom bash script.

Mitogenome assembly and phylogeny

Visual overview

Methods

For all sequencing, assembly and annotation pipelines Nextflow (version 23.04.1 build 5866) was used with Java (version 11.0.3). Raw reads were first processed using Trimmomatic (version 0.39) ; for PE reads, adapters were removed and N-bases trimmed. Next reads were channeled to getorganelle (version 1.7.7.0) to reconstruct the mitochondrial genome. Runs which resulted in a complete mitochondrial genome were further processed by flipping in cases where the reverse complement was generated and adjustment to COX1 as starting gene. Reference and newly constructed mitochondrial genomes were annotated and visualized using MitoZ (version 3.6). Coding genes were then extracted and aligned as amino acid and nucleic acid sequences using MAFFT (version 7.490).

BEAST2 (version 2.7.5) was used for all phylogenetic inference with a pure birth process (Yule process). Aligned concatenated nucleic acid sequences from all assembled and reference mitochondrial genomes were treated as homochromous. The JC69 substitution model with gamma-distributed rate heterogeneity was used (JC69 + Γ4) and a strict molecular clock was assumed. Trees and clock models were linked and all model parameters were estimated jointly. A Markov Chain Monte Carlo was run for each analysis. Tracer (version 1.7.2) was used to assess convergence and effective sample size (should be over 200). The percentage of samples discarded as burn-in was at least 10%. For the amino acid inferred species tree, 2 assembled sequences of each subclade clade with a posterior distribution over 0.95 were selected at random. The mito REV substitution model was used and all genes were partitioned separately, allowing for different rates for each gene. Effective sample size (ESS) was at least 900 for each of the inferred parameters. The intraspecies phylogeny was inferred using a TN96 + Γ4 model. Each gene was partitioned separately and codon positions 1 and 2 were partitioned separately to 3.

Genome wide association study

Visual overview

Methods

For all sequencing, assembly and annotation pipelines Nextflow (version 23.04.1 build 5866) was used with Java (version 11.0.3). Raw reads were first processed using Trimmomatic (version 0.39). BWA (version 0.7.17) and SAMtools (version 1.14) were used to index the assembled reference genome, and align, sort and filter raw reads. Flagstat and Kraken2 (version 2.1.1) were used as QC tools. The standard GATK (version 4.2.6.1) pipeline was followed for variant calling, base quality score recalibration (BQSR) and genotyping. Duplicates were marked, initial basecalling was conducted for BQSR, hard filtering was done after visualizing plots for QUAL, DP, QD, FS, MQ, MQRankSum, SQR, ReadPosRankSum. gVCF’s of all worms were merged, and a joint genotyping was conducted to generate the final VCF. Using this VCF the population stratification was examined by PCA in PLINK (version 1.9). Next a positive control was ran using the gender data to confirm the pipeline in PLINK and assess the GWAS results for linear and logistical regressions. Finally SNP’s associated to sex were filtered out and a final linear regression was conducted on the two populations of drug sensitive and non-sensitive worms.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.idea		.idea
01_De-novo_hybrid_assembly		01_De-novo_hybrid_assembly
02_De-novo_annotation_pipeline		02_De-novo_annotation_pipeline
03_Mitogenome_assembly_and_phylogeny		03_Mitogenome_assembly_and_phylogeny
04_Genome_wide_association_study		04_Genome_wide_association_study
Custom R-scripts		Custom R-scripts
Images Mitochondrial Sequences/images_mito		Images Mitochondrial Sequences/images_mito
Images Worms		Images Worms
Logfiles		Logfiles
Overview_GWAS.PNG		Overview_GWAS.PNG
Overview_mitogenome_assembly_and_phylogeny.PNG		Overview_mitogenome_assembly_and_phylogeny.PNG
Overview_whole_genome_assembly.PNG		Overview_whole_genome_assembly.PNG
README.md		README.md
img.png		img.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genetic makeup of Trichuris incognita, a novel Trichuris species naturally infecting humans and displaying signs of resistance to drug treatment

Contents:

De-novo hybrid genome assembly

Visual overview

Methods

Gene prediction and funtional annotation

Methods

Mitogenome assembly and phylogeny

Visual overview

Methods

Genome wide association study

Visual overview

Methods

About

Releases

Packages

Languages

max-baer/Trichuris_Incognita

Folders and files

Latest commit

History

Repository files navigation

Genetic makeup of Trichuris incognita, a novel Trichuris species naturally infecting humans and displaying signs of resistance to drug treatment

Contents:

De-novo hybrid genome assembly

Visual overview

Methods

Gene prediction and funtional annotation

Methods

Mitogenome assembly and phylogeny

Visual overview

Methods

Genome wide association study

Visual overview

Methods

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages