index.qmd

![](images/logo.png){width="600"}

NOTE: THIS PROJECT IS UNDER DEVELOPMENT AND MAY NOT FUNCTION AS EXPECTED UNTIL THIS MESSAGE GOES AWAY

## Introduction {#introduction}

**nf-core/pathogensurveillance** is a population genomic pipeline for pathogen diagnosis, variant detection, and biosurveillance. The pipeline accepts the paths to raw reads for one or more organisms (in the form of a CSV file) and creates reports in the form of interactive HTML reports or PDF documents. Significant features include the ability to analyze unidentified eukaryotic and prokaryotic samples, creation of reports for multiple user-defined groupings of samples, automated discovery and downloading of reference assemblies from NCBI RefSeq, and rapid initial identification based on k-mer sketches followed by a more robust core genome phylogeny and SNP-based phylogeny.

## Background - why use PathogenSurveillance? {#background}

**TL;DR:** **unknown FASTQ -\> sample ID + phylogeny + publication-quality figures**

Most existing genomic tools are designed to be used with a reference genome. Unfortunately, this is a luxury in the pathogen diagnostic world, where researchers often are faced with unknown samples. This creates a conundrum: how can you draw a point of comparison without a starting point?

PathogenSurveillance addresses this by **picking a good reference genome for you** through a process of k-mer sketch comparisons. In most cases, this will be the best reference. It is therefore a great option for identification of an unknown sample, or when you would like some empirical data suggesting which references are available in nearby species. Conversely, if you already have a good understanding of your system, you also have the ability to supply your choice of a reference genome.

Pathogensurveillance uses several strategies to address the emerging complexity of working with sequences that could be anything ranging from neamtodes to bacteria or fungi. First, PathogenSurveillance uses **reasonable baseline parameters** that work very well with most species. There are some processes where a one-size-fits-all strategy isn't feasible. In such cases, PathogenSurveillance will change the analysis workflow automatically (e.g. assembling prokaryotic vs eukaryotic core genes). These analysis branchpoints are facilitated by the nextflow framework, and require no additional inputs from the user.

While PathogenSurveillance may be a useful tool for researchers of all levels, it was designed with clinicians in mind who may have limited bioinformatics training. **PathogenSurveillance is very simple to run**. At a minimum, all that needs to be supplied is a .CSV file with a single column specifying the path to your sample's sequencing reads. For more experienced users, there are opportunities to customize the parameters of the pipeline, for example by porviding information on geogrpahic locations, hsots, etc. In particular, the final report is designed for configuration through Quarto and the PSminer widget R package.

**Pathogensurveillance is particularly good for:**

-   unknown sample identification
-   exploratory population analysis using minimal input parameters
-   inexperienced bioinformatics users
-   efficient parallelization of tasks
-   repeated analysis (given caching) where you would like to add new samples to a past run

**Pathogensurveillance is not designed for:**

-   viral sequence
-   non gDNA datasets (RNA-seq, RAD-seq, ChIP-seq, etc.)
-   mixed/impure samples
-   Researchers who want to use a highly configurable pipeline, or those who would like to extensively test the pipeline at each stage

::: {.callout-note}
Pathogensuveillance manages its many software dependencies through containers. The pipeline works with Linux machines and HPC clusters, but Singularity and Docker currently have **compatibility issues with MacOS and Windows computers**. 
:::

## Pipeline summary {#pipelinesummary}

This is a quick breakdown of the processes used by PathogenSurveillance:

-   Download sequences and references if they are not provided locally
-   Quickly obtain several initial sample references (`bbmap`)
-   More accurately select appropriate reference genomes. Genome "sketches" are compared between first-pass references, samples, and any references directly provided by the user (`sourmash`)
-   Genome assembly
    -   Illumina shortreads: (`spades`)
    -   Pacbio or Oxford Nanopore longreads: (`flye`)
-   Genome annotation (`bakta`)
-   Align reads to reference sequences (`bwa`)
-   Variant calling and filtering (`graphtyper`, `vcflib`)
-   Determine relationship between samples and references
    -   Build SNP tree from variant calls (`iqtree`)
    -   For Prokaryotes:
        -   Identify shared orthologs (`pirate`)
        -   Build tree from core genome phylogeny (`iqtree`)
    -   For Eukaryotes:
        -   Identify BUSCO genes (`busco`)
        -   Build tree from BUSCO genes (`read2tree`)
-   Generate interactive html report/pdf file
    -   Sequence and assembly information (`fastqc, multiqc, quast`)
    -   sample identification tables and heatmaps
    -   Phylogenetic trees from genome-wide SNPs and core genes
    -   minimum spanning network

## PathogenSurveillance pipeline chart {#flowchart}

![](images/pipeline_diagram.png){width="700"}