Skip to content

Align Pangenome

Sam Minot edited this page Feb 18, 2025 · 4 revisions

Workflow: align_pangenome.nf

Align a set of metagenomic sequencing data to a binned pangenome in which the genes have been grouped by similar occurrence across genomes (using the build_pangenome.nf workflow).

Those binned gene groups will be used to summarize a batch of metagenomic data, and the similarity of gene abundance across metagenomes will be tested for concordence with the similarity across genomes.

By applying the results of gene binning to the metagenome alignment results using the same gene catalog, the diversity of a collection of organisms can be collapsed into units of those co-occurrent genetic elements.

Analysis Steps:

  1. Align WGS reads to the gene catalog.
  2. Filter duplicate alignments so that each read aligns to <=1 gene from the catalog. Duplicate alignments are resolved using FAMLI.
  3. Summarize the proportion of reads from each sample which aligns to each gene.
  4. Aggregate gene-level information using the gene bins (generated in the Build Pangenome step).
  5. If sample metadata was provided, use corncob to identify gene bins which are differentially abundant between sample groups.

Parameters:

--genes             Path of single deduplicated amino acid FASTA file to be used for alignment
--reads             Folder containing the set of FASTQ files to align against those genes
--reads_suffix      File ending for all reads (default: ${params.reads_suffix})
--paired            Set to true if input folder contains paired-end FASTQ files (default: false)
--samplesheet       Optional file listing the FASTQ inputs to process (headers: sample,R1,R2)

--min_score_reads   Minimum alignment score (default: ${params.min_score_reads})
--min_identity      Minimum percent identity of the amino acid alignment required to retain the alignment
                    (default: ${params.min_identity}, ranges 0-100)
--max_evalue        Maximum E-value threshold used to filter all alignments
                    (default: ${params.max_evalue})
--aligner           Algorithm used for alignment (default: ${params.aligner}, options: diamond, blast)
--query_gencode     Genetic code used for conceptual translation of genome sequences
                    (default: ${params.query_gencode})
--max_overlap       Any alignment which overlaps a higher-scoring alignment by more than this
                    amount will be filtered out (default: ${params.max_overlap}, range: 0-100)
--aln_fmt           Column headings used for alignment outputs (see DIAMOND documentation for details)
                    (default: ${params.aln_fmt})

--gene_bins         Grouping of genes into bins (e.g. gene_bins.csv)
--group_profile     Gene content of genome groups (e.g. group_profile.csv)
--genome_groups     Grouping of genomes by gene content (e.g. genome_groups.csv)
--output            Folder where output files will be written

--metadata          Optional: Metadata table used to compare samples (CSV)
--formula           Optional: Column(s) from metadata table used for comparison

--min_n_reads       Exclude any samples with fewer than this number of reads aligned (default: 0)
--min_n_genes       Exclude any samples with fewer than this number of genes detected (default: 0)

Application Notes:

  1. The genes parameter should point to the centroids.faa.gz output from build_pangenome.nf.
  2. The value of min_identity should ideally match that which was used to construct the pangenome.
  3. The combination of min_n_reads and min_n_genes can be used to filter out samples from the final outputs, with the aim of excluding low-coverage samples.
  4. To identify gene bins which are differentially abundant between groups of samples, point to a metadata CSV which includes the metadata labels for each sample and specify the column which contains the basis of comparison with formula. The best way to make sure that the sample labels in metadata match up with the FASTQ inputs is to include a samplesheet CSV which explicitly states which FASTQ files belong to each sample.
  5. The simplest comparison for groups of samples is to use a column with 0 or 1, in which case those two groups will be compared with 0 as the reference and 1 as the comparitor. If you provide multiple numeric values, it will be evaluated as a linear correlation. If you provide multiple non-numeric values, it will be evaluated as a series of categorical comparisons with each unique value being compared against the reference (which is picked as the one that comes alphanumerically first).

Output Files:

  • Detailed information on the alignment of each sample to the gene catalog:
    • alignments/
  • Output of the statistical analysis for differential abundance between sample groups. Will contain multiple subfolders if categorical comparisons were performed.
    • association/
  • Metadata applied to each sample (CSV)
    • csv/metagenome.obs.csv.gz
  • Abundance of each gene bin across each sample - number of reads per gene per sample
    • csv/metagenome.bins.X.csv.gz
  • Abundance of each gene bin across each sample - proportion of reads per gene per sample
    • csv/metagenome.bins.prop.csv.gz
  • Abundance of each gene across each sample - number of reads per gene per sample
    • csv/metagenome.genes.X.csv.gz
  • Abundance of each gene across each sample - number of reads per gene per sample, normalized to gene length
    • csv/metagenome.genes.depth.csv.gz
  • Coverage of each gene across each sample - proportion of each gene's length which was covered by reads in each sample
    • csv/metagenome.genes.coverage.csv.gz
  • Bin-level abundance information in AnnData format
    • h5ad/metagenome.bins.h5ad
  • Gene-level abundance information in AnnData format
    • h5ad/metagenome.genes.h5ad
  • Detailed information on how many reads were aligned from each sample to each gene, in long format
    • read_alignments.csv.gz
  • Aggregated information at the gene and bin level in MuData format
    • metagenome.h5mu