-
Notifications
You must be signed in to change notification settings - Fork 4
Align Pangenome
Sam Minot edited this page Feb 18, 2025
·
4 revisions
Workflow: align_pangenome.nf
Align a set of metagenomic sequencing data to a binned pangenome in
which the genes have been grouped by similar occurrence across genomes
(using the build_pangenome.nf
workflow).
Those binned gene groups will be used to summarize a batch of metagenomic data, and the similarity of gene abundance across metagenomes will be tested for concordence with the similarity across genomes.
By applying the results of gene binning to the metagenome alignment results using the same gene catalog, the diversity of a collection of organisms can be collapsed into units of those co-occurrent genetic elements.
- Align WGS reads to the gene catalog.
- Filter duplicate alignments so that each read aligns to <=1 gene from the catalog. Duplicate alignments are resolved using FAMLI.
- Summarize the proportion of reads from each sample which aligns to each gene.
- Aggregate gene-level information using the gene bins (generated in the Build Pangenome step).
- If sample metadata was provided, use corncob to identify gene bins which are differentially abundant between sample groups.
--genes Path of single deduplicated amino acid FASTA file to be used for alignment
--reads Folder containing the set of FASTQ files to align against those genes
--reads_suffix File ending for all reads (default: ${params.reads_suffix})
--paired Set to true if input folder contains paired-end FASTQ files (default: false)
--samplesheet Optional file listing the FASTQ inputs to process (headers: sample,R1,R2)
--min_score_reads Minimum alignment score (default: ${params.min_score_reads})
--min_identity Minimum percent identity of the amino acid alignment required to retain the alignment
(default: ${params.min_identity}, ranges 0-100)
--max_evalue Maximum E-value threshold used to filter all alignments
(default: ${params.max_evalue})
--aligner Algorithm used for alignment (default: ${params.aligner}, options: diamond, blast)
--query_gencode Genetic code used for conceptual translation of genome sequences
(default: ${params.query_gencode})
--max_overlap Any alignment which overlaps a higher-scoring alignment by more than this
amount will be filtered out (default: ${params.max_overlap}, range: 0-100)
--aln_fmt Column headings used for alignment outputs (see DIAMOND documentation for details)
(default: ${params.aln_fmt})
--gene_bins Grouping of genes into bins (e.g. gene_bins.csv)
--group_profile Gene content of genome groups (e.g. group_profile.csv)
--genome_groups Grouping of genomes by gene content (e.g. genome_groups.csv)
--output Folder where output files will be written
--metadata Optional: Metadata table used to compare samples (CSV)
--formula Optional: Column(s) from metadata table used for comparison
--min_n_reads Exclude any samples with fewer than this number of reads aligned (default: 0)
--min_n_genes Exclude any samples with fewer than this number of genes detected (default: 0)
- The
genes
parameter should point to thecentroids.faa.gz
output frombuild_pangenome.nf
. - The value of
min_identity
should ideally match that which was used to construct the pangenome. - The combination of
min_n_reads
andmin_n_genes
can be used to filter out samples from the final outputs, with the aim of excluding low-coverage samples. - To identify gene bins which are differentially abundant between groups of samples, point to a
metadata
CSV which includes the metadata labels for each sample and specify the column which contains the basis of comparison withformula
. The best way to make sure that the sample labels inmetadata
match up with the FASTQ inputs is to include asamplesheet
CSV which explicitly states which FASTQ files belong to each sample. - The simplest comparison for groups of samples is to use a column with
0
or1
, in which case those two groups will be compared with0
as the reference and1
as the comparitor. If you provide multiple numeric values, it will be evaluated as a linear correlation. If you provide multiple non-numeric values, it will be evaluated as a series of categorical comparisons with each unique value being compared against the reference (which is picked as the one that comes alphanumerically first).
- Detailed information on the alignment of each sample to the gene catalog:
alignments/
- Output of the statistical analysis for differential abundance between sample groups. Will contain multiple subfolders if categorical comparisons were performed.
association/
- Metadata applied to each sample (CSV)
csv/metagenome.obs.csv.gz
- Abundance of each gene bin across each sample - number of reads per gene per sample
csv/metagenome.bins.X.csv.gz
- Abundance of each gene bin across each sample - proportion of reads per gene per sample
csv/metagenome.bins.prop.csv.gz
- Abundance of each gene across each sample - number of reads per gene per sample
csv/metagenome.genes.X.csv.gz
- Abundance of each gene across each sample - number of reads per gene per sample, normalized to gene length
csv/metagenome.genes.depth.csv.gz
- Coverage of each gene across each sample - proportion of each gene's length which was covered by reads in each sample
csv/metagenome.genes.coverage.csv.gz
- Bin-level abundance information in AnnData format
h5ad/metagenome.bins.h5ad
- Gene-level abundance information in AnnData format
h5ad/metagenome.genes.h5ad
- Detailed information on how many reads were aligned from each sample to each gene, in long format
read_alignments.csv.gz
- Aggregated information at the gene and bin level in MuData format
metagenome.h5mu