Skip to content

10. Population Genetics Analysis of Genes Found in a GCF

Rauf Salamzade edited this page Jul 11, 2023 · 64 revisions

lsaBGC-PopGene.py

To automatically detect interesting evolutionary patterns in BGC instances of a GCF, we have developed the program lsaBGC-PopGene.py which calculates conservation and population genetics type statistics for each homolog group found in the GCF. These stats are based on codon alignments for each homolog group constructed from running MAGUS (a wrapper of MAFFT) for protein sequences followed by PAL2NAL transformation of the corresponding nucleotide coding sequences.

There are multiple output files of potential interest to users by lsaBGC-PopGene.py, but for most, the main results of interest will be the following three:

  1. Codon multi-sequence-alignment FASTAs - can be found in the subdirectory: Codon_Alignments/
  2. Codon multi-sequence-alignment conservation visualizations - can be found in the subdirectory: Codon_MSA_Plots/
  3. Summarized report on conservation/population-genetics statistics for homolog groups - can be found in the main directory as the file Homolog_Group_Information.txt. A data dictionary and discussion around the calculation of certain statistics is described below.

Codon Multi-Sequence-Alignment Conservation Visualizations.

Using the ggplot2 library in R, lsaBGC-PopGene.py automatically constructs a multi-panel image to visualize the conservation exhibited among different genetic instances of a particular homolog group. The x-axis for all panels is the position along the codon-based multiple-sequence-alignment.

Currently, four panels are shown in these visualizations (though updates are planned later):

Panel Title Description
Top panel Site coverage The number of sequences in the codon-alignment which have an allele (no-gap) at a particular position.
Second from the top panel Count of Alleles The number of distinct nucleotides (thus max value of 4) observed at a particular position.
Third from the top panel 1.0 - Major Allele Frequency The inverse of the major-allele-frequency. Note, this can be below 0.5 if there are multiple minor alleles present.
Bottom panel Domains The span of predicted domains along the alignment.

Here is an example of such a plot for some homolog group.

More Information on the Homolog_Group_Information.txt Result File

This is the primary output of lsaBGC-PopGene.py and is a very-wide table where each row corresponds to a homolog group found in the focal GCF of investigation. All the columns and information components are described below in the data dictionary.

Algorithm to Infer Consensus Ordering and Directionality of Homolog Groups

We developed an algorithm to infer the consensus order of homolog groups relative to each other in the GCF, as well as their relative consensus directionality (sense vs. antisense). This algorithm works by first computing how many times a homolog group proceeds another homolog group in the different BGC instances belonging to a GCF. This information, encoded as a dictionary, is iterated through to first construct a consensus path from the most often starting homolog group to the most often ending homolog group for the GCF. Afterwards, homolog groups which were not featured in this consensus path, maybe because they are infrequently found and not core to the GCF, are attempted to be placed in their most appropriate locations. When looking at the Homolog_Group_Information.txt file sorted by the consensus ordering the homolog groups, it is thus important to consider the proportion of samples with the GCF which have a particular homolog group. The core ordering will be influenced more by more prevalent homolog groups. The consensus directionality of homolog groups is simply based on whether most instances of it are forward or reverse relative to a reference gene from some chosen representative BGC.

Homolog Group Beta-RD Statistic

If a user provides genome-wide similarity estimates from MASH (as is the default in the lsaBGC-AutoAnalyze.py workflow), then the median beta-rd statistic is computed per homolog group. The beta-rd statistic measures the pairwise sequence similarity of homolog group instances from a pair of samples normalized by their genome-wide similarity. More on this statistic, that is the core of the lsaBGC-Divergence.py program, please check out that program's page. The max beta-rd is also reported in newer versions to allow for more easy identification of potential HGT events.

Tajima's D and Proportion of Variable Sites in a Homolog Group's Multiple Sequence Alignment

Tajima's D statistic is a classic population genetics method to measure the abundance of rare to common variants observed for a homolog group. If there are more common minor-allele variants than rare ones, this indicates balancing selection; whereas, if rarer variants are more prevalent it suggests high conservation or sweeping selection. This metric was applied in M. tuberculosis by the Pepperell group where they found that gene length can influence the metric and this insight should thus be taken into consideration when interpreting the statistic.

We used and modified code developed by Tom Whalley [and made available on Github at: https://github.com/WhalleyT/tajima]. Namely, we made the calculation of Tajima's D into a single discrete function and adjusted its calculation to match the approach described by Tajima 1989. Additionally, when calculating Tajima's D we only consider segregating sites and pairwise differences between sequences when multiple non-gap alleles are observed. Codon alignments were filtered for sequences which were highly ambiguous (>25% gaps in MSA) and subsequently specific codon sites were filtered if they featured gaps > 10% of remaining sequences or were non-polymorphic.

As an alternate assessment of sequence variation, we also report the proportion of sites across the multiple sequence alignment of a homolog group where multiple alleles exist (major allele < 98%) and the proportion of sites where the major allele is non-dominant (<75%).

Calculation of dN/dS

The rate of non-synonymous mutations relative to the rate of synonymous mutations is a classical statistic used to infer the effect of positive vs. negative selection. If the rate for non-synonymous mutations is higher than the rate for synonymous mutations across different instances of a gene (dN/dS > 1), it could suggest positive selection; whereas, if the reverse is true (dN/dS < 1), it could suggest negative or purifying selection.

We utilize Biopython's codonalign.codonseq module to calculate dN/dS using the method described in Nei and Gojobori 1986. Codon alignments are filtered to reduce gaps in the same way as for the calculation of Tajima's D.

Additionally, to ensure we avoid excessive computation when the number of sequences with a homolog group is large, we have implemented an empirical but non-exhaustive framework in which we calculate the median dN/dS between 1000 randomly selected pairs of samples (number adjustable). We do this random sampling and calculation of dN/dS for 20 iterations and finally report the median across iterations along with absolute median deviation to assess robustness of dN/dS estimates.

In the future, we hope to be able to implement a more accurate inference of dN/dS without compromising computational speed. The gold-standard is to use codeml in PAML; however, the software package does not scale well to multiple sequence alignments with hundreds or thousands of sequences. We are thus considering implementing a similar approach to PAML, wherein we perform ancestral state reconstruction using a codon model for the multiple sequence alignment using FastML and calculating dN/dS by comparing each extant sequence for the homolog group with the ancestral sequence. This approach should scale linearly [O(n)] and avoid pairwise calculations [O(n^2)].

Finally, a recent study by Rahman et al. 2021 found that dN/dS calculations reported in bacterial evolutionary studies likely are an underestimate due to weak selection on synonymous codons; thus, we urge the user takes caution when interpreting results of selection and recommend they follow up with targeted assessment for the influence of selection using phylogenetic-aware, maximum-likelihood approaches such as PAML or Hyphy.

Data Dictionary

IMPORTANT NOTE: Please make sure to pay attention to the column titled single-copy_in_GCF_context in the Overview sheets to inform whether evolutionary statistics can be properly interpreted for a homolog group.

Column Description Population/Clade-Analysis Specific Experimental
gcf id The GCF identifier No
gcf annotation The predicted class distribution of BGCs assigned to the GCF No
homolog group The homolog group identifier (unique identifier of each row) No
annotation Predicted products for genes belonging to homolog group. If annotation performed in lsaBGC-Ready, this is via KOfam + PGAP HMMs. No
MIBiG mapping Mapping to MIBiGv3.1 proteins - only reported if reference BGC in MIBiG is generally deemed as matching to GCF No
hg order index Relational order of homolog group to other homolog groups (see above for algorithm details) No
hg consensus direction Consensus direction of homolog group relative to other genes. Boolean logic: 0 = 'reverse' and 1 = 'forward' No
hg proportion multi-copy genome-wide (formerly "hg median copy count") The percentage of primary genomes (so the genomes for which OrthoFinder was run on) where homolog group is found in multiple copies across the genome. No Yes, if using OrthoFinder Mode "BGC_Only" is used for lsaBGC-Easy.py/lsaBGC-Ready.py (look at lsaBGC-Ready.py help function for more details on "BGC_Only" mode).
single-copy_in_GCF_context Whether the homolog group is found in single-copy for samples in the context of this GCF (so paralogs can exist in the genome). No
median gene length The median gene length of genes belonging to the homolog group. No
is core to bgc True if >=50% of gene instances of homolog group overlap with protocluster core region as defined by antiSMASH. No
num of hg instances Number of homolog group instances identified in BGCs belonging to the GCF. Sometimes multiple genes in a BGC are grouped into the same homolog group - useful to consider! No
samples with hg Number of samples with the homolog group in the relevant GCF. No
proportion of samples with hg Proportion of samples with GCF which have the homolog group. If core should be 1.0. No
ambiguous sites proportion Proportion of sites along codon alignment for homolog group where >10% of samples have gaps. High proportions might indicate caution for interpretation of various evolutionary stats. No
Tajimas D Tajima's D statistic measuring the frequency of polymorphisms which are rare vs. common (can indicate signatures of sweeping selection or balancing selection on the homolog group's evolution - but be cautious) - generally useful as a calculator of sequence diversity existing for a homolog group. Only computed for filtered codon alignments with >=4 sequences and >=21 codon sites and >=3 segregating sites. NA - could signify either high conservation or too few sequences/positions for assessment. No No, but adaptation to account for the presence of minor instances of gaps in alignment is not described in original Tajima's D calculation.
proportion variable sites The proportion of sites in the homolog group's which are variable (major allele < 98%). Only accounts for sites in codon alignment which are largely non-ambiguous (<10% of sequences have a gap). No
proportion nondominant major allele The proportion of sites in the homolog group's where the major allele is non-dominant (major allele < 75%). Only accounts for sites in codon alignment which are largely non-ambiguous (<10% of sequences have a gap). No
median beta rd The median Beta-RD statistic - ratio of homolog-group sequence similarity vs. amino-acid expected similarity (based on single-copy genes from GToTree alignment) across pairs of samples. No
median beta rd The median Beta-RD statistic - ratio of homolog-group sequence similarity vs. amino-acid expected similarity (based on single-copy genes from GToTree alignment) across pairs of samples. No
median dn/ds The median of 20 iterations of calculating the median dN/dS ratio between 32 randomly selected sequences as calculated using cal_dn_ds() from Biopython's codonalign.codonseq module. Only computed for filtered codon alignments with >=4 sequences and >=21 codon sites and >=3 segregating sites. No Note, Biopython's codonalign.codonseq module is in development.
mad dn/ds The median absolute deviation of 20 iterations of calculating the median dN/dS ratio between 32 randomly selected sequences as calculated using cal_dn_ds() from Biopython's codonalign.codonseq module. No
populations with hg The number of populations/clades/groupings where the homolog group is found. Yes
proportion of total populations with hg The proportion of all populations where the homolog group is found. Yes
most_significant_Fisher_exact_pvalues presence absence The lowest P-value after Benjamini-Hochberg correction from performing Fisher's exact test looking at specificity of homolog group to a specific population. Yes Experimental!
median Tajimas D per population Tajima's D is calculated for each population/clade and the median value is reported. Yes
mad Tajimas D per population The median absolute deviation of the median Tajima's D statistic across populations/clade. Yes
most negative population Tajimas D The most negative Tajima's D statistic observed for a single population/clade, has to be less than 0.0. Yes
most positive population Tajimas D The most positive Tajima's D statistic observed for a single population/clade, has to be greater than 0.0. Yes
population entropy For each population, a frequency is calculated representing the number of samples with the homolog group from the population divided by the total number of samples with the homolog group. The reported statistic is a simple alpha-diversity measure using entropy with base 2. Yes
median fst-like estimate The median Fst for the homolog group across all pairs of populations. To calculate Fst, pi_within and pi_between are calculated as the median distance of sample pairs to the consensus sequence of the homolog group where the former regards within-focal population pairs and the latter regards sample pairs from the focal and comparator populations. Fst is then calculated for each population pair as pi_between minus pi_within divided by pi_between. Yes Experimental! Derived off the Fst estimate described by Hudson, Slatkin and Maddison 1989 but modified to find differences relative to a single consensus sequence for computational efficiency.
population proportion of members with hg Percentage of samples for populations with the focal homolog group in this GCF-context. Yes
all domains Domains identified for proteins belonging to homolog group. No

Usage

usage: lsaBGC-PopGene.py [-h] -g GCF_LISTING -m ORTHOFINDER_MATRIX [-i GCF_ID] -o OUTPUT_DIRECTORY [-k SAMPLE_SET]
                         [-s SPECIES_PHYLOGENY] [-u POPULATION_CLASSIFICATION] [-p BGC_PREDICTION_SOFTWARE] [-c CPUS] [-d]
                         [-e] [-t] [-w EXPECTED_SIMILARITIES]

        Program: lsaBGC-PopGene.py
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

        This program investigates conservation and population genetic related statistics for each homolog group
        observed in BGCs belonging to a single GCF.


optional arguments:
  -h, --help            show this help message and exit
  -g GCF_LISTING, --gcf_listing GCF_LISTING
                        BGC listings file for a gcf. Tab delimited: 1st column lists sample name while the 2nd column is the path to a BGC prediction in Genbank format.
  -m ORTHOFINDER_MATRIX, --orthofinder_matrix ORTHOFINDER_MATRIX
                        OrthoFinder matrix.
  -i GCF_ID, --gcf_id GCF_ID
                        GCF identifier.
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Path to output directory.
  -k SAMPLE_SET, --sample_set SAMPLE_SET
                        Sample set to keep in analysis. Should be file with one sample id per line.
  -s SPECIES_PHYLOGENY, --species_phylogeny SPECIES_PHYLOGENY
                        The species phylogeny in Newick format. Specifies that dN/dS should be calculated by comparing extant homolog-group sequences to ancestral state reconstruction. Does not currently work!!!
  -u POPULATION_CLASSIFICATION, --population_classification POPULATION_CLASSIFICATION
                        Popualation classifications for each sample. Tab delemited: 1st column lists sample name while the 2nd column is an identifier for the population the sample belongs to.
  -p BGC_PREDICTION_SOFTWARE, --bgc_prediction_software BGC_PREDICTION_SOFTWARE
                        Software used to predict BGCs (Options: antiSMASH, DeepBGC, GECCO).
                        Default is antiSMASH.
  -c CPUS, --cpus CPUS  The number of cpus to use.
  -d, --regular_mafft   Run mafft --linsi and not the MAGUS divide-and-conquer approach which allows for scalability and more efficient computing.
  -e, --each_pop        Run analyses individually for each population as well.
  -t, --filter_for_outliers
                        Filter instances of homolog groups which deviate too much from the median gene length observed for the initial set of proteins.
  -w EXPECTED_SIMILARITIES, --expected_similarities EXPECTED_SIMILARITIES
                        Path to file listing expected similarities between genomes/samples. This is
                        computed most easily by running lsaBGC-Ready.py with '-t' specified, which will estimate
                        sample to sample similarities based on alignment used to create species phylogeny.
Clone this wiki locally