Studying the Phylogeny and Reticulate Evolution of the Wheat Species Complex using Repeated Random Haplotype Sampling (RRHS) (Figure 4)
Author: Daniel Lang
This directory comprises a set of jupyter notebooks, snakemake workflows and scripts that were utilized to perform the phylogenetic analyses underlying the model of reticulate evolution in the wheat species complex.
- Figure 4
- Figure S10
- Table S7 (manuscript version comprises only MST edges - full version as tsv and xlsx)
- Figure S11
- Figure S13
- 1. SNP filtering and export as multiple alignments using IUPAC ambiguity and RRHS for heterozygous sites
- 2. Phylogenetic inference of maximum likelihood trees for the IUPAC alignments
- 3. Phylogenetic inference of maximum likelihood trees for the 1000 RRHS samples
- 4. Inferring consensus topologies for the 1000 RRHS trees using RAxML
- 5. Inferring consensus topologies for the 1000 RRHS trees using ASTRAL
- 6. Inference of a Phylogenetic Consensus Network from RRHS trees
- 7. Analysis of the Phylogenetic Consensus Network and Taxon-Community Clusters
- 8. Plotting Trees as Phylograms and Cloudogram/Densitrees
- 9. Checking dominant genotype and introgressions in durum x dicoccoides RIL lines
1. SNP filtering and export as multiple alignments using IUPAC ambiguity and RRHS for heterozygous sites
A summary of the overall SNP filtering process is also provided as LibreOffice calc sheet.
- Unimputed variant calls in VCF split by chromosome (e.g.
full_vcfs/chr1A.minocc10.maf1pc.vcf
) - Genotype Metadata
- Variant call data structures in HDF5
- Multiple sequence alignments with heterozygous sites as IUPAC ambiguity symbols (FASTA)
- 3 subgenomes (B, A, D)
- 3 x 7 chromosomes (1B-7B, 1A-7A, 1D-7D)
- 1000 Multiple sequence alignments with heterozygous sites randomly selected by RRHS (FASTA)
- Filtered VCF files (VCF)
- Diverse exploratory plots (PDF)
Applied filtering criteria (in order of application):
- Only SNPs
- Maximally missing in 10% of the genotypes
- Biallelic
- Alternative allele must be present in >1 genotype
- LD Pruning (scikit.allele.locate_unlinked):
allele.locate_unlinked(gn, size=window_size, step=step_size, threshold=threshold)
- window size = 500bp
- step size = 200bp
- threshold = 0.1
- Polymorphic sites
- Multiple sequence alignments with heterozygous sites as IUPAC ambiguity symbols (FASTA)
- 3 subgenomes (B, A, D)
- 3 x 7 chromosomes (1B-7B, 1A-7A, 1D-7D)
- Phylogenetic trees (NEWICK):
iupac/A/RAxML_bestTree.ASC_GTRGAMMA_felsenstein
iupac/A_chromosomes/RAxML_bestTree.chr1A.ASC_GTRGAMMA_felsenstein
- automatically generated configuration files as input for RAxML
- other files reported by RAxML
-m ASC_GTRGAMMA --JC69 --asc-corr=felsenstein
- total size to relate ascertainment bias: 44746258 → present in 95% of the subgenome-specific samples on median
- 12 and 4 threads respectively
#cluster submission
snakemake --snakefile Snakefile.raxml --cluster 'qsub -q QUEUENAME -V -cwd -e log/ -o log/ -pe serial {threads} -l job_mem=2G' -j 100 -w 500
snakemake --snakefile Snakefile.raxml_chr --cluster 'qsub -q QUEUENAME -V -cwd -e log/ -o log/ -pe serial {threads} -l job_mem=2G' -j 100 -w 500
- 1000 Multiple sequence alignments with heterozygous sites randomly selected by RRHS (FASTA)
- 1000 Phylogenetic trees (NEWICK) e.g.
RRHS_RAxML/output/RAxML_bestTree.ASC_GTRGAMMA_felsenstein.A.1
- automatically generated configuration files as input for RAxML
- other files reported by RAxML
-m ASC_GTRGAMMA --JC69 --asc-corr=felsenstein
- total size to relate ascertainment bias: 44746258 → present in 95% of the subgenome-specific samples on median
- 10 threads
- per core memory: 2GB
#cluster submission
snakemake --cluster 'qsub -q QUEUENAME -e {log.err} -o {log.out} -cwd -pe serial {threads} -l job_mem={params.mem}' -j 500 -w 500
- 1000 Phylogenetic trees (NEWICK) e.g.
RRHS_RAxML/output/RAxML_bestTree.ASC_GTRGAMMA_felsenstein.A.1
- Consensus topologies with support values (NEWICK; e.g.
RAxML_MajorityRuleConsensusTree.ASC_GTRGAMMA_felsenstein.A.RRHS_consensus.MR
) with 4 methods:- MajorityRuleExtendedConsensusTree
- MajorityRuleConsensusTree
- StrictConsensusTree
- Threshold-75-ConsensusTree
- other files reported by RAxML
-m ASC_GTRGAMMA
- 20 threads
cd RRHS_RAxML/
snakemake --snakefile Snakefile.consensu
Gratefully following the advise and suggestions by ASTRAL developer Siavash Mirarab:
smirarab/ASTRAL#38
- 1000 Phylogenetic trees (NEWICK) e.g.
RRHS_RAxML/output/RAxML_bestTree.ASC_GTRGAMMA_felsenstein.A.1
- Consensus topologies with quartet values (NEWICK) e.g.
RAxML_bestTree.ASC_GTRGAMMA_felsenstein.A.astral.quartets.nh
-t 1
cd RRHS_RAxML/
snakemake --snakefile Snakefile.ASTRAL-II
This procedure:
- foreach subgenome, infers the minimum spanning tree tip graph for each RRHS tree using the phylogenetic distance as weight →
mst_trees
- foreach subgenome, merges them each into a weighted graph using the inverse of the relative number of
mst_trees
sharing an edge as a weight →G
- infers the minimum spanning tree graph of
G
→GM
(MST edges) - merges all subgenome graphs
G
into a combined graph with annotated MST edges
Subsequently, the graph was imported into Cytoscape, community clustering was performed using the Newmann-Girvan algorithm implemented in the ClusterMaker2 plugin, the resulting clusters were intersected with taxonomic information and annotated with the AutoAnnotate plugin.
- 1000 Phylogenetic trees (NEWICK) e.g.
RRHS_RAxML/output/RAxML_bestTree.ASC_GTRGAMMA_felsenstein.A.1
- Genotype Metadata
- Weighted phylogenetic consensus network for each subgenome (B, A, D)
G
- Minimal phylogenetic consensus network for each subgenome (B, A, D)
GM
- Combined, annotated, weighted phylogenetic consensus network comprising all subgenomes (B, A, D) → Figure4A, FigureS10 and Table S4 (manuscript version comprises only MST edges - full version as tsv and xlsx)
The phylogenetic taxon-community clusters were exported from Cytoscape.
- RRHS_RAxML/GetNetwork.B.ipynb
- RRHS_RAxML/GetNetwork.A.ipynb
- RRHS_RAxML/GetNetwork.D.ipynb
- RRHS_RAxML/MergeSubgenomeNetworks.ipynb
- 20 threads
Indepth statistical analysis of the Taxon-Comunity clusters' composition and the relationships between the 3 bread wheat communities.
- Node annotation RRHS_RAxML/Network.Clusters.nodes.csv
- Country Codes
- RRHS_RAxML/Consensus_network.1000_RAxML.all_genomes.tsv
- Country annotated nodes of the phylogenetic community-taxon clusters
- Country factors used in the analyses as tsv and xlsx (latter was manually curated)
- Several exploratory plots some of which ended up as panels in Figure S13 (in notebook and PDF)
- Tree files in NEWICK from the IUPAC and RRHS RAxML analyses
- Genotype Metadata
- Taxon-Comunity cluster colors
- Country annotated nodes of the phylogenetic community-taxon clusters
- Several tree plots in the notebook
- Annotated versions of the trees
This part of the analysis could be considered as something like blind study, as I was not aware that the genotypes taxon initally marked as T. turgidum actually were F6 RIL offspring from this study:
As they were identified as hybrids in our approach they served as an excellent showcase and proof-of-concept.
- Multiple sequence alignments with heterozygous sites as IUPAC ambiguity symbols (FASTA) per chromosome
- Various plots and tables in the notebook