Data and scripts for testing SNPPar with either simulated or empirical datasets
The basic command to use SNPPar is:
snppar -g [reference] -t [tree] -s [snp_table] -d [output_directory]
note: '-d' is optional, if left off SNPPar will direct the output to the current folder
To estimate computer usage, add the following before the above command:
/usr/bin/time -lp
Simulated dataset (see also 7. Run simulated data with SNPPar)
note: SNP tables other than those in the sub-folder 'HCMC_L2_r10p' need to be unpacked before using.
snppar -g NC_00962_3_1.gbk -t HCMC_L2_r10p.tre -s HCMC_L2_r10p_r1_alleles.csv -d sim_out_HCMC_L2_r10p/
Empirical dataset (see also 8. Run empirical data with SNPPar)
note: the SNP tables need to be unpacked before using.
snppar -g NC_00962_3_1.gbk -t HCMC_L2_r10p.tre -s HCMC_L2_r10p_alleles_82strains_var_rF_c95.csv -d real_out_HCMC_L2_r10p/
These are self-contained datasets, including reference(s), tree and snp_table(s) required to run SNPPar. The instructions for each are below in 10. Published Datasets.
Holt KE, McAdam P, Thai PVK, Thuong NTT, Ha DTM, Lan NN, Lan NH, Nhu NTQ, Hai HT, Ha VTN, Thwaites G, Edwards DJ, Nath AP, Pham K, Ascher DB, Farrar J, Khor CC, Teo YY, Inouye M, Caws M, Dunstan SJ. Frequent transmission of the Mycobacterium tuberculosis Beijing lineage and positive selection for the EsxW Beijing variant in Vietnam. Nat Genet 2018;50:849-856 doi:10.1038/s41588-018-0117-9
- Vietnam (Ho Chi Minh City) lineage 2 - 'HCMC_L2'
- Global lineage 2 - 'Global_L2'
- Global lineages 1, 2 and 4 - 'Global_L124'
HCMC_L2 and Global_L2 (n=821 and n=940 respectively)
Random samples of 10, 20 and 50 percent of isolates, plus all isolates
Size Labels HCMC_L2 Global_L2
10% '_r10p' 82 94
20% '_r20p' 164 188
50% '_r50p' 410 470
100% '_all' 821 940
Global_L124 (n=2965) Random samples of (n=)100, 500, 1000 and 2000 isolates
size label
100 '_r100'
500 '_r500'
1000 '_r1000'
2000 '_r2000'
Examples of each step using 'HCMC_L2_r10p' are given below. The relevant data will be found in the first folder for the step each is introduced.
Generated from one of three different lists;
'Step2/HCMC_L2.txt', 'Step2/Global_L2.txt', and 'Step2/Global_L124.txt'.
Note: All of the python scripts provided use python 3, often along with the packages biopython and ete3 (same as SNPPar).
python getRandomSet.py -l HCMC_L2.txt -n 82 -o HCMC_L2_r10p.txt
The randomly generated lists used for testing are provided (as I forgot a 'seed') in the folder 'Step2'
Note: 'data/Step3/NC_000962_alleles_3786strains_var.csv.zip' needs to be unzipped before the next step Warning: The SNP table 'NC_000962_alleles_3786strains_var.csv' is a 900 MB file after decompression
The three modules used in this run are 'filter', which uses 'TB_excluded_regions_with_repeat.txt' to remove SNPs that occur in previously identified repeat regions and PE/PPE genes, 'cons'(erve), which removes those SNPs with more than (1-cons)*100*isolates missing calls, and 'aln' (alignment), which outputs the SNP table in MFASTA format for Tree generation using RAxML.
python parseSNPtable.py -s NC_000962_alleles_3786strains_var.csv -x TB_excluded_regions_with_repeat.txt -r NC_00962_3_1.gbk -m filter,cons,aln -l HCMC_L2_r10p.txt -p HCMC_L2_r10p_ -c 0.95
Conservation Note:
dataset conservation level
r10p, r20p, r100 -c 0.95
r50p, r500 -c 0.99
all, r1000, r2000 -c 0.995
Outputs (of importance to us)
SNP table (CSV) - 'HCMC_L2_r10p_alleles_82strains_var_rF_c95.csv' MFASTA alignment - HCMC_L2_r10p_alleles_82strains_var_rF_c95.mfasta' (Note: these have been renamed to reduce the name lengths)
Resulting SNP tables for all twelve data sets are in 'dat/Step3/SNP_tables.zip'. Resulting MFASTA files are in 'data/Step3/MFASTA.zip'.
Note: RAxML was run on a computational cluster (MASSIVE) to make use of the multiple-threaded version. Each of up to 16 threads were allocated 4 Gb. Each data set was run five times using different seed values for each run - these seeds were 'recycled' for each data set.
Seeds Run
1 2 3 4 5
'-p' 8876251475 1885216875 8282263475 1118374265 8144432745
'-x' 31566 315646 3444666 15623245 553324366
raxml -T 16 -s HCMC_L2_r10p_alleles_82strains_var_rF_c95.mfasta -n HCMC_L2_r10p_t01.tre -f a -m ASC_GTRGAMMA --asc-corr=lewis -p 8876251475 -x 31566 -N 100
Information from each run was collected from the resulting 'RAxML_info' file (e.g. 'set4/RAxML_info.HCMC_L2_r10p_t01.out') and collated in 'set4/RAxML_runs.tsv'. Information collected include the final maximum likelihood Optimization Likelihood (OL), the alpha distribution of GAMMA estimated from the data set along with the individual mutation rates between base pairs (e.g. A<->T) relative to the rate of G<->T, and the base frequencies. The latter three were used as estimates for modelling of sequence evolution in SeqGen (see below), whilst the first value was used to pick the 'best' tree from the five replicates (i.e. ML estimate closest to zero). A edited version of 'data/RAxML_runs.tsv' with only the statistics for the 'best' Tree was also produced ('step4/best_RAxML_runs.tsv').
In each case, the tree used was the 'RAxML_bipartitions' version. The 'best' RAxML tree for each data set was opened in FigTree using version 1.4.4, where each was midpoint rooted and nodes set to 'descending' order. Each tree was then exported in Newick format. The folder 'data/Step4/trees' contains a reformatted 'best' tree for each of the twelve data sets (eg. 'Step4/trees/HCMC_L2_r10p.tre').
Note: SeqGen was also run on a computational cluster (MASSIVE) to allow for the output files (~1.3 GB each simulation). The processing of these output files (Command 2) were also completed on the same cluster environment.
As the amount of simulated SNPs was found to be much higher than expected (due to a poor choice of ascertainment bias correction used in RAxML), for each of the 12 datatsets an internal correction was used where branch lengths were multiplied by the expected number of SNPs (the number of SNPs in the real dataset that created the tree) divided by the observed simulated SNPs in the first run of each of the twelve populations/samplesize combinations ('E/O', see table below for values used here).
Note: as the seqgen output files are so large, these have been excluded from the repository.
The variable sites for each replicate were extracted using processSeqGen.py in the 'scripts' folder, and separated into those calls for internal nodes and tips. It produces the tips as both FASTA and SNP table output.
seqgen -m GTR -a [alpha] -f [#A],[#C],[#G],[#T] -r [A<>C],[A<>G],[A<>T],[C<>G],[C<>T],[G<>T] -of -wa -k1 [treefile]
where values for -a -f and -r are found in 'set4/best_RAxML_runs.tsv' and treefile is the 'Phylip' format input for SeqGen containing the seed sequence (H37Rv, shortened to account for filtered sites). These can be found in 'data/Step5/phy_files'. The shortened sequence length is 4,022,248 bp (4,411,532 total genome length - 389,284 excluded bases).
For HCMC_L2_r10p pre-run 1, the SeqGen command was:
seqgen -m GTR -a 2.76 -f 0.15,0.35,0.34,0.16 -r 1.06,3.29,0.35,0.54,3.11,1 -of -wa -k1 mtb_short_seq_HCMC_L2_r10p.phy
(note: number of SNPs established by running processSeqGen.py - see command 2):
seqgen -m GTR -a [alpha] -f [#A],[#C],[#G],[#T] -r [A<>C],[A<>G],[A<>T],[C<>G],[C<>T],[G<>T] -of -wa -k1 [treefile] -t [E/O]
For HCMC_L2_r10p run 1, the updated SeqGen command was:
seqgen -m GTR -a 2.76 -f 0.15,0.35,0.34,0.16 -r 1.06,3.29,0.35,0.54,3.11,1 -of -wa -k1 mtb_short_seq_HCMC_L2_r10p.phy -t 0.044
python processSeqGen.py -i [seqgen_output] -p [output_prefix]
For HCMC_L2_r10p run 1:
python processSeqGen.py -i [seqgen_output_example] -p HCMC_L2_r10p_r1
E/O corrections
replicate expected observed E/O
HCMC_L2_r10p 5561 125150 0.044
HCMC_L2_r20p 9027 188805 0.048
HCMC_L2_r50p 16824 219728 0.077
HCMC_L2_all 25547 289022 0.088
Global_L2_r10p 4303 161383 0.027
Global_L2_r20p 7956 201160 0.04
Global_L2_r50p 17971 250597 0.052
Global_L2_all 20014 256046 0.078
Global_L124_r100 12911 139064 0.093
Global_L124_r500 29442 213650 0.138
Global_L124_r1000 43038 281296 0.153
Global_L124_r2000 63451 398169 0.159
In the subfolder 'data/Step5/simulated/' there are two sub-folders, FASTA and SNPTables. The results for run 1 to 10 of HCMC_L2_r10p are supplied uncompressed in both sub-folders (sans SeqGen output as discussed above). The rest of the outputs for the 11 other datasets are either zipped into a single file (e.g. 'HCMC_L2_r20p.zip' in 'Step5/simulated/FASTA and /SNPTables') or split into separate zipped files (<100 MB for GitHub, e.g. 'Global_L124_r1000_r1_to_r3.zip' in 'Step5/simulated/FASTA'). Compressed files need be be unpacked before use.
Because I forgot to extract the SNP lists with the previous function, we first need to extract the SNP lists for all 120 replicates.
python SNPTableToList.py -i [SNPTable.csv] -p [output_prefix]
For HCMC_L2_r10p run 1:
python SNPTableToList.py -i HCMC_L2_r10p_r1_alleles.csv -p HCMC_L2_r10p_r1
Once we have the SNP list for each replicate data set, we can extract the expected homoplasic mutation events from the node and tip sequence data.
python extractHomoplasies.py -n [nodes.fasta] -t [tips.fasta] -T [tree] -l [SNP list] -p [output_prefix]
For HCMC_L2_r10p run 1:
python extractHomoplasies.py -n HCMC_L2_r10p_r1_nodes.fasta -t HCMC_L2_r10p_r1.fasta -T HCMC_L2_r10p.tre -l HCMC_L2_r10p_r1_SNPList.txt -p HCMC_L2_r10p_r1
These next two steps are run iteratively for each replicate to directly save the time and memory information for each run. Note that for the simulated data sets only 'intermediate' and 'simple' sorting were tested.
/usr/bin/time -lp snppar -g NC_00962_3_1.gbk -d [output_directory] -t [tree] -s [alleles.csv]
For HCMC_L2_r10p run 1:
/usr/bin/time -lp snppar -g NC_00962_3_1.gbk -d simulated_out/HCMC_L2_r10p/ -t HCMC_L2_r10p.tre -s simulated/HCMC_L2_r10p/HCMC_L2_r10p_r1_alleles.csv
Note: '-E S' is added for 'simple' sorting; 'intermediate' sorting is default
Example output of 'time' command for a SNPPar run:
real 10.17 <- run time (-t)
user 8.71
sys 0.74
90202112 maximum resident set size <- maximum memory use (-m)
0 average shared memory size
0 average unshared data size
0 average unshared stack size
66443 page reclaims
2426 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
1246 voluntary context switches
11716 involuntary context switches
python compareResults_hSNP.py -e [homoplasic_mutation_events.csv] -o [homoplasic_events_all_calls.tsv] -c [SNP count] -p [population] -s [sample size] -r [run] -t [time] -m [memory] -T [tree] -S [sorting]
where -e are the homoplasic events produced by SeqGen, and -o are the homoplasic events called by SNPPar, -c is the SNP count for the replicate (alleles.csv), -p is the population (e.g. 'HCMC_L2'), -s is the actual sample size (e.g 84 for 'HCMC_L2_r10p'), -r is the replicate run (e.g. 'r1'), -t is the 'real' time from the 'time' command for the SNPPar run, -m is the maximum memory use from the same command (see example below), and -S is the type of sorting ('I' for 'intermediate' or 'S' for 'simple' - or 'C' for 'complex' with empirical data).
For HCMC_L2_r10p run 1:
python compareResults_hSNP.py -e HCMC_L2_r10p_r1_homoplasic_mutation_events.csv -o simulated_out/HCMC_L2_r10p/homoplasic_events_all_calls.tsv -c 5758 -p HCMC_L2 -s 82 -r r1 -t 10.17 -m 90202112 -T HCMC_L2_r10p.tre -S S
This script either produces a new results file, or appends to an existing file. The resulting file is included ('homoplasic_test_results.tsv'); this is the file with the results for R analysis (see below).
However, the results do need to be curated as they are collated...
Two other files are produced for each comparison, 'incorrect_homoplasic_calls.tsv' and 'incorrect_type_homoplasic_calls.tsv'
For HCMC_L2_r10p run 1, these files are:
S_HCMC_L2_82_r1_incorrect_homoplasic_calls.tsv
S_HCMC_L2_82_r1_incorrect_homoplasic_test_calls.tsv
However, as there are no errors for run 1, these are empty files.
Examples of errors for Global_L124 r1000 run8 ('intermediate' sorting) include:
I_Global_L124_1000_r8_incorrect_homoplasic_calls.tsv
False Negative
2881482 N30 N31 A G
2881482 N31 N32 G A
I_Global_L124_1000_r8_incorrect_homoplasic_test_calls.tsv
expected 149521 Y R
observed 149521 Y P
expected 170130 Y R
observed 170130 Y P
expected 881703 N P
observed 881703 N PP
expected 2316301 N P
observed 2316301 N PP
The incorrect homoplasy call reported (position 2881482) is an example of a homoplasic event that was missed by SNPPar (i.e. false-negative) - here, ASR will have assigned a single event (A->G) to the sister branch of N31-N32 rather than the expected reversion.
The first two type call errors (149521 and 170130) are examples of incorrect type calls at the root node of the tree, both where a reversion is expected, but SNPPar called a parallel event.
The last two 'errors' (881703 and 2316301) are not actual errors, but are called when the mutation event ocurs in overlapping genes (SNPPar reports a mutation event for both genes, whilst the expected results do not include gene information, so only expected once). Note that as SNPs involving more than two events are more difficult to analyse, these are reported by compareResults_hSNP.py at the command line for immediate checking by the user.
The twelve empirical datasets used to generate the phylogenetic trees in Step 4 above were also subject to performance analysis for computational resources. As these datasets include missing calls, all three sorting options were tested, 'intermediate' (default), 'simple' ('-E S' option in SNPPar) and 'complex' ('-E C' option).
Otherwise, the command used for the SNPPar test runs was exactly the same as for the simulated datasets, i.e.:
/usr/bin/time -lp snppar -g NC_00962_3_1.gbk -d [output_directory] -t [tree] -s [alleles.csv]
For HCMC_L2_r10p:
/usr/bin/time -lp snppar -g NC_00962_3_1.gbk -d real_out/HCMC_L2_r10p -t HCMC_L2_r10p.tre -s HCMC_L2_r10p_alleles_82strains_var_rF_c95.csv
Results for all runs are in the folder 'Step8/SNPPar_output_real'
As there were only twelve runs (by three sorting options), output data (specifically time and memory use) was collated by hand rather than code. The results file produced during testing is 'step8/Performance.txt'.
The results files generated in Steps 7 and 8 ('homoplasic_test_results.tsv' and 'Performance.txt' respectively) were then used to analyse the results using R version 3.5.1 in RStudio version 1.1.383. Both the markdown script and resulting output (in HTML format) are provided here.
SNPPar_performance.Rmd
SNPPar_performance.html
Note: if you want to run the RMD script, make sure to change the paths to the two files where appropriate.
Perrin A, Larsonneur E, Nicholson AC, Edwards DJ, Gundlach KM, Whitney AM, Gulvik CA, Bell ME, Rendueles O, Cury J, Hugon P, Clermont D, Enouf V, Loparev V, Juieng P, Monson T, Warshauer D, Elbadawi LI, Spalding Walters M, Crist MB, Noble-Wang J, Borlaug G, Rocha EPC, Criscuolo A, Touchon M, Davis JP, Holt KE, McQuiston JR, Brisse S. Evolutionary dynamics and genomic features of the Elizabethkingia anophelis 2015 to 2016 Wisconsin outbreak strain. Nat Commun 2017;8:15483 doi:10.1038/ncomms15483
CP014805v2.gbk
CP014805v2_CP014805_alleles_1outgroup_69strains_var_regionFiltered_cons0.95_var.csv
final_raxml_tree_no_recomb.tree
snppar -s CP014805v2_CP014805_alleles_1outgroup_69strains_var_regionFiltered_cons0.95_var.csv -t final_raxml_tree_no_recomb.tree -g CP014805v2.gbk -d snppar_output
The 'all_mutation_events.tsv' output file was then used to get the mutation event counts for all genes using 'countMEbyGene.py'.
python countMEbyGene.py -i snppar_output/all_mutation_events.tsv -p Elizabethkingia_all_
output: Elizabethkingia_all_ME_counts_by_gene.txt
As the isolates were named by readset and not CDC identifier as per the paper, the names were converted in the NEXUS format tree using 'isolate_list.txt'.
Final tree: elizabethkingia_node_labelled_nexus_CSID.tre
The file 'homoplasic_mutation_events.tsv' and the final tree were then used to map information back on to the tree using FigTree using version 1.4.4.
Lieberman TD, Michel JB, Aingaran M, Potter-Bynoe G, Roux D, Davis MR Jr, Skurnik D, Leiby N, LiPuma JJ, Goldberg JB, McAdam AJ, Priebe GP, Kishony R. Parallel bacterial evolution within multiple patients identifies candidate pathogenicity genes. Nat Genet 2011;43:1275-1280 doi:10.1038/ng.997
AU0158_ch1.gb
AU0158_ch2.gb
AU0158_ch3.gb
NIHMS335194-supplement-2-alleles_chr1.csv
NIHMS335194-supplement-2-alleles_chr1.csv
NIHMS335194-supplement-2-alleles_chr1.csv
lieberman2011natgen_ss_root.newick
snppar -s NIHMS335194-supplement-2-alleles_chr1.csv -t lieberman2011natgen_ss_root.newick -g AU0158_ch1.gb -d snppar_output -p chr1_
snppar -s NIHMS335194-supplement-2-alleles_chr2.csv -t lieberman2011natgen_ss_root.newick -g AU0158_ch2.gb -d snppar_output -p chr2_
snppar -s NIHMS335194-supplement-2-alleles_chr3.csv -t lieberman2011natgen_ss_root.newick -g AU0158_ch3.gb -d snppar_output -p chr3_
The three 'all_mutation_events.tsv' output file were then used to get the mutation event counts for all genes using 'countMEbyGene.py'.
python countMEbyGene.py -i snppar_output/chr1_all_mutation_events.tsv -p chr1_all_
python countMEbyGene.py -i snppar_output/chr2_all_mutation_events.tsv -p chr2_all_
python countMEbyGene.py -i snppar_output/chr3_all_mutation_events.tsv -p chr3_all_
outputs: (chr1_/chr2_/chr3_)all_ME_counts_by_gene.txt
These were then combined to get the 'all_ME_counts_by_gene.txt' which has been sorted by number of mutation events per gene.
The three 'homoplasic_mutation_events' files were then used to map information back on to the 'chr1_node_labelled_nexus.tre' tree using FigTree using version 1.4.4.
Holt KE, McAdam P, Thai PVK, Thuong NTT, Ha DTM, Lan NN, Lan NH, Nhu NTQ, Hai HT, Ha VTN, Thwaites G, Edwards DJ, Nath AP, Pham K, Ascher DB, Farrar J, Khor CC, Teo YY, Inouye M, Caws M, Dunstan SJ. Frequent transmission of the Mycobacterium tuberculosis Beijing lineage and positive selection for the EsxW Beijing variant in Vietnam. Nat Genet 2018;50:849-856 doi:10.1038/s41588-018-0117-9
NC_00962_3_1.gbk
Global_L124_r2000_out_NC_000962_alleles_2001strains_regionFiltered_cons0.995_var_hardFiltered.csv
Note: this needs decompressing before use. Also, all SNPs that occur in any genes either partially or completely overlapping with excluded regions have also been excluded from the SNP table ('hard filter' option in 'parseSNPTable3.py')
Mtb_GL124_2000_out_c995_root.tre
snppar -s Global_L124_r2000_out_NC_000962_alleles_2001strains_regionFiltered_cons0.995_var_hardFiltered.csv -g NC_00962_3_1.gbk -t Mtb_GL124_2000_out_c995_root.tre -d snppar_output -p Global_L124_all_r2000_out
Prior to further analysis, the complete mutation event list was filtered to remove any events that involved the root node (via sorting in Excel). The resulting mutation events file ('Global_L124_all_r2000_out_all_mutation_events_no_root.tsv') was then passed bck through SNPPar to obtain the homoplasic events within Lineages 1, 2 and 4.
snppar -M Global_L124_all_r2000_out_all_mutation_events_no_root.tsv -t Mtb_GL124_2000_out_c995_root.tre -d snppar_output/no_root_run -p run2_ -P -C -R
Note that we also identified the types of homoplasies('-P', parallel; '-C', convergent; and '-R', revertant options).
The final 'all_mutation_events' and 'homoplasic_mutation_events' were then used to get the mutation event counts for all genes using 'countMEbyGene.py'.
python countMEbyGene.py -i Global_L124_all_r2000_out_all_mutation_events_no_root.tsv -p all
and
python countMEbyGene.py -i run2_homoplasic_events_all_calls.tsv -p h
outputs: 'allME_counts_by_gene.txt' and 'hME_counts_by_gene.txt' respectively
These were then combined with list of genes with no overlap with filtered regions ('zero_gene_tag_list.txt' extracted from 'Global_L124_r2000_out_NC_000962_alleles_2001strains_regionFiltered_cons0.995_var_hardFiltered.txt' via Excel) in analysis using R version 3.5.1 in RStudio version 1.1.383. Both the markdown script and resulting output (in HTML format) are provided here.
GL124_r2000_out_analysis.Rmd
GL124_r2000_out_analysis.html
The file 'Global_L124_all_r2000_out_homoplasic_events_all_calls.tsv' and the tree 'Global_L124_all_r2000_out_node_labelled_nexus.tre' were then used to map information about the katG gene back on to the tree using FigTree using version 1.4.4.