Quantifying the Completeness of Multiple Sequence Alignments used in Phylogenetic and Phylogenomic Research
The software is written in C++, and has been tested under the Linux and MacOS platforms. You need to have C++ compiler installed in your computer to compile the source code. The compilation steps are shown as follows:
$ tar -zxvf AliStat_1.xx.tar.gz
$ cd AliStat_1.xx
$ make
Then an executable file named alistat will appear.
Consider a multiple sequence alignment with m sequences and n sites. Given this alignment, we may compute:
- Ca - Completeness score for the alignment
Ca = Number of unambiguous characters in the alignment / (m x n)
- Cr - Completeness score for individual sequences
Cr = Number of unambiguous characters in a row (sequence) of the alignment / n
- Cc - Completeness score for individual sites
Cc = Number of unambiguous characters in a column (site) of the alignment / m
- Cij - Completeness score for pairs of sequences
Cij = Number of columns with homologous pairs of unambiguous characters in sequences i and j / n
By definition, Cii = 1.
- Iij - Incompleteness score for pairs of sequences
Iij = 1 - Cij
- Pij - P-distance for pairs of sequences.
Let Wij be the set of columns with unambiguous homologous characters of sequences i and j.
Pij = (Number of sites with differences between sequences i and j across the columns in Wij) / |Wij|
Pij is undefined if |Wij| = 0, and Pii = 0.
Syntax:
./alistat <alignment file> <data type> [other options]
./alistat -h
<alignment file> : Multiple alignment file in FASTA format
<data type> : 1 - Single nucleotides (SN);
2 - Di-nucleotides (DN);
3 - Codons (CD);
4 - 10-state genotype data (10GT);
5 - 14-state genotype data (14GT);
6 - Amino acids (AA);
7 - Mixture of nucleotides and amino acids (NA)
(User has to specify the data type for each partition
inside the partition file)
other options:
-b : Report the brief summary of figures to the screen
Output format:
File,#seqs,#sites,Ca,Cr_max,Cr_min,Cc_max,Cc_min,Cij_max,Cij_min
(this option cannot work with the options: -o,-t,-r,-m,-i,-d)
-c <coding_type> : 0 - A, C, G, T [default];
1 - C, T, R (i.e. C, T, AG);
2 - A, G, Y (i.e. A, G, CT);
3 - A, T, S (i.e. A, T, CG);
4 - C, G, W (i.e. C, G, AT);
5 - A, C, K (i.e. A, C, GT);
6 - G, T, M (i.e. G, T, AC);
7 - K, M (i.e. GT, AC);
8 - S, W (i.e. GC, AT);
9 - R, Y (i.e. AG, CT);
10 - A, B (i.e. A, CGT);
11 - C, D (i.e. C, AGT);
12 - G, H (i.e. G, ACT);
13 - T, V (i.e. T, ACG);
(this option is only valid for <data type> = 1)
-o <FILE> : Prefix for output files
(default: <alignment file> w/o .ext)
-n <FILE> : Only consider sequences with names listed in FILE
-p <FILE> : Specify the partitions
For <data type> = 1 - 6, partition file format:
"<partition name 1>=<start pos>-<end pos>, ..."
"<partition name 2>= ..."
Example:
part1=1-50,60-100
part2=101-200
(enumeration starts with 1)
For <data type> = 7, partition file format:
"<SN/DN/CD/10GT/14GT/AA>, <partition name 1>=<start pos>-<end pos>, ..."
"<SN/DN/CD/10GT/14GT/AA>, <partition name 2>= ..."
Example:
SN,part1=1-50,60-100
AA,part2=101-200
CD,part3=201-231
(this option cannot be used with '-s' at the same time)
-s <n1,n2> : Sliding window analysis: window size = n1; step size = n2
(this option cannot be used with '-p' at the same time)
-t <n3,n4,...> : Only output the tables n3, n4, ...
1 - C scores for individual sequences (Cr)
2 - C scores for individual sites (Cc)
3 - Distribution of C scores for individual sites (Cc)
4 - Matrix with C scores for pairs of sequences (Cij)
5 - Matrix with incompleteness scores for pairs of
sequences (Iij = 1 - Cij)
6 - Table with C score and incompleteness scores for pairs
of sequences (Cij & Iij)
(default: the program does not output any tables)
If "-t" option is used but no <n3,n4,...>, then the program
outputs all tables
-r <row|col|both>: Reorder the rows/columns (or both) of the alignment
according to the Cr/Cc scores
All the tables are displayed according to the reordered
alignment
To output the reordered alignment, please also use the
option -m
-m <n5> : Mask the alignment; 0 <= n5 <= 1
Output (1) the alignment with columns Cc >= n5 in the file
'Mask.fst', (2) the alignment with columns Cc < n5 in the
file 'Disc.fst', and (3) the alignment with an extra row in
the first line to indicate whether the column is masked in
the file 'Stat.fst'.
(Special case: if no <n5>, whole alignment is outputted in
the file 'Mask.fst'; if <n5> is 0, the alignment with
columns Cc > 0 is outputted in the file 'Mask.fst')
-i <1|2|3> : Generate heat map image for Cij scores of sequence pairs
1 - Triangular heat map
2 - Rectangular heat map
3 - Both
(if no number, then both triangular & rectangular heat map
files are outputted)
-d : Report the p distances between sequences
in the file with extension '.p-dist.csv'
(default: disabled)
Note: computation of p distances may take long time for
large number of sequences
-u : Color scheme of the heatmaps
1 - Default color scheme, suitable for color-blind persons
2 - Another color scheme
-h : This help page
Please refer to the enclosed user manual for detailed information.
Seven types of sequence data are considered: single nucleotides, di-nucleotides, codons, 10-state genotypes, 14-state genotypes, amino acids, and a mix of single nucleotides and amino acids.
The single nucleotides, di-nucleotides, codons, and mixture of nucleotides and amino acids use the nomenclature outlined above.
The 10- and 14-state genotype data are experimental in the sense that they have not yet been recognised by the wider scientific community as sound sources of genetic information pertaining to genomes of diploid and triploid species. Simulation-based research is currently under way to examine the usefulness of these types of data. How these types of data are generated may be revealed in due course.
When the sequences are regarded as
- single nucleotides, the data represent a 4-state alphabet (i.e., it comprises four letters: A, C, G, and T).
- di-nucleotides, the data represent a 16-state alphabet (i.e., it comprises 16 pairs of letters: AA, …, TT).
- codons, the data represent a 64-state alphabet (i.e., it comprises 64 triplets of letters: AAA, …, TTT).
- amino acids, the data represent a 20-state alphabet (i.e., it comprises 20 letters: A, …, Y).
- 10-state genotype data, the data represent a 10-state alphabet: A, C, G, T, R, Y, K, M, S, and W.
- 14-state genotype data, the data represent a 14-state alphabet: A, C, G, T, R, Y, K, M, S, W, B, D, H, and V.
- Dr. Lars Jermin (Email: lars.jermiin@anu.edu.au)
- Dr. Thomas Wong (Email: thomas.wong@anu.edu.au)
Thomas K F Wong, Subha Kalyaanamoorthy, Karen Meusemann, David K Yeates, Bernhard Misof, Lars S Jermiin, A minimum reporting standard for multiple sequence alignments, NAR Genomics and Bioinformatics, Volume 2, Issue 2, June 2020, lqaa024