-
Notifications
You must be signed in to change notification settings - Fork 7
Subcommand: edgepca
Perform Edge PCA (Principal Component Analysis) for a set of samples.
Usage: gappa analyze edgepca [options]
Input | |
---|---|
--jplace-path |
Required. TEXT:PATH(existing)=[] ... List of jplace files or directories to process. For directories, only files with the extension .jplace[.gz] are processed. |
Settings | |
--kappa |
FLOAT=1 Exponent for scaling between weighted and unweighted splitification. |
--epsilon |
FLOAT=1e-05 Epsilon to use to determine if a split matrix’s column is constant for filtering. Set to a negative value to deavtivate constant columnn filtering. |
--components |
UINT=5 Number of principal coordinates to calculate. Use 0 to calculate all possible coordinates. |
--point-mass |
FLAG Treat every pquery as a point mass concentrated on the highest-weight placement. In other words, ignore all but the most likely placement location (the one with the highest LWR), and set its LWR to 1.0. |
--ignore-multiplicities |
FLAG Set the multiplicity of each pquery to 1.0. The multiplicity is the equvalent of abundances for placements, and hence ignored with this flag. |
Color | |
--color-list |
TEXT=spectral List of colors to use for the palette. Can either be the name of a color list, a file containing one color per line, or an actual comma-separated list of colors. Colors can be specified in the format #rrggbb using hex values, or by web color names. |
--reverse-color-list |
FLAG If set, the order of colors of the --color-list is reversed. |
--mask-color |
TEXT=#dfdfdf Color used to indicate masked or invalid values, such as infinities or NaNs. Color can be specified in the format #rrggbb using hex values, or by web color names. |
Output | |
--out-dir |
TEXT=. Directory to write output files to. |
--file-prefix |
TEXT File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--file-suffix |
TEXT File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
Tree Output | |
--write-newick-tree |
FLAG If set, the tree is written to a Newick file. This format cannot store color information. |
--write-nexus-tree |
FLAG If set, the tree is written to a Nexus file. This can for example be opened in FigTree. |
--write-phyloxml-tree |
FLAG If set, the tree is written to a Phyloxml file. This can for example be used in Archaeopteryx. |
--write-svg-tree |
FLAG If set, the tree is written to a SVG file. This gives a file for vector graphics editors. |
Newick Tree Output | |
--newick-tree-branch-length-precision |
INT=6 Needs: --write-newick-tree Number of digits to print for branch lengths in Newick format. |
--newick-tree-quote-invalid-chars |
FLAG Needs: --write-newick-tree If set, node labels that contain characters that are invalid in the Newick format (i.e., spaces and :;()[],{} ) are put into quotation marks. If not set (default), these characters are instead replaced by underscores, which changes the names, but works better with most downstream tools. |
Svg Tree Output | |
--svg-tree-shape |
TEXT:{circular,rectangular}=circular Needs: --write-svg-tree Shape of the tree. |
--svg-tree-type |
TEXT:{cladogram,phylogram}=cladogram Needs: --write-svg-tree Type of the tree, either using branch lengths ( phylogram ), or not (cladogram ). |
--svg-tree-stroke-width |
FLOAT=5 Needs: --write-svg-tree Svg stroke width for the branches of the tree. |
--svg-tree-ladderize |
FLAG Needs: --write-svg-tree If set, the tree is ladderized. |
Global Options | |
--allow-file-overwriting |
FLAG Allow to overwrite existing output files instead of aborting the command. |
--verbose |
FLAG Produce more verbose output. |
--threads |
UINT Number of threads to use for calculations. |
--log-file |
TEXT Write all output to a log file, in addition to standard output to the terminal. |
Performs Edge PCA. The command is a re-implementation of guppy epca
, see there for more details.
Edge PCA is an analysis method for phylogenetic placement data that reveals consistent differences between samples (jplace
files). It uses the imbalance of placements across the edges of tree, which allows to find differences in placements that may be close in the tree.
Similar to guppy, the command produces two tables that contain the result of the analysis. The projection.csv
table contains the jplace
samples projected into principal coordinate space, and the transformation.csv
table lists the top eigenvalues (first column) and their corresponding eigenvectors (remaining columns).
Furthermore, we split the transformation
table here, for post-processing convenience. The eigenvalues.csv
table just contains the eigenvalues, while the eigenvectors.csv
contains the eigenvectors across all edges that were used in the PCA computation.
The correspondence of eigenvectors to edges is a bit tricky: Only the inner edges of the tree (the ones not leading to leaf nodes) have a meaningful edge imbalance value (which is the value used for computing the PCA). Furthermore, some inner edges might have a constant imbalance value, for instance, if no sequences had any placement stored in the outer branches of an edge. In that case, we filter out these edges before computing the PCA as well, as they are not meaningful and might lead to numerical issues if retained.
Hence, for the correspondence between the (inner, non-constant imbalance) edges and the eigenvector components of the PCA, we need some extra work. The edge_indices.newick
tree contains an annotated tree with inner nodes labeled according to the edge index. This edge index is the first column in eigenvectors.csv
, making it possible to link the two. Note that we label the nodes in that file, and not the edges, as the Newick file format does not support the latter; see here for this shortcoming of the Newick file format, and resulting issues. You can for example use Dendroscope to examine the newick file, or use some programmatic way.
Furthermore, we produce separate Newick files for each PCA component, named eigenvector_*.newick
. These are in NHX format, and annotate the components of the eigenvectors onto the edges, using 0.0
for leaf edges and those that were filtered out for the PCA. They can for example be displayed by iTOL; search for NHX in the iTOL help to see how those values can be displayed.
The principal components projection of the samples can be plotted and for example colored according to some per-sample metadata feature, in order to reveal dependencies between the placements of a samples and its metadata:
Furthermore, if the --write-...-tree
options are used, the principal components are visualized on the tree:
These trees allow to interpret how the plot above separates samples; that is, they show which edges contribute most to distinguish samples from each other. These trees can also be re-created using the annotated eigenvector_*.newick
, so that instead of colors, some other way of visualization can be used.
When using this method, please do not forget to cite
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070
Frederick Matsen, Steven Evans. Edge Principal Components and Squash Clustering: Using the Special Structure of Phylogenetic Placement Data for Sample Comparison. PLOS ONE, 2013. doi:10.1371/journal.pone.0056859
Module analyze
- correlation
- dispersion
- edgepca
- imbalance-kmeans
- krd
- phylogenetic-kmeans
- placement-factorization
- squash
Module edit
Module examine
Module prepare
Module simulate
Module tools