-
Notifications
You must be signed in to change notification settings - Fork 5
D) hogwash inputs
The required structure of the phenotype data object is a matrix. The rows correspond to samples and should be ordered to match the tips of the phylogenetic tree. There should only be one column, which contains the phenotype data. The matrix should have both row names and column names. The row names must exactly match the tree’s tip labels. The phenotype can either be binary (0/1) or continuous. At this time hogwash does not support multiple categorical phenotypes (eg. ‘A’, ‘B’, & ‘C’).
Discrete phenotype:
Antibiotic_resistance | |
---|---|
sample_1 | 0 |
sample_2 | 0 |
sample_3 | 1 |
sample_4 | 1 |
Continuous phenotype:
Toxin_production | |
---|---|
sample_1 | 0.10 |
sample_2 | 1.20 |
sample_3 | 0.05 |
sample_4 | 2.70 |
The required structure of the genotype data object is a matrix. The rows correspond to samples and should be ordered to match the tips of the phylogenetic tree. The columns correspond to individual genotypes. The matrix should have both row names and column names. The row names must exactly match the tree’s tip labels. Genotypes can be SNPs (core genome), genes (accessory genome) or other types (indels, pathways, etc...). Genotypes must be coded in binary (0/1).
Genotype:
SNP_1 | SNP_2 | SNP_3 | SNP_4 | SNP_5 | |
---|---|---|---|---|---|
sample_1 | 0 | 1 | 1 | 0 | 0 |
sample_2 | 0 | 0 | 0 | 1 | 1 |
sample_3 | 1 | 0 | 0 | 1 | 0 |
sample_4 | 1 | 1 | 1 | 0 | 1 |
The phylogenetic tree should be rooted. If the tree is not rooted, either:
- root the tree either to an outgroup and then remove the outgroup from the tree, phenotype, and genotype (in this example assume tip t4 is the outgroup)
- use the midpoint rooting method and reorder your phenotype and genotype to the new order
- or supply it to hogwash and the function will midpoint root the tree automatically.
The tree must be fully bifurcating. I recommend building your phylogenetic tree with an outgroup, root using the outgroup, and then remove the outgroup prior to running hogwash.
hogwash allows the user to create ancestral reconstructions for individual genotypes and then condense them into meaningful groups.
Requiring that an individual SNP occur in multiple lineages may be too stringent, but instead if all relevant SNPs from a gene are grouped together the power to identify convergent evolution will be increased because this grouping method could capture larger trends in functional impact at the gene level and a reduce the multiple testing correction burden. Use cases for this method could be to group SNPs into genes or genes into pathways.
The required structure of the grouping genotypes key data object is a matrix. Each row corresponds to a genotype. The first column must have the name of a genotype included in the genotype matrix. The second column must have a name for a group to which the item in the first column belongs. Row names are not required. The column names are used in output plots and therefore must be included.
SNP | GROUP |
---|---|
SNP_1 | GENE_A |
SNP_1 | PATH_A |
SNP_2 | GENE_A |
SNP_3 | GENE_B |
SNP_4 | GENE_C |
SNP_5 | GENE_A |
SNP_6 | GENE_D |
SNP_6 | PATH_A |
The user can select either "post-ar" or "pre-ar" as the preferred grouping method. The default is "post-ar". For more, please revisit the grouping section.
The default value is 10,000.
The default value is 0.15.
The user can select to plot the tree as a "phylogram" (right-facing, square tree; default) or a "fan" (circular). Note, the phylogram is plotted ignoring tree edge lengths and the fan is plotted using tree edge lengths.
The default value is 0.70 based on the value found in Farhat et al’s 2013 Nature genetics paper. However, think carefully about the bootstrap confidence threshold you choose. For example, IQ-TREE is an increasingly popular method by which to create phylogenetic trees. The ultrafast bootstrap (UFBoot) support values are not the same as normal bootstrap support values. UFBOOT support values are only considered high confidence for >= 0.95.
This is a key that categorizes each sample into a user-defined group. It could be strain type, isolation location, species of host, etc... The ID supplied in the column just needs to be a character. In the output PDF the tree tips will be colored by this supplied ID. ID colors are automatically generated (you can't pick the colors).
strain_id | |
---|---|
sample_1 | "ribotype_027" |
sample_2 | "ribotype_014" |
sample_3 | "ribotype_014" |
sample_4 | "ribotype_014" |
Note: Hogwash is computationally slow so very large datasets are unlikely to finish within a reasonable amount of time. We have typically been working with <500 samples and <500,000 genomic variants. In practice, it is possible to split large genotype matrices into multiple sub-matrices and run hogwash in parallel on those sub-matrices in order to complete faster.
Next: the hogwash outputs.