-
Notifications
You must be signed in to change notification settings - Fork 30
Input data formats
Data formats in bioinformatics can be problematic so I have tried to make this detailed enough. While we try to identify errors upon parsing, it's not perfect. Yet.
All of the raw files which you can use as the online examples are available here.
### Phylogenies (Trees) Phylogenies form the backbone of the visualization as they link together all the other data. While it is possible to use phandango without them for GWAS-type graphs, all metadata, recombination blocks and pan-genome content relies on them.Trees must be in Newick format and must end in .tre or .tree. Newick is the standard output from most tree drawing software (e.g. RAxML), but not all. If you need to convert your tree to a different format try using FigTree but watch out - often single quotations are added around taxon names which must be manually removed! Notably Nexus files are not supported.
### Metadata Metadata is displayed to the right of the tree (so a tree must exist!) and the taxon names here must match those in the tree. Which columns are displayed can be controlled in the settings menu, and a key can be displayed by pressing *k*.Format:
- comma seperated values (CSV) file (example here)
- File ending in .csv
- The first line is used for the column headers
- The first column contains the taxon names, which must match those in the tree
Colour selection:
- The colour scale depends on the type of data in each column (binary, ordinal or continuous), which is inferred from the data, but this is far from perfect!
- Adding on :o or :c to the end of the name (in the first row) forces the choice to be ordinal or continuous, respectively.
- If you want multiple columns to use the same colours for the same values (e.g. so that the value 42 is the same colour in each column), then group these columns by adding an integer to the suffix - e.g. :o1. You can have as many groups as you like.
Converting to GFF3:
- Can often be done using Artemis
- Can be done on the command line with seqret via
seqret -sequence EMBL_FILE_NAME -feature -fformat embl -fopenfile GFF_FILE_NAME -osformat gff –auto
Display:
- All of the semi-colon separated fields are read and displayed when you hover over a gene / region.
- If colour appears in the info field then genes are coloured similarly to Artemis.
Gubbins output is in GFF3 format and must end in .gff or .gff3, similar to the genome annotation (example here). If you have an old gubbins output file (e.g. *rec.tab) then there is a simple python script here which will convert it for you.
Essential fields:
- The second field of each line (except the headers) must be GUBBINS, to distinguish these files from annotation GFFs.
- The semi-colon selerated info string (field 9) must contain the following strings neg_log_likelihood, taxa and snp_count
- Values are surrounded with double quotes, e.g. snp_count="7";
- The taxa field is a list of whitespace separated taxon names which must match taxa in the tree in order to be displayed.
A tab seperated txt file (i.e. ending in .txt) - this has a default file name like segments_tabular.txt (example here).
Format:
- The first line must be * LIST OF FOREIGN GENOMIC SEGMENTS:*
- The second line (the header) is not used
- Subsequent lines have 6 fields corresponding to (1) block start co-ordinate (integer), (2) block end co-ordinate (integer), (3) origin cluster (integer), (4) home cluster (integer), (5) not used, (6) taxon name (string).
The output file gene_presence_absence.csv is used and this contributes both the annotation data and the block data (example file).
This CSV file is often huge and can cause browsers to crash. There is a simple python script here which minimises this file.
### Scatterplots (Manhattan plots)GWAS results are in plink format, i.e. a tab deliminated file with header line similar to
#CHR SNP BP minLOG10(P) log10(p) r^2
- The 3rd column is as the genome co-ordinate
- For seer output, which is a k-mer not a single base, the 3rd column should be x1..x2, e.g.
140..160
- The 5th column -
r^2
- contributes the colour.