Input data formats

Trees
Metadata
Genome Annotations
Genomic data (blocks)
GWAS data

Data formats in bioinformatics can be problematic so I have tried to make this detailed enough. While we try to identify errors upon parsing, it's not perfect. Yet.

All of the raw files which you can use as the online examples are available here.

### Phylogenies (Trees) Phylogenies form the backbone of the visualization as they link together all the other data. While it is possible to use phandango without them for GWAS-type graphs, all metadata, recombination blocks and pan-genome content relies on them.

Trees must be in Newick format and must end in .tre or .tree. Newick is the standard output from most tree drawing software (e.g. RAxML), but not all. If you need to convert your tree to a different format try using FigTree but watch out - often single quotations are added around taxon names which must be manually removed! Notably Nexus files are not supported.

### Metadata Metadata is displayed to the right of the tree (so a tree must exist!) and the taxon names here must match those in the tree. Which columns are displayed can be controlled in the settings menu, and a key can be displayed by pressing *k*.

Format:

comma seperated values (CSV) file (example here)
File ending in .csv
The first line is used for the column headers
The first column contains the taxon names, which must match those in the tree

Colour selection:

The colour scale depends on the type of data in each column (binary, ordinal or continuous), which is inferred from the data, but this is far from perfect!
Adding on :o or :c to the end of the name (in the first row) forces the choice to be ordinal or continuous, respectively.
If you want multiple columns to use the same colours for the same values (e.g. so that the value 42 is the same colour in each column), then group these columns by adding an integer to the suffix - e.g. :o1. You can have as many groups as you like.

### Genome Annotations Annotations appear in the top right of the display and are nearly essential for interpreting recombination / GWAS results. They must be in [GFF3](http://gmod.org/wiki/GFF3#GFF3_Format) format and end in *.gff* or *.gff3*. Parsing GFF files is error prone so it's worth looking at an [example file](https://raw.githubusercontent.com/jameshadfield/phandangoExampleData/master/gubbinsNAR/Spn23f.gff), especially the first two lines: ``` ##gff-version 3 ##sequence-region 1 ```

Converting to GFF3:

Can often be done using Artemis
Can be done on the command line with seqret via seqret -sequence EMBL_FILE_NAME -feature -fformat embl -fopenfile GFF_FILE_NAME -osformat gff –auto

Display:

All of the semi-colon separated fields are read and displayed when you hover over a gene / region.
If colour appears in the info field then genes are coloured similarly to Artemis.

### Genomic data (recombination blocks, pan genome output) Currently three different file types are parsed, but it shouldn't be hard to convert any block-like data into one of these formats.

Gubbins

Gubbins output is in GFF3 format and must end in .gff or .gff3, similar to the genome annotation (example here). If you have an old gubbins output file (e.g. *rec.tab) then there is a simple python script here which will convert it for you.

Essential fields:

The second field of each line (except the headers) must be GUBBINS, to distinguish these files from annotation GFFs.
The semi-colon selerated info string (field 9) must contain the following strings neg_log_likelihood, taxa and snp_count
Values are surrounded with double quotes, e.g. snp_count="7";
The taxa field is a list of whitespace separated taxon names which must match taxa in the tree in order to be displayed.

BRAT NextGen

A tab seperated txt file (i.e. ending in .txt) - this has a default file name like segments_tabular.txt (example here).

Format:

The first line must be * LIST OF FOREIGN GENOMIC SEGMENTS:*
The second line (the header) is not used
Subsequent lines have 6 fields corresponding to (1) block start co-ordinate (integer), (2) block end co-ordinate (integer), (3) origin cluster (integer), (4) home cluster (integer), (5) not used, (6) taxon name (string).

ROARY pan genome

The output file gene_presence_absence.csv is used and this contributes both the annotation data and the block data (example file).

This CSV file is often huge and can cause browsers to crash. There is a simple python script here which minimises this file.

### Scatterplots (Manhattan plots)

GWAS results are in plink format, i.e. a tab deliminated file with header line similar to #CHR SNP BP minLOG10(P) log10(p) r^2

The 3rd column is as the genome co-ordinate
For seer output, which is a k-mer not a single base, the 3rd column should be x1..x2, e.g. 140..160
The 5th column - r^2 - contributes the colour.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input data formats

Gubbins

BRAT NextGen

ROARY pan genome

Clone this wiki locally