Skip to content
Mark Fiers edited this page May 21, 2012 · 16 revisions

Plots

Hagfish can generate a number of different plots. The most detailed plot of these is:

hagfish_cplot2

(Note -there is an older - outdated - script hagfish_cplot, which plots only the ECP)

A cplot2 plot shows all six coverage plots for any given region. The image below, generated from the test set included with the Hagfish distribution, shows a cplot2 plot for a sequence called "contig" starting at position 1 and ending at position 2e5. The image was generated using the following command line:

hagfish_cplot2 -n 6e4 --ymax 200 contig

This specific plot was generated in the course of quality control of a DNA assembly project, meaning that the raw read pairs underlying the assembly were mapped back to the resulting sequence. The following plots are interpreted in this context, but other applications are possible - with (slightly) different interpretations.

A hagfish cplot2 plot

Figure 1, Overview cplot2 of a sequence aptly named 'contig'.

The plot ranges from position 0 to 60,000. The plot is compressed into a single "band". It is possible to zoom in and plot data over multiple bands with the following parameters:

  • -s - start of the plotted region
  • -e - end of the plotted region
  • -n - number of nucleotides per band.

Note - it is possible to define each of these parameters using an exponential format (60,000 can be written as 60e3 or 6e4). So we can zoom in zoom in a on a specific region of the plot using:

hagfish_cplot2 --ymax 200 -n 5e3 -e 1e4 contig

resulting in the following plot:

Zoomed in on the previous plot

Figure 2, Zoomed in on the first 1e4 nucleotides of the plot in Figure 1

a , with 10,000 nucleotides per band (-n 1e4). With the -s parameter it would have been possible to specify a start coordinate for the plot. The left y axis indicates the start position of that specific band, and the type of plot (ICP or ECP). The right y-axis indicates the scale - with a line drawn at the median ok coverage for that complete sequence (not all analyzed sequences). Above the x-axis are the three stacked, inclusive, coverage plots (ICPs): "high" in red, "ok" in green and "low" in blue. Below the x-axis are the three exclusive coverage plots (ECPs), also stacked (but mirrored) (see Home for an explanation of the differences between ICPs and ECPS and what the three categories mean).

The reference sequence contains two gaps (a stretch of NNNs - identified by hagfish_gapfinder) inbetween 3k* & 4k and 12k and 14k. Gaps are indicated by a yellow background in the plot. Additionally, a score is calculated (-S flag), and plotted, by a black line (seeOperation for details on score calculation).

Any region with "sufficient" green ECP coverage (below the y-axis) indicates that that region is completely covered by reads mapping to that region. Any region with "sufficient" green ICP coverage (above the y-axis) indicates that that region is structurally sound, i.e.: based on the read pairs mapped to the genome.

The plot shows two abberant major features. The first is around 3.5k - the first gap. The ICP shows coverage in the "high" category bridging the gap. The ECP has two small bumps in the "high" category on both sides of the gap. A number of conclusions can be drawn from this signature: it indicates, in this case, that the assembly is correctly scaffolded in this location (i.e. the contigs left & right of the gap belong together), but the size of the gap (the yellow block) is over estimated. This can be concluded by the fact that there is coverage in the ICP spanning the gap (there are read pairs spanning the gap), but they fall in the "high" category - meaning they map further apart than expected - hence, the size of the gap is too large.

hagfish_blockplot

The same region plotted above is also displayed in the, more compact, plot below:

Example of a Hagfish blockplot

This plot shows a heatmap, based on the score, and is much better suited to visualize large regions at once.

Clone this wiki locally