Skip to content

3. Phasing assessment with hap mers

Arang Rhie edited this page Apr 23, 2020 · 5 revisions

1. Inherited hap-mer plots

When parental read sets are available, we can get the parental specific mers and overlay on the child’s read set. (See build hap-mer dbs for trios for more details).

Here, we use the parental assemblies of COL and CVI strain to get the parental specific k-mers, and generated the hap-mers by intersecting with the F1’s read set. Download and untar the pre-built inherited meryl dbs: col0.hapmer.meryl.tar and cvi0.hapmer.meryl.tar.

The hap-mer painted on the F1’s read set, along with the cut-offs for the solid k-mers are as follows. The stacked version gives an overall view, unstacked line and filled versions allows visually confirm the hap-mer distribution follows binomial distribution. A properly phased assembly is expected to contain each hap-mer in the corresponding haplotype assembly.

Stacked Filled Line

2. Hap-mer blob plots

The blob plots provides a quick glance on the overall phasing across an assembly. Each dot (circle) represents a sequence, a contig or scaffold, with its size relative to the sequence length. Each sequence is colored by assembly. The x and y axis are the number of hap-mers found.

hapmers_blob

A more phased sequence will have less hap-mers found from the other haplotype, so the closer the circles are aligned along the axis, the better the phasing performance is.

  • Note: There will be an option available soon to transform the axis 45 degrees, to less overlap the axis.

3. Hap-mer completeness (recovery rate)

Hap-mer completeness is computed similar to the k-mer completeness, but this time, per each hap-mer set. The hap-mer itself is already filtered, so we can use it directly as solid k-mers.

hap-mer completeness = found hap-mers in an assembly / total hap-mers

Similar to the overall k-mer completeness, this time, the second column is the hap-mer set used for evaluation.

col	col0.hapmer	18316089	18408825	99.4962
cvi	col0.hapmer	200302	18408825	1.08808
both	col0.hapmer	18344405	18408825	99.6501
col	cvi0.hapmer	132496	18264244	0.725439
cvi	cvi0.hapmer	18148554	18264244	99.3666
both	cvi0.hapmer	18161738	18264244	99.4388

Here, we see the col assembly is capturing 99.5% of the col hap-mers, with 1% of cvi hap-mers which are likely switch errors or base pair errors.
Depending on the de novo phasing assembly strategy, these could be useful to measure if the assembly achieved complete phasing per haplotypes.
both is showing the stats for both assemblies combined, which could be used as a measure for overall haplotype completeness.

4. Spectra copy number analysis per hap-mers

Associated spectra-cn plots are generated per assembly and per hap-mer.
Here are the examples of the col assembly with col0 and cvi0 hapmers.

Line Line

The col assembly shows col0 hap-mers in the expected copy number range, with almost no hap-mer missing. On the other hand, we have almost no cvi0 hap-mers found, regardless of the copy number.

See example/triocanu_clr...spectra-cn.*.png for more details.

5. Phased block statistics and switch error rates

The phased blocks are defined by hap-mers found in the assembly. By using the hap-mers as markers, we can define a phased block where we see more than 2 markers from the same haplotype. Assemblies usually come out with local switches, introduced by consensus errors, and we want to tolerate those short-range switches when calculating switch errors within a phased block.

Short-range switches are defined as consecutive hap-mers from the other haplotype, in a defined length of sequence. By default, merqury is set to

  • num_switch : 100
  • short_range : 20,000

which allows at most 100 marker switches, bounded in 20kb within a phased-block. If there are more markers, or the markers observed are stretching over 20kb, a long-range switch is occurring, switching the haplotype block.

In short, switch=100 and range=20000 allows 0.5% of switches in ~20kb window.

The output summary statistics is saved as *.phased_block.stats.

head *.100_20000.phased_block.stats
==> triocanu_clr.col.100_20000.phased_block.stats <==
triocanu_clr.col.100_20000	354	123,190,467	19	347,996	4,222,794	12,619,854	56418	19792291	0.28505%
==> triocanu_clr.cvi.100_20000.phased_block.stats <==
triocanu_clr.cvi.100_20000	309	122,007,476	24	394,846	5,524,119	10,997,312	49891	19682578	0.253478%

Each column is:

  1. Assembly name with short-range switch parameters
  2. Number of blocks
  3. Total bases in blocks (Block sum)
  4. Smallest block size
  5. Avg. block size
  6. Block N50 size
  7. Longest block size
  8. Number of markers from the other haplotype
  9. Total number of markers in blocks
  10. Switch error rate

Blocks are visualized in the N* plots, ordered by its’ size along the block sum coverage. The blocks are colored according to the haplotype. The trio binned assemblies have small blocks of switches to the other haplotype, which indicates the small fraction of reads being mis-binned. (We expect this will become smaller when the error rate in the long reads go down.)

col_block cvi_block

We can compare the phased blocks vs. contig (and scaffolds when provided) N* plots, to see how relatively a contig (or scaffold) is phased.

col_cont cvi_cont

When using the stats for publication, make sure to report both num_switch and short_range along with the phased block N50 and switch error rate. The larger the num_switch is defined, we tolerate higher short-range switch error rate, thus get longer phased block stats. However, short range switches may be critical for some analysis. In those cases, to get more reliable phased blocks, use smaller num_switches.

For example, to get more stringent phased blocks allowing only 10 switches in a raw, run:

sh $MERQURY/trio/switch_error.sh triocanu_clr.col.sort.bed triocanu_clr.col 10 20000
sh $MERQURY/trio/switch_error.sh triocanu_clr.cvi.sort.bed triocanu_clr.cvi 10 20000
sh $MERQURY/trio/block_n_stats.sh col.fasta triocanu_clr.col.10_20000.phased_block.bed cvi.fasta triocanu_clr.cvi.10_20000.phased_block.bed triocanu_clr.10_20000

Here are the col block continuity as for comparison. Left: 100_20000, right: 10_20000.

col_cont_100 col_cont_10

As we can see, the more stringent we get, the blocks become smaller and local stretches of switches forms its own block.

6. Track each haplotype block in the assembly

There are several files generated loadable on genome browsers (IGV) for further investigating haplotype consistency.

  • phased_block.bed per assembly and haplotype
  • hapmer.bed and hapmer.tdf per assembly and haplotype