Skip to content

How to filter nanomonsv result

Yuichi Shiraishi edited this page May 6, 2021 · 4 revisions

This page is for nanomonsv version 0.3.0 or before.

Quickstart

We have prepared the in-house script for performing the post-filtering process (this may be integrated into nanomonsv commands in the future).

Usage of this script can be seen as follows:

$ python3 misc/post_filter.py -h

Try with the example data from our paper:

python3 misc/post_filter.py misc/example/v0.3.0/COLO829.nanomonsv.result.txt COLO829.nanomonsv.result.filt.txt {path_to_GRCh38_reference_genome}

In fact, one of the most effective ways is to remove insertions and deletions confined in simple repeat regions. For that, the user needs to prepare the bgzip'ed and tabix'ed simple repeat bed file as follows:

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/simpleRepeat.txt.gz   
zcat simpleRepeat.txt.gz | cut -f 2-4 | sort -k1,1 -k2,2n -k3,3n > simpleRepeat.bed   
bgzip -c simpleRepeat.bed > simpleRepeat.bed.gz
tabix -p bed simpleRepeat.bed.gz 

Then,

python3 misc/post_filter.py misc/example/v0.3.0/COLO829.nanomonsv.result.txt COLO829.nanomonsv.result.filt.txt {path_to_GRCh38_reference_genome} --simple_repeat_bed simpleRepeat.bed.gz

This will label many records for putative false positives.

The rationale for filtering

nanomonsv in its current form generates many possible duplicated SVs and possible false positives. Here, we describe the source of duplicates and false positives we are aware of. In the future, some of these patterns will be removed as standard in nanomonsv.

  1. Apparent SVs that are, in fact, the same as an SV in the other record.

    In this case, the breakpoints of the two structural variations share the same chromosomes and directions, and the positions are close together. These probably represent the same SV, but coming from different sources of supporting read clusters (e.g., cluster consisting of soft-clipping reads, or cluster consisting of deletion represented by CIGAR string). In this case, we remove one of the SVs by the following rules in order of priority from the top:

    • The one with the smaller number of supporting reads.
    • The one with the smaller size of inserted sequences.
    • The one whose first breakpoint coordinate is smaller.
    • The one whose second breakpoint coordinate is smaller.
  2. When two insertions are close together

    In this case, the two insertions are actually coming from the same insertion. Sometimes, the alignment tool can split the insertion into two parts. In many cases, nanomonsv can remove the split insertion in the later filtering step. However, occasionally, it seems that nanomonsv sometimes calls both the genuine insertion and the part of the split insertion. Therefore, we filter one insertion with a shorter insert sequence length when two insertions in close proximity are called.

  3. When one non-insertion type SV is part of the other insertion-type SV

    This type of duplication occurs when the inserted sequence comes from the other genomic regions (such as LINE1 transduction type insertions). Around long insertions, alignment tools generate both the soft-clipping supporting reads (which will be clustered and eventually become non-insertion type SV such as translocation) and insertion ('I' in CIGAR string) supporting reads (which completely cover the inserted sequence within the alignment).

    You may find the example by the following command:

    $ grep 584987 misc/example/H2009.nanomonsv.result.txt 
    
  4. Insertions and deletions within simple repeat regions.

    If we could identify insertions and deletions in simple repeat regions (especially repeat expansion events), it would be extremely nice. However, it seems that those in simple repeat regions are not reliable in our feeling. So, currently, we recommend filtering out them.