Skip to content

How to understand nanomonsv result filtering

Yuichi Shiraishi edited this page May 5, 2021 · 4 revisions

The rationale for filtering

From version 0.4.0, nanomonsv get command performs several basic filtering. Here, we describe each filtering items and their rationals.

  1. Apparent SVs that are, in fact, the same as an SV in the other record (Duplicate_with_close_SV)

    In this case, the breakpoints of the two structural variations share the same chromosomes and directions, and the positions are close together. These probably represent the same SV, but coming from different sources of supporting read clusters (e.g., cluster consisting of soft-clipping reads, or cluster consisting of deletion represented by CIGAR string). In this case, we remove one of the SVs by the following rules in order of priority from the top:

    • The one with the smaller number of supporting reads.
    • The one with the smaller size of inserted sequences.
    • The one whose first breakpoint coordinate is smaller.
    • The one whose second breakpoint coordinate is smaller.
  2. When two insertions are close together (Duplicate_with_close_insertion)

    In this case, the two insertions are actually coming from the same insertion. Sometimes, the alignment tool can split the insertion into two parts. In many cases, nanomonsv can remove the split insertion in the later filtering step. However, occasionally, it seems that nanomonsv sometimes calls both the genuine insertion and the part of the split insertion. Therefore, we filter one insertion with a shorter insert sequence length when two insertions in close proximity are called.

  3. When one non-insertion type SV is part of the other insertion-type SV (Duplicate_with_insertion)

    This type of duplication occurs when the inserted sequence comes from the other genomic regions (such as LINE1 transduction type insertions). Around long insertions, alignment tools generate both the soft-clipping supporting reads (which will be clustered and eventually become non-insertion type SV such as translocation) and insertion ('I' in CIGAR string) supporting reads (which completely cover the inserted sequence within the alignment).

  4. Insertions or deletions whose size is too small (Too_small_size)

    Currently, nanomonsv focuses on insertions and deletions whose sizes are 100 or larger. Those that do not meet this threshold are basically filtered out at the beginning of the process. But occasionally, some SVs may remain in the final step. Currently, just in case. we do not remove these SVs but mark them. In the future, we would like to lower the size threshold to 50.

  5. SVs whose allele frequencies are too low (Too_low_VAF)

    In the process of nanomonsv get command, we validate the SV candidates by collecting the reads around the breakpoint of putative SVs and checking whether the putative SV segment sequence (concatenated sequence around the SV breakpoints) exists (then the read is set as a “variant supporting read”) or not (then the read is classified to a “reference read”) in each read of the tumor and matched control. Variant allele frequencies (VAFs) are measured by the number of variant supporting reads divided by that of the total reads (variant supporting read + reference read). When these VAFs are below the threshold (0.05 by default), then SVs are marked. This method of calculating the VAF may lead to underestimation, especially when one of the cutoff points has higher-order amplification, and we recognize that future improvements may be necessary.

  6. SVs where one or both breakpoints are located in decoy sequences (SV_with_decoy)

    We believe many users use the reference genome that contains the decoy sequences as we do. In these cases, many SVs involving decoy sequences are detected from nanomonsv. However, there is little evidence that these are correct ones, and we currently recommend filtering these out.

About insertions and deletions within simple repeat regions.

If we could identify insertions and deletions in simple repeat regions (especially repeat expansion events), it would be extremely nice. However, it seems that those in simple repeat regions are not reliable in our feeling. So, currently, we recommend filtering out them. One way for filtering them is described in another wiki page