Skip to content
/ minisv Public

Lightweight mosaic/somatic SV caller for long reads (WIP)

License

Notifications You must be signed in to change notification settings

lh3/minisv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Getting Started

# Download k8 v1.2; better copy "k8" to a directory in PATH
curl -L https://zenodo.org/records/11357203/files/k8-1.2.tar.bz2?download=1 | tar -jxf -
cp k8-1.2/k8-`uname -m`-`uname -s` k8

# extract candidate SVs
k8 minisv.js e -c "k8 minisv.js" -n COLO829T test/COLO829T.hs38l.paf.gz \
  test/COLO829T.chm13g.paf.gz | bash > COLO829T.rsv
k8 minisv.js extract -n COLO829BL test/COLO829BL.hs38l.paf.gz > COLO829BL.rsv

# single-sample calling
sort -k1,1 -k2,2n -S4G COLO829T.rsv | k8 minisv.js merge - > COLO829T.msv
k8 minisv.js genvcf COLO829T.msv > COLO829T.vcf

# paired calling
sort -k1,1 -k2,2n -S4G COLO829{T,BL}.rsv | k8 minisv.js merge - \
  | grep COLO829T | grep -v COLO829BL > COLO829T.paired.msv

Table of Contents

Introduction

Minisv is a lightweight mosaic/somatic structural variation (SV) caller for long genomic reads. Different from other SV callers, it prefers to combine read alignments against multiple reference genomes. Minisv retains an SV on a read only if the SV is observed on the read alignments against all references. This simple strategy reduces alignment errors and filters out germline SVs in the sample (when used with the assembly of input reads) or in the population (when used with pangenomes).

Given PacBio HiFi reads at high coverage, minisv achieves higher specificity for mosaic SV calling, has comparable accuracy to tumor-normal paired SV callers, and is the only tool accurate enough for calling large chromosomal alterations (CAs) from a single tumor sample without matched normal.

Workflow

Minisv can call 1) germline SVs, 2) somatic SVs in tumor-normal pairs, 3) large somatic SVs in a single tumor, 4) mosaic SVs and 5) de novo SVs in a trio; the exact use also depends on the input data types. Minisv achieves this variety of calling modes in three steps:

  1. Extract SVs from read alignment with extract. This command processes one read at a time, extracts long INDELs or breakends and outputs in a BED-like minisv format. If you have tumor-normal pairs, remember to use -n to specify samples such that you can distinguish the samples in step 3.

  2. Intersect candidate SVs extracted from multiple alignment files with isec. You can skip this step if you use one reference genome only, but you would lose the key advantage of minisv. The first two steps are shared by all calling modes, though the input alignments may be different. You can combine the first two steps with the e command, which provides a more convenient but less flexible interface.

  3. Call SVs supported by multiple reads using merge. This command counts the number of supporting reads for each sample. You can filter out SVs supported by normal reads to call somatic SVs - this is how minisv works with multiple samples jointly.

Calling SVs

Minisv seamlessly works with SAM, PAF or GAF (graph PAF) formats. It requires the ds:Z tag outputted by minigraph-0.21+ or minimap2-2.28+. For minigraph, use -cxlr for long reads. For minimap2, use -cxmap-hifi -s50 --ds for HiFi reads. The minimap2 option for Nanopore reads varies with the read error rate.

Germline SVs

minisv.js extract -b data/hs38.cen-mask.bed aln.hg38l.paf.gz > sv.hg38l.rsv
cat sv.hg38l.rsv | sort -k1,1 -k2,2n -S4g | minisv.js merge - > sv.hg38l.msv
minisv.js genvcf sv.hg38l.msv > sv.hg38l.vcf

For calling germline SVs, you only need one linear reference genome. This is the simplest use case. However, minisv does not infer genotypes. It is not the best tool for germline SV calling.

Somatic SVs in tumor-normal pairs

minisv.js e -n TUMOR -0b data/hs38.cen-mask.bed tumor.hg38l.paf tumor.chm13l.paf \
  tumor.chm13g.paf tumor-to-normal.self.paf | bash > sv-tumor.hg38l+tgs.rsv
minisv.js extract -n NORMAL normal.hg38l.paf > sv-normal.hg38l.rsv
cat sv-tumor.hg38l+tgs.rsv sv-normal.hg38l.rsv | sort -k1,1 -k2,2 -S4g \
  | minisv.js merge - | grep TUMOR | grep -v NORMAL > sv-paired.hg38l+tgs.msv

The last command selects SVs only present in TUMOR but not in NORMAL.

If you want to take the GRCh38 coordinate system, it is recommended to also align reads to T2T-CHM13 and the CHM13 graph. If you have PacBio HiFi reads for the normal, assemble the normal reads with hifiasm, align the tumor reads to the normal assembly and provide the alignment as the last input along with option -0.

Graph alignment greatly reduces alignment errors when the normal assembly is not available. When you have the normal assembly, intersecting with graph alignment can be optional. While graph alignment still improves specificity, it may affect sensitivity a little - a classical sensitivity vs specificity problem.

Calling de novo SVs will be similar.

Large somatic SVs in tumor-only samples

minisv.js e -b data/hs38.cen-mask.bed tumor.hg38l.paf tumor.chm13l.paf \
  tumor.chm13g.paf | bash > sv.hg38l+tg.rsv
cat sv.hg38l+tg.rsv | sort -k1,1 -k2,2 -S4g | minisv.js merge - > sv.hg38l+tg.msv
minisv.js view -IC sv.hg38l+tg.msv

In the lack of the normal assembly in this case, pangenome graph alignment is critical for reducing alignment errors and for filtering out common germline SVs. There may be several hundred SVs called with this procedure. Most of the called SVs below 10 kb are rare germline events, not somatic. Nevertheless, most large chromosomal alterations, such as translocations, foldback inversions (tagged by foldback in the output) and events longer 100 kb, are likely somatic becase such large alterations rarely occur to germline.

Mosaic SVs

minisv.js e -0b data/hs38.cen-mask.bed aln.hg38l.paf aln.chm13l.paf \
  aln.chm13g.paf aln.self.paf | bash > sv.hg38l+tgs.rsv
cat sv.hg38l+tgs.rsv | sort -k1,1 -k2,2 -S4g | minisv.js merge -c2 -s0 - > sv.hg38l+tgs.msv

Having the phased sample assembly is critical to the calling of small mosaic SVs. If you do not have the assembly, please perform graph alignment. However, because there are more small rare SVs than small mosaic SVs, only large mosaic chromosomal alteration calls are reliable. If you have the assembly, graph alignment may still help specificity but may hurt sensitivity around VNTRs.

Comparing SVs

The eval command of minisv compares two or multiple SV callsets. To compare two callsets:

minisv.js eval -b data/hs38.reg.bed -l 100 call1.vcf call2.msv

where -l specifies the minimum SV length and -b specifies confident regions. The command line outputs TP, FN and FP. Minisv considers two SVs, S1 and S2, to be the same if both ends of S1 are within 500 bp from ends of S2 and the INDEL types of S1 and S2 are the same. Minisv compares all types of SVs that can be associated with two ends. You can also specify the minimum read support (-c) and the minimum SV length (-l) on the command line.

If three or more callsets are given on the command line, minisv will generate an output like:

SN  980     0.6091  0.8580  0.6135  0.9104  0.8754  C1
SN  0.0673  110     0.0762  0.1319  0.1119  0.0665  C2
SN  0.8408  0.6727  958     0.6411  0.9216  0.8469  C3
SN  0.1980  0.4000  0.2119  326     0.2313  0.2154  C4
SN  0.7071  0.5818  0.7359  0.6074  536     0.7012  C5
SN  0.8459  0.5636  0.8361  0.6442  0.8731  947     C6

where the diagonal gives the count of SVs and the number at row R and column C equals to the fraction of calls in R found in C. In this example, 84.59% of calls in C6 were found in C1 and 87.54% of calls in C1 found in C6. Generally, higher fraction on a row is correlated with higher sensitivity; higher fraction on a column is correlated with higher specificity for the caller on the column.

If you have 3+ callsets, you may also use option -M for evaluation in the consensus mode. In this mode, an FP is a call in callset C that is not found in another other callsets. Conversely, an FN is a call that is supported by two or more other callsets but is not called in C. Here is sample output:

RN  845  98   0.1160  C1
RP  980  49   0.0500  C1
RN  994  918  0.9235  C2
RP  110  18   0.1636  C2
...

Here the false negative rate (RN) for C1 is 11.6% and the false positive rate (RP) is 5.0% based on the definition above.

Minisv seamlessly parses the VCF format and the minisv format. It has been tested with Severus, Sniffles2, cuteSV, SAVANA, SVision-Pro, nanomonsv, SvABA and GRIPSS.

Limitations

  1. Minisv simply counts supporting reads to call an SV. More complex algorithms, such as phasing, VNTR reposition and machine learning, may improve its accuracy. On the other hand, matching the accuracy of other callers with such a simple algorithm implies the potential of our multi-reference approach.

  2. Minisv does not output genotypes. This makes it less useful for germline SV calling. On the HG002 benchmark data, minisv is close to but not as good as the best germline SV caller.

  3. For the best accuracy, minisv needs the alignment of all reads against multiple reference genomes or pangenome graphs. This greatly increases the running time. A better strategy is to only align reads with SVs to reduce alignment time. It is possible to build a pipeline on top of minisv but this has not been implemented.

About

Lightweight mosaic/somatic SV caller for long reads (WIP)

Resources

License

Stars

Watchers

Forks

Releases

No releases published