-
Notifications
You must be signed in to change notification settings - Fork 5
8. FAQs
Read mapping around repetitive elements of the genome can be innacurate, potentially yielding false positive SVs around them. Conversely, there are some SVs expected to appear around repeats (i.e.: those derived from transposable elements insertions). This means that the decision on how to handle SVs around such regions is not trivial. By default, perSVade runs RepeatModeler
and RepeatMasker
to annotate repetitive regions, which is used to assess whether removing SV calls overlapping these elements increases the overall accuracy. If so, they are removed.
This can be suboptimal for two reasons. First, the prediction of repeats is time-consuming for large genomes. Second, some repeat families (i.e.: simple repeats, low complexity regions) may yield more false positive calls than others (i.e.: large transposons), so that treating all repetitive elements together may not be always the best option.
perSVade includes two options to circumvent this:
-
Skipping the analysis of repeats (
--skip_repeat_analysis
) -
Providing a .tab file that has manually-curated regions with repeats to be discarded (
--previous_repeats_table
). Note that this option is also useful if you want to predict only once the repeats and use the resulting table for many samples with the same genome.
It is also possible to annotate whether small variants (when running with run_smallVarsCNV
) are found around repeats (using --consider_repeats_smallVarCall
). However, if you just want to get small variants calls with no repeat annotation we recommend using --skip_repeat_analysis
to avoid excessive, useless, computation time.
perSVade crashed due to insufficient resources (allocated RAM or time), do I have to repeat all the running?
No worries. perSVade is designed to restart from the last correct step if you re-run with the same command in the same output directory. This means that you can simply re-run on a system with higher resources and expect no steps to be repeated. In addition, there are some tips that can be useful:
-
You may consider readjusting the balance between allocated threads (with
--threads
) and RAM (depends on the running system). Some programs used by perSVade require a minimum amount of RAM/thread, which depends on the input. We have developed perSVade to balance this automatically, but we can't guarantee that it will always work (specially if input genomes/reads are much larger than those tested). -
Another option that can be useful to deal with memory errors is to specify the fraction of RAM that is being allocated to a given perSVade run. In several steps, the pipeline needs to calculate the available memory (using the python command psutil.virtual_memory()) to allocate resources accordingly. This command returns all the available memory in the computer. Sometimes you may be running on a fraction of the computers' resources (for example if you are using a fraction of a complete node of a computing cluster). If this is the case the psutil calculation will overestimate the available RAM. If that is the case you can provide the fraction available through the argument
--fraction_available_mem
. By default, it will calculate the available RAM by filling the memory, which may give errors. It is highly reccommended that you provide this option. If you want to use all the allocated memory you should specify--fraction_available_mem 1.0
. -
By default, perSVade only allocates 50% of the available RAM for some java processes. You can change this through the command --fractionRAM_to_dedicate. For example
--fractionRAM_to_dedicate 0.8
would allocate 80% of the RAM. -
IMPORTANT NOTE: If you are running perSVade in MareNostrum4 or Nord3 you should skip the
--fraction_available_mem
argument, as perSVade already calculates it.
perSVade generates a single .tab file for each type of SV (inversions, deletions, tandemDuplications, insertions, translocations and unclassified SVs) into the folder SVdetection_output/final_gridss_running
. Each of these has a unique set of columns that represent the variants. This is the meaning of these columns (note that text in "" indicates the column names of the corresponding .tab files):
- Simple SVs: deletions, inversions and tandemDuplications (duplication of a region which gets inserted next to the affected region). These are described by a chromosome ("Chr"), "Start" and "End" coordinates of the SV. perSVade outputs one .tab file for each of these SV types.
- Insertions: a region of the genome (indicated by "ChrA", "StartA", "EndA") is copied ("Copied" is TRUE) or cut ("Copied" is FALSE) and inserted into another region (indicated by "ChrB", "StartB"). "EndB" comes from adding to "StartB" the length of the inserted region. There is one .tab file for insertions.
- Translocations: balanced translocations between two chromosomes ("ChrA" and "ChrB"). "StartA" indicates the start of "ChrA" (position 0). "EndA" indicates the position in "ChrA" where the translocation happens. For non-inverted translocations, "StartB" is 0 and "EndB" is the position in "ChrB" where the translocation happens. For inverted translocations, "StartB" is the position in "ChrB" where the translocation happens and "EndB" is the last position of "ChrB". There is one .tab file for translocations.
- Unclassified SVs: There is one .tab file ("unclassified_SVs.tab") that reports all the variants that are called by
clove
and cannot be assigned to any of the above SV types. These include unclassified breakpoints (which could be part of unresolved/unkown complex variants) and complex inverted SVs (which are non-standard SVs). These types of SVs are not included in the simulations, so that the accuracy on them is unknown. This is why we group them together into a single file. For unclassified breakpoints, the "SVTYPE" indicates which is the orientation of the two breakends (there are 4 possible orientations, and the "SVTYPE" is different depending on if the two breakends are in the same chromosome or not). "#CHROM" - "POS" indicate one breakend and "#CHR" - "END" the other. "START" is -1 for such unclassified breakpoints. Complex inverted SVs represent variants were a region (indicated by "CHR2", "START", "END") is copied ("SVTYPE" is CVD) or cut ("SVTYPE" is IVT (different chromosomes)), inverted and inserted into "#CHROM"-"POS". Complex inverted translocations ("SVTYPE" is CVT) are variants where a region of a chromosome (defined by "#CHROM", "POS" and "END") is cut, inverted and inserted right in the 3'.
perSVade's parameter optimisation is too slow. I just want to run SV calling with some pre-defined filters. How can I do this?
perSVade includes an option to skip the parameter optimisation: --fast_SVcalling
. This uses some default parameters for the filtering. You can also customize the filters by providing a .json file with them through --parameters_json_file
. The default parameters encoded as .json are in misc/default_perSVade_parameters.json
, and this file can be used as a template to provide custom parameters with --parameters_json_file
.
This is a vcf that contains all called SVs and CNVs, in a way that is focused on how each SV affects particular regions of the genome. This is useful for further functional annotation. Each SV can be split across multiple rows, as it can affect several regions of the genome. All the rows that are related to the same SV are identified by the field variantID
in INFO
. On top of this, each row has a unique identifier indicated by the field ID
. Some SVs generate de novo-inserted sequences around the breakends, and each of these de novo insertions are represented in a single row. Note that each of the rows may indicate a region under CNV (with the SVTYPE
in INFO
as DEL
, DUP
or TDUP
), a region with some rearrangement (with the SVTYPE
in INFO
as BND
) or a region with a de novo insertion (with the SVTYPE
in INFO
as insertionBND
). These three types of regions can be interpreted by the Ensembl Variant Effect Predictor (VEP, also implemented in perSVade) for functional annotation. The fields #CHROM
and POS
indicate the position of the SV, and the field END
from INFO
indicates the end of a region under SV when applicable.
More precisely, this is how each of the SV types are represented (check the FAQ How are the SVs encoded into single files? to understand what the fileds in <>
mean):
-
Each deletion or tandem duplication is represented with a
variantID=DEL|<Chr>:<Start>-<End>
orvariantID=TDUP|<Chr>:<Start>-<End>
, respectively. There may be up to three rows related to each of these variants:-
One row for the deleted/duplicated region (for deletions this wold have an ID
DEL|CNV|<Chr>:<Start>-<End>
). The fieldsPOS
andEND
inINFO
indicate the deleted/duplicated region. TheSVTYPE
fromINFO
isDEL
orTDUP
. -
Up to two rows for each of the de novo insertions (for deletions there may be one row with an ID
DEL|insertion-<Chr>-<Start>|<Chr>:<Start>-<End>
and another with an IDDEL|insertion-<Chr>-<End>|<Chr>:<Start>-<End>
) around each of the breakends. TheSVTYPE
fromINFO
isinsertionBND
.
-
-
Each inversion is is represented with a
variantID=INV|<Chr>:<Start>-<End>
. There may be several rows related to each inversion:-
One row for each of the breakends of the inverted region (with an ID
INV|BND-<Chr>-<Start>|<Chr>:<Start>-<End>
andINV|BND-<Chr>-<End>|<Chr>:<Start>-<End>
). TheSVTYPE
fromINFO
isBND
. -
One row for each of the de novo insertions around the breakends tha generate this inversion. Each of them has an ID
INV|insertion-<#CHROM>-<POS>|<Chr>-<Start>:<End>
, where#CHROM
and<POS>
are equal to the equivalent vcf fields. TheSVTYPE
fromINFO
isinsertionBND
.
-
-
Each copy-paste insertion is represented with a
variantID=INS|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste
. There are several rows related to each insertion:-
One row for the duplicated region (in ChrA), with an ID
INS|CNV|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste
. The fieldsPOS
andEND
inINFO
indicated the copied region. TheSVTYPE
fromINFO
isDUP
. -
One row for the insertion position (in ChrB), with an ID
INS|BND-<ChrB>-<StartB>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste
. The fieldPOS
indicate the insertion position. TheSVTYPE
fromINFO
isBND
. -
One row for each of the de novo insertions around the breakends tha generate this insertion. Each of them has an ID
INS|insertion-<#CHROM>-<POS>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste
, where#CHROM
and<POS>
are equal to the equivalent vcf fields. TheSVTYPE
fromINFO
isinsertionBND
.
-
-
Each cut-paste insertion is represented with a
variantID=INS|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|cutPaste
. There are several rows related to each insertion:-
One row for each of the breakends of the cut region (in ChrA), with IDs
INS|BND-<ChrA>-<StartA>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|cutPaste
andINS|BND-<ChrA>-<EndA>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|cutPaste
. TheSVTYPE
fromINFO
isBND
. -
One row for the insertion position (in ChrB), with an ID
INS|BND-<ChrB>-<StartB>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|cutPaste
. The fieldPOS
indicate the insertion position. TheSVTYPE
fromINFO
isBND
. -
One row for each of the de novo insertions around the breakends tha generate this insertion. Each of them has an ID
INS|insertion-<#CHROM>-<POS>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste
, where#CHROM
and<POS>
are equal to the equivalent vcf fields. TheSVTYPE
fromINFO
isinsertionBND
.
-
-
Each balanced translocation is represented with a
variantID=TRA|<ChrA>:<StartA>-<EndA><><ChrB>:<StartB>-<EndB>
. There are several rows related to each translocation:-
One row for each of the breakend regions, with IDs
TRA|BND-<ChrA>-<EndA>|<ChrA>:<StartA>-<EndA><><ChrB>:<StartB>-<EndB>
for the ChrA breakend andTRA|BND-<ChrB>-<POS>|<ChrA>:<StartA>-<EndA><><ChrB>:<StartB>-<EndB>
for the ChrB breakend. Note that the<POS>
is equivalent to the vcf field. TheSVTYPE
fromINFO
isBND
. -
One row for each of the de novo insertions around the breakends tha generate this translocation. Each of them has an ID
TRA|insertion-<#CHROM>-<POS>|<ChrA>:<StartA>-<EndA><><ChrB>:<StartB>-<EndB>
, where#CHROM
and<POS>
are equal to the equivalent vcf fields. TheSVTYPE
fromINFO
isinsertionBND
.
-
-
Each unclassified breakpoint is represented with a
variantID=<breakpoint_SVTYPE>like|<#CHROM>:<POS>-<CHR2>:<END>
, where<breakpoint_SVTYPE>
can be DEL, TAN, INV1, INV2, ITX1, ITX2, INVTX1, INVTX2. There are several rows related to each unclassified breakpoint:-
One row for each of the breakend regions, with IDs
<breakpoint_SVTYPE>like|BND-<#CHROM>-<POS>|<#CHROM>:<POS>-<CHR2>:<END>
for the#CHROM
breakend and<breakpoint_SVTYPE>like|BND-<CHR2>-<END>|<#CHROM>:<POS>-<CHR2>:<END>
for theCHR2
breakend. TheSVTYPE
fromINFO
isBND
. -
One row for each of the de novo insertions around the breakends tha generate this breakpoint. Each of them has an ID
<breakpoint_SVTYPE>like|insertion-<#CHROM>-<POS>|<breakpoint_SVTYPE>like|<#CHROM>:<POS>-<CHR2>:<END>
, where#CHROM
and<POS>
are equal to the equivalent vcf fields. TheSVTYPE
fromINFO
isinsertionBND
.
-
-
The complex inverted SVs are represented with a
variantID=<varID>|<CHR2>:<START>-<END>|<#CHR2>:<POS>
, where<varID>
can be CVD (inverted intrachromosomal duplication), CVT (inverted intrachromosomal cut-paste insertion) or IVT (inverted interchromosomal cut-paste insertion). Each of them is split across multiple vcf rows:- CVD's are encoded as copy-paste insertions.
- IVT's are encoded as cut-paste insertions.
- CVT's are encoded as unclassified breakpoints, but with an
CVT
tag instead of<breakpoint_SVTYPE>like
. This means that there is one breakend for each end of the CVT.
-
The coverage-based deletion/duplication calls are represented with a
variantID=coverageDUP|CNV|<#CHROM>:<POS>-<END>
orvariantID=coverageDEL|CNV|<#CHROM>:<POS>-<END>
, respectively. There is one row for each variant. TheSVTYPE
fromINFO
isDUP
orDEL
. The range betweenPOS
andEND
indicates the span of the duplication/deletion. The fieldmerged_relative_CN
fromINFO
indicates the most conservative relative copy number state (closest to 1.0) of this region as called by the different CNV callers. In a diploid organism, merged_relative_CN of 0.5 would indicate monosomy and 1.5 trisomy of the indicated region. The fieldmedian_coverage_corrected
indicates the relative coverage (where 1.0 would mean a coverage as the median of the genome) after correction in the coverage-based CNV calling pipeline.
Other important remarks and fields:
-
In SVs, the field
BREAKEND_real_AF
fromINFO
indicates the fraction of reads mapped to that position that support each of the breakends (indicated byBREAKENDIDs
fromINFO
) that form the SV in the corresponding row. In the case ofDUP
,DEL
andTDUP
records, this is a comma-sepparated string. In the case ofBND
orinsertionBND
records this is a float, as there is only one breakend. This value can give you a rough idea of the genortype of the SV. However, note that an SV can have both homozygous and heterozygous breakends at the same time. SV genotyping is still an unsolved problem. Check https://github.com/PapenfussLab/gridss/wiki/GRIDSS-Documentation#calculating-variant-allele-fraction-vaf for more information on how gridss calculates these variant allele frequencies. In order to generate higher-confidence calls it may be interesting to filter out SVs that have low values of BREAKEND_real_AF (i.e. filter out SVs where none of the breakends has BREAKEND_real_AF>0.1). -
Coverage-based CNV calling is error-prone, so that it is advisable to filter out CNVs where
median_coverage_corrected
andmerged_relative_CN
do not match (i.e: duplications (merged_relative_CN>1) with a median_coverage_corrected around 1). -
The field
RELCOVERAGE
fromINFO
indicates the relative coverage normalised by the median coverage across windows of the genome for the region under CNV in tandem duplications, deletions, copy-paste insertions and coverage-inferred CNVs. The fieldsRELCOVERAGE_TO_5
andRELCOVERAGE_TO_3
fromINFO
indicate the relative coverage normalised by the coverage of the 5' or 3' region, respectively. These values can be used as an additional quality control for the called CNVs.
This is not a trivial task, particularly for SVs and CNVs, where the same variant may be represented slightly different in two different perSVade runs. For example, we generally consider that two SVs of a given type (i.e. a deletion) are the same if they overlap by >75% and have their breakends <50 bp appart. We have developed some python functions to introduce this definition of "same variant" across different samples.
These functions are in ./scripts/sv_functions.py
(see Quick start for an example of how to import this script as a module).
Below are some key functions:
-
sv_functions.get_integrated_small_vars_df_severalSamples
is a function to integrate the files from the small variant and per gene coverage analysis, into single files for all samples. These are the generated files:-
merged_coverage.tab
includes the stackedsmallVars_CNV_output/CNV_results/genes_and_regions_coverage.tab
files, where the columnsampleID
indicates the the sampleID. -
smallVars.tab
includes the stackedsmallVars_CNV_output/variant_calling_ploidy2.tab
files, where the columnsampleID
indicates the the sampleID. -
smallVars_annot.tab
includes all the annotated variants (from thesmallVars_CNV_output/variant_annotation_ploidy2.tab
files).
-
-
sv_functions.get_integrated_CNperWindow_df_severalSamples
stacks all theCNV_calling/final_df_coverage.tab
files across different samples. It generates theintegrated_CNperWindow.tab
file where the columnsampleID
indicates the the sampleID. -
sv_functions.get_integrated_SV_CNV_df_severalSamples
integrates the SV and coverage-based CNV calling and annotation files for several samples and provides some information about the overlaps between samples. It requires a valid .gff file. These are the generated files:-
SV_CNV.simple.tab
includes the stackedSVcalling_output/SV_and_CNV_variant_calling.tab
files, where the columnsampleID
indicates the the sampleID. -
SV_CNV_annot.tab
includes the variant annotations as inSVcalling_output/SV_and_CNV_variant_calling.tab_annotated_VEP.tab
for all samples. This table also includes the fieldsis_protein_coding_gene
(a boolean indicating if the variant overlaps a protein-coding gene) andis_transcript_disrupting
(a boolean indicating if the variant disrupts a transcript). Beware that duplications are also considered to be 'transcript-disrupting'. -
SV_CNV.tab
is similar toSV_CNV.simple.tab
, but it includes relevant information to compare the variants across samples. These are some useful extra fields that are found in this table, for each variant:-
overlapping_samples_byBreakPoints_allCalled
: A comma-sepparated string with the sample IDs where gridss found at least one breakpoint overlapping the variant. This can be considered as a false positive-prone estimator of samples that share this variant, as some of the breakpoints may not be real. This field is only meaningful for SVs called by gridss and clove. -
overlapping_samples_byBreakPoints_PASS
: A comma-sepparated string with the sample IDs where gridss found at least one breakpoint (passing the filters of gridss as defined by perSVade) overlapping the variant. This can be considered as a false negative-prone estimator of samples that share this variant, as some of the breakpoints may be missed due to only considering high-confidence breakpoints to compute the overlaps. This field is only meaningful for SVs called by gridss and clove. -
overlapping_samples_CNV_atLeast_<min_pct_overlap>
(where<min_pct_overlap>
may be 0.25, 0.5, 0.75 or 0.9): In coverage-based CNVs, a comma-sepparated string with the sample IDs that may have the same CNV. A CNV is considered to be found in another sample if it has an equal or more extreme (farther from 1.0) copy number (CN) in at least<min_pct_overlap>
of the windows of the given CNV. For example, a duplication in a region of sample A is considered to be also found sample B (according tooverlapping_samples_CNV_atLeast_0.75
) if at least 75% of the windows of that same region in sampleB have a CN number above or equal the CN in A. This field is only meaningful for coverage-based CNVs. -
variantID_across_samples
: is an identifier of each SV/CNV that is unique across different samples. By default, two SVs of a given type (i.e. a deletion) are the same if they overlap by >75% and have their breakends <50 bp appart.
-
-
Other remarks:
-
You can type
help(<function_name>)
from a python terminal to get more details about the arguments of these functions. -
All these functions take an argument called
paths_df
. This should be a table (either a pandas dataframe object or a tab-delimited file with header names) with the columnssampleID
(this should be a unique identifier of each sample) andperSVade_outdir
(the path to the perSVade outdir of that sample). -
We recommend using the fields
overlapping_samples_byBreakPoints_allCalled
,overlapping_samples_byBreakPoints_PASS
andoverlapping_samples_CNV_atLeast_<min_pct_overlap>
if we are interested in filtering out variants in a given sample that are found in other 'background' samples. For example, usingoverlapping_samples_byBreakPoints_allCalled
andoverlapping_samples_CNV_atLeast_0.25
as the overlapping fields would be a conservative way to ensure that the variant is not in any of the 'background' samples. -
The field
variantID_across_samples
is useful for analyses where we want to work with variants shared across different samples (i.e.: hierarchical clustering of samples by the variants)
This is indeed a bit tricky. We here provide an example of running this function and processing the outputs using pandas dataframes. This example is related to an imaginary dataset where we called variants (both SVs and coverage-based CNVs) for one drug resistant sample (called "R") and two susceptible samples (called "S1" and "S2"). This would be a workflow for finding variants that are found in "R" and not in "S1" or "S2" and may drive the resistance:
# load the functions
import pandas as pd
import sys
sys.path.insert(0, "<perSVade_dir>/scripts")
import sv_functions as fun
# define an outdir
integrated_files_dir = "./integrated_files"
# define a dataframe with the paths to the perSVade output directories
paths_df = pd.DataFrame({"sampleID" : ["R", "S1", "S2"],
"perSVade_outdir": ["./outdir_R", "./outdir_S1", "./outdir_S2"]})
# get a file with the Copy Number (CN) per window, which is necessary to compute the overlaps between CNV calls
integrated_CNperWindow_file = fun.get_integrated_CNperWindow_df_severalSamples(paths_df, integrated_files_dir, threads=4)
# generate files with the integrated SVs and CNVs. This requires a variant annotation file
fun.get_integrated_SV_CNV_df_severalSamples(paths_df, integrated_files_dir, <gff>, <reference_genome>, threads=4, integrated_CNperWindow_file=integrated_CNperWindow_file, tol_bp=50, pct_overlap=0.75, add_overlapping_samples_eachSV=True)
# import the generated df with the variants. This is equivalent to the "SV_and_CNV_variant_calling.vcf" but with INFO fields split across different columns and some extra fields.
df_vars = pd.read_csv("./integrated_files/SV_CNV.tab", sep="\t")
# keep vars that are only in "R" and none of the others
def get_df_vars_r_in_bgSamples(r, bg_samples):
samples_overlapping_by_breakpoints = set(str(r["overlapping_samples_byBreakPoints_allCalled"]).split(","))
samples_overlapping_by_CN = set(str(r["overlapping_samples_CNV_atLeast_0.25"]).split(","))
all_samples_overlapping = samples_overlapping_by_breakpoints.union(samples_overlapping_by_CN)
return len(bg_samples.intersection(all_samples_overlapping))>0
df_vars_R = df[(df_vars.sampleID=="R") & ~(df_vars.apply(get_df_vars_r_in_bgSamples, bg_samples={"S1", "S2"}, axis=1))]
# print the variants, using the IDs that are unique across samples
print(set(df_vars_R.variantID_across_samples))
The singularity build
command writes files into a cachedir ($HOME/.singularity/cache
by default) and a temporary dir (/tmp
by default). As perSVade's is a large image, you may get into storage problems if $HOME
or /tmp
do not have enough disk quota. You can change these directories through environmental variables before running singularity (i.e. with export SINGULARITY_CACHEDIR=<cachedir>
and export SINGULARITY_TMPDIR=<tmpdir>
). <cachedir>
and <tmpdir>
should be the redefined paths with enough disk. You can find more information in https://sylabs.io/guides/3.6/user-guide/build_env.html. Note that <tmpdir>
should have full permissions (given with sudo chmod 777 <tmpdir>
).
Most of them are written under the provided --outdir
, but there are also some written under $HOME/.perSVade_dir
(which can be removed any time). You can specify a PERSVADE_TMPDIR
environmental varibale to not write under $HOME/.perSVade_dir
(i.e. with export PERSVADE_TMPDIR=/tmp/perSVade_tmpdir
)
The following plots may be useful to orient yourself:
We tested perSVade (only the SV calling pipeline providing aligned reads as inputs) on six eukaryotes (three samples / species) using either no parameter optimization (gray) or different types of simulations (black, red, blue) in a machine with 16 cores. Shown are the running time and maximum RAM used. Note that this is ignoring the resources related to read alignment (which was performed independently). The x axes reflect the reference genome size (left) and the number of mapped read pairs (right), which are correlated with resource consumption. This data may be useful to allocate computational resources for running perSVade.
Of note, perSVade was run with different parameters for the human datasets to avoid excessive resource consumption. First, we skipped the marking of duplicate reads on the .bam files with perSVade’s --skip_marking_duplicates option. Second, we ran the simulations on a subset of the genome (only chromosomes 2, 7, 9, X, Y and mitochondrial). Third, we skipped the ‘homologous’ simulations in human samples due to excessive memory consumption.
This is a cartoon of the process:
This is the core, most novel function of perSVade. There are two main steps:
-
Choosing regions of the genome for simulations of SVs. These can be either around regions with previously known SVs, regions with pairwise homology (inferred with blastn) or randomly selected regions. In order to use regions with known SVs, you can either provide them (through
--real_bedpe_breakpoints
) or let perSVade infer them from other sequencing datasets (with--close_shortReads_table
). If specified,--close_shortReads_table
should be a table with the paths to sequencing datasets of the same (or close) species. perSVade runs the SV calling pipeline with default parameters on these datasets and generates a file with these 'known regions' with SVs. Note that all these options are mutually exclusive. By default, perSVade runs on random regions. -
Choose a set of optimum parameters through simulations. perSVade generates two simulated genomes (tunable with --nsimulations) with 50 SVs of each type (tunable through --nvars) based on the reference genome and the provided regions. There are two simulated genomes for each of the desired ploidies, tunable through --simulation_plodies. If --simulation_plodies is not provided, perSVade will rely on the --ploidy parameter (i.e. if --ploidy is 2 the simulated genomes will have only heterozygous variants). For each simulated genome perSVade simulates reads with equal insert size, coverage and read length as the input reads. Then it aligns the reads and runs gridss to obtain a list of 'raw breakpoints'. PerSVade then tries several combinations of filters on them (by default >278,000,000, which is tunable through --range_filtering_benchmark) to generate many 'filtered breakpoints'. Each of these is processed with clove to generate a set of 'raw SVs'. PerSVade next tries several combinations of filters on each of them to get a set of filtered SVs. These are compared against the true set of SVs (inserted in the simulated genome) to calculate the accuracy (F-value) of each combination of gridss and clove filters on each simulated genome, ploidy and SV type. These filters are optimised for each simulation, and thus may not be accurate on independent sets of SVs (due to overfitting). In order to reduce this effect, perSVade tests how each of these 'best' filters perform on all simulations, ploidies and SV types (not only in those that yielded the given filters as optimum). The heatmap shows the F-value for an example sample, where the filters in the first row are accurate on all simulations (indicating that there is no overfitting on them) and thus they are chosen as the final set of best parameters. PerSVade writes the accuracy of these best parameters into
<outdir>/SVdetection_output/parameter_optimisation/benchmarking_all_filters_for_all_genomes_and_ploidies/df_cross_benchmark_best.tab
, which will allow you to understand how much you can trust the results.
These optimised filters (or parameters) are used for calling SVs in the real SV calling. They are stored in <outdir>/SVdetection_output/final_gridss_running/perSVade_parameters.json
.
The following plot is an adapted example (for Candida albicans’s sample SRR538786) of how perSVade graphically reports the SV calling accuracy for any given input (as in the output folder parameter_optimisation/plots/cross_accuracy_heatmaps/
):
In order to find the best parameters for SV calling, perSVade generates two simulated genomes (with simName ‘simulation_1’ or ‘simulation_2’) with up to 50 SVs of each type and finds the parameters that yield the highest F-value for each simulated genome and SV type (referred in the rows as ‘training’ parameters). In order to reduce overfitting, perSVade evaluates how each of these training parameters work on the other simulations and SV types (referred in the columns as ‘testing’ simulations). Overfitted parameters should perform well on their origin simulations, but not necessarily on the others. The heatmap shows the F-value of these evaluations, hereafter referred as ‘testing instances’. The chosen parameters for the SV calling on the real data (which work as good as possible on all simulations and SV types) can be identified because each cell has the F-value printed inside. In addition, perSVade saves the accuracy of these chosen parameters in a -tab file. The ‘*’ indicate testing instances where the training and testing are from the same SV type. The ‘=’ indicate testing instances where the training and testing are from the same SV type and simulation (same simName). Note that perSVade may generate <50 SVs for some types (i.e.: it may be impossible to find so many balanced translocations for some genomes). The testing instances with <10 SVs of a given type have a small number inside indicating the number of SVs (i.e. in this example there are 6 translocations in ‘simulation_1’). On another line, perSVade can generate simulations with different ploidies (i.e.: haploid or diploid genomes), which is indicated in the ‘ploidy’ line. The example only includes diploid heterozygous (‘diploid_hetero’) variants, which is the default behavior for diploid species like C. albicans.
Based on the findings described in the paper of perSVade, we propose the following recommendations for a cost-effective usage of perSVade:
-
For SV calling on many datasets of one species with similar properties (similar coverage, read length and insert size), run perSVade using ‘random’ simulations on one sample, and use the optimized parameters for the other samples (skipping optimization). The reported calling accuracy may be overestimated since the simulations are not realistic, but the chosen parameters are expected to be optimal.
-
If you are interested in understanding the real SV calling accuracy, run perSVade on realistic simulations (‘homologous’ or ‘known’), which may report an accuracy that is closer to the real one.
The CNV calling relies on the cylowess package. If you installed perSVade with the traditional installation you'll install it through the ./installation/setup_environment.sh
script. However, we have seen that this does not always work, and you may get an error while trying to import cylowess. Make sure that cylowess is properly installed in perSVade's main conda environment following these commands (and making sure that you are using the python
interpreter of the conda environment):
git clone https://github.com/histed/lowess-work-carljv
cd lowess-work-carljv
python setup.py install
If your reference genome has mitochondrial chromosomes you should specify them with --mitochondrial_chromosome
. For example if there are two mtDNA chromosomes called 'chr_mito1' and 'chr_mito2' you should provide the argument --mitochondrial_chromosome chr_mito1,chr_mito2
. If there are no mitochondrial chromosomes you should state --mitochondrial_chromosome no_mitochondria
.
Note that mtDNA chromosomes are treated differently for SV, CNV calling and variant annotation. It is thus mandatory that you provide the mtDNA chromosomes.
perSVade has an overwhelming amount of options and I fear that my command is wrong. How can I check this without running all the pipeline?
perSVade includes an argument (--downsampled_coverage
) to use only a random fraction of the input sequencing data (either reads or a sorted bam). You can test that your perSVade command works with such fraction of the data, which will run much faster. For example, if you add --downsampled_coverage 10.0
to your command perSVade will run on a subset of the data that yields an average coverage of 10x. We note that setting very low values (i.e. --downsamped_coverage 0.1
) may result in errors because some regions of the genome won't be covered at all or because none gridss may not find any breakpoints.
In addition, you can run the parameter optimization for SV calling only on a fraction of chromosomes (i.e. with --simulation_chromosomes chr1,chr2,chr3
) to speed up the process.
Note that these options can be useful beyond testing purposes. For example, you may use --downsampled_coverage
to infer the SV calling accuracy on datasets with a given coverage. On another line, we used simulation_chromosomes
to optimize perSVade for the human genome (see this).