Skip to content

8. FAQs

MikiSchikora edited this page Nov 17, 2021 · 27 revisions

How are repetitive elements considered in perSVade and why is it important?

Read mapping around repetitive elements of the genome can be inaccurate, potentially yielding false positive SVs around them. Conversely, there are some SVs expected to appear around repeats (i.e.: those derived from transposable elements insertions). This means that the decision on how to handle SVs around such regions is not trivial. In addition, it may be interesting to know which variants (i.e. SVs or small variants) are overlapping repeats for downstream analyses. PerSVade includes the 'infer_repeats' module to find repeats in a genome with RepeatModeler and RepeatMasker. You can input these repeats to several modules (with --repeats_file), which will have the following effects:

  • If repeats are provided, 'optimize_parameters' will assess whether removing SV calls overlapping repeats increases the overall accuracy. If so, the resulting optimized parameters will include a 'filter_overlappingRepeats : True'. If you use these optimized parameters in 'call_SVs', any breakpoint overlapping repeats will be removed.

  • If repeats are provided, 'call_SVs' may filter out SVs that overlap repeats if the SV filtering parameters include a 'filter_overlappingRepeats : True'.

  • If repeats are provided, 'find_known_SVs' will pass them to the 'call_SVs' module.

  • If repeats are provided, 'integrate_SV_CNV_calls' will add a field in the INFO which indicates whether the SVs overlap repeats.

  • If repeats are provided, 'call_small_variants' will add a field in the tabular variant calling file which indicates whether the SVs overlap repeats.

The consideration of repeats can be suboptimal for two reasons. First, the prediction of repeats is time-consuming for large genomes. Second, some repeat families (i.e.: simple repeats, low complexity regions) may yield more false positive calls than others (i.e.: large transposons), so that treating all repetitive elements together may not be always the best option. perSVade includes two options to circumvent this:

  • Skipping the analysis of repeats (--repeats_file skip).

  • Instead of providing the default output of 'infer_repeats', you can filter out the regions to be considered as 'repeats'. For example you could just keep simple repeats and low complexity regions in the 'repeats file' provided to --repeats_file.

Note that --repeats_file is a mandatory argument in all these modules. We think that it is important that each user decides how repeats should be considered in their datasets.

How does the --fraction_available_mem work?

This pipeline calculates the available RAM for several steps (using the python library psutil.virtual_memory()). psutil calculates all the available RAM in the computer. If you are running on a fraction of the computers' resources (i.e. in an HPC cluster), psutil may overestimate the available RAM. If that is the case, '--fraction_available_mem' allows you to correct this overestimation. For example, if you are using 16 cores of an HPC node that has 32 cores (at 2Gb/core), psutil may (wrongly) calculate ~64 Gb of RAM available (the whole node). To solve this you can provide '--fraction_available_mem 0.5' to use only half of the memory calculate by psutil. Note that you may need to find how psutil works on your system to adequately set --fraction_available_mem. If --fraction_available_mem is not provided (default behavior), this pipeline will calculate the available RAM by filling the memory, which may give errors. It is thus highly reccommended that you provide this option. If you want to use all the allocated memory you should specify --fraction_available_mem 1.0.

perSVade crashed due to insufficient resources (allocated RAM or time), do I have to repeat all the running?

No worries. perSVade is designed to restart from the last correct step if you re-run with the same command in the same output directory. This means that you can simply re-run on a system with higher resources and expect no steps to be repeated. In addition, there are some tips that can be useful:

  • You may consider readjusting the balance between allocated threads (with --threads) and RAM (depends on the running system). Some programs used by perSVade require a minimum amount of RAM/thread, which depends on the input. We have developed perSVade to balance this automatically, but we can't guarantee that it will always work (specially if input genomes/reads are much larger than those tested).

  • Another option that can be useful to deal with memory errors is to specify the fraction of RAM that is being allocated to a given perSVade run. This pipeline calculates the available RAM for several steps (using the python library psutil.virtual_memory()). psutil calculates all the available RAM in the computer. If you are running on a fraction of the computers' resources (i.e. in an HPC cluster), psutil may overestimate the available RAM. If that is the case, '--fraction_available_mem' allows you to correct this overestimation. For example, if you are using 16 cores of an HPC node that has 32 cores (at 2Gb/core), psutil may (wrongly) calculate ~64 Gb of RAM available (the whole node). To solve this you can provide '--fraction_available_mem 0.5' to use only half of the memory calculate by psutil. Note that you may need to find how psutil works on your system to adequately set --fraction_available_mem. If --fraction_available_mem is not provided (default behavior), this pipeline will calculate the available RAM by filling the memory, which may give errors. It is thus highly reccommended that you provide this option. If you want to use all the allocated memory you should specify --fraction_available_mem 1.0.

  • By default, perSVade only allocates 50% of the available RAM for some java processes. You can change this through the command --fractionRAM_to_dedicate. For example --fractionRAM_to_dedicate 0.8 would allocate 80% of the RAM.

  • IMPORTANT NOTE: If you are running perSVade in MareNostrum4 or Nord3 you should skip the --fraction_available_mem argument, as perSVade already calculates it.

How are the SVs encoded into single files?

perSVade generates a single .tab file for each type of SV (inversions, deletions, tandemDuplications, insertions, translocations and unclassified SVs) into the folder SVdetection_output/final_gridss_running. Each of these has a unique set of columns that represent the variants. This is the meaning of these columns (note that text in "" indicates the column names of the corresponding .tab files):

  • Simple SVs: deletions, inversions and tandemDuplications (duplication of a region which gets inserted next to the affected region). These are described by a chromosome ("Chr"), "Start" and "End" coordinates of the SV. perSVade outputs one .tab file for each of these SV types.
  • Insertions: a region of the genome (indicated by "ChrA", "StartA", "EndA") is copied ("Copied" is TRUE) or cut ("Copied" is FALSE) and inserted into another region (indicated by "ChrB", "StartB"). "EndB" comes from adding to "StartB" the length of the inserted region. There is one .tab file for insertions.
  • Translocations: balanced translocations between two chromosomes ("ChrA" and "ChrB"). "StartA" indicates the start of "ChrA" (position 0). "EndA" indicates the position in "ChrA" where the translocation happens. For non-inverted translocations, "StartB" is 0 and "EndB" is the position in "ChrB" where the translocation happens. For inverted translocations, "StartB" is the position in "ChrB" where the translocation happens and "EndB" is the last position of "ChrB". There is one .tab file for translocations.
  • Unclassified SVs: There is one .tab file ("unclassified_SVs.tab") that reports all the variants that are called by clove and cannot be assigned to any of the above SV types. These include unclassified breakpoints (which could be part of unresolved/unkown complex variants) and complex inverted SVs (which are non-standard SVs). These types of SVs are not included in the simulations, so that the accuracy on them is unknown. This is why we group them together into a single file. For unclassified breakpoints, the "SVTYPE" indicates which is the orientation of the two breakends (there are 4 possible orientations, and the "SVTYPE" is different depending on if the two breakends are in the same chromosome or not). "#CHROM" - "POS" indicate one breakend and "#CHR" - "END" the other. "START" is -1 for such unclassified breakpoints. Complex inverted SVs represent variants were a region (indicated by "CHR2", "START", "END") is copied ("SVTYPE" is CVD) or cut ("SVTYPE" is IVT (different chromosomes)), inverted and inserted into "#CHROM"-"POS". Complex inverted translocations ("SVTYPE" is CVT) are variants where a region of a chromosome (defined by "#CHROM", "POS" and "END") is cut, inverted and inserted right in the 3'.

perSVade's parameter optimisation is too slow. I just want to run SV calling with some pre-defined filters. How can I do this?

perSVade includes an option to skip the parameter optimisation: --fast_SVcalling. This uses some default parameters for the filtering. You can also customize the filters by providing a .json file with them through --parameters_json_file. The default parameters encoded as .json are in misc/default_perSVade_parameters.json, and this file can be used as a template to provide custom parameters with --parameters_json_file.

What is in SV_and_CNV_variant_calling.vcf?

This is a vcf that contains all called SVs and CNVs, in a way that is focused on how each SV affects particular regions of the genome. This is useful for further functional annotation. Each SV can be split across multiple rows, as it can affect several regions of the genome. All the rows that are related to the same SV are identified by the field variantID in INFO. On top of this, each row has a unique identifier indicated by the field ID. Some SVs generate de novo-inserted sequences around the breakends, and each of these de novo insertions are represented in a single row. Note that each of the rows may indicate a region under CNV (with the SVTYPE in INFO as DEL, DUP or TDUP), a region with some rearrangement (with the SVTYPE in INFO as BND) or a region with a de novo insertion (with the SVTYPE in INFO as insertionBND). These three types of regions can be interpreted by the Ensembl Variant Effect Predictor (VEP, also implemented in perSVade) for functional annotation. The fields #CHROM and POS indicate the position of the SV, and the field END from INFO indicates the end of a region under SV when applicable.

More precisely, this is how each of the SV types are represented (check the FAQ How are the SVs encoded into single files? to understand what the fileds in <> mean):

  • Each deletion or tandem duplication is represented with a variantID=DEL|<Chr>:<Start>-<End> or variantID=TDUP|<Chr>:<Start>-<End>, respectively. There may be up to three rows related to each of these variants:

    • One row for the deleted/duplicated region (for deletions this wold have an ID DEL|CNV|<Chr>:<Start>-<End>). The fields POS and END in INFO indicate the deleted/duplicated region. The SVTYPE from INFO is DEL or TDUP.

    • Up to two rows for each of the de novo insertions (for deletions there may be one row with an ID DEL|insertion-<Chr>-<Start>|<Chr>:<Start>-<End> and another with an ID DEL|insertion-<Chr>-<End>|<Chr>:<Start>-<End>) around each of the breakends. The SVTYPE from INFO is insertionBND.

  • Each inversion is is represented with a variantID=INV|<Chr>:<Start>-<End>. There may be several rows related to each inversion:

    • One row for each of the breakends of the inverted region (with an ID INV|BND-<Chr>-<Start>|<Chr>:<Start>-<End> and INV|BND-<Chr>-<End>|<Chr>:<Start>-<End>). The SVTYPE from INFO is BND.

    • One row for each of the de novo insertions around the breakends tha generate this inversion. Each of them has an ID INV|insertion-<#CHROM>-<POS>|<Chr>-<Start>:<End>, where #CHROM and <POS> are equal to the equivalent vcf fields. The SVTYPE from INFO is insertionBND.

  • Each copy-paste insertion is represented with a variantID=INS|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste. There are several rows related to each insertion:

    • One row for the duplicated region (in ChrA), with an ID INS|CNV|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste. The fields POS and END in INFO indicated the copied region. The SVTYPE from INFO is DUP.

    • One row for the insertion position (in ChrB), with an ID INS|BND-<ChrB>-<StartB>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste. The field POS indicate the insertion position. The SVTYPE from INFO is BND.

    • One row for each of the de novo insertions around the breakends tha generate this insertion. Each of them has an ID INS|insertion-<#CHROM>-<POS>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste, where #CHROM and <POS> are equal to the equivalent vcf fields. The SVTYPE from INFO is insertionBND.

  • Each cut-paste insertion is represented with a variantID=INS|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|cutPaste. There are several rows related to each insertion:

    • One row for each of the breakends of the cut region (in ChrA), with IDs INS|BND-<ChrA>-<StartA>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|cutPaste and INS|BND-<ChrA>-<EndA>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|cutPaste. The SVTYPE from INFO is BND.

    • One row for the insertion position (in ChrB), with an ID INS|BND-<ChrB>-<StartB>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|cutPaste. The field POS indicate the insertion position. The SVTYPE from INFO is BND.

    • One row for each of the de novo insertions around the breakends tha generate this insertion. Each of them has an ID INS|insertion-<#CHROM>-<POS>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste, where #CHROM and <POS> are equal to the equivalent vcf fields. The SVTYPE from INFO is insertionBND.

  • Each balanced translocation is represented with a variantID=TRA|<ChrA>:<StartA>-<EndA><><ChrB>:<StartB>-<EndB>. There are several rows related to each translocation:

    • One row for each of the breakend regions, with IDs TRA|BND-<ChrA>-<EndA>|<ChrA>:<StartA>-<EndA><><ChrB>:<StartB>-<EndB> for the ChrA breakend and TRA|BND-<ChrB>-<POS>|<ChrA>:<StartA>-<EndA><><ChrB>:<StartB>-<EndB> for the ChrB breakend. Note that the <POS> is equivalent to the vcf field. The SVTYPE from INFO is BND.

    • One row for each of the de novo insertions around the breakends tha generate this translocation. Each of them has an ID TRA|insertion-<#CHROM>-<POS>|<ChrA>:<StartA>-<EndA><><ChrB>:<StartB>-<EndB>, where #CHROM and <POS> are equal to the equivalent vcf fields. The SVTYPE from INFO is insertionBND.

  • Each unclassified breakpoint is represented with a variantID=<breakpoint_SVTYPE>like|<#CHROM>:<POS>-<CHR2>:<END>, where <breakpoint_SVTYPE> can be DEL, TAN, INV1, INV2, ITX1, ITX2, INVTX1, INVTX2. There are several rows related to each unclassified breakpoint:

    • One row for each of the breakend regions, with IDs <breakpoint_SVTYPE>like|BND-<#CHROM>-<POS>|<#CHROM>:<POS>-<CHR2>:<END> for the #CHROM breakend and <breakpoint_SVTYPE>like|BND-<CHR2>-<END>|<#CHROM>:<POS>-<CHR2>:<END> for the CHR2 breakend. The SVTYPE from INFO is BND.

    • One row for each of the de novo insertions around the breakends tha generate this breakpoint. Each of them has an ID <breakpoint_SVTYPE>like|insertion-<#CHROM>-<POS>|<breakpoint_SVTYPE>like|<#CHROM>:<POS>-<CHR2>:<END>, where #CHROM and <POS> are equal to the equivalent vcf fields. The SVTYPE from INFO is insertionBND.

  • The complex inverted SVs are represented with a variantID=<varID>|<CHR2>:<START>-<END>|<#CHR2>:<POS>, where <varID> can be CVD (inverted intrachromosomal duplication), CVT (inverted intrachromosomal cut-paste insertion) or IVT (inverted interchromosomal cut-paste insertion). Each of them is split across multiple vcf rows:

    • CVD's are encoded as copy-paste insertions.
    • IVT's are encoded as cut-paste insertions.
    • CVT's are encoded as unclassified breakpoints, but with an CVT tag instead of <breakpoint_SVTYPE>like. This means that there is one breakend for each end of the CVT.
  • The coverage-based deletion/duplication calls are represented with a variantID=coverageDUP|CNV|<#CHROM>:<POS>-<END> or variantID=coverageDEL|CNV|<#CHROM>:<POS>-<END>, respectively. There is one row for each variant. The SVTYPE from INFO is DUP or DEL. The range between POS and END indicates the span of the duplication/deletion. The field merged_relative_CN from INFO indicates the most conservative relative copy number state (closest to 1.0) of this region as called by the different CNV callers. In a diploid organism, merged_relative_CN of 0.5 would indicate monosomy and 1.5 trisomy of the indicated region. The field median_coverage_corrected indicates the relative coverage (where 1.0 would mean a coverage as the median of the genome) after correction in the coverage-based CNV calling pipeline.

Other important remarks and fields:

  • In SVs, the field BREAKEND_real_AF from INFO indicates the fraction of reads mapped to that position that support each of the breakends (indicated by BREAKENDIDs from INFO) that form the SV in the corresponding row. In the case of DUP, DEL and TDUP records, this is a comma-sepparated string. In the case of BND or insertionBND records this is a float, as there is only one breakend. This value can give you a rough idea of the genortype of the SV. However, note that an SV can have both homozygous and heterozygous breakends at the same time. SV genotyping is still an unsolved problem. Check https://github.com/PapenfussLab/gridss/wiki/GRIDSS-Documentation#calculating-variant-allele-fraction-vaf for more information on how gridss calculates these variant allele frequencies. In order to generate higher-confidence calls it may be interesting to filter out SVs that have low values of BREAKEND_real_AF (i.e. filter out SVs where none of the breakends has BREAKEND_real_AF>0.1).

  • Coverage-based CNV calling is error-prone, so that it is advisable to filter out CNVs where median_coverage_corrected and merged_relative_CN do not match (i.e: duplications (merged_relative_CN>1) with a median_coverage_corrected around 1).

  • The field RELCOVERAGE from INFO indicates the relative coverage normalised by the median coverage across windows of the genome for the region under CNV in tandem duplications, deletions, copy-paste insertions and coverage-inferred CNVs. The fields RELCOVERAGE_TO_5 and RELCOVERAGE_TO_3 from INFO indicate the relative coverage normalised by the coverage of the 5' or 3' region, respectively. These values can be used as an additional quality control for the called CNVs.

How can I compare the variants called in different samples?

This is not a trivial task, particularly for SVs and CNVs, where the same variant may be represented slightly different in two different perSVade runs. For example, we generally consider that two SVs of a given type (i.e. a deletion) are the same if they overlap by >75% and have their breakends <50 bp appart. We have developed some python functions to introduce this definition of "same variant" across different samples.

These functions are in ./scripts/sv_functions.py (see Quick start for an example of how to import this script as a module).

Below are some key functions:

  • sv_functions.get_integrated_small_vars_df_severalSamples is a function to integrate the files from the small variant and per gene coverage analysis, into single files for all samples. These are the generated files:

    • merged_coverage.tab includes the stacked smallVars_CNV_output/CNV_results/genes_and_regions_coverage.tab files, where the column sampleID indicates the the sampleID.

    • smallVars.tab includes the stacked smallVars_CNV_output/variant_calling_ploidy2.tab files, where the column sampleID indicates the the sampleID.

    • smallVars_annot.tab includes all the annotated variants (from the smallVars_CNV_output/variant_annotation_ploidy2.tab files).

  • sv_functions.get_integrated_CNperWindow_df_severalSamples stacks all the CNV_calling/final_df_coverage.tab files across different samples. It generates the integrated_CNperWindow.tab file where the column sampleID indicates the the sampleID.

  • sv_functions.get_integrated_SV_CNV_df_severalSamples integrates the SV and coverage-based CNV calling and annotation files for several samples and provides some information about the overlaps between samples. It requires a valid .gff file. These are the generated files:

    • SV_CNV.simple.tab includes the stacked SVcalling_output/SV_and_CNV_variant_calling.tab files, where the column sampleID indicates the the sampleID.

    • SV_CNV_annot.tab includes the variant annotations as in SVcalling_output/SV_and_CNV_variant_calling.tab_annotated_VEP.tab for all samples. This table also includes the fields is_protein_coding_gene (a boolean indicating if the variant overlaps a protein-coding gene) and is_transcript_disrupting (a boolean indicating if the variant disrupts a transcript). Beware that duplications are also considered to be 'transcript-disrupting'.

    • SV_CNV.tab is similar to SV_CNV.simple.tab, but it includes relevant information to compare the variants across samples. These are some useful extra fields that are found in this table, for each variant:

      • overlapping_samples_byBreakPoints_allCalled: A comma-sepparated string with the sample IDs where gridss found at least one breakpoint overlapping the variant. This can be considered as a false positive-prone estimator of samples that share this variant, as some of the breakpoints may not be real. This field is only meaningful for SVs called by gridss and clove.

      • overlapping_samples_byBreakPoints_PASS: A comma-sepparated string with the sample IDs where gridss found at least one breakpoint (passing the filters of gridss as defined by perSVade) overlapping the variant. This can be considered as a false negative-prone estimator of samples that share this variant, as some of the breakpoints may be missed due to only considering high-confidence breakpoints to compute the overlaps. This field is only meaningful for SVs called by gridss and clove.

      • overlapping_samples_CNV_atLeast_<min_pct_overlap> (where <min_pct_overlap> may be 0.25, 0.5, 0.75 or 0.9): In coverage-based CNVs, a comma-sepparated string with the sample IDs that may have the same CNV. A CNV is considered to be found in another sample if it has an equal or more extreme (farther from 1.0) copy number (CN) in at least <min_pct_overlap> of the windows of the given CNV. For example, a duplication in a region of sample A is considered to be also found sample B (according to overlapping_samples_CNV_atLeast_0.75) if at least 75% of the windows of that same region in sampleB have a CN number above or equal the CN in A. This field is only meaningful for coverage-based CNVs.

      • variantID_across_samples: is an identifier of each SV/CNV that is unique across different samples. By default, two SVs of a given type (i.e. a deletion) are the same if they overlap by >75% and have their breakends <50 bp appart.

Other remarks:

  • You can type help(<function_name>) from a python terminal to get more details about the arguments of these functions.

  • All these functions take an argument called paths_df. This should be a table (either a pandas dataframe object or a tab-delimited file with header names) with the columns sampleID (this should be a unique identifier of each sample) and perSVade_outdir (the path to the perSVade outdir of that sample).

  • We recommend using the fields overlapping_samples_byBreakPoints_allCalled, overlapping_samples_byBreakPoints_PASS and overlapping_samples_CNV_atLeast_<min_pct_overlap> if we are interested in filtering out variants in a given sample that are found in other 'background' samples. For example, using overlapping_samples_byBreakPoints_allCalled and overlapping_samples_CNV_atLeast_0.25 as the overlapping fields would be a conservative way to ensure that the variant is not in any of the 'background' samples.

  • The field variantID_across_samples is useful for analyses where we want to work with variants shared across different samples (i.e.: hierarchical clustering of samples by the variants)

How do I deal with the output of the get_integrated_SV_CNV_df_severalSamples function?

This is indeed a bit tricky. We here provide an example of running this function and processing the outputs using pandas dataframes. This example is related to an imaginary dataset where we called variants (both SVs and coverage-based CNVs) for one drug resistant sample (called "R") and two susceptible samples (called "S1" and "S2"). This would be a workflow for finding variants that are found in "R" and not in "S1" or "S2" and may drive the resistance:

# load the functions
import pandas as pd
import sys 

sys.path.insert(0, "<perSVade_dir>/scripts")
import sv_functions as fun

# define an outdir
integrated_files_dir = "./integrated_files"

# define a dataframe with the paths to the perSVade output directories
paths_df = pd.DataFrame({"sampleID" : ["R", "S1", "S2"],
                         "perSVade_outdir": ["./outdir_R", "./outdir_S1", "./outdir_S2"]})

# get a file with the Copy Number (CN) per window, which is necessary to compute the overlaps between CNV calls
integrated_CNperWindow_file = fun.get_integrated_CNperWindow_df_severalSamples(paths_df, integrated_files_dir, threads=4)

# generate files with the integrated SVs and CNVs. This requires a variant annotation file
fun.get_integrated_SV_CNV_df_severalSamples(paths_df, integrated_files_dir, <gff>, <reference_genome>, threads=4, integrated_CNperWindow_file=integrated_CNperWindow_file, tol_bp=50, pct_overlap=0.75, add_overlapping_samples_eachSV=True)

# import the generated df with the variants. This is equivalent to the "SV_and_CNV_variant_calling.vcf" but with INFO fields split across different columns and some extra fields.
df_vars = pd.read_csv("./integrated_files/SV_CNV.tab", sep="\t")

# keep vars that are only in "R" and none of the others
def get_df_vars_r_in_bgSamples(r, bg_samples):

    samples_overlapping_by_breakpoints = set(str(r["overlapping_samples_byBreakPoints_allCalled"]).split(","))
    samples_overlapping_by_CN = set(str(r["overlapping_samples_CNV_atLeast_0.25"]).split(","))

    all_samples_overlapping = samples_overlapping_by_breakpoints.union(samples_overlapping_by_CN)
    return len(bg_samples.intersection(all_samples_overlapping))>0

df_vars_R = df[(df_vars.sampleID=="R") & ~(df_vars.apply(get_df_vars_r_in_bgSamples, bg_samples={"S1", "S2"}, axis=1))]

# print the variants, using the IDs that are unique across samples 
print(set(df_vars_R.variantID_across_samples))

Running singularity build takes too much disk space. How can I solve this?

The singularity build command writes files into a cachedir ($HOME/.singularity/cache by default) and a temporary dir (/tmp by default). As perSVade's is a large image, you may get into storage problems if $HOME or /tmp do not have enough disk quota. You can change these directories through environmental variables before running singularity (i.e. with export SINGULARITY_CACHEDIR=<cachedir> and export SINGULARITY_TMPDIR=<tmpdir>). <cachedir> and <tmpdir> should be the redefined paths with enough disk. You can find more information in https://sylabs.io/guides/3.6/user-guide/build_env.html. Note that <tmpdir> should have full permissions (given with sudo chmod 777 <tmpdir>).

Where does perSVade write temporary files? Can I change it?

Most of them are written under the provided --outdir, but there are also some written under $HOME/.perSVade_dir (which can be removed any time). You can specify a PERSVADE_TMPDIR environmental varibale to not write under $HOME/.perSVade_dir (i.e. with export PERSVADE_TMPDIR=/tmp/perSVade_tmpdir)

Which resources should I allocate for perSVade?

The following plots may be useful to orient yourself:

We tested perSVade (only the SV calling pipeline providing aligned reads as inputs) on six eukaryotes (three samples / species) using either no parameter optimization (gray) or different types of simulations (black, red, blue) in a machine with 16 cores. Shown are the running time and maximum RAM used. Note that this is ignoring the resources related to read alignment (which was performed independently). The x axes reflect the reference genome size (left) and the number of mapped read pairs (right), which are correlated with resource consumption. This data may be useful to allocate computational resources for running perSVade.

Of note, perSVade was run with different parameters for the human datasets to avoid excessive resource consumption. First, we skipped the marking of duplicate reads on the .bam files with perSVade’s --skip_marking_duplicates option. Second, we ran the simulations on a subset of the genome (only chromosomes 2, 7, 9, X, Y and mitochondrial). Third, we skipped the ‘homologous’ simulations in human samples due to excessive memory consumption.

How does the parameter optimization for SV calling work?

This is a cartoon of the process:

This is the core, most novel function of perSVade. There are two main steps:

  • Choosing regions of the genome for simulations of SVs. These can be either around regions with previously known SVs, regions with pairwise homology (inferred with blastn) or randomly selected regions. In order to use regions with known SVs, you can either provide them (through --real_bedpe_breakpoints) or let perSVade infer them from other sequencing datasets (with --close_shortReads_table). If specified,--close_shortReads_table should be a table with the paths to sequencing datasets of the same (or close) species. perSVade runs the SV calling pipeline with default parameters on these datasets and generates a file with these 'known regions' with SVs. Note that all these options are mutually exclusive. By default, perSVade runs on random regions.

  • Choose a set of optimum parameters through simulations. perSVade generates two simulated genomes (tunable with --nsimulations) with 50 SVs of each type (tunable through --nvars) based on the reference genome and the provided regions. There are two simulated genomes for each of the desired ploidies, tunable through --simulation_plodies. If --simulation_plodies is not provided, perSVade will rely on the --ploidy parameter (i.e. if --ploidy is 2 the simulated genomes will have only heterozygous variants). For each simulated genome perSVade simulates reads with equal insert size, coverage and read length as the input reads. Then it aligns the reads and runs gridss to obtain a list of 'raw breakpoints'. PerSVade then tries several combinations of filters on them (by default >278,000,000, which is tunable through --range_filtering_benchmark) to generate many 'filtered breakpoints'. Each of these is processed with clove to generate a set of 'raw SVs'. PerSVade next tries several combinations of filters on each of them to get a set of filtered SVs. These are compared against the true set of SVs (inserted in the simulated genome) to calculate the accuracy (F-value) of each combination of gridss and clove filters on each simulated genome, ploidy and SV type. These filters are optimised for each simulation, and thus may not be accurate on independent sets of SVs (due to overfitting). In order to reduce this effect, perSVade tests how each of these 'best' filters perform on all simulations, ploidies and SV types (not only in those that yielded the given filters as optimum). The heatmap shows the F-value for an example sample, where the filters in the first row are accurate on all simulations (indicating that there is no overfitting on them) and thus they are chosen as the final set of best parameters. PerSVade writes the accuracy of these best parameters into <outdir>/SVdetection_output/parameter_optimisation/benchmarking_all_filters_for_all_genomes_and_ploidies/df_cross_benchmark_best.tab, which will allow you to understand how much you can trust the results.

These optimised filters (or parameters) are used for calling SVs in the real SV calling. They are stored in <outdir>/SVdetection_output/final_gridss_running/perSVade_parameters.json.

How does perSVade report the SV calling accuracy?

The following plot is an adapted example (for Candida albicans’s sample SRR538786) of how perSVade graphically reports the SV calling accuracy for any given input (as in the output folder parameter_optimisation/plots/cross_accuracy_heatmaps/):

In order to find the best parameters for SV calling, perSVade generates two simulated genomes (with simName ‘simulation_1’ or ‘simulation_2’) with up to 50 SVs of each type and finds the parameters that yield the highest F-value for each simulated genome and SV type (referred in the rows as ‘training’ parameters). In order to reduce overfitting, perSVade evaluates how each of these training parameters work on the other simulations and SV types (referred in the columns as ‘testing’ simulations). Overfitted parameters should perform well on their origin simulations, but not necessarily on the others. The heatmap shows the F-value of these evaluations, hereafter referred as ‘testing instances’. The chosen parameters for the SV calling on the real data (which work as good as possible on all simulations and SV types) can be identified because each cell has the F-value printed inside. In addition, perSVade saves the accuracy of these chosen parameters in a -tab file. The ‘*’ indicate testing instances where the training and testing are from the same SV type. The ‘=’ indicate testing instances where the training and testing are from the same SV type and simulation (same simName). Note that perSVade may generate <50 SVs for some types (i.e.: it may be impossible to find so many balanced translocations for some genomes). The testing instances with <10 SVs of a given type have a small number inside indicating the number of SVs (i.e. in this example there are 6 translocations in ‘simulation_1’). On another line, perSVade can generate simulations with different ploidies (i.e.: haploid or diploid genomes), which is indicated in the ‘ploidy’ line. The example only includes diploid heterozygous (‘diploid_hetero’) variants, which is the default behavior for diploid species like C. albicans.

Which type of parameter optimization should I use?

Based on the findings described in the paper of perSVade, we propose the following recommendations for a cost-effective usage of perSVade:

  • For SV calling on many datasets of one species with similar properties (similar coverage, read length and insert size), run perSVade using ‘random’ simulations on one sample, and use the optimized parameters for the other samples (skipping optimization). The reported calling accuracy may be overestimated since the simulations are not realistic, but the chosen parameters are expected to be optimal.

  • If you are interested in understanding the real SV calling accuracy, run perSVade on realistic simulations (‘homologous’ or ‘known’), which may report an accuracy that is closer to the real one.

The package cylowess was not installed properly. How can I solve this?

The CNV calling relies on the cylowess package. If you installed perSVade with the traditional installation you'll install it through the ./installation/setup_environment.sh script. However, we have seen that this does not always work, and you may get an error while trying to import cylowess. Make sure that cylowess is properly installed in perSVade's main conda environment following these commands (and making sure that you are using the python interpreter of the conda environment):

git clone https://github.com/histed/lowess-work-carljv
cd lowess-work-carljv
python setup.py install

How are mitochondrial chromosomes considered in perSVade?

If your reference genome has mitochondrial chromosomes you should specify them with --mitochondrial_chromosome. For example if there are two mtDNA chromosomes called 'chr_mito1' and 'chr_mito2' you should provide the argument --mitochondrial_chromosome chr_mito1,chr_mito2. If there are no mitochondrial chromosomes you should state --mitochondrial_chromosome no_mitochondria.

Note that mtDNA chromosomes are treated differently for SV, CNV calling and variant annotation. It is thus mandatory that you provide the mtDNA chromosomes.

perSVade has an overwhelming amount of options and I fear that my command is wrong. How can I check this without running all the pipeline?

perSVade includes an argument (--downsampled_coverage) to use only a random fraction of the input sequencing data (either reads or a sorted bam). You can test that your perSVade command works with such fraction of the data, which will run much faster. For example, if you add --downsampled_coverage 10.0 to your command perSVade will run on a subset of the data that yields an average coverage of 10x. We note that setting very low values (i.e. --downsamped_coverage 0.1) may result in errors because some regions of the genome won't be covered at all or because none gridss may not find any breakpoints.

In addition, you can run the parameter optimization for SV calling only on a fraction of chromosomes (i.e. with --simulation_chromosomes chr1,chr2,chr3) to speed up the process.

Note that these options can be useful beyond testing purposes. For example, you may use --downsampled_coverage to infer the SV calling accuracy on datasets with a given coverage. On another line, we used simulation_chromosomes to optimize perSVade for the human genome (see this).

Clone this wiki locally