-
Notifications
You must be signed in to change notification settings - Fork 5
8. FAQs
Read mapping around repetitive elements of the genome can be inaccurate, potentially yielding false positive SVs around them. Conversely, there are some SVs expected to appear around repeats (i.e.: those derived from transposable elements insertions). This means that the decision on how to handle SVs around such regions is not trivial. In addition, it may be interesting to know which variants (i.e. SVs or small variants) are overlapping repeats for downstream analyses. PerSVade includes the 'infer_repeats' module to find repeats in a genome with RepeatModeler
and RepeatMasker
. You can input these repeats to several modules (with --repeats_file
), which will have the following effects:
-
If repeats are provided, 'optimize_parameters' will assess whether removing SV calls overlapping repeats increases the overall accuracy. If so, the resulting optimized parameters will include a 'filter_overlappingRepeats : True'. If you use these optimized parameters in 'call_SVs', any breakpoint overlapping repeats will be removed.
-
If repeats are provided, 'call_SVs' may filter out SVs that overlap repeats if the SV filtering parameters include a 'filter_overlappingRepeats : True'.
-
If repeats are provided, 'find_known_SVs' will pass them to the 'call_SVs' module.
-
If repeats are provided, 'integrate_SV_CNV_calls' will add a field in the INFO which indicates whether the SVs overlap repeats.
-
If repeats are provided, 'call_small_variants' will add a field in the tabular variant calling file which indicates whether the SVs overlap repeats.
The consideration of repeats can be suboptimal for two reasons. First, the prediction of repeats is time-consuming for large genomes. Second, some repeat families (i.e.: simple repeats, low complexity regions) may yield more false positive calls than others (i.e.: large transposons), so that treating all repetitive elements together may not be always the best option. perSVade includes two options to circumvent this:
-
Skipping the analysis of repeats (
--repeats_file skip
). -
Instead of providing the default output of 'infer_repeats', you can filter out the regions to be considered as 'repeats'. For example you could just keep simple repeats and low complexity regions in the 'repeats file' provided to
--repeats_file
.
Note that --repeats_file
is a mandatory argument in all these modules. We think that it is important that each user decides how repeats should be considered in their datasets.
This pipeline calculates the available RAM for several steps (using the python library psutil.virtual_memory()). psutil calculates all the available RAM in the computer. If you are running on a fraction of the computers' resources (i.e. in an HPC cluster), psutil may overestimate the available RAM. If that is the case, '--fraction_available_mem' allows you to correct this overestimation. For example, if you are using 16 cores of an HPC node that has 32 cores (at 2Gb/core), psutil may (wrongly) calculate ~64 Gb of RAM available (the whole node). To solve this you can provide '--fraction_available_mem 0.5' to use only half of the memory calculate by psutil. Note that you may need to find how psutil works on your system to adequately set --fraction_available_mem. If --fraction_available_mem is not provided (default behavior), this pipeline will calculate the available RAM by filling the memory, which may give errors. It is thus highly reccommended that you provide this option. If you want to use all the allocated memory you should specify --fraction_available_mem 1.0.
perSVade crashed due to insufficient resources (allocated RAM or time), do I have to repeat all the running?
No worries. perSVade is designed to restart from the last correct step if you re-run with the same command in the same output directory. This means that you can simply re-run on a system with higher resources and expect no steps to be repeated. In addition, there are some tips that can be useful:
-
You may consider readjusting the balance between allocated threads (with
--threads
) and RAM (depends on the running system). Some programs used by perSVade require a minimum amount of RAM/thread, which depends on the input. We have developed perSVade to balance this automatically, but we can't guarantee that it will always work (specially if input genomes/reads are much larger than those tested). -
Another option that can be useful to deal with memory errors is to specify the fraction of RAM that is being allocated to a given perSVade run. Check the previous FAQ How does the --fraction_available_mem work? for more information.
-
By default, perSVade only allocates 50% of the available RAM for some java processes. You can change this through the command --fractionRAM_to_dedicate. For example
--fractionRAM_to_dedicate 0.8
would allocate 80% of the RAM. -
IMPORTANT NOTE: If you are running perSVade in MareNostrum4 or Nord3 you should skip the
--fraction_available_mem
argument, as perSVade already calculates it.
The following plots may be useful to orient yourself:
We tested the SV calling of perSVade on six eukaryotes (three samples / species) using either no parameter optimization (gray) or different types of simulations (black, red, blue) in a machine with 16 cores. Shown are the running time and maximum RAM used. Note that this is ignoring the resources related to read alignment (which was performed independently). Each point reflects the resources used by 'optimize_parameters' (except in the gray points),'call_SVs' and 'integrate_SV_CNV_calls'. The x axes reflect the reference genome size (left) and the number of mapped read pairs (right), which are correlated with resource consumption. This data may be useful to allocate computational resources for running perSVade.
Of note, perSVade's 'optimize_parameters' was run with particular parameters for the human datasets to avoid excessive resource consumption. First, we skipped the marking of duplicate reads on the .bam files with perSVade’s --skip_marking_duplicates option from the module 'align_reads'. Second, we ran the simulations on a subset of the genome (only chromosomes 2, 7, 9, X, Y and mitochondrial). Third, we skipped the ‘homologous’ simulations in human samples due to excessive memory consumption.
perSVade 'call_SVs' generates a single .tab file for each type of SV (inversions, deletions, tandemDuplications, insertions, translocations and unclassified SVs). Each of these has a unique set of columns that represent the variants. This is the meaning of these columns (note that text in "" indicates the column names of the corresponding .tab files):
- Simple SVs: deletions, inversions and tandemDuplications (duplication of a region which gets inserted next to the affected region). These are described by a chromosome ("Chr"), "Start" and "End" coordinates of the SV. perSVade outputs one .tab file for each of these SV types.
- Insertions: a region of the genome (indicated by "ChrA", "StartA", "EndA") is copied ("Copied" is TRUE) or cut ("Copied" is FALSE) and inserted into another region (indicated by "ChrB", "StartB"). "EndB" comes from adding to "StartB" the length of the inserted region. There is one .tab file for insertions.
- Translocations: balanced translocations between two chromosomes ("ChrA" and "ChrB"). "StartA" indicates the start of "ChrA" (position 0). "EndA" indicates the position in "ChrA" where the translocation happens. For non-inverted translocations, "StartB" is 0 and "EndB" is the position in "ChrB" where the translocation happens. For inverted translocations, "StartB" is the position in "ChrB" where the translocation happens and "EndB" is the last position of "ChrB". There is one .tab file for translocations.
- Unclassified SVs: There is one .tab file ("unclassified_SVs.tab") that reports all the variants that are called by
clove
and cannot be assigned to any of the above SV types. These include unclassified breakpoints (which could be part of unresolved/unkown complex variants) and complex inverted SVs (which are non-standard SVs). These types of SVs are not included in the simulations, so that the accuracy on them is unknown. This is why we group them together into a single file. For unclassified breakpoints, the "SVTYPE" indicates which is the orientation of the two breakends (there are 4 possible orientations, and the "SVTYPE" is different depending on if the two breakends are in the same chromosome or not). "#CHROM" - "POS" indicate one breakend and "#CHR" - "END" the other. "START" is -1 for such unclassified breakpoints. Complex inverted SVs represent variants were a region (indicated by "CHR2", "START", "END") is copied ("SVTYPE" is CVD) or cut ("SVTYPE" is IVT (different chromosomes)), inverted and inserted into "#CHROM"-"POS". Complex inverted translocations ("SVTYPE" is CVT) are variants where a region of a chromosome (defined by "#CHROM", "POS" and "END") is cut, inverted and inserted right in the 3'.
This is a cartoon of the process (note that perSVase's modules have the same names as in the pipeline overview):
'optimize_parameters' is the core, most novel function of perSVade. There are two main steps:
-
Choose regions of the genome for simulations of SVs. These can be either around some specific regions (with the argument
--regions_SVsimulations <regions>.bedpe
) or randomly placed across the genome (with the argument--regions_SVsimulations random
). Note that perSVade has modules to infer either regions with previously known SVs (through 'find_knownSVs_regions') or regions with pairwise homology (through 'find_homologous_regions'). The SV simulations around such regions may be more realistic than random simulations, which is why you may consider them. -
Choose a set of optimum parameters through simulations (with the 'optimize_parameters' module). By default, perSVade generates two simulated genomes (tunable with --nsimulations) with 50 SVs of each type (tunable through --nvars) based on the reference genome and the provided regions. There are two simulated genomes for each of the desired ploidies, tunable through --simulation_plodies. We usually set
--simulation_plodies haploid
for haploid organisms and--simulation_plodies diploid_hetero
for diploids (which means that the simulated genomes will have only heterozygous variants). For each simulated genome perSVade simulates reads with equal insert size, coverage and read length as the input mapped reads (provided assbam
). Then it aligns the reads and runs gridss to obtain a list of 'raw breakpoints'. PerSVade then tries several combinations of filters on them (by default >278,000,000, which is tunable through --range_filtering_benchmark) to generate many 'filtered breakpoints'. Each of these is processed with clove to generate a set of 'raw SVs'. PerSVade next tries several combinations of filters on each of them to get a set of filtered SVs. These are compared against the true set of SVs (inserted in the simulated genome) to calculate the accuracy (F-value) of each combination of gridss and clove filters on each simulated genome, ploidy and SV type. These filters are optimised for each simulation, and thus may not be accurate on independent sets of SVs (due to overfitting). In order to reduce this effect, perSVade tests how each of these 'best' filters perform on all simulations, ploidies and SV types (not only in those that yielded the given filters as optimum). The heatmap shows the F-value for an example sample (BG2 from Candida glabrata), where the filters in the second row are accurate on all simulations (indicating that there is no overfitting on them) and thus they are chosen as the final set of best parameters. PerSVade writes the accuracy of these best parameters into<outdir>/optimized_parameters_accuracy.tab
, which will allow you to understand how much you can trust the results.
These optimised filters (or parameters) are written into <outdir>/optimized_parameters.json
, and they may be used for calling SVs with 'call_SVs'.
Based on the findings described in the paper of perSVade, we propose the following recommendations for a cost-effective usage of perSVade:
-
For SV calling on many datasets of one species with similar properties (similar coverage, read length and insert size), run perSVade using ‘random’ simulations on one sample, and use the optimized parameters for the other samples (skipping optimization). The reported calling accuracy may be overestimated since the simulations are not realistic, but the chosen parameters are expected to be optimal.
-
If you are interested in understanding the real SV calling accuracy, run perSVade on realistic simulations (‘homologous’ or ‘known’), which may report an accuracy that is closer to the real one.
perSVade's parameter optimisation is too slow. I just want to run SV calling with some pre-defined filters. How can I do this?
perSVade's 'call_SVs' requires an argument called --SVcalling_parameters
. This can be either a .json file with some parameters or a 'default' string, which will tell 'call_SVs' to run on default parameters. The .json file can be the one that is output by 'optimize_parameters' or a custom one (see this file for a template). Of note, we found that the default parameters do not work well in some datasets, so that they should be used carefully. If you want to skip parameter optimization, we recommend that you provide a .json file customized after reading about the meaning of each of the filters (see the FAQ Which are the filters used by perSVade? below).
-
min_Nfragments: Minimum number of reads supporting a breakend in gridss to be accepted (default is 5).
-
min_af: Minimum Variant Allele Frequency of a breakend in gridss to be accepted (default is 0.25).
-
min_QUAL: Minimum quality (QUAL field of the vcf file) of a breakend in gridss to be accepted (default is 0).
-
max_to_be_considered_small_event: Maximum length of a breakpoint in gridss to be considered a small event (default is 1000) . Events shorter than this value are considered as “small events”, which are treated particularly by other filtering steps.
-
min_length_inversions: Minimum length of inversion-like breakends in gridss to be accepted (default is 40).
-
maximum_lenght_inexactHomology: Maximum length of the inexact homology region around a breakend in gridss to be accepted (default is 50). This filter is not applied to “small events”, as defined by “max_to_be_considered_small_event”.
-
maximum_microhomology: Maximum length of the exact homology (microhomology) region around a breakend in gridss to be accepted (default is 50).
-
maximum_strand_bias: Maximum strand bias of a breakend in gridss to be accepted (default is 0.99). This filter is only applied to “small events”, as defined by “max_to_be_considered_small_event”.
-
filter_noReadPairs: Discards gridss breakends without discordant read pair support (default is false). This filter is not applied to “small events”, as defined by “max_to_be_considered_small_event”.
-
filter_noSplitReads: Discards gridss breakends without split-read evidence (default is false). This filter is only applied to “small events”, as defined by “max_to_be_considered_small_event”.
-
filter_overlappingRepeats: Discards gridss breakends overlapping repetitive elements (default is false). By default, perSVade generates a file including the regions with repeats (using RepeatModeler (Flynn et al. 2020) (v2.0.1) and RepeatMasker (Chen 2004) (v4.0.9)). The user can also provide a file with these regions to speed up the computation.
-
filter_polyGC: Discards gridss breakends with long inserted G or C sequences (>15bp) (default is true).
-
wrong_FILTERtags: A set of values in the FILTER field of the gridss vcf which flag discarded breakends (default is [“NO_ASSEMBLY”]).
-
range_filt_DEL_breakpoints: A range of lengths in which DEL-like breakends (as defined by gridss) are discarded if the breakend has a region with inexact homology above 5bp (default is [0, 1]). For example, if set to [500, 1000], DEL-like breakends whose length is between 500 and 1000bp with an inexact homology sequence >5 bp would be discarded.
-
dif_between_insert_and_del: The margin given for comparing the length of the inserted sequence (len_seq) on a gridss DEL-like breakend and the length of the actual event (len_event) (default is 5). If len_seq > (len_event - dif_between_insert_and_del), the breakend is filtered out. This filter is only applied to “small events”, as defined by “max_to_be_considered_small_event”.
-
max_rel_coverage_to_consider_del: The maximum relative coverage that a region spanning a DEL-like breakpoint (as defined by clove) can have to be classified as an actual deletion (default is 0.1).
-
min_rel_coverage_to_consider_dup: The minimum relative coverage that a region spanning a TAN-like breakpoint (as defined by clove) can have to be classified as an actual tandem duplication (default is 1.8).
This is a vcf that contains all called SVs and CNVs, in a way that is focused on how each SV affects particular regions of the genome. This is useful for further functional annotation. Each SV can be split across multiple rows, as it can affect several regions of the genome. All the rows that are related to the same SV are identified by the field variantID
in INFO
. On top of this, each row has a unique identifier indicated by the field ID
. Some SVs generate de novo-inserted sequences around the breakends, and each of these de novo insertions are represented in a single row. Note that each of the rows may indicate a region under CNV (with the SVTYPE
in INFO
as DEL
, DUP
or TDUP
), a region with some rearrangement (with the SVTYPE
in INFO
as BND
) or a region with a de novo insertion (with the SVTYPE
in INFO
as insertionBND
). These three types of regions can be interpreted by the Ensembl Variant Effect Predictor (VEP, also implemented in perSVade) for functional annotation. The fields #CHROM
and POS
indicate the position of the SV, and the field END
from INFO
indicates the end of a region under SV when applicable.
More precisely, this is how each of the SV types are represented (check the FAQ How are the SVs encoded into single files? to understand what the fileds in <>
mean):
-
Each deletion or tandem duplication is represented with a
variantID=DEL|<Chr>:<Start>-<End>
orvariantID=TDUP|<Chr>:<Start>-<End>
, respectively. There may be up to three rows related to each of these variants:-
One row for the deleted/duplicated region (for deletions this wold have an ID
DEL|CNV|<Chr>:<Start>-<End>
). The fieldsPOS
andEND
inINFO
indicate the deleted/duplicated region. TheSVTYPE
fromINFO
isDEL
orTDUP
. -
Up to two rows for each of the de novo insertions (for deletions there may be one row with an ID
DEL|insertion-<Chr>-<Start>|<Chr>:<Start>-<End>
and another with an IDDEL|insertion-<Chr>-<End>|<Chr>:<Start>-<End>
) around each of the breakends. TheSVTYPE
fromINFO
isinsertionBND
.
-
-
Each inversion is is represented with a
variantID=INV|<Chr>:<Start>-<End>
. There may be several rows related to each inversion:-
One row for each of the breakends of the inverted region (with an ID
INV|BND-<Chr>-<Start>|<Chr>:<Start>-<End>
andINV|BND-<Chr>-<End>|<Chr>:<Start>-<End>
). TheSVTYPE
fromINFO
isBND
. -
One row for each of the de novo insertions around the breakends tha generate this inversion. Each of them has an ID
INV|insertion-<#CHROM>-<POS>|<Chr>-<Start>:<End>
, where#CHROM
and<POS>
are equal to the equivalent vcf fields. TheSVTYPE
fromINFO
isinsertionBND
.
-
-
Each copy-paste insertion is represented with a
variantID=INS|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste
. There are several rows related to each insertion:-
One row for the duplicated region (in ChrA), with an ID
INS|CNV|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste
. The fieldsPOS
andEND
inINFO
indicated the copied region. TheSVTYPE
fromINFO
isDUP
. -
One row for the insertion position (in ChrB), with an ID
INS|BND-<ChrB>-<StartB>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste
. The fieldPOS
indicate the insertion position. TheSVTYPE
fromINFO
isBND
. -
One row for each of the de novo insertions around the breakends tha generate this insertion. Each of them has an ID
INS|insertion-<#CHROM>-<POS>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste
, where#CHROM
and<POS>
are equal to the equivalent vcf fields. TheSVTYPE
fromINFO
isinsertionBND
.
-
-
Each cut-paste insertion is represented with a
variantID=INS|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|cutPaste
. There are several rows related to each insertion:-
One row for each of the breakends of the cut region (in ChrA), with IDs
INS|BND-<ChrA>-<StartA>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|cutPaste
andINS|BND-<ChrA>-<EndA>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|cutPaste
. TheSVTYPE
fromINFO
isBND
. -
One row for the insertion position (in ChrB), with an ID
INS|BND-<ChrB>-<StartB>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|cutPaste
. The fieldPOS
indicate the insertion position. TheSVTYPE
fromINFO
isBND
. -
One row for each of the de novo insertions around the breakends tha generate this insertion. Each of them has an ID
INS|insertion-<#CHROM>-<POS>|<ChrA>:<StartA>-<EndA>|<ChrB>:<StartB>|copyPaste
, where#CHROM
and<POS>
are equal to the equivalent vcf fields. TheSVTYPE
fromINFO
isinsertionBND
.
-
-
Each balanced translocation is represented with a
variantID=TRA|<ChrA>:<StartA>-<EndA><><ChrB>:<StartB>-<EndB>
. There are several rows related to each translocation:-
One row for each of the breakend regions, with IDs
TRA|BND-<ChrA>-<EndA>|<ChrA>:<StartA>-<EndA><><ChrB>:<StartB>-<EndB>
for the ChrA breakend andTRA|BND-<ChrB>-<POS>|<ChrA>:<StartA>-<EndA><><ChrB>:<StartB>-<EndB>
for the ChrB breakend. Note that the<POS>
is equivalent to the vcf field. TheSVTYPE
fromINFO
isBND
. -
One row for each of the de novo insertions around the breakends tha generate this translocation. Each of them has an ID
TRA|insertion-<#CHROM>-<POS>|<ChrA>:<StartA>-<EndA><><ChrB>:<StartB>-<EndB>
, where#CHROM
and<POS>
are equal to the equivalent vcf fields. TheSVTYPE
fromINFO
isinsertionBND
.
-
-
Each unclassified breakpoint is represented with a
variantID=<breakpoint_SVTYPE>like|<#CHROM>:<POS>-<CHR2>:<END>
, where<breakpoint_SVTYPE>
can be DEL, TAN, INV1, INV2, ITX1, ITX2, INVTX1, INVTX2. There are several rows related to each unclassified breakpoint:-
One row for each of the breakend regions, with IDs
<breakpoint_SVTYPE>like|BND-<#CHROM>-<POS>|<#CHROM>:<POS>-<CHR2>:<END>
for the#CHROM
breakend and<breakpoint_SVTYPE>like|BND-<CHR2>-<END>|<#CHROM>:<POS>-<CHR2>:<END>
for theCHR2
breakend. TheSVTYPE
fromINFO
isBND
. -
One row for each of the de novo insertions around the breakends tha generate this breakpoint. Each of them has an ID
<breakpoint_SVTYPE>like|insertion-<#CHROM>-<POS>|<breakpoint_SVTYPE>like|<#CHROM>:<POS>-<CHR2>:<END>
, where#CHROM
and<POS>
are equal to the equivalent vcf fields. TheSVTYPE
fromINFO
isinsertionBND
.
-
-
The complex inverted SVs are represented with a
variantID=<varID>|<CHR2>:<START>-<END>|<#CHR2>:<POS>
, where<varID>
can be CVD (inverted intrachromosomal duplication), CVT (inverted intrachromosomal cut-paste insertion) or IVT (inverted interchromosomal cut-paste insertion). Each of them is split across multiple vcf rows:- CVD's are encoded as copy-paste insertions.
- IVT's are encoded as cut-paste insertions.
- CVT's are encoded as unclassified breakpoints, but with an
CVT
tag instead of<breakpoint_SVTYPE>like
. This means that there is one breakend for each end of the CVT.
-
The coverage-based deletion/duplication calls are represented with a
variantID=coverageDUP|CNV|<#CHROM>:<POS>-<END>
orvariantID=coverageDEL|CNV|<#CHROM>:<POS>-<END>
, respectively. There is one row for each variant. TheSVTYPE
fromINFO
isDUP
orDEL
. The range betweenPOS
andEND
indicates the span of the duplication/deletion. The fieldmerged_relative_CN
fromINFO
indicates the most conservative relative copy number state (closest to 1.0) of this region as called by the different CNV callers. In a diploid organism, merged_relative_CN of 0.5 would indicate monosomy and 1.5 trisomy of the indicated region. The fieldmedian_coverage_corrected
indicates the relative coverage (where 1.0 would mean a coverage as the median of the genome) after correction in the coverage-based CNV calling pipeline.
Other important remarks and fields:
-
In SVs, the field
BREAKEND_real_AF
fromINFO
indicates the fraction of reads mapped to that position that support each of the breakends (indicated byBREAKENDIDs
fromINFO
) that form the SV in the corresponding row. In the case ofDUP
,DEL
andTDUP
records, this is a comma-sepparated string. In the case ofBND
orinsertionBND
records this is a float, as there is only one breakend. This value can give you a rough idea of the genortype of the SV. However, note that an SV can have both homozygous and heterozygous breakends at the same time. SV genotyping is still an unsolved problem. Check https://github.com/PapenfussLab/gridss/wiki/GRIDSS-Documentation#calculating-variant-allele-fraction-vaf for more information on how gridss calculates these variant allele frequencies. In order to generate higher-confidence calls it may be interesting to filter out SVs that have low values of BREAKEND_real_AF (i.e. filter out SVs where none of the breakends has BREAKEND_real_AF>0.1). -
Coverage-based CNV calling is error-prone, so that it is advisable to filter out CNVs where
median_coverage_corrected
andmerged_relative_CN
do not match (i.e: duplications (merged_relative_CN>1) with a median_coverage_corrected around 1). -
The field
RELCOVERAGE
fromINFO
indicates the relative coverage normalised by the median coverage across windows of the genome for the region under CNV in tandem duplications, deletions, copy-paste insertions and coverage-inferred CNVs. The fieldsRELCOVERAGE_TO_5
andRELCOVERAGE_TO_3
fromINFO
indicate the relative coverage normalised by the coverage of the 5' or 3' region, respectively. These values can be used as an additional quality control for the called CNVs.
This is not a trivial task, particularly for SVs and CNVs, where the same variant may be represented slightly different in two different perSVade runs. For example, we generally consider that two SVs of a given type (i.e. a deletion) are the same if they overlap by >75% and have their breakends <50 bp appart. We have developed some python functions to introduce this definition of "same variant" across different samples.
These functions are in ./scripts/sv_functions.py
(see Quick start for an example of how to import this script as a module).
Below are some key functions:
-
sv_functions.get_integrated_small_vars_df_severalSamples
is a function to integrate the files from the small variant and per gene coverage analysis, into single files for all samples. These are the generated files:-
merged_coverage.tab
includes the stackedsmallVars_CNV_output/CNV_results/genes_and_regions_coverage.tab
files, where the columnsampleID
indicates the the sampleID. -
smallVars.tab
includes the stackedsmallVars_CNV_output/variant_calling_ploidy2.tab
files, where the columnsampleID
indicates the the sampleID. -
smallVars_annot.tab
includes all the annotated variants (from thesmallVars_CNV_output/variant_annotation_ploidy2.tab
files).
-
-
sv_functions.get_integrated_CNperWindow_df_severalSamples
stacks all theCNV_calling/final_df_coverage.tab
files across different samples. It generates theintegrated_CNperWindow.tab
file where the columnsampleID
indicates the the sampleID. -
sv_functions.get_integrated_SV_CNV_df_severalSamples
integrates the SV and coverage-based CNV calling and annotation files for several samples and provides some information about the overlaps between samples. It requires a valid .gff file. These are the generated files:-
SV_CNV.simple.tab
includes the stackedSVcalling_output/SV_and_CNV_variant_calling.tab
files, where the columnsampleID
indicates the the sampleID. -
SV_CNV_annot.tab
includes the variant annotations as inSVcalling_output/SV_and_CNV_variant_calling.tab_annotated_VEP.tab
for all samples. This table also includes the fieldsis_protein_coding_gene
(a boolean indicating if the variant overlaps a protein-coding gene) andis_transcript_disrupting
(a boolean indicating if the variant disrupts a transcript). Beware that duplications are also considered to be 'transcript-disrupting'. -
SV_CNV.tab
is similar toSV_CNV.simple.tab
, but it includes relevant information to compare the variants across samples. These are some useful extra fields that are found in this table, for each variant:-
overlapping_samples_byBreakPoints_allCalled
: A comma-sepparated string with the sample IDs where gridss found at least one breakpoint overlapping the variant. This can be considered as a false positive-prone estimator of samples that share this variant, as some of the breakpoints may not be real. This field is only meaningful for SVs called by gridss and clove. -
overlapping_samples_byBreakPoints_PASS
: A comma-sepparated string with the sample IDs where gridss found at least one breakpoint (passing the filters of gridss as defined by perSVade) overlapping the variant. This can be considered as a false negative-prone estimator of samples that share this variant, as some of the breakpoints may be missed due to only considering high-confidence breakpoints to compute the overlaps. This field is only meaningful for SVs called by gridss and clove. -
overlapping_samples_CNV_atLeast_<min_pct_overlap>
(where<min_pct_overlap>
may be 0.25, 0.5, 0.75 or 0.9): In coverage-based CNVs, a comma-sepparated string with the sample IDs that may have the same CNV. A CNV is considered to be found in another sample if it has an equal or more extreme (farther from 1.0) copy number (CN) in at least<min_pct_overlap>
of the windows of the given CNV. For example, a duplication in a region of sample A is considered to be also found sample B (according tooverlapping_samples_CNV_atLeast_0.75
) if at least 75% of the windows of that same region in sampleB have a CN number above or equal the CN in A. This field is only meaningful for coverage-based CNVs. -
variantID_across_samples
: is an identifier of each SV/CNV that is unique across different samples. By default, two SVs of a given type (i.e. a deletion) are the same if they overlap by >75% and have their breakends <50 bp appart.
-
-
Other remarks:
-
You can type
help(<function_name>)
from a python terminal to get more details about the arguments of these functions. -
All these functions take an argument called
paths_df
. This should be a table (either a pandas dataframe object or a tab-delimited file with header names) with the columnssampleID
(this should be a unique identifier of each sample) andperSVade_outdir
(the path to the perSVade outdir of that sample). -
We recommend using the fields
overlapping_samples_byBreakPoints_allCalled
,overlapping_samples_byBreakPoints_PASS
andoverlapping_samples_CNV_atLeast_<min_pct_overlap>
if we are interested in filtering out variants in a given sample that are found in other 'background' samples. For example, usingoverlapping_samples_byBreakPoints_allCalled
andoverlapping_samples_CNV_atLeast_0.25
as the overlapping fields would be a conservative way to ensure that the variant is not in any of the 'background' samples. -
The field
variantID_across_samples
is useful for analyses where we want to work with variants shared across different samples (i.e.: hierarchical clustering of samples by the variants)
This is indeed a bit tricky. We here provide an example of running this function and processing the outputs using pandas dataframes. This example is related to an imaginary dataset where we called variants (both SVs and coverage-based CNVs) for one drug resistant sample (called "R") and two susceptible samples (called "S1" and "S2"). This would be a workflow for finding variants that are found in "R" and not in "S1" or "S2" and may drive the resistance:
# load the functions
import pandas as pd
import sys
sys.path.insert(0, "<perSVade_dir>/scripts")
import sv_functions as fun
# define an outdir
integrated_files_dir = "./integrated_files"
# define a dataframe with the paths to the perSVade output directories
paths_df = pd.DataFrame({"sampleID" : ["R", "S1", "S2"],
"perSVade_outdir": ["./outdir_R", "./outdir_S1", "./outdir_S2"]})
# get a file with the Copy Number (CN) per window, which is necessary to compute the overlaps between CNV calls
integrated_CNperWindow_file = fun.get_integrated_CNperWindow_df_severalSamples(paths_df, integrated_files_dir, threads=4)
# generate files with the integrated SVs and CNVs. This requires a variant annotation file
fun.get_integrated_SV_CNV_df_severalSamples(paths_df, integrated_files_dir, <gff>, <reference_genome>, threads=4, integrated_CNperWindow_file=integrated_CNperWindow_file, tol_bp=50, pct_overlap=0.75, add_overlapping_samples_eachSV=True)
# import the generated df with the variants. This is equivalent to the "SV_and_CNV_variant_calling.vcf" but with INFO fields split across different columns and some extra fields.
df_vars = pd.read_csv("./integrated_files/SV_CNV.tab", sep="\t")
# keep vars that are only in "R" and none of the others
def get_df_vars_r_in_bgSamples(r, bg_samples):
samples_overlapping_by_breakpoints = set(str(r["overlapping_samples_byBreakPoints_allCalled"]).split(","))
samples_overlapping_by_CN = set(str(r["overlapping_samples_CNV_atLeast_0.25"]).split(","))
all_samples_overlapping = samples_overlapping_by_breakpoints.union(samples_overlapping_by_CN)
return len(bg_samples.intersection(all_samples_overlapping))>0
df_vars_R = df[(df_vars.sampleID=="R") & ~(df_vars.apply(get_df_vars_r_in_bgSamples, bg_samples={"S1", "S2"}, axis=1))]
# print the variants, using the IDs that are unique across samples
print(set(df_vars_R.variantID_across_samples))
Note the following points to troubleshoot any problems with this singularity build
:
-
As it is written, this command will ask for a 'Docker Username' and 'Docker Password', and you can just type enter there (there is no need to have a user).
-
The
--docker-login
argument may not be accepted nor necessary in some singularity versions. -
You may run out of disk (
singularity build
may need 50Gb) depending on where you are building the image (see the following FAQ Where does perSVade write temporary files? Can I change it? for a solution). -
You may get a fatal error like this: "While performing build: While searching for mksquashfs: exec: "mksquashfs": executable file not found in $PATH". This may happen because singularity does not know where to find
mksquashfs
. You can tell it where it is by editing the filesingularity.conf
, which is installed in<conda_dir>/etc/singularity/singularity.conf
if you used conda to install singularity. Find a line that has 'mksquashfs path', which will be likely commented. Replace this line bymksquashfs path = <conda_dir>/bin/mksquashfs
. Make sure that themksquashfs
is correct by runningwhich mksquashfs
.
The singularity build
command writes files into a cachedir ($HOME/.singularity/cache
by default) and a temporary dir (/tmp
by default). As perSVade's is a large image, you may get into storage problems if $HOME
or /tmp
do not have enough disk quota. You can change these directories through environmental variables before running singularity (i.e. with export SINGULARITY_CACHEDIR=<cachedir>
and export SINGULARITY_TMPDIR=<tmpdir>
). <cachedir>
and <tmpdir>
should be the redefined paths with enough disk. You can find more information in https://sylabs.io/guides/3.6/user-guide/build_env.html. Note that <tmpdir>
should have full permissions (given with sudo chmod 777 <tmpdir>
).
The singularity image relies on several conda environments installed inside the image. Some users encountered buggy interactions between their native python installation and the singularity environments (see this issue), which may appear as errors while trying to import python packages. The root of this problem is that the python sys.path
object from the container includes the path ~/.local/lib/python3.6/site-packages
from the host machine before the (expected) /opt/conda/envs/perSVade_env/lib/python3.6/site-packages
. This is a known conda bug, and we have found two possible workarounds:
-
Insert the correct
site-packages
path at the beginning of sys.path before running perSVade withsingularity exec -e mikischikora_persvade_<tag>.sif bash -c 'source /opt/conda/etc/profile.d/conda.sh && conda activate perSVade_env && export PYTHONPATH=/opt/conda/envs/perSVade_env/lib/python3.6/site-packages && /perSVade/scripts/perSVade <module_name>'
. -
Remove the host machine's
~/.local/lib/python3.6/site-packages
from sys.path before running perSVade withsingularity exec -e mikischikora_persvade_<tag>.sif bash -c 'source /opt/conda/etc/profile.d/conda.sh && conda activate perSVade_env && export PYTHONNOUSERSITE=True && /perSVade/scripts/perSVade <module_name>'
.
We have noticed that some picard
commands through the following message in some machines:
ERROR: This cross-compiler package contains no program <conda_dir>/envs/<env_name>_picard_env/bin/x86_64-conda_cos6-linux-gnu-gfortran
INFO: activate-gfortran_linux-64.sh made the following environmental changes: +HOST=x86_64-conda_cos6-linux-gnu -HOST=x86_64-conda-linux-gnu
This does not stop the execution, because it is not an error. This is just a message that appears in some machines stating that some environmental changes were performed, but you should not worry about the results. Note that perSVade will stop the execution in any step goes wrong.
Most of them are written under the provided --outdir
, but there are also some written under $HOME/.perSVade_dir
(which can be removed any time). You can specify a PERSVADE_TMPDIR
environmental varibale to not write under $HOME/.perSVade_dir
(i.e. with export PERSVADE_TMPDIR=/tmp/perSVade_tmpdir
)
The CNV calling relies on the cylowess package. If you installed perSVade with the traditional installation you'll install it through the ./installation/setup_environment.sh
script. However, we have seen that this does not always work, and you may get an error while trying to import cylowess. Make sure that cylowess is properly installed in perSVade's main conda environment following these commands (and making sure that you are using the python
interpreter of the conda environment):
git clone https://github.com/histed/lowess-work-carljv
cd lowess-work-carljv
python setup.py install
If your reference genome has mitochondrial chromosomes you should specify them with --mitochondrial_chromosome
. For example if there are two mtDNA chromosomes called 'chr_mito1' and 'chr_mito2' you should provide the argument --mitochondrial_chromosome chr_mito1,chr_mito2
. If there are no mitochondrial chromosomes you should state --mitochondrial_chromosome no_mitochondria
.
Note that mtDNA chromosomes are treated differently for SV, CNV calling and variant annotation. It is thus mandatory that you provide the mtDNA chromosomes.
Variant annotation is often error-prone because the compatibility between VEP (the program used by perSVade to annotate variants) and the provided gff is not granted. Gff files can have a flexible format, which can be sometimes incompatible with perSVade. While there is not a universal solution to any problem, we have seen that validating your gff (through this link) is often useful. If your problem persists, raise a Github issue and we probably can work it out.