-
Notifications
You must be signed in to change notification settings - Fork 3
FAQ
The runtime varies and is machine-specific, but there are multiple ways to decrease runtime.
You can try to add more stringent filters for your vcf when running 01-filter_vcf.sh
.
Currently, the default filtering retains only PASS
variants, but users can add additional criteria such as read depth and allele frequencies to filter out common variants.
# define default filters
cmd="bcftools view -f 'PASS,.' $vcf_file"
You can use any version of the ClinVar database when using the "custom" workflow.
Simply download the clinvar.vcf.gz
file and place into the /data
directory.
You can run the download_db_files.sh
separately to download the required variant submissions files (variant_summary.txt
and submission_summary.txt
), and place them into the /data
directory.
AutoGVP will automatically detect that these are in the /data
folder and will not download new ones if present.
As of AutoGVP v.0.4.1, AutoPVS1 hg38 uses gene symbols from VEP 104.
Therefore, we recommended to either run VEP v104 to ensure optimal tool compatibility.
Alternatively, if using VEP > v104, we recommend to lift over the gene symbols in the PVS1.level
file located in the AutoPVS1 data folder using this custom python script where hgnc_tsv
is the gene name database TSV file from the monthly HGNC server here.
Example command, with results used to replace PVS1.level
file:
python3 D3b-DGD-Collaboration/scripts/update_gene_symbols.py -g hgnc_complete_set_2021-06-01.txt -f PVS1.level -z GENE level -u GENE -o results --explode_records 2> old_new.log
We have most_severe
and latest
as options when running run_autogvp.sh
as to allow the user to tailor for their use-case. Although we have not seen major differences from our testing, there may be some more specific cases where this matters.
The default is to use the “latest” submission based on the assumption that technology and methods get better over time (re: variant predictions).
However, if the latest entries are deposited around the same time period in the ClinVar database, it may be more appropriate to use the most_severe
.
We strongly recommend manual inspection for genes of interest.
You can find a current list of MedGen Concept IDs and their associated diseases in the MedGenIDMappings.txt.gz
file here and filter on diseases of interest.
Provide this file as a text file of Concept IDs with no column header.
We provide two Concept ID lists in the data/
folder:
- A full disease Concept ID list pulled from the
disease_names
text file on the ClinVar FTP. This file was filtered for IDs for whichCategory == "disease"
, and disease IDs associated with"Not specified"
and"Not provided"
were removed. - A CPG-associated Concept ID list, including only those concept IDs associated with variants in 214 cancer predisposition genes. Gene-concept ID associations can be pulled from the
gene_condition_source_id
text file on the ClinVar FTP.
Both lists were derived from ClinVar FTP files pulled on 10/05/2023.
CAVATICA is a open and public cloud-based tool that uses our workflow based on upstream Kids First workflows(Germline SNV Annotations, Pathogenicity-Preprocessing). The Kids First germline harmonization workflow contains custom annotations. This workflow is intended for users that wish to apply the standard KidsFirst workflows to their data or AutoGVP to harmonized Kids First datasets.