Skip to content
Ryan Corbett edited this page Nov 29, 2023 · 9 revisions

Frequently Asked Questions (FAQ)

AutoGVP is running too slow. How can I improve this run-time?

The runtime varies and is machine-specific, but there are multiple ways to decrease runtime. You can try to add more stringent filters for your vcf when running 01-filter_vcf.sh. Currently, the default filtering retains only PASS variants, but users can add additional criteria such as read depth and allele frequencies to filter out common variants.

# define default filters
cmd="bcftools view -f 'PASS,.' $vcf_file"

What if I want to use a different version of ClinVar database or submission files?

You can use any version of the ClinVar database when using the "custom" workflow. Simply download the clinvar.vcf.gz file and place into the /data directory. You can run the download_db_files.sh separately to download the required variant submissions files (variant_summary.txt and submission_summary.txt), and place them into the /data directory. AutoGVP will automatically detect that these are in the /data folder and will not download new ones if present.

Why do you recommend lift over of AutoPVS1 gene symbols if not using VEP v104?

As of AutoGVP v.0.4.1, AutoPVS1 hg38 uses gene symbols from VEP 104. Therefore, we recommended to either run VEP v104 to ensure optimal tool compatibility. Alternatively, if using VEP > v104, we recommend to lift over the gene symbols in the PVS1.level file located in the AutoPVS1 data folder using this custom python script where hgnc_tsv is the gene name database TSV file from the monthly HGNC server here.

Example command, with results used to replace PVS1.level file:

python3 D3b-DGD-Collaboration/scripts/update_gene_symbols.py -g hgnc_complete_set_2021-06-01.txt -f PVS1.level -z GENE level -u GENE -o results --explode_records 2> old_new.log

What is the recommended conflict_res option?

We have most_severe and latest as options when running run_autogvp.sh as to allow the user to tailor for their use-case. Although we have not seen major differences from our testing, there may be some more specific cases where this matters. The default is to use the “latest” submission based on the assumption that technology and methods get better over time (re: variant predictions). However, if the latest entries are deposited around the same time period in the ClinVar database, it may be more appropriate to use the most_severe. We strongly recommend manual inspection for genes of interest.

How can I create my own Concept ID list?

You can find a current list of MedGen Concept IDs and their associated diseases in the MedGenIDMappings.txt.gz file here and filter on diseases of interest. Provide this file as a text file of Concept IDs with no column header. We provide two Concept ID lists in the data/ folder:

  1. A full disease Concept ID list pulled from the disease_names text file on the ClinVar FTP. This file was filtered for IDs for which Category == "disease", and disease IDs associated with "Not specified" and "Not provided" were removed.
  2. A CPG-associated Concept ID list, including only those concept IDs associated with variants in 214 cancer predisposition genes. Gene-concept ID associations can be pulled from the gene_condition_source_id text file on the ClinVar FTP.

Both lists were derived from ClinVar FTP files pulled on 10/05/2023.

Why is there a CAVATICA version?

CAVATICA is a open and public cloud-based tool that uses our workflow based on upstream Kids First workflows(Germline SNV Annotations, Pathogenicity-Preprocessing). The Kids First germline harmonization workflow contains custom annotations. This workflow is intended for users that wish to apply the standard KidsFirst workflows to their data or AutoGVP to harmonized Kids First datasets.