Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indel Annotation Problem with Uploaded Variation Column #1799

Open
hkirmak opened this issue Nov 20, 2024 · 9 comments
Open

Indel Annotation Problem with Uploaded Variation Column #1799

hkirmak opened this issue Nov 20, 2024 · 9 comments
Assignees

Comments

@hkirmak
Copy link

hkirmak commented Nov 20, 2024

Describe the issue

After performing variant calling with GATK's HaplotypeCaller and normalize multiallelic sites usign bcftools, I annotate my variants usign Ensembl VEP v113.2. There seems to be a discrepancy in the annotated variants especially variants within the homopolymer regions. In my VCF file, there is a deletion within a homopolymer region. Ensembl VEP annotates this variation as "indel" and creates a uploaded variation with ambiguous values.

Additional information

Please fill in the following sections to help us find the source of your issue as quickly as possible.

System

  • VEP version: 113.2
  • VEP Cache version: 113
  • Perl version: 5.32.1
  • OS: Ubuntu
  • tabix installed

Full VEP command line

## VEP command-line: vep --assembly GRCh38 --cache --cache_version 113 --custom [PATH]/clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN,CLNVC,CLNDISDB,[PATH]/1000GENOMES-phase_3.vcf.gz,1000Genomes,vcf,exact,0,[PATH]/CosmicCodingMuts.normal.vcf.gz,COSMIC,vcf,exact,0,LEGACY_ID,[PATH]/TurkishVariomeDB.sorted.vcf.gz,TURKISHVARIOME,vcf,exact,0,[PATH]/UK10K_COHORT.20160215.sites.vcf.gz,UK10K,vcf,exact,0,[PATH]/hg38.phyloP100way.bw,phyloP100way,bigwig,exact,[PATH]/gnomad.exomes.v4.0.coverage.summary.bed.gz,GnomadExomeCoverage,bed,overlap,[PATH]/gnomad.genomes.r3.0.1.coverage.summary.bed.gz,GnomadGenomeCoverage,bed,overlap,[PATH]/gnomad_exomes.vcf.gz,gnomADe,vcf,exact,0,AC,AN,AF,nhomalt,AF_afr,AF_amr,AF_asj,AF_eas,AF_fin,AF_mid,AF_nfe,AF_remaining,AF_sas,lcr,[PATH]/gnomad_genomes.vcf.gz,gnomADg,vcf,exact,0,AC,AN,AF,nhomalt,AF_afr,AF_amr,AF_asj,AF_eas,AF_fin,AF_mid,AF_nfe,AF_remaining,AF_sas,lcr,segdup --database 0 --dir_cache [PATH]/cache --dir_plugins [PATH]/plugins --distance 500 --everything --exclude_predicted --fasta [PATH]/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta --fields Uploaded_variation,Location,Gene,MVP_score,ExAC_AC,ExAC_AF,ExAC_AMR_AF,ExAC_EAS_AF,ExAC_FIN_AF,ExAC_NFE_AF,ExAC_SAS_AF,1000Genomes_1000Gp3_AF,1000Genomes_MAF,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,IMPACT,STRAND,VARIANT_CLASS,SYMBOL,HGNC_ID,BIOTYPE,CANONICAL,MANE_SELECT,MANE_PLUS_CLINICAL,SWISSPROT,LOVD,SOURCE,SIFT,SIFT4G,PolyPhen,DOMAINS,HGVSc,HGVSp,HGVSg,AF,AFR_AF,AMR_AF,EAS_AF,EUR_AF,SAS_AF,AA_AF,EA_AF,MAX_AF,PUBMED,LoFtool,ada_score,rf_score,CADD_PHRED,CADD_RAW,Condel,BLOSUM62,DANN_score,DEOGEN2_pred,DEOGEN2_score,Eigen-raw_coding,Eigen-phred_coding,FATHMM_pred,FATHMM_score,GERP++_RS,LRT_pred,LRT_score,MetaLR_pred,MetaLR_score,MutationTaster_score,MutationTaster_pred,MutationAssessor_score,MutationAssessor_pred,PROVEAN_pred,PROVEAN_score,SIFT4G_pred,SIFT4G_score,REVEL,genename,phastCons100way_vertebrate,phastCons17way_primate,phastCons470way_mammalian,phyloP100way_vertebrate,phyloP17way_primate,phyloP470way_mammalian,MaxEntScan_alt,MaxEntScan_diff,MaxEntScan_ref,GeneSplicer,ClinVar,ClinVar_CLNSIG,ClinVar_CLNREVSTAT,ClinVar_CLNDN,ClinVar_CLNVC,ClinVar_CLNDISDB,ClinVar_CLNVCSO,EXON,COSMIC_LEGACY_ID,COSMIC_CNT,UNIPROT_ISOFORM,MetaRNN_score,TURKISHVARIOME_AF,UK10K_AF,OtherDifficult,Mappability,LowComplexity,GCcontent,FunctionalTechnicallyDifficult,gnomADe_AC,gnomADe_AN,gnomADe_AF,gnomADe_nhomalt,gnomADe_AF_afr,gnomADe_AF_amr,gnomADe_AF_asj,gnomADe_AF_eas,gnomADe_AF_fin,gnomADe_AF_mid,gnomADe_AF_nfe,gnomADe_AF_remaining,gnomADe_AF_sas,gnomADe_lcr,gnomADe_segdup,gnomADg_AC,gnomADg_AN,gnomADg_AF,gnomADg_nhomalt,gnomADg_AF_afr,gnomADg_AF_amr,gnomADg_AF_asj,gnomADg_AF_eas,gnomADg_AF_fin,gnomADg_AF_mid,gnomADg_AF_nfe,gnomADg_AF_remaining,gnomADg_AF_sas,gnomADg_lcr,gnomADg_segdup,GnomadExomeCoverage,GnomadGenomeCoverage,NMD --force_overwrite --fork --format vcf --hgvs --hgvsg --individual all --input_file [PATH]/S26.vcf --merged --numbers --offline --output_file [PATH]/S26.tsv --plugin LoFtool --plugin dbscSNV,[PATH]/dbscSNV1.1_GRCh38.txt.gz --plugin GO --plugin NMD --plugin LOVD --plugin CADD,[PATH]/whole_genome_SNVs.tsv.gz --plugin REVEL,[PATH]/revel_grch38.tsv.gz --plugin Phylop1 --plugin Condel --plugin Blosum62 --plugin dbNSFP,[PATH]/dbNSFP4.9c_grch38.gz,transcript_match=1,MVP_score,genename,ExAC_AC,gnomAD_exomes_AC,gnomAD_genomes_AC,gnomAD_exomes_AF,gnomAD_genomes_AF,1000Gp3_AF,ExAC_AF,ExAC_AMR_AF,ExAC_EAS_AF,ExAC_FIN_AF,ExAC_NFE_AF,ExAC_SAS_AF,DANN_score,DEOGEN2_pred,DEOGEN2_score,Eigen-raw_coding,Eigen-phred_coding,FATHMM_score,FATHMM_pred,GERP++_RS,LRT_score,LRT_pred,MetaRNN_score,MetaRNN_rankscore,MetaRNN_pred,MetaLR_score,MetaLR_pred,MutationAssessor_score,MutationAssessor_pred,MutationTaster_score,MutationTaster_pred,PROVEAN_score,PROVEAN_pred,SIFT4G_score,SIFT4G_pred,phyloP17way_primate,phyloP470way_mammalian,phastCons100way_vertebrate,phastCons17way_primate,phastCons470way_mammalian,gnomAD_exomes_flag,gnomAD_genomes_flag,gnomAD_exomes_nhomalt,gnomAD_genomes_nhomalt --plugin MaxEntScan,[PATH]/fordownload --plugin GeneSplicer,[PATH]/genesplicer,[PATH]/human --tab --verbose

Full error message

There is no error message.

Data files (if applicable)

They include:

  • The input file (variants as in VCF):

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT s1
chr1 235380161 . TTGTGTGTG T 559.02 PASS AC=1;AF=0.5;AN=2;DB;DP=95;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=27.95;SOR=1.244 GT:AD:DP:GQ:PL:SB 1/0:0,12:20:99:576,216,456:0,0,6,10
chr1 235380161 . TTGTGTGTG TTGTGTG 559.02 PASS AC=1;AF=0.5;AN=2;DB;DP=95;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=27.95;SOR=1.244 GT:AD:DP:GQ:PL:SB 0/1:0,8:20:99:576,384,358:0,0,6,10

  • The output file (tab separated direct VEP command line output):
    chr1_235380168_TG/TGTGTGTG/TGTGTG chr1:235380168-235380169 ENSG00000284770 - - - - - - - - - ENST00000366601 Transcript intron_variant - - - - - - MODIFIER 1 indel TBCE HGNC:11582 protein_coding - - - - Ensembl - - - - ENST00000366601.8:c.100+20_100+21insTGTGTG - chr1:g.235380209_235380214dup - - - - - - - - - - 0.901 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Q15813,A0A2R8Y809,A0A2R8Y4P2,A0A2R8YER9,A0A2R8Y4V6,A0A2R8Y5Q8,A0A2R8YHI9,A0A2R8Y7E7,A0A2R8YFM3,A0A2R8Y787,A0A2R8Y6G3,A0A2R8Y7H8,A0A2U3TZJ6,A0A2R8Y5H6,A0A2R8Y784
    chr1_235380168_TG/TGTGTGTG/TGTGTG chr1:235380168-235380169 ENSG00000284770 - - - - - - - - - ENST00000366601 Transcript intron_variant - - - - - - MODIFIER 1 indel TBCE HGNC:11582 protein_coding - - - - Ensembl - - - - ENST00000366601.8:c.100+20_100+21insTGTG - chr1:g.235380211_235380214dup - - - - - - - - - - 0.901 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    (...) so on

According to the VEP documentation, these variatns should be deletion

@dglemos
Copy link
Contributor

dglemos commented Nov 20, 2024

Hi @hkirmak,
Thanks for sending a detailed explanation of the issue.
Unfortunately I cannot reproduce the issue, in my output these two variants are deletion.

Can you please re-run VEP for the two variants above with the following options:

vep
--assembly GRCh38
--cache
--cache_version 113
--database 0
--dir_cache [PATH]/cache
--dir_plugins [PATH]/plugins
--distance 500
--everything
--fasta [PATH]/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta
--force_overwrite
--format vcf
--hgvs
--hgvsg
--input_file [PATH]/S26.vcf
--merged
--numbers
--offline
--output_file [PATH]/S26.tsv
--tab
--verbose

@hkirmak
Copy link
Author

hkirmak commented Nov 20, 2024

Hi,
Thank you for your quick response.

I ran the command you have sent above in a clean environment but it generated the following error:

ERROR: Unexpected extra command-line parameter(s): 0 at /miniconda3/envs/junk/bin/vep line 30.

We usually do not use database option as it is incompatible with the --cache option and offline running according to Ensembl-VEP documentation: https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#basic

I ran the command without the --database option (completed without any errors but there were some warnings about transcript-assembly mismatches, expected since --verbose is invoked) and the variants appear as deletions as you have found.

The command I use is much more different as it involves other options and custom annotations. I will check what may be causing such a problem, if you have another suggestion for our problem that would also be greatly appreciated.

Thank you.

@hkirmak
Copy link
Author

hkirmak commented Nov 20, 2024

I checked my command and the only difference is that I use the option "--individual all". I am not entirely sure what this option points to. As in the documentation:
"Consider only alternate alleles present in the genotypes of the specified individual(s). May be a single individual, a comma-separated list or "all" to assess all individuals separately. Individual variant combinations homozygous for the given reference allele will not be reported. Each individual and variant combination is given on a separate line of output. Only works with VCF files containing individual genotype data; individual IDs are taken from column headers. Not used by default"
However I expected to see different alleles in different lines from this documentation. I will try to run the command without this option.

@dglemos
Copy link
Contributor

dglemos commented Nov 20, 2024

Thanks for checking.
I copied --database 0 from your command under "Full VEP command line" but as you said this is not a valid option.
The option --individual all shouldn't interfere with the variant class - I'll double check.

chr1_235380168_TG/TGTGTGTG/TGTGTG chr1:235380168-235380169 ENSG00000284770 - - - - - - - - - ENST00000366601 Transcript intron_variant - - - - - - MODIFIER 1 indel TBCE HGNC:11582 protein_coding - - - - Ensembl - - - - ENST00000366601.8:c.100+20_100+21insTGTGTG - chr1:g.235380209_235380214dup - - - - - - - - - - 0.901 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Q15813,A0A2R8Y809,A0A2R8Y4P2,A0A2R8YER9,A0A2R8Y4V6,A0A2R8Y5Q8,A0A2R8YHI9,A0A2R8Y7E7,A0A2R8YFM3,A0A2R8Y787,A0A2R8Y6G3,A0A2R8Y7H8,A0A2U3TZJ6,A0A2R8Y5H6,A0A2R8Y784
chr1_235380168_TG/TGTGTGTG/TGTGTG chr1:235380168-235380169 ENSG00000284770 - - - - - - - - - ENST00000366601 Transcript intron_variant - - - - - - MODIFIER 1 indel TBCE HGNC:11582 protein_coding - - - - Ensembl - - - - ENST00000366601.8:c.100+20_100+21insTGTG - chr1:g.235380211_235380214dup - - - - - - - - - - 0.901 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

This output displays the two input variants merged. VEP is not supposed to do that, it should return annotations for each variant separately.

The command I use is much more different as it involves other options and custom annotations. I will check what may be causing such a problem, if you have another suggestion for our problem that would also be greatly appreciated.

You should review the --field, I don't think these fields match the fields from your output example.

@dglemos
Copy link
Contributor

dglemos commented Nov 20, 2024

The problem is caused by the option --individual all. This option takes into account the alternate alleles present in the genotypes of the specified individual(s), in your vcf is one individual represented ass1.

We are going to review this option.
Until the issue is resolved, I recommend you remove this option from your command.

Sorry for the inconvenience.

@hkirmak
Copy link
Author

hkirmak commented Nov 21, 2024

Thank you. I will proceed without the --individual all option until this issue is resolved.
I am not sure if I should close this issue since maybe it is needed for resolving this issue, I will leave that to you if that is okay.

@dglemos
Copy link
Contributor

dglemos commented Nov 21, 2024

Can you explain what you're trying to get by using the --individual option?
This option returns a separate variant annotation row for each individual. If you want to returns a list of individuals and their zygosity you can use option --individual_zyg (documentation here).

@hkirmak
Copy link
Author

hkirmak commented Nov 21, 2024

I was using the --individual option for annotating multiallelic sites (because that was what I understood from the VEP documentation) but then I had to use bcftools normalization for these sites as they generated downstream filtering errors prior to annotation step, therefore --individual option is not a necessity for me anymore, but I hadn't removed it from the VEP command.

@dglemos
Copy link
Contributor

dglemos commented Nov 21, 2024

The option --individual takes into account the genotype data while annotating variants, this is probably not the option you're looking for.
If at some point you have problems with allele representation (normalisation) then you could check the option --minimal which converts alleles to their most minimal representation before consequence calculation (documentation here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants