-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All variants are intergenic with NCBI GFF #1620
Comments
Hey @dzc0104, Thank you for your question. The problem is related with using the NCBI GTF/GFF annotation for microorganisms: we currently require the GTF/GFF annotation to explicitly describe the transcript and its exons. For your use case, you could use the following modified annotation:
As this is not the first time we got this question (see #1074), I am going to talk with the team about the possibility of supporting these NCBI GTF/GFF annotation files for microorganisms. Maybe we can consider each CDS as a single-exon transcript. I will keep you updated on this. Best regards, |
Thank you for the response @nuno-agostinho |
Hi @dzc0104, I manually created the file by basically:
Tell me if you need further instructions.
If you downloaded the GFF3 annotation via the Cheers, |
@nuno-agostinho Yay! It worked. Thank you very much, Nuno. Regard, |
@nuno-agostinho I still have a question. How can position 77 be associated with multiple types of genes, namely F, M, NP, and P? During my analysis, I observed that genomic position 77 is annotated with gene symbols F, M, NP, and P across various transcripts like this I got this information from a dataset https://www.ncbi.nlm.nih.gov/nuccore/AF077761 that includes details about gene symbols and transcript types. But I'm not sure what it means biologically to have different gene types at the same position. |
Hi @dzc0104, The only results associated with genes F and M are However, the default distance between a variant and a transcript used by VEP to annotate up/downstream variants is 5 000 bp (optimised for vertebrates) and the genome you mentioned is small (15 186 bp). Please try to decrease the Hope this makes it clear. Cheers, |
Hi @nuno-agostinho, Thank you for your assistance. As part of my data analysis, I've identified synonymous variants and now I'm exploring their potential impacts at the amino acid level. While synonymous variants traditionally aren't thought to have functional impacts on protein structure, they can affect RNA stability, protein folding, evolutionary conservation, splicing regulation, and regulatory elements. I've utilized Variant Effect Predictor (VEP) with the SIFT option (-sift b), but unfortunately, I didn't receive any relevant data in the output. Does this lack of prediction indicate that there are no available predictions for my variants? Here's the command I used: Additionally, I'm seeking recommendations for other tools to analyze the functional impacts of synonymous variants, particularly those focusing on RNA-level effects, splicing regulation, and non-protein-coding impacts. Thank you for your guidance! 😊 I have attached hereby the link to the VCF file. Best regards, |
Hi @dzc0104, VEP only returns pre-computed SIFT results stored in Ensembl databases in Regarding additional tools to help predict variant consequences, some articles list such tools:
Hope this information was useful. Cheers, |
Hi @nuno-agostinho, I have a similar issue as the one originally reported by @dzc0104 regarding intergenic variant calling. I've built .gff3 files using both prokka and bakta for reference genomes against which I'm looking to find variants. Here's an excerpt of a bakta .gff3 below:
I've tried to make use of your method here:
and even changing CDS to gene in the .gff3 file and including a biotype to remedy the warning (just on the off chance...):
However, I still receive warnings ( Any recommendations here, or if you'd like me to provide test data, do let me know. Cheers, |
Hi @Joshua-Macleod, Based on that warning, I would say that those lines have no field indicating their biotype, so VEP can't determine whether they are part of a Could you show me the lines in your GFF3 file relative to Best, |
Hi @nuno-agostinho, Thanks for getting back to me. Here are the lines:
Worth noting, these aren't loci outputted by vep (edit: presumably wouldn't be for the same reason they're noted in the warnings - I didn't put two and two together). Cheers, |
Hi,
I am attempting to annotate a customized VCF file using NCBI's GFF and (fna) FASTA files for the Newcastle disease virus (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_004786615.1/). However, I've observed that all the variants are being classified as intergenic. But this is not true, when viewed in IGV.
System
###Script
#To install the bgzip and tabix (I did it in my local terminal)
#Download htslib-1.19.1.tar.gz
tar -zxvf htslib-1.19.1.tar.gz
cd htslib-1.19.1
#removing header line of gff as vep does not work with files having header line (local terminal)
grep -v '^#' genomic.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip > genomic.gff.gz
tabix -p gff genomic.gff.gz
#for compressing fasta file (local terminal and transfer all the files in super computer later)
bgzip -c GCF_004786615.1_ASM478661v1_genomic.fna > GCF_004786615.1_ASM478661v1_genomic.fna.gz
#for indexing fasta file
samtools faidx GCF_004786615.1_ASM478661v1_genomic.fna.gz
#creating a synonyms file that maps the chromosome names used in your VCF to those used in your GFF file
zcat iso1_filtered.snp.vcf.gz | grep -v '^#' | sort -k1,1 -o sorted_iso1.vcf
cut -f1 sorted_iso10.vcf > 1snpsynonyms.txt
zcat genomic.gff.gz | grep -v '^#' | sort -k1,1 -o sorted.gff
#variants annotation for snp using ASM4786615.1
vep -i iso1_filtered.snp.vcf.gz --gff /home/shared/hauck_research/Deepa_NDV_updated/troubleshooting/ncbiASM478661/ncbi_dataset/data/GCF_004786615.1/genomic.gff.gz --fasta /home/shared/hauck_research/Deepa_NDV_updated/troubleshooting/ncbiASM478661/ncbi_dataset/data/GCF_004786615.1/GCF_004786615.1_ASM478661v1_genomic.fna.gz --synonyms 1snpsynonyms.txt --species avian_orthoavulavirus
Full error message
I have not got any warning message as the script ran but the output file was with all intergenic variants.
Data files
A sample of the GFF after
NC_075404.1 RefSeq region 1 15186 . + . ID=NC_075404.1:1..15186;Dbxref=taxon:2560319;country=United Kingdom: N. Ireland;gbkey=Src;genome=genomic;isolate=chicken/N. Ireland/Ulster/67;mol_type=genomic RNA;old-name=Newcastle disease virus
NC_075404.1 RefSeq gene 56 1801 . + . ID=gene-QKC91_gp1;Dbxref=GeneID:80527638;Name=N;gbkey=Gene;gene=N;gene_biotype=protein_coding;locus_tag=QKC91_gp1
NC_075404.1 RefSeq CDS 122 1591 . + 0 ID=cds-YP_010790286.1;Parent=gene-QKC91_gp1;Dbxref=GenBank:YP_010790286.1,GeneID:80527638;Name=YP_010790286.1;gbkey=CDS;gene=N;locus_tag=QKC91_gp1;product=nucleoprotein;protein_id=YP_010790286.1
NC_075404.1 RefSeq gene 1804 3254 . + . ID=gene-QKC91_gp2;Dbxref=GeneID:80527633;Name=P;gbkey=Gene;gene=P;gene_biotype=protein_coding;locus_tag=QKC91_gp2
.....
A sample of the compressed VCF
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT iso1
NODE_1_length_6008_cov_909.877255 980 . T C 12078.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.924;DP=624;ExcessHet=0.0000;FS=1.120;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=19.87;ReadPosRankSum=0.149;SOR=0.728 GT:AD:DP:GQ:PL 0/1:236,372:608:99:12086,0,6929
NODE_1_length_6008_cov_909.877255 3666 . C T 15573.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-0.079;DP=770;ExcessHet=0.0000;FS=7.765;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=20.88;ReadPosRankSum=0.795;SOR=0.362 GT:AD:DP:GQ:PL 0/1:235,511:746:99:15581,0,5829
NODE_1_length_6008_cov_909.877255 3812 . A G 534.64 ReadPosRankSum-8 AC=1;AF=0.500;AN=2;BaseQRankSum=1.096;DP=826;ExcessHet=0.0000;FS=15.515;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=0.66;ReadPosRankSum=-12.298;SOR=2.487 GT:AD:DP:GQ:PL 0/1:722,85:807:99:542,0,23105
NODE_1_length_6008_cov_909.877255 4631 . T C 1817.64 ReadPosRankSum-8 AC=1;AF=0.500;AN=2;BaseQRankSum=-3.725;DP=846;ExcessHet=0.0000;FS=22.208;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=2.24;ReadPosRankSum=-13.945;SOR=1.685 GT:AD:DP:GQ:PL 0/1:680,133:813:99:1825,0,21905
NODE_2_length_2668_cov_848.858356 289 . G A 924.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.811;DP=720;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=59.97;MQRankSum=0.000;QD=1.50;ReadPosRankSum=-5.861;SOR=0.631 GT:AD:DP:GQ:PL 0/1:531,87:618:99:932,0,16256
.....
Synonyms text file format
NODE_1_length_6008_cov_909.877255 NC_075404.1
NODE_1_length_6008_cov_909.877255 NC_075404.1
NODE_1_length_6008_cov_909.877255 NC_075404.1
NODE_1_length_6008_cov_909.877255 NC_075404.1
NODE_2_length_2668_cov_848.858356 NC_075404.1
NODE_2_length_2668_cov_848.858356 NC_075404.1
.....
VEP output
ENSEMBL VARIANT EFFECT PREDICTOR v104.3
Output produced at 2024-02-09 19:23:53
Using API version 104, DB version ?
ensembl-funcgen version 104.f1c7762
ensembl-io version 104.1d3bb6e
ensembl version 104.1af1dce
ensembl-variation version 104.20f5335
Column descriptions:
Uploaded_variation : Identifier of uploaded variant
Location : Location of variant in standard coordinate format (chr:start or chr:start-end)
Allele : The variant allele used to calculate the consequence
Gene : Stable ID of affected gene
Feature : Stable ID of feature
Feature_type : Type of feature - Transcript, RegulatoryFeature or MotifFeature
Consequence : Consequence type
cDNA_position : Relative position of base pair in cDNA sequence
CDS_position : Relative position of base pair in coding sequence
Protein_position : Relative position of amino acid in protein
Amino_acids : Reference and variant amino acids
Codons : Reference and variant codon sequence
Existing_variation : Identifier(s) of co-located known variants
Extra column keys:
IMPACT : Subjective impact classification of consequence type
DISTANCE : Shortest distance from variant to transcript
STRAND : Strand of the feature (1/-1)
FLAGS : Transcript quality flags
SOURCE : Source of transcript
genomic.gff.gz : /home/shared/hauck_research/Deepa_NDV_updated/troubleshooting/ncbiASM478661/ncbi_dataset/data/GCF_004786615.1/genomic.gff.gz (overlap)
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
NODE_1_length_6008_cov_909.877255_980_T/C NODE_1_length_6008_cov_909.877255:980 C - - - intergenic_variant - - - - - - IMPACT=MODIFIER
NODE_1_length_6008_cov_909.877255_3666_C/T NODE_1_length_6008_cov_909.877255:3666 T - - - intergenic_variant - - - - - - IMPACT=MODIFIER
NODE_1_length_6008_cov_909.877255_3812_A/G NODE_1_length_6008_cov_909.877255:3812 G - - - intergenic_variant - - - - - - IMPACT=MODIFIER
NODE_1_length_6008_cov_909.877255_4631_T/C NODE_1_length_6008_cov_909.877255:4631 C - - - intergenic_variant - - - - - - IMPACT=MODIFIER
....
The text was updated successfully, but these errors were encountered: