Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFF format for custom Bacterial genome #1074

Closed
Adrian-Howard opened this issue Nov 10, 2021 · 3 comments
Closed

GFF format for custom Bacterial genome #1074

Adrian-Howard opened this issue Nov 10, 2021 · 3 comments
Assignees
Labels

Comments

@Adrian-Howard
Copy link

Hello,

I am trying to run vep offline mode with a bacterial genome but variants that fall in a cds region are printed as intergenic.

My reference files where downloaded directly from NCBI genome and have prepared them ass suggested in the link https://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#gff

This is an example of my gff file:
CP054425.1 RefSeq gene 1 1503 . + . ID=gene-EE567_RS00005;Name=dnaA;gbkey=Gene;gene=dnaA;gene_biotype=protein_coding;locus_tag=EE567_RS00005;old_locus_tag=EE567_000005
CP054425.1 Protein Homology CDS 1 1503 . + 0 ID=cds-WP_007057882.1;Parent=gene-EE567_RS00005;Dbxref=Genbank:WP_007057882.1;Name=WP_007057882.1;gbkey=CDS;gene=dnaA;inference=COORDINATES: similar to AA sequence:RefSeq:WP_013139947.1;locus_tag=EE567_RS00005;product=chromosomal replication initiator protein DnaA;protein_id=WP_007057882.1;transl_table=11
CP054425.1 RefSeq gene 2238 3362 . + . ID=gene-EE567_RS00010;Name=dnaN;gbkey=Gene;gene=dnaN;gene_biotype=protein_coding;locus_tag=EE567_RS00010;old_locus_tag=EE567_000010
CP054425.1 Protein Homology CDS 2238 3362 . + 0 ID=cds-WP_012576463.1;Parent=gene-EE567_RS00010;Dbxref=Genbank:WP_012576463.1;Name=WP_012576463.1;gbkey=CDS;gene=dnaN;inference=COORDINATES: similar to AA sequence:RefSeq:WP_007051765.1;locus_tag=EE567_RS00010;product=DNA polymerase III subunit beta;protein_id=WP_012576463.1;transl_table=11

This is the output i get:
#Uploaded_variation Location Allele Consequence IMPACT SYMBOL Gene Feature_type Feature BIOTYPE EXON INTRON HGVSc HGVSp cDNA_position CDS_position Protein_position Amino_acids Codons STRAND
CP054425.1_459_C/G CP054425.1:459 G intergenic_variant MODIFIER - - - - - - - - - - - - - - -
CP054425.1_462_G/A CP054425.1:462 A intergenic_variant MODIFIER - - - - - - - - - - - - - - -
CP054425.1_723_C/T CP054425.1:723 T intergenic_variant MODIFIER - - - - - - - - - - - - - - -

I have tried to modify the GFF to include the transcript and make it simpler as follows but still have problems with the output.
CP054425.1 RefSeq gene 1 1503 . + . ID=gene-EE567_RS00005;Name=dnaA
CP054425.1 Protein Homology transcript 1 1503 . + . ID=transcript-TWP_007057882.1;Parent=gene-EE567_RS00005;biotype=protein_coding
CP054425.1 Protein Homology exon 1 1503 . + . ID=exon-WP_007057882.1;Parent=transcript-TWP_007057882.1
CP054425.1 RefSeq gene 2238 3362 . + . ID=gene-EE567_RS00010;Name=dnaN
CP054425.1 Protein Homology transcript 2238 3362 . + . ID=transcript-TWP_012576463.1;Parent=gene-EE567_RS00010;biotype=protein_coding
CP054425.1 Protein Homology exon 2238 3362 . + . ID=exon-WP_012576463.1;Parent=transcript-TWP_012576463.1

This is the output i get:
#Uploaded_variation Location Allele Consequence IMPACT SYMBOL Gene Feature_type Feature BIOTYPE EXON INTRON HGVSc HGVSp cDNA_position CDS_position Protein_position Amino_acids Codons STRAND
CP054425.1_459_C/G CP054425.1:459 G intergenic_variant MODIFIER dnaA gene-EE567_RS00005 Transcript transcript-TWP_007057882.1 protein_coding 1/1 - transcript-TWP_007057882.1:n.459C>G - 459 - - - - 1
CP054425.1_459_C/G CP054425.1:459 G upstream_gene_variant MODIFIER dnaN gene-EE567_RS00010 Transcript transcript-TWP_012576463.1 protein_coding - - - - - - - - - 1
CP054425.1_459_C/G CP054425.1:459 G upstream_gene_variant MODIFIER EE567_RS00020 gene-EE567_RS00020 Transcript transcript-TWP_032743437.1 protein_coding - - - - - - - - - 1
CP054425.1_459_C/G CP054425.1:459 G upstream_gene_variant MODIFIER recF gene-EE567_RS00015 Transcript transcript-TWP_032743438.1 protein_coding - - - - - - - - - 1
CP054425.1_459_C/G CP054425.1:459 G upstream_gene_variant MODIFIER gyrB gene-EE567_RS00025 Transcript transcript-TWP_172664385.1 protein_coding - - - - - - - - - 1

This is my vep command line:
vep -e -i Bio-Kult_Bi-26_all_Q30DP_norm.vcf -o Bio-Kult_Bi-26_all_Q30DP_norm_ann --tab --fields "Uploaded_variation,Location,Allele,Consequence,IMPACT,SYMBOL,Gene,Feature_type,Feature,BIOTYPE,EXON,INTRON,HGVSc,HGVSp,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,STRAND" --custom B_longum_subps_infantis_Bi-26_transcript.gff.gz,Bi-26,gff --fasta B_longum_subps_infantis_Bi-26.fasta.gz

I will play around with the gff file to see if it works.

Best,
Adrian

@dglemos
Copy link
Contributor

dglemos commented Nov 10, 2021

Hi @Adrian-Howard,
Bacterial genomes in NCBI are annotated with only CDS which means in the GFF file the CDS is directly attached to the gene without any transcript and exons that is why the variants are being annotated as intergenic.

Could you try again with the following GFF example:

CP054425.1 RefSeq gene 1 1503 . + . ID=gene-EE567_RS00005;Name=dnaA
CP054425.1 Protein Homology transcript 1 1503 . + . ID=transcript-TWP_007057882.1;Parent=gene-EE567_RS00005
CP054425.1 Protein Homology exon 1 1503 . + . ID=exon-WP_007057882.1;Parent=transcript-TWP_007057882.1
CP054425.1 RefSeq gene 2238 3362 . + . ID=gene-EE567_RS00010;Name=dnaN
CP054425.1 Protein Homology transcript 2238 3362 . + . ID=transcript-TWP_012576463.1;Parent=gene-EE567_RS00010
CP054425.1 Protein Homology exon 2238 3362 . + . ID=exon-WP_012576463.1;Parent=transcript-TWP_012576463.1

@dglemos dglemos self-assigned this Nov 10, 2021
@dglemos dglemos assigned nuno-agostinho and unassigned dglemos Mar 29, 2022
@nuno-agostinho
Copy link
Contributor

Hi @Adrian-Howard!

Did the comment by @dglemos helped you with this issue?

Please reach out if you still want to discuss this.

Thanks,
Nuno

@nuno-agostinho
Copy link
Contributor

Hi @Adrian-Howard, hope everything is going well with you.

I'll mark this issue as stale for now, but please feel free to open it again if you are still having issues. Thanks!

@nuno-agostinho nuno-agostinho closed this as not planned Won't fix, can't repro, duplicate, stale Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants