Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors in submitting annotations to NCBI #330

Closed
kevinmyers opened this issue Sep 27, 2024 · 5 comments · Fixed by #334
Closed

Errors in submitting annotations to NCBI #330

kevinmyers opened this issue Sep 27, 2024 · 5 comments · Fixed by #334
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@kevinmyers
Copy link

I submitted Bakta annotations to NCBI this week and over half had some fatal errors. They weren't hard to fix, but I wanted to let you know in case there's something that can be done with a future update to avoid them. I am using Bakta version 1.9.1 installed using conda and ran with the --compliant tag.

FATAL: SUSPECT_PRODUCT_NAMES: 1 feature equals 'tmRNA'. Is this a tmRNA or is it a protein?
(Looking at the product it appears to be a hypothetical protein, so I changed it to that)

FATAL: SUSPECT_PRODUCT_NAMES: 1 feature starts with '-'
(Product name: putative-PNPOx domain-containing protein)

FATAL: SUSPECT_PRODUCT_NAMES: 2 features start with '''
(Product name: 'chromo' domain containing protein)
(Product name: 'Cold-shock' DNA-binding domain)

FATAL: 1 feature contains 'remnant'
(Product name: Remnant of transposase, IS3 family)

FATAL: SUSPECT_PRODUCT_NAMES: 2 features contain '#'
(Product name: ATPase/5###-3### helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V))
(Product name: 3###-5### helicase subunit RecB of the DNA repair enzyme RecBCD (exonuclease V))
(Product name: putative DNA-binding protein with ###double-wing### structural motif, MmcQ/YjbR family)
(Product name: Anthranilate synthase, amidotransferase component Para-aminobenzoate synthase, amidotransferase component # TrpAbPabAb)
(Product name: Chorismate mutase I # AroHI)

FATAL: RRNA_NAME_CONFLICTS: 3 rRNA product names are not standard. Correct the names to the standard format, eg "16S ribosomal RNA"
(Product name: (partial) 23S ribosomal RNA)
(Product name: (5' truncated) 16S ribosomal RNA)

@kevinmyers kevinmyers added the bug Something isn't working label Sep 27, 2024
@oschwengers oschwengers self-assigned this Sep 27, 2024
@oschwengers oschwengers added enhancement New feature or request and removed bug Something isn't working labels Sep 27, 2024
@oschwengers oschwengers added this to the v1.10.0 milestone Sep 27, 2024
@oschwengers
Copy link
Owner

oschwengers commented Sep 27, 2024

Hi @kevinmyers ,
thanks a lot for reaching out and reporting these things. It's hard to catch up with all potential submission issues, especially with cluttered-up protein names, but we will do our best.

I will add a couple of sanitizing rules and steps so that we will be able to handle as many of them as possible, soon.

Thanks again. I'll keep you updated.

@kevinmyers
Copy link
Author

No problem. Bakta is the best annotation tool I've used for annotating our metagenomics samples. I love it and am happy to do whatever I can to help improve it.

@oschwengers
Copy link
Owner

OK, I have added a few additional checks and product improvements fixing the following:

  • FATAL: SUSPECT_PRODUCT_NAMES: 2 features start with ''' -> fixed ae4142b
  • FATAL: SUSPECT_PRODUCT_NAMES: 2 features contain '#' -> fixed 4d9f8b6
  • FATAL: RRNA_NAME_CONFLICTS: 3 rRNA product names are not standard. Correct the names to the standard format, eg "16S ribosomal RNA" -> fixed Revise truncated pseudo attributes #333
  • FATAL: SUSPECT_PRODUCT_NAMES: 1 feature starts with '-' -> fixed ae4142b
  • FATAL: 1 feature contains 'remnant' -> fixed 51f4d11

However, for the following I need an example or better the exact feature entry, e.g. from the tsv or json file? This would help to pinpoint these cases.

  • FATAL: SUSPECT_PRODUCT_NAMES: 1 feature equals 'tmRNA'. Is this a tmRNA or is it a protein?

@kevinmyers
Copy link
Author

Thanks @oschwengers!

I'm attaching one of the discrepancy reports for the tmRNA problem. Here is the associated lines in the GFF file:

LacMBR1_d26_Ctrl_pb1	Prodigal	gene	585203	585481	.	-	.	ID=ACE6IH_02570_gene;locus_tag=ACE6IH_02570

LacMBR1_d26_Ctrl_pb1	Prodigal	CDS	585203	585481	.	-	0	ID=ACE6IH_02570;Name=hypothetical_protein;locus_tag=ACE6IH_02570;product=hypothetical_protein;Parent=ACE6IH_02570_gene;inference=ab initio prediction:Prodigal:2.6;Note=RefSeq:WP_048373218.1,SO:0001217,UniParc:UPI0006533E88,UniRef:UniRef100_A0A0J6JD09,UniRef:UniRef50_A0A7Y1EVI1,UniRef:UniRef90_A0A6A7YFX6

Discrepancy_UW_FK_PSEUD1_1_out.txt

@oschwengers
Copy link
Owner

Hmm, very odd/interesting. There are indeed entries in UniRef solely annotated with TmRNA. Bakta now discards these annotations, since they're not very informative anyway.

Again, thanks a lot for reporting! These changes are now public in the main branch, and will be released with v1.10.0 soon. Just in case you face more of these often-occurring fatal errors, please do not hesitate to keep posting them (in new issues).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants