You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So I have created a custom database by taking the refseq proteins and adding proteins from a database called RUG2. When I run classification with the regular refseq database on one of my samples, I get about 18M classified reads. When I run the same sample with refseq plus RUG2, I only get about 11K reads. I don't understand why adding proteins to an existing database to create a new database results in so much fewer classifications. I'm happy to share any files you need to debug the issue. Any help would be highly appreciated.
The text was updated successfully, but these errors were encountered:
the taxonomy must work out also with the RUG2 database: does your fasta file has proper headers with proper taxonomy IDs that are also contained in your names.dmp / tree.dmp
what happens when you make a kaiju index only of the RUG2 database and classify the reads
use one of the sequences from your DB and give it as input to kaiju -p to classify it, it should be found (obvisouly)
So if I have some headers in my custom fasta file that do NOT have tax IDs that occur in nodes.dmp... will that cause problems?
When I run kaiju using only the RUG2 database, I get very few classifications.
When I get one of the RUG2 protein sequences and run it against my custom DB with kaiju -p, it DOES NOT classify it. So that's obviously a problem.
Looks like if there is any header where the tax ID does not occur in nodes.dmp then it screws up the database. Once I took out the proteins that had tax IDs that don't occur in nodes.dmp (and proteins with X's in them), the database built properly and it seems to be classifying reads well.
So I have created a custom database by taking the refseq proteins and adding proteins from a database called RUG2. When I run classification with the regular refseq database on one of my samples, I get about 18M classified reads. When I run the same sample with refseq plus RUG2, I only get about 11K reads. I don't understand why adding proteins to an existing database to create a new database results in so much fewer classifications. I'm happy to share any files you need to debug the issue. Any help would be highly appreciated.
The text was updated successfully, but these errors were encountered: