Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databases for v4.3 #292

Closed
vragh opened this issue Jun 4, 2021 · 3 comments
Closed

Databases for v4.3 #292

vragh opened this issue Jun 4, 2021 · 3 comments

Comments

@vragh
Copy link

vragh commented Jun 4, 2021

Dear SortMeRNA team,

Thank you for the amazing tool!!

I see that v4.3 is out, and the release page lists a bunch of databases. The v4.0 documentation page lists yet another bunch of databases. (Which I can't seem to be able to find?)

The documentation for v4.3 seems to be WIP, so I'd like to ask:

  • Which databases should I use? The one in the release, or these other SILVA/Rfam databases?
  • If it's the ones from the release, which one in particular? What's the difference between smr_v4.3_sensitive_db.fasta and mr_v4.3_sensitive_db_rfam_seeds.fasta? What's the relationship between these and the separate SILVA/Rfam databases mentioned above?
  • If were to just use all the SILVA and Rfam databases (not yours, but ones straight from the source) can I just specify all of them? Would this be a folly? Is it better to use one of your bundled databases instead?

Your assistance would be much appreciated. (Super sorry for asking a ton of questions, I'm just really confused.)

@ekopylova
Copy link
Contributor

Hello @vragh,

We recommend to use smr_v4.3_default_db.fasta.

The difference between the databases is the % ID for clustering the sequences for each kingdom + rRNA component.
Specifically,

smr_v4.3_default_db.fasta -> bac-16S 90%, 5S & 5.8S seeds, rest 95% (benchmark accuracy: 99.899%)
smr_v4.3_sensitive_db.fasta -> all 97% (benchmark accuracy: 99.907%)
smr_v4.3_sensitive_db_rfam_seeds.fasta -> all 97%, except RFAM database which includes the full seed database sequences

The accuracy (based on sensitivity and selectivity) is very good for both, however the "sensitive" database will run at least 2x slower.

Yes, you can use the complete SILVA and Rfam databases, however you will not gain much in terms of sensitivity (e.g. sensitivity for the default database is >99.9%) but the program will run much slower.

Best,
Jenya

@vragh
Copy link
Author

vragh commented Jun 8, 2021

Hi @ekopylova ,

Thank you for the pointers. I decided to go ahead with smr_v4.3_sensitive_db_rfam_seeds.fasta.

@vragh vragh closed this as completed Jun 8, 2021
@ppericard
Copy link
Contributor

@ekopylova, @vragh, just to be sure that the information is complete:

smr_v4.3_sensitive_db.fasta is the most complete database. As said above it includes all db clustered at 97%id, including the full RFAM 5S & 5.8S database (~150,000 sequences before clustering).

On the other hand, the smr_v4.3_sensitive_db_rfam_seeds.fasta includes all db clustered at 97%id except the RFAM db which only includes the seed sequences for 5S & 5.8S (~800 sequences).

I'm not sure there is much difference in terms of running time between these 2 dbs but if what you're looking for is maximum sensitivity I would advise using the smr_v4.3_sensitive_db.fasta database.

Best,
Pierre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants