Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide/update/refactor documentation around k-mer sizes and scaled #2918

Closed
ctb opened this issue Jan 13, 2024 · 1 comment · Fixed by #2921
Closed

provide/update/refactor documentation around k-mer sizes and scaled #2918

ctb opened this issue Jan 13, 2024 · 1 comment · Fixed by #2921

Comments

@ctb
Copy link
Contributor

ctb commented Jan 13, 2024

per @LilyAnderssonLee review

I may have overlooked it in your documentation. It would be helpful if you could provide some recommendations for the selection of k-mers and scales.

and

What are the consequences of using different scales? Do you recommend using the same scale to generate the signatures for genomes or databases?

@ctb
Copy link
Contributor Author

ctb commented Jan 14, 2024

Partially answered here https://sourmash.readthedocs.io/en/latest/using-sourmash-a-guide.html#what-k-mer-size-s-should-i-use and here https://sourmash.readthedocs.io/en/latest/using-sourmash-a-guide.html#what-resolution-should-my-signatures-be-how-should-i-create-them. I think we should add a pointer to this in the FAQ to make it more discoverable, and also highlight this section:

One additional wrinkle is that we provide a number of precalculated databases at k=21, k=31, and k=51. It is often convenient to calculate signatures at these sizes so that you can use these databases.

@ctb ctb closed this as completed in #2921 Jan 15, 2024
ctb added a commit that referenced this issue Jan 15, 2024
Adds the following FAQ entry to address
#2918:

> ## What scaled values should I use with sourmash?
> 
> We recommend scaled=1000 or scaled=10000 when working with bacterial
> and archaeal sketches and DNA. We have quite a bit of experience with
> this, and even some
> [published
benchmarks](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05103-0)
> showing that this works very well.  You may need to use lower scaled
> values with smaller query and target sequences, such as viral genomes
> or genes, but we do not have systematic advice on this.
> 
> That having been said, you can always use a lower scaled value - the
only
> consequence is that memory and compute requirements increase.
> 
> Also, sourmash will automatically use the larger of two scaled values
> when comparing two sketches with different scaled values. So if, for
example,
> you use [the precomputed databases](databases.md), you will always end
up
> using your query sketches at a minimum scaled of 1000, even if you
created
> them with a lower scaled value.
> 
> Please also see [What resolution should my signatures
be?](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-create-them).

Fixes #2918

---------

Co-authored-by: Colton Baumler <63077899+ccbaumler@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant