-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can lca summarize kmers counts be transformed to a corresponding fraction of the community? #1011
Comments
hi, I'm afraid we don't have that functionality implemented yet! It's briefly mentioned here as one of the future for the LCA databases: but I don't have a timeline for that work just yet. Let's leave this issue open and I'll update you when I know more. |
Hmm, you know... I think I could give you some code that takes your gather output and a lineage spreadsheet, and does the summary on the f_unique_weighted - the only tricky bits there are the lineage loading, and I have lots of code for that in various projects. It fits pretty well with stuff I want to add to sourmash anyway. I'll whip that up when I get a chance. |
I also have something that kind of does that =P This is the relevant block of code that generates a dataframe with lineage information in |
Thanks @luizirber and @ctb for your messages. |
hi @cesarcardonau yep, understood! I have a few thoughts -- first, very few nucleotide k-mers are informative beyond genus, as far as we can tell - see http://ivory.idyll.org/blog/2017-how-specific-kmers.html. A fair number of those that appear to be may in fact be due to contamination or taxonomic mistakes (see http://ivory.idyll.org/blog/2020-sourmash-gtdb-oddities.html for the source of this intuition). second, now that we are using more complete / larger databases such as GTDB, I suspect that gather is going to be about as sensitive as LCA methods, and a lot more specific. we certainly do find that gather outperforms most taxonomic assignment software (I'm not sure which links @luizirber wants to point, that work will be available soon). third, some in depth work we are doing in our charcoal project suggests that LCA-based identification is not super reliable. fourth, and I guess the most practical point, is that we might not support the specific LCA feature you want for a bit :) |
I've implemented a simple script to do this reporting for gather results, attached as a .txt file. The key bits of code are this function, which:
To run the attached script, you'll need gather results, and a lineage spreadsheet in the style that |
some example results here:
|
so... I think I totally misunderstood this issue because I was focused on something else 😂 It is relatively straightforward to support queries into LCA databases with signatures that have abundances. Totally different problem than having LCA databases support tracking of abundances, which is what we're discussing in #581 and doing in #1015 My apologies. No promises, but this should be a relatively short hack... |
Thanks @ctb, this sounds promising. |
yes, exactly! Sorry, I was just confused! |
great, we are on the same page! thanks for clarifying! |
Working on this in #1022. Not quite as easy as I'd expected - there seem to be multiple places where abundances are effectively flattened! - but I've got some tests set up and will work through it. |
hi @cesarcardonau I've got something that looks right to me in #1022. I have some more tests to write, but in the meantime it should be ready for use if you want to try it out; you'd need to install that branch in developer mode, per these instructions. Let me know if you are interested and need help getting that to work. Alternatively we will probably cut a new release (3.3.1?) in the next few weeks and this should be in it by then. |
Thanks so much for @ctb. I am okay waiting for the next release, I appreciate you guys prioritizing this issue and give such a quick turnaround. Excited to try it in a few weeks or so. |
@cesarcardonau and @taylorreiter here are some attempts by me to ground-truth the calculation. Feedback welcome! sourmash lca summarize --with-abundanceHMP signature from sample G36354Using the GTDB database, gather reports the following genomic breakdown (using a custom script (see comment above) that aggregates taxonomy of gather matches to higher levels) --
vs sourmash lca summarize --with-abundance:
So we see that these reports are pretty consonant in numbers, even for pretty disparate types of analyses! Fake/simulated abundancesI built a fake signature from 5x one genome, 1x another genome, like so:
Note here that 1.fa, an Archaeum, is about 1/4 the size of 63.fa, a Bacterium -- % wc ../podar-ref/{1,63}.fa
21242 21246 1508077 ../podar-ref/1.fa
66992 67018 5426146 ../podar-ref/63.fa
88234 88264 6934223 total that is, one genome (archaea) is 22% of the DNA while the other (bacteria) is 78% of the DNA. >>> size1 = 1508077
>>> size2 = 5426146
>>> size1 / (size1 + size2)
0.21748319891067824
>>> size2 / (size1 + size2)
0.7825168010893218 This is reflected in the output of sourmash lca summarize at the kingdom level ---
so that matches well, in terms of ratios of k-mers! When you add % sourmash lca summarize --db ../podar-ref.lca.json.gz --query xxx.sig --with-abundance
43.2% 563 Bacteria
56.8% 740 Archaea which (approximately) matches the math weighting the first genome by a factor of five -- >>> size1 = 1508077
>>> size2 = 5426146
>>> size1_weighted = size1*5
>>> total = size1_weighted + size2
>>> f_1 = size1_weighted / total
>>> f_2 = size2 / total
>>> f_1
0.5815267784421292
>>> f_2
0.4184732215578708 huzzah! This all seems to make sense to me. This is something I'll add into the docs in #1022, would love to entertain any questions you have! |
Note that this script will fail if the gather output has matches from multiple databases and the lineage file has not been combined across those databases. It fails on line 88 from |
Hi,
I am looking forward to using lca summarize to see the different taxonomical compositions of my metagenome.
From my first test runs and your documentation [1] it looks like I'm only seeing the number of k-mers that are matching each taxonomic level. I was hoping to see what relative fraction of the community is associated with each taxonomic level, similar to what I get with the f_unique_weighted column in the gather commands, this column adds to 1 (or very close to it) for each gather execution.
My ultimate question: is there a way to go from lca summarize kmers counts to the fraction of the community?
Thanks! Cesar
[1] https://sourmash.readthedocs.io/en/latest/command-line.html
"The output is space-separated and consists of three columns: the percentage of total k-mers that have this classification; the number of k-mers that have this classification; and the lineage classification"
The text was updated successfully, but these errors were encountered: