-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Top candidates from first Zoonotic run for assembler testing #89
Comments
Here are draft reference-based assemblies for the 136 datasets listed above: There are three directories:
The top row of the human-readable alignment looks like this:
The symbols indicate read depth:
The example above is typical of what I've seen in several assemblies. Notice how there is strong fluctuation in depth from at least 8 reads down to 1 or none. Hence my lobbying on the call today to provide a per-base quality metric for the assemblies because downstream analyses typically assume that a base call is good. |
The above assemblies were generated by a fully automated protocol as a proof of concept that this can be implemented in an HPC framework in reasonable time. The assembler is a very quick and dirty hack, but the contigs should be good enough to serve as a sanity check for de novo efforts. De novo assemblies will also be a check on my reference assemblies -- comparisons solicited! |
I've been using a plot of Divergence vs. Hits to focus on divergent hits of a certain hit threshold (when you see it you'll understand why). It required removing the 'noise' with respect to blacklisted hits. I was going through accessions/library which are <85% divergence and >100 read hits. Pre-filtering Scatter PlotPost-filtering Scatter PlotThere's still a bit more work to do here with clean-up, but if See SRA Entry |
This looks like a good hit to me. Cov2m + new summarizer would find it. If there are many like this, that supports your argument for including fragments in the reference. I don't understand your scatterplots. What I would like to see is a histogram of dataset counts binned by identity (100, 99, .... 80) to a reference with full-length only genomes (FLOM). If this histogram is flat or has a fat tail, then there is a strong case for including metagenomic fragments. If it has a thin tail, then fragments may be more trouble than they're worth. |
That's what the scatter plot is showing for cov2m. Remember how in the simulated divergence we had the drop map% as a function of divergence. Here we see a drop in hits (even in a very course measurement of total hits) following divergence. Therefore divergent + many hits (i.e. the outliers) are where we find novel viruses. |
Still not really following, but if you are seeing a fat tail with cov2m then that is a very promising result because it suggests we can find many novel highly diverged Cov's (many = more than I was expecting, anyway). I guess that's what you're trying to say & I'm having difficulty following the details. Assuming this result, I totally agree we should push sensitivity by using reference fragments, but then the difficulty of implementing an analysis pipeline and generating a high-quality data repository in 6 weeks looks daunting / borderline impossible because we need 1. automated detection with a fragmented reference followed by 2. HPC de novo assembly against hosts with varying distance from known genomes, which from my perspective looks like a non-trivial research problem which is not solvable in the time available. Maybe the next priority is to focus on validating & documenting your result & seeking more funding from fast-moving entrepreneurial / non-traditional grants to buy the additional time needed? |
@ababaian can you explain in more detail how you made the scatterplot/histogram and how you interpret it, preferably in prose rather than pointing to a notebook? This analysis looks tricky to me. The naive way is to parse pan_genome lines in s3://serratus-public/out/200505_zoonotic/summary/, but these look to be worthless (my responsibility on that point, of course). With luck, the next summarizer might enable this analysis, but I'm doubtful, I'm expecting I'll need to use the tinyhit files pending further refinement of the summarizer. If the new summary reports are better they'll still need some careful post-processing to measure coverage on Cov only and check for bad (black) regions which cause spurious coverage. Based on my Cov-2 multiple alignment, I expect hits at 80% id to have coverage only of one well-conserved gene, in which case I think we'll need to do some manual analysis to convince ourselves that low-coverage / low-identity hits are true positives and rule out difficult cases such as Cov's in the reference set which have incorporated mammal genes. IMO we need to understand a bit more about the phylogenetics of the virus before we can automate detection at high divergence, doubly so if we use a fragmented reference. |
So I do exactly that. I parse every single hit in the First pass was a bit of a wash because of the 'junk' hits, but by focusing on the quadrant <85% divergence and >100 read hits, I identify them quickly. I then added blacklist to filter out this junk and I was left with a few libraries that had divergent and plentyful matches. I haven't wrapped my head around how to automate a process like this, but it will help refine the genome very quickly and hopefully with a bit more work can give us some 'criteria' for defining putative CoV+ libraries to assembly. |
Excellent, this is exactly the kind of approach I was thinking of. I was focusing on higher identities first to make sure I could assemble easier cases. You're digging the tunnel from the other end; hopefully we will meet in the middle. |
See also: |
Agreed. Both ERR2756788 (score=86) and SRR3292629 (score=100) are in my top zoonotic candidates from which I picked some representatives as the assembler test cases. |
@ababaian said: "For instance with 100,000 SARS-CoV-2 reads, some will leak into MERS at a fixed rate." Probably due to bowtie2's randomized tie-breaking:
|
We knew this would happen right. This is why we went to a common pan-genome. Also found a sick kitty :'( Edit: Crossed my wires. |
Ferrets are kitties? My latest summarizer thinks this one is noise due to unmasked black regions. |
What are the next steps for this topic? Can we refine this into a set of deliverables? |
The purpose of the issue was to announce deliverables :-) We are de facto using Github as a sort of slack and we're making up protocols for announcements and discussions as we go along. Maybe announcements are better suited to slack, but FWIW I am a novice slack user and so far I strongly dislike it and don't understand how to use it productively. My learning curve is exploding exponentially right now and I'm trying to flatten the curve :-) Back on topic, I will close the issue because it seems to have served it's non-issue purpose. |
This is an inventory of top hits, I think we've gone through the zoonotic1 dataset now and we need to centralize this list of hits like Robert has done, add the few he may have missed and this is something we can hand off to the assembly team. I propose creating a wiki page holding this data since it is relevant long term. We don't have a central database to hold meta-data like hits, contigs etc... yet. |
I am closing this issue and punting to @ababaian to create a Wiki / open an issue for Wiki creation. |
K I'll add this to my TODO and use this issue then :P |
Hi @ababaian , I made a stub for this on the Serratus Assembly Wiki page: |
@rcedgar I hear you regarding the learning curve. Here's what has been useful break-down for me coordinating work on GitHub on other projects:
My $0.02! |
"Place where persistent documentation and information lives." Conflicts with doc/ in the repo, which in Artem's convention is notebooks? This stuff should go in CONTRIBUTING. It's not easy to grasp, but I can see the need for some organization here. I don't know where to put things or look for things so there is wasted and duplicated effort, I'm sure I'm not the only one. |
@ababaian What is the scope of the doc/ directory in the 'serratus' repo? On my other projects, only code-specific documentation goes in the source code repo in the doc or docs top-level directory. But higher-level project documentation would go in the Wiki. @rcedgar The Wiki is also a Git repo, FYI. |
Serratus Documentation
|
Is there anything addition regarding this issue that is needed or are we good to close this? |
Good to close. As owner, I'm closing. |
Here are the top candidates I found in a preliminary analysis:
zoonotic_candidates.tsv
Fields are:
The text was updated successfully, but these errors were encountered: