-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request for dereplicate
#81
Comments
I think allowing selection between longest or most abundant by exposing clust_size etc is a good idea for q2-vsearch (OTU clustering), but does not matter that much here, since the sequence selected as the centroid has little or no bearing on the consensus taxonomy performed at the other end.
I am not sure that vsearch would have appropriate options for this, e.g., to take a feature table as input for defining centroids. This sounds more like a closed-reference OTU clustering approach, so is fairly off topic here. exposing randseed may be a better option to avoid pseudo-random behavior in centroid selection. Would that suit your need? |
Thanks @nbokulich , Here's what I was wondering... For a particular cluster, I have the following four bits of information:
Say one cluster contains three ASVs, with these bits of information:
When I run I'm pretty sure that's what happens. My concern is more along the lines of having two programs that appear to do the same thing, but produce two potentially different outputs. Under the hood, both qiime vsearch cluster-features-de-novo and qiime rescript dereplicate are calling upon I find the rescript dereplication functionality super valuable, and I want to use it primarily to perform the clustering of my data. However, I was a bit concerned when I realized that I'd get a different number of clusters when I used that program versus the original vsearch method in QIIME, and it feels like the better way of defining clusters is with abundances instead of sequence length. Again though, I'm not sure how much of a difference that really makes. Appreciate your insights - thanks very much. |
Thanks @devonorourke
Correct
They do not appear to do the same thing, though. They both use VSEARCH under the hood, but for two different things, in two different ways, and the help documentation (for both the actions as well as the plugins overall) should all make that fairly clear. Very happy to consider a pull request to clarify this further in the documentation.
that is not what I think your idea to use this for clustering observational data that have been taxonomically classified and have abundance info is really quite interesting, and could expand the functionality of RESCRIPt in new ways (possibly as a new action?). This needs more experimentation and testing before adding this to RESCRIPt, though, as it really bends the original purpose of Thanks! |
Thanks @nbokulich ,
Perhaps my interest in adding the functionality of applying LCA + 'super' (or other) processes available in RESCRIPt would be better suited for q2-vsearch instead. Given that the abundance information is already available within My overall plan was to: I can do all of this with current QIIME/RESCIPRt tools, but the one hitch is that I'm having to use a clustering vsearch process that sorts the fasta based on length, rather than abundance. Maybe that's sufficient. Thanks! |
No, I don't think so. What you are doing is quite unusual, and certainly out of scope for q2-vsearch (since you are not really trying to perform OTU clustering per se, are you? more collapsing features by taxonomy and sequence similarity). To put it another way, it would be easier to add this functionality to RESCRIPt than to q2-vsearch. I am still not really sure what you are trying to actually accomplish though, after reading through those forum topics. You are combining OTU clustering and taxonomy dereplication in a strange way. If you are trying to perform ensemble classification (e.g., to find taxonomy consensus across multiple classifiers), your workflow should just consist of: On the other hand, if you want to cluster the data with q2-vsearch just do so prior to taxonomy classification. The point of clustering is to reduce complexity, e.g., to limit computational resources during steps like taxonomy classification. Why cluster afterwards? At that point the benefit of reducing complexity is behind you. I am happy to leave this issue open if you want to test out some new functionality, and present a use case for it. Thanks @devonorourke ! |
Hi all,
There was a recent question posted to the QIIME forum that got me thinking about why the results might differ between running stand-alone QIIME clustering and vsearch-standalone clustering.
I'm curious if it would be a significant challenge to add an additional component to
qiime rescript dereplicate
that would follow allow for the user to supply a feature-table file. This information is only relevant when users want to invoke the--p-perc-identity
argument. At the moment, this triggers vsearch to use--clust_size
, but because the fasta header is formatted with only feature information (and not abundance information) the centroid selection defaults to the longest sequence in the cluster, rather than the most abundant sequence. If the feature-table information could be imported withqiime rescript dereplicate
, the relevant abundance information might then be available for each sequence feature (summed across all samples with that sequence feature), and inserted into the clustering command.I'm not sure whether or not you want the default to be the length of the sequence or the abundance of the sequence feature - my guess is the sequence abundances are probably more relevant when defining a cluster than the length, but perhaps this varies a lot across marker genes, and the sequence fragments generated therein.
Thanks for your consideration!
The text was updated successfully, but these errors were encountered: