Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request for dereplicate #81

Open
devonorourke opened this issue Oct 5, 2020 · 5 comments
Open

feature request for dereplicate #81

devonorourke opened this issue Oct 5, 2020 · 5 comments

Comments

@devonorourke
Copy link
Contributor

Hi all,

There was a recent question posted to the QIIME forum that got me thinking about why the results might differ between running stand-alone QIIME clustering and vsearch-standalone clustering.

I'm curious if it would be a significant challenge to add an additional component to qiime rescript dereplicate that would follow allow for the user to supply a feature-table file. This information is only relevant when users want to invoke the --p-perc-identity argument. At the moment, this triggers vsearch to use --clust_size, but because the fasta header is formatted with only feature information (and not abundance information) the centroid selection defaults to the longest sequence in the cluster, rather than the most abundant sequence. If the feature-table information could be imported with qiime rescript dereplicate, the relevant abundance information might then be available for each sequence feature (summed across all samples with that sequence feature), and inserted into the clustering command.

I'm not sure whether or not you want the default to be the length of the sequence or the abundance of the sequence feature - my guess is the sequence abundances are probably more relevant when defining a cluster than the length, but perhaps this varies a lot across marker genes, and the sequence fragments generated therein.

Thanks for your consideration!

@nbokulich
Copy link
Collaborator

I think allowing selection between longest or most abundant by exposing clust_size etc is a good idea for q2-vsearch (OTU clustering), but does not matter that much here, since the sequence selected as the centroid has little or no bearing on the consensus taxonomy performed at the other end.

allow for the user to supply a feature-table file

I am not sure that vsearch would have appropriate options for this, e.g., to take a feature table as input for defining centroids. This sounds more like a closed-reference OTU clustering approach, so is fairly off topic here.

exposing randseed may be a better option to avoid pseudo-random behavior in centroid selection. Would that suit your need?

@devonorourke
Copy link
Contributor Author

Thanks @nbokulich ,

Here's what I was wondering...

For a particular cluster, I have the following four bits of information:

  • the ASV feature id in the cluster,
  • the length of the ASV sequence,
  • the abundances of that ASV across all my samples,
  • the taxonomic information of that ASV.

Say one cluster contains three ASVs, with these bits of information:

ASVid ASVlength (nt) ASVabund (nSeqs) ASVtaxastring
asv_1 180 10,000 o:Hymenoptera; f:Apidae; g:Bombini; s:lapidarius
asv_2 190 30 o:Hymenoptera; f:Apidae; g:Bombini; s:
asv_3 181 10,001 o:Hymenoptera; f:; g:; s:

When I run qiime rescript dereplicate at the moment, I'm pretty sure the cluster itself that is retained will be asv_2 because it is the longest sequence. I think I was confused initially about the consequence of this though - it sounds like it doesn't matter which of these three ASVs serve as the centroid, because all of the taxonomic information is subsequently considered when determining the final taxonomic information applied to the cluster. Am I correct in thinking that the step where the .uc file is parsed, we'd be grouping across all three potential taxonomy strings, and retaining the most complete one in my example, assuming I was using the 'super' mode? In other words, the output file would have just a single featureID for the cluster - the longest one, asv_2 - with a taxonomic string from asv_1?

I'm pretty sure that's what happens. My concern is more along the lines of having two programs that appear to do the same thing, but produce two potentially different outputs. Under the hood, both qiime vsearch cluster-features-de-novo and qiime rescript dereplicate are calling upon vsearch --cluster_size to do the clustering, but the two programs will produce different centroids because one option will use the sequence length (dereplicate) while the other uses the sequence abundance (cluster-features-de-novo) to define the centroid.

I find the rescript dereplication functionality super valuable, and I want to use it primarily to perform the clustering of my data. However, I was a bit concerned when I realized that I'd get a different number of clusters when I used that program versus the original vsearch method in QIIME, and it feels like the better way of defining clusters is with abundances instead of sequence length. Again though, I'm not sure how much of a difference that really makes.

Appreciate your insights - thanks very much.

@nbokulich
Copy link
Collaborator

Thanks @devonorourke

we'd be grouping across all three potential taxonomy strings, and retaining the most complete one in my example, assuming I was using the 'super' mode?

Correct

My concern is more along the lines of having two programs that appear to do the same thing, but produce two potentially different outputs.

They do not appear to do the same thing, though. They both use VSEARCH under the hood, but for two different things, in two different ways, and the help documentation (for both the actions as well as the plugins overall) should all make that fairly clear. Very happy to consider a pull request to clarify this further in the documentation.

I find the rescript dereplication functionality super valuable, and I want to use it primarily to perform the clustering of my data.

that is not what dereplicate is designed to do right now, though. dereplicate was designed for dereplicating/clustering reference data and is not designed to handle abundance information.

I think your idea to use this for clustering observational data that have been taxonomically classified and have abundance info is really quite interesting, and could expand the functionality of RESCRIPt in new ways (possibly as a new action?). This needs more experimentation and testing before adding this to RESCRIPt, though, as it really bends the original purpose of dereplicate. I recommend making a separate branch to test this out... and will leave this issue open (meaning, I think it's worth a shot!)

Thanks!

@devonorourke
Copy link
Contributor Author

Thanks @nbokulich ,
My intention for this comment...

They do not appear to do the same thing
... was to point out that the two commands are both applying vsearch --clust_size. Agreed, their intended use cases are different.

Perhaps my interest in adding the functionality of applying LCA + 'super' (or other) processes available in RESCRIPt would be better suited for q2-vsearch instead. Given that the abundance information is already available within qiime vsearch cluster-features-de-novo, maybe I should have started a feature request there instead of RESCRIPt?

My overall plan was to:
1a. Classify the a set of dereplicated (ASVs) using VSEARCH
1b. Classify the same set of ASVs using sklearn/naive Bayes
2a. Cluster the ASVs at a fixed %identity for VSEARCH results
2b. Cluster the ASVs at the same %identity for sklearn/naive Bayes
3. Merge the taxa to find the best possible string

I can do all of this with current QIIME/RESCIPRt tools, but the one hitch is that I'm having to use a clustering vsearch process that sorts the fasta based on length, rather than abundance. Maybe that's sufficient.

Thanks!

@nbokulich
Copy link
Collaborator

Perhaps my interest in adding the functionality of applying LCA + 'super' (or other) processes available in RESCRIPt would be better suited for q2-vsearch instead

No, I don't think so. What you are doing is quite unusual, and certainly out of scope for q2-vsearch (since you are not really trying to perform OTU clustering per se, are you? more collapsing features by taxonomy and sequence similarity). To put it another way, it would be easier to add this functionality to RESCRIPt than to q2-vsearch.

I am still not really sure what you are trying to actually accomplish though, after reading through those forum topics. You are combining OTU clustering and taxonomy dereplication in a strange way. If you are trying to perform ensemble classification (e.g., to find taxonomy consensus across multiple classifiers), your workflow should just consist of:
1a. Classify the a set of dereplicated (ASVs) using VSEARCH
1b. Classify the same set of ASVs using sklearn/naive Bayes
2. Merge the taxa to find the best possible string

On the other hand, if you want to cluster the data with q2-vsearch just do so prior to taxonomy classification. The point of clustering is to reduce complexity, e.g., to limit computational resources during steps like taxonomy classification. Why cluster afterwards? At that point the benefit of reducing complexity is behind you.

I am happy to leave this issue open if you want to test out some new functionality, and present a use case for it.

Thanks @devonorourke !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants