Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add uchime2_denovo to close #92 #100

Open
wants to merge 11 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions q2_vsearch/_chimera.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@
from ._format import UchimeStatsFmt


_uchime_defaults = {'dn': 1.4,
_uchime_defaults = {'method': 'uchime',
'dn': 1.4,
'mindiffs': 3,
'mindiv': 0.8,
'minh': 0.28,
Expand Down Expand Up @@ -68,26 +69,29 @@ def _uchime_ref(sequences, table, reference_sequences, dn, mindiffs,

def uchime_denovo(sequences: DNAFASTAFormat,
table: biom.Table,
method: str = _uchime_defaults['method'],
dn: float = _uchime_defaults['dn'],
mindiffs: int = _uchime_defaults['mindiffs'],
mindiv: float = _uchime_defaults['mindiv'],
minh: float = _uchime_defaults['minh'],
xn: float = _uchime_defaults['xn']) \
-> (DNAFASTAFormat, DNAFASTAFormat, UchimeStatsFmt):
cmd, chimeras, nonchimeras, uchime_stats = \
_uchime_denovo(sequences, table, dn, mindiffs, mindiv, minh, xn)
_uchime_denovo(sequences, table, method,
dn, mindiffs, mindiv, minh, xn)
return chimeras, nonchimeras, uchime_stats


def _uchime_denovo(sequences, table, dn, mindiffs, mindiv, minh, xn):
def _uchime_denovo(sequences, table, method,
dn, mindiffs, mindiv, minh, xn):
# this function only exists to simplify testing
chimeras = DNAFASTAFormat()
nonchimeras = DNAFASTAFormat()
uchime_stats = UchimeStatsFmt()
with tempfile.NamedTemporaryFile() as fasta_with_sizes:
_fasta_with_sizes(str(sequences), fasta_with_sizes.name, table)
cmd = ['vsearch',
'--uchime_denovo', fasta_with_sizes.name,
'--' + method + '_denovo', fasta_with_sizes.name,
colinbrislawn marked this conversation as resolved.
Show resolved Hide resolved
'--uchimeout', str(uchime_stats),
'--nonchimeras', str(nonchimeras),
'--chimeras', str(chimeras),
Expand Down
29 changes: 29 additions & 0 deletions q2_vsearch/citations.bib
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,32 @@ @article{rideout2014subsampled
publisher={PeerJ Inc.},
doi={10.7717/peerj.545}
}

@article{edgar2011uchime,
title={UCHIME improves sensitivity and speed of chimera detection},
author={Edgar, Robert C and Haas, Brian J and Clemente, Jose C and Quince, Christopher and Knight, Rob},
journal={Bioinformatics},
volume={27},
number={16},
pages={2194--2200},
year={2011},
publisher={Oxford University Press}
}

@article{edgar2016uchime2,
title={UCHIME2: improved chimera prediction for amplicon sequencing},
author={Edgar, Robert C},
journal={BioRxiv},
pages={074252},
year={2016},
publisher={Cold Spring Harbor Laboratory}
}

@article{edgar2016unoise2,
title={UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing},
author={Edgar, Robert C},
journal={BioRxiv},
pages={081257},
year={2016},
publisher={Cold Spring Harbor Laboratory}
}
28 changes: 18 additions & 10 deletions q2_vsearch/plugin_setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -371,7 +371,7 @@
'nonchimeras': 'The non-chimeric sequences.',
'stats': 'Summary statistics from chimera checking.'
},
name='Reference-based chimera filtering with vsearch.',
name='Reference-based chimera filtering.',
description=('Apply the vsearch uchime_ref method to identify chimeric '
'feature sequences. The results of this method can be used '
'to filter chimeric features from the corresponding feature '
Expand All @@ -385,6 +385,8 @@
'sequences': FeatureData[Sequence],
'table': FeatureTable[Frequency]},
parameters={
'method': qiime2.plugin.Str % qiime2.plugin.Choices(
['uchime', 'uchime2', 'uchime3']),
'dn': qiime2.plugin.Float % qiime2.plugin.Range(0., None),
'mindiffs': qiime2.plugin.Int % qiime2.plugin.Range(1, None),
'mindiv': qiime2.plugin.Float % qiime2.plugin.Range(0., None),
Expand All @@ -404,12 +406,17 @@
'abundances).'),
},
parameter_descriptions={
'method': ('Denovo chimera detection based on uchime (Edgar 2011), '
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colinvwood While trying to keep this short and sweet, I've added a little more detail.

How does this look?

I think we should cite Rob's papers too. Let me work on adding those two papers to the .bib file...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good 👍🏻

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want me to add these citations to the .bib and link them up, or are we good to go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're referring to each of the citations for each of the algorithm iterations, I think that yes we should do that. If you're referring to "Rob's papers" I'm unsure which ones those are exactly.

'uchime2 (Edgar 2016), or uchime3 (Edgar 2016).'),
'dn': ('No vote pseudo-count, corresponding to the parameter n in '
'the chimera scoring function.'),
'mindiffs': 'Minimum number of differences per segment.',
'mindiv': 'Minimum divergence from closest parent.',
'mindiffs': ('Minimum number of differences per segment. '
'Ignored for uchime2 and uchime3.'),
'mindiv': ('Minimum divergence from closest parent. '
'Ignored for uchime2 and uchime3.'),
'minh': ('Minimum score (h). Increasing this value tends to reduce '
'the number of false positives and to decrease sensitivity.'),
'the number of false positives and to decrease sensitivity. '
'Ignored for uchime2 and uchime3.'),
'xn': ('No vote weight, corresponding to the parameter beta in the '
'scoring function.'),
},
Expand All @@ -418,12 +425,13 @@
'nonchimeras': 'The non-chimeric sequences.',
'stats': 'Summary statistics from chimera checking.'
},
name='De novo chimera filtering with vsearch.',
description=('Apply the vsearch uchime_denovo method to identify chimeric '
'feature sequences. The results of this method can be used '
'to filter chimeric features from the corresponding feature '
'table. For more details, please refer to the vsearch '
'documentation.')
name='De novo chimera filtering.',
description=('Apply one of the vsearch uchime*_denovo methods to '
'identify chimeric feature sequences. '
'The results of these methods can be used to filter chimeric '
'features from the corresponding feature table. '
'For more details, please refer to the vsearch manual.'),
citations=[citations['edgar2011uchime', 'edgar2016uchime2', 'edgar2016unoise2']]
colinbrislawn marked this conversation as resolved.
Show resolved Hide resolved
)


Expand Down
3 changes: 3 additions & 0 deletions q2_vsearch/tests/test_chimera.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on adding tests for these new algorithm versions (beyond testing the command string)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a trivial test that shows both working.

I don't have an example in which these methods differ.... Would you like me to try and find one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that there is a test to which we pass "uchime3" as the method; however we technically can't be sure that this method is being implemented by the underlying software without differentiating behavior.

I understand if it's too difficult to contrive input data that shows different expected behavior for the different algorithm methods, but if it is reasonably easy to do so it would be best.

Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ def test_uchime_denovo(self):

obs_chime = _read_seqs(chime)
exp_chime = [self.input_sequences_list[3]]
# >feature4 is the chimera!
self.assertEqual(obs_chime, exp_chime)

# sequences are reverse-sorted by abundance in output
Expand Down Expand Up @@ -105,8 +106,10 @@ def test_uchime_denovo_no_chimeras_alt_params(self):
with redirected_stdio(stderr=os.devnull):
cmd, chime, nonchime, stats = _uchime_denovo(
sequences=self.input_sequences, table=self.input_table,
method='uchime3',
dn=42.42, mindiffs=4, mindiv=0.5, minh=0.42, xn=9.0)
cmd = ' '.join(cmd)
self.assertTrue('--uchime3_denovo' in cmd)
self.assertTrue('--dn 42.42' in cmd)
self.assertTrue('--mindiffs 4' in cmd)
self.assertTrue('--mindiv 0.5' in cmd)
Expand Down
Loading