Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: use false positive rate instead of error rate? #17

Open
eboyden opened this issue Apr 7, 2021 · 2 comments
Open

feature request: use false positive rate instead of error rate? #17

eboyden opened this issue Apr 7, 2021 · 2 comments

Comments

@eboyden
Copy link

eboyden commented Apr 7, 2021

Hi, I'm a big fan of this software but was wondering if it might make sense to provide the option to threshold based on a false positive rate instead of error rate (similar to what SeqPurge does using the binomial distribution calculation), since longer overlaps should be more tolerant of higher error rates. We've found that we obtain the best performance when piping multiple instances of NGmerge to grossly simulate this effect; e.g. to simulate a 1E-6 FP threshold, we allow 8% errors for overlaps of 10-14 bp, 17% errors for overlaps of 15-19 bp, and 23% errors for overlaps of 20+ bp. But obviously this is still overly stringent for longer overlaps, not to mention time consuming.

@jsh58
Copy link
Owner

jsh58 commented Apr 17, 2021

Thanks for the question. This is an interesting topic that requires two separate answers, for the two modes of NGmerge:

  • In stitch mode, I have found that relaxing the allowed errors (increasing -p) causes increased false positives -- that is, placing reads in an incorrect overlapping alignment. This occurs, for example, with reads derived from genomes with numerous pseudo-repetitive regions. In such cases, longer overlaps should not necessarily be more tolerant of errors, and what you suggest would worsen the situation.
  • In adapter-removal mode, there is an additional check that can be made: the putative adapter sequences can be examined via the -c <file>. As stated in the description of the -c <file>:

    If the sequences that appear in the 'Adapter' columns are not consistent, they may be false positives, and one should consider decreasing -p or increasing -e.

@eboyden
Copy link
Author

eboyden commented Jun 18, 2021

  • To your first point, the risk really depends on the dataset and how one is using NGmerge. For example, not only do we use it to trim dovetails of otherwise good read pairs, we also sometimes use it in stitch mode with impossibly high -m but low -e to stitch and remove dovetailed reads, allowing only unstitched (undovetailed) reads to pass forward. In this case, we're willing to tolerate a slightly higher FP stitching rate if it means cleaner data. But being able to tune the FP rate directly (with an error rate that automatically adjusts as a function of overlap length) would be preferable to only being able to tune the error rate and minimum overlap.
  • To your second point, this only works when the "adapters" are consistent, e.g. sequencing adapters for a shotgun library. For some types of amplicon sequencing, when the 5' primer sequences have already been removed from the reads, the 3' dovetails will be the reverse complements of those primer sequences, and therefore they will be inconsistent by design.

In any case, thanks for the response and the software. I understand that implementing feature requests is time consuming and not always a high priority - just letting you know there's interest if you (or anyone) were inclined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants