Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ts} whitelist filter pos errors #309

Merged
merged 6 commits into from
Feb 6, 2019
Merged

Conversation

TomSmithCGAT
Copy link
Member

  • Adds options to whitelist to detect putative errors in the CBs above the knee threshols
  • Adds tests to cover new options

See (#138) for motivation

Copy link
Member

@IanSudbery IanSudbery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the use of the Levenshume distance here, rather than the hamming distance? Is it just that the number of comparisons is low enoght to make it feasible? Should we be worried that indels in the CB will likely mess up the UMI?

Finally, I wonder if we can reduce the number of options by not having seperate switches for activeing something and selecting its mode, but instead combine the two into a single option? So umi_tools whitelist doesn't do error detection, umi_tools whitelist --ed-above-threshold=discard does do ED and sets to discard reads?

umi_tools/umi_methods.py Show resolved Hide resolved
umi_tools/whitelist.py Outdated Show resolved Hide resolved
@TomSmithCGAT
Copy link
Member Author

TomSmithCGAT commented Feb 1, 2019

@IanSudbery - Regarding INDELs in the cell barcodes. My rationale for this is probably best explained by going back to this blog post: https://cgatoxford.wordpress.com/2017/05/23/estimating-the-number-of-true-cell-barcodes-in-single-cell-rna-seq-part-2/

I think INDELs in the CBs above the knee definitely do exist so we should support their detection. Rather stupidly, I hadn't considered the impact on the UMI. To my mind, the possible solutions are (in my order of preference):

  1. Discard all reads from CBs with possible INDEL but allow correction of substitutions, e.g make --ed-above-thershold=[discard|correct] only affect subsitution CBs.
  2. [CURRENT BEHAVIOUR]. Discard all reads from CBs with possible INDEL/substitution (via --ed-above-thershold=discard). We should add a warning to user if they switch to correct to make clear the risks.
  3. Correct the UMIs when the CB may contain an INDEL. This is harder to implement in the current workflow since we'd need to pass on information about the INDEL to extract. More importantly, I'd be concerned about screwing up UMIs for false positive INDELs. I'd always favour the conservative approach of discarding the CB entirely.

@TomSmithCGAT
Copy link
Member Author

As suggested, I've removed the ed-resolution option and just left the option ed-above-threshold=[discard/correct], defaulted to None.

I've also implemented solution 1 above. E.g putative INDEL CBs are always discarded. The discarding/correcting behaviour is logged like so. Note the below is from the testing where we allow 3 errors in a 16-20 base CB in order to actually detect some "error" CBs. Hence the relatively large number of putative CB errors detected.

2019-02-06 15:52:02,654 INFO Top 414 cell barcodes passed the selected threshold
2019-02-06 15:52:02,991 INFO CBs above the knee corrected due to possible substitutions: 6
2019-02-06 15:52:02,991 INFO CBs above the knee discarded due to possible INDELs: 26
2019-02-06 15:52:02,991 INFO CBs above the knee discarded due to possible errors from multiple other CBs: 0


@TomSmithCGAT
Copy link
Member Author

As far as I'm aware, this branch is now good to merge?

@IanSudbery
Copy link
Member

IanSudbery commented Feb 6, 2019 via email

@TomSmithCGAT TomSmithCGAT merged commit f9518fa into master Feb 6, 2019
@TomSmithCGAT TomSmithCGAT deleted the {TS}-WhitelistFilterPosErrors branch August 4, 2022 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants