{ts} whitelist filter pos errors #309

TomSmithCGAT · 2019-01-24T16:01:03Z

Adds options to whitelist to detect putative errors in the CBs above the knee threshols
Adds tests to cover new options

See (#138) for motivation

…ve sequence errors

IanSudbery

Why the use of the Levenshume distance here, rather than the hamming distance? Is it just that the number of comparisons is low enoght to make it feasible? Should we be worried that indels in the CB will likely mess up the UMI?

Finally, I wonder if we can reduce the number of options by not having seperate switches for activeing something and selecting its mode, but instead combine the two into a single option? So umi_tools whitelist doesn't do error detection, umi_tools whitelist --ed-above-threshold=discard does do ED and sets to discard reads?

umi_tools/umi_methods.py

umi_tools/whitelist.py

TomSmithCGAT · 2019-02-01T10:28:10Z

@IanSudbery - Regarding INDELs in the cell barcodes. My rationale for this is probably best explained by going back to this blog post: https://cgatoxford.wordpress.com/2017/05/23/estimating-the-number-of-true-cell-barcodes-in-single-cell-rna-seq-part-2/

I think INDELs in the CBs above the knee definitely do exist so we should support their detection. Rather stupidly, I hadn't considered the impact on the UMI. To my mind, the possible solutions are (in my order of preference):

Discard all reads from CBs with possible INDEL but allow correction of substitutions, e.g make --ed-above-thershold=[discard|correct] only affect subsitution CBs.
[CURRENT BEHAVIOUR]. Discard all reads from CBs with possible INDEL/substitution (via --ed-above-thershold=discard). We should add a warning to user if they switch to correct to make clear the risks.
Correct the UMIs when the CB may contain an INDEL. This is harder to implement in the current workflow since we'd need to pass on information about the INDEL to extract. More importantly, I'd be concerned about screwing up UMIs for false positive INDELs. I'd always favour the conservative approach of discarding the CB entirely.

TomSmithCGAT · 2019-02-06T15:53:45Z

As suggested, I've removed the ed-resolution option and just left the option ed-above-threshold=[discard/correct], defaulted to None.

I've also implemented solution 1 above. E.g putative INDEL CBs are always discarded. The discarding/correcting behaviour is logged like so. Note the below is from the testing where we allow 3 errors in a 16-20 base CB in order to actually detect some "error" CBs. Hence the relatively large number of putative CB errors detected.

2019-02-06 15:52:02,654 INFO Top 414 cell barcodes passed the selected threshold
2019-02-06 15:52:02,991 INFO CBs above the knee corrected due to possible substitutions: 6
2019-02-06 15:52:02,991 INFO CBs above the knee discarded due to possible INDELs: 26
2019-02-06 15:52:02,991 INFO CBs above the knee discarded due to possible errors from multiple other CBs: 0

TomSmithCGAT · 2019-02-06T15:54:08Z

As far as I'm aware, this branch is now good to merge?

IanSudbery · 2019-02-06T16:02:03Z

okay, good to go

…

On Wed, 6 Feb 2019 at 15:55 Tom Smith ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In umi_tools/umi_methods.py <#309 (comment)>: > @@ -438,6 +440,70 @@ def getUserDefinedBarcodes(whitelist_tsv, getErrorCorrection=False): return set(cell_whitelist), false_to_true_map +def checkError(barcode, whitelist, errors=1): + ''' + Check for errors (substitutions, insertions, deletions) between a barcode + and a set of whitelist barcodes. + + Returns the whitelist barcodes which match the input barcode + allowing for errors. Returns as soon as two are identified. + ''' + + near_matches = [] + comp_regex = regex.compile("(%s){e<=%i}" % (barcode, errors)) Turns out regex is pretty quick. We can reduce run time for this step by approximately 20% but at a cost of increased dependencies. For 10,000 CBs above the knee, the current run-time is ~90s. For now, I'll leave this as it is — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#309 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFJFjrrTUvHtcEwqbBIFCJc_qksp284Cks5vKvr4gaJpZM4aRLdp> .

TomSmithCGAT added 2 commits January 24, 2019 15:57

adds options to detect and discard/correct CBs above knee with putati…

2021264

…ve sequence errors

clarifies whitelist doc

4ab01ea

TomSmithCGAT requested a review from IanSudbery January 24, 2019 16:01

resolve conflicts with master

73e3477

IanSudbery reviewed Jan 31, 2019

View reviewed changes

umi_tools/umi_methods.py Show resolved Hide resolved

umi_tools/whitelist.py Outdated Show resolved Hide resolved

TomSmithCGAT added 3 commits February 6, 2019 13:10

simplified error detection above knee options

4da0818

updates above-knee CB filtering to only correct substitutions

6694437

typo

b4c6871

TomSmithCGAT merged commit f9518fa into master Feb 6, 2019

TomSmithCGAT deleted the {TS}-WhitelistFilterPosErrors branch August 4, 2022 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{ts} whitelist filter pos errors #309

{ts} whitelist filter pos errors #309

TomSmithCGAT commented Jan 24, 2019

IanSudbery left a comment

TomSmithCGAT commented Feb 1, 2019 •

edited

Loading

TomSmithCGAT commented Feb 6, 2019

TomSmithCGAT commented Feb 6, 2019

IanSudbery commented Feb 6, 2019 via email

{ts} whitelist filter pos errors #309

{ts} whitelist filter pos errors #309

Conversation

TomSmithCGAT commented Jan 24, 2019

IanSudbery left a comment

Choose a reason for hiding this comment

TomSmithCGAT commented Feb 1, 2019 • edited Loading

TomSmithCGAT commented Feb 6, 2019

TomSmithCGAT commented Feb 6, 2019

IanSudbery commented Feb 6, 2019 via email

TomSmithCGAT commented Feb 1, 2019 •

edited

Loading