Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent result of no reads from the same coordinate in UMI-tools dedup #567

Closed
camelest opened this issue Nov 28, 2022 · 1 comment
Closed

Comments

@camelest
Copy link

camelest commented Nov 28, 2022

Hi, thank you so much for the wonderful tool.

I have encountered a strange result where in the original file I have these 2 reads (subsampled 10X chromium 5' scRNA-seq public data mapped by STAR v2.7.10a)

SRR12018267.78007711	83	chr9	5437908	255	61M40S	=	5436598	-1371	CAGTAGATGACGCACCTCAGCCAATTCGCGCAGCCCTCAGCTTCTTTAAAGAGCCGGCACTCCCCATATAAGAAATNACCGCCGGTGGCCTACTCGTAGAG	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFF:F:#FFFFFFFFFFFFFFFF:FFFFFFF	NH:i:1	HI:i:1	nM:i:0	AS:i:161	CR:Z:CTCTACGAGTAGGCCA	UR:Z:CCGGCGGTNA	GX:Z:-	GN:Z:-	sS:Z:CTCTACGAGTAGGCCACCGGCGGTNATTTCTTATATGGGGAGTGCCGGCTCTTTAAAGAAGCTGAGGGCTGCGCGAATTGGCTGAGGTGCGTCATCTACTG	sQ:Z:FFFFFFF:FFFFFFFFFFFFFFFF#:F:FFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	sM:i:-23	CB:Z:-	UB:Z:-
SRR12018267.400620518	83	chr9	5437908	255	61M40S	=	5431901	-6068	CAGTAGATGACGCACCTCAGCCAATTCGCGCAGCCCTCAGCTTCTTTAAAGAGCCGGCACTCCCCAGCTCAGAAATGACCGCCGGTGGCCTACTCGTAGAG	FFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:F:F:F:FFFFFFFFFFFFFFF,:,FFFF,FFFFFFF:FFFFFFFFFFFFFFFFF::	NH:i:1	HI:i:1	nM:i:0	AS:i:161	CR:Z:CTCTACGAGTAGGCCA	UR:Z:CCGGCGGTCA	GX:Z:-	GN:Z:-	sS:Z:CTCTACGAGTAGGCCACCGGCGGTCATTTCTGAGCTGGGGAGTGCCGGCTCTTTAAAGAAGCTGAGGGCTGCGCGAATTGGCTGAGGTGCGTCATCTACTG	sQ:Z:::FFFFFFFFFFFFFFFFF:FFFFFFF,FFFF,:,FFFFFFFFFFFFFFF:F:F:F:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFF	sM:i:0	CB:Z:CTCTACGAGTAGGCCA	UB:Z:CCGGCGGTCA

which somehow disappears after deduplication. On different umitools dedup runs, I sometimes see 5 examples of similar read groups and sometimes only see 2 examples.

What is more strange is that if I create a test.bam just containing these 2 reads, the deduplication always results in choosing 1 representative reads.

Do you have any idea what is going on here? I have read through #458 but still could not figure out why. Is it because I subsampled the data? Thank you so much for your help.


umitools v1.1.2
umi_tools dedup --per-cell -I input.bam --extract-umi-method=tag --umi-tag=UR --cell-tag=CR -S output.bam

@camelest camelest changed the title Inconsistent result on different runs of umitools dedup ended up to no representative reads from the same mapping coordinate Inconsistent result of no reads from the same coordinate in UMI-tools dedup Nov 28, 2022
@TomSmithCGAT
Copy link
Member

Hi @camelest,

umi_tools is not deterministic by default, so different runs can yield different results. There's an open PR to make it deterministic, with links to other issues describing how to make it deterministic in the current version, if you want to read further (#550).

Without seeing the full input and output for all reads with the same alignment coordinates as the reads above, it's not possible to be certain what's happening. However, I expect you have more reads with the same aligment coordinates and similar enough UMIs that form a network with more than one possible solution.

Tom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants