-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bismark silently outputs incorrect results when UMIs are added using Illuminas bcl-convert #699
Comments
Thanks Lars, I'll await further developments. Hope all is well? |
Hi Felix, I'm doing well, I hope you are too :) I double checked and this is an issue. Tools like Illuminas bcl-convert and umi-tools places the UMI like this (umi-tools by default uses The function Line 6207 in 37e2cad
replaces spaces with underscores so the sam record ID is A00001:001:HN2F7DRX1:1:1101:1452:1000:CAAGAG_1:N:0:AATGACGC This causes index to be treated as the UMI. No warning or error is given (either by deduplicate_bismark or umi-tools dedup) and the estimated number of duplicates is massively inflated. This is extra problematic as this means the workflow
will also cause this problem. Here is an example:
|
Hi Lars, sorry for the late reaction this issue. I can completely see how this issue arises, and that the behaviour you describe is not desirable. I was hoping to find a good suggestion of how to handle this so that users are aware of it... Just to quickly mention that currently,
I can currently think of a few different actions we could take:
What are your thoughts on this? |
Hi Felix I agree, I like all the options you suggest, but I think options 2 and 3 are the most relevant. A warning/error that it looks like you are deduplicating based on index sequence would probably have saved me. I prefer an error, since warnings are more easily missed/ignored but I can also see how that could be annoying. Is there some way to check if the first N reads all have this pattern and an identical "UMI"? Then it surely must be an error? |
Thanks for the comments. For 2), yes we can check the first say 1000 lines and see if all 'UMIs' are the same (as in the sequencing index), and then die. The |
Which part of using the |
Yea I meant if you had it processed already without |
This is not really a bug as the documentation clearly states how deduplicate_bismark expects UMIs to be handled, but it is an easy mistake to make.
As documented in deduplicate_bismark, Bismark expects UMIs of the form:
@A00001:001:HN2F7DRX1:1:1101:1452:1000 1:N:0:AATGACGC:CAAGAG
But if Illuminas bcl-convert is used with OverrideCycles to handle UMIs, the read ID looks like this
@A00001:001:HN2F7DRX1:1:1101:1452:1000:CAAGAG 1:N:0:AATGACGC
The UMI is highlighted in bold.
This means the sample index is used as a UMI, and no warning or error is emitted.
I propose running a pre-flight check to detect this scenario, and potentially to support the UMI location chosen by Illumina.
EDIT: I might have been completely off. I'll close it for now.
The text was updated successfully, but these errors were encountered: