-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paired end data without length #357
Comments
Using paired-end mode forces the use of template, or fragment length as this is how UMI-tools calculates the mapping location of read2 without actually having to find read2. It doesn't use read length. |
Hi Ian. |
Hi Simon, Sorry, this is not something we do at the moment. The closest we get is that It wouldn't actaully be a big change to create an option for what you are asking, however, neither me, nor my co-author @TomSmithCGAT have time at the moment to do this. If you wanted to have a go, we'd happily accept a pull request. A guess, the best way to do this would be to create a new option (something along the lines of "ignore read2 position", then find where the read "key" is defined in |
Hi Ian, Thanks for the response. As far as I can see, the only place the template length is taken is in sam_methods.py, where line 504 adds Thanks, |
Yeah, so I'd guess I'd alter that to say something like (not
self.options.ignore_tlen)*self.options.paried*read.tlen
Add it as an option, create a test for it to add to the test suite (tests
are specified in tests/tests.yml).
…On Wed, 21 Aug 2019 at 14:27, Simon Pearce ***@***.***> wrote:
Hi Ian,
Thanks for the response. As far as I can see, the only place the template
length is taken is in sam_methods.py, where line 504 adds
self.options.paired*read.tlen to the key.
I've made myself a local copy of umi_tools with this replaced by 0, and
that seems to be working at the moment. If I have time I'll try and add it
as an option.
Thanks,
Simon
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#357?email_source=notifications&email_token=ABJELDVFCQEXOLNNOHSZIQTQFU7D3A5CNFSM4IJVD3M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4ZU6DY#issuecomment-523456271>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABJELDW57MUU2MIZYPMCXNTQFU7D3ANCNFSM4IJVD3MQ>
.
|
My colleague Stephen Kitcatt has helped modify the code as discussed below and is in the pull request. I have been using this change internally for the last few months, and it can make a big difference to the number of reads flagged as duplicates. For instance sometimes the aligner decides to include a non-reference base on the end of the read, and sometimes it decides to soft-mask it. This leads to having slightly different lengths, which then don't get marked as duplicates without this change. |
Hello @SPPearce and @IanSudbery I was thinking another approach to the problem @SPPearce mentions is to not actually ignore template length, but rather do a better job in determining a bona fide set of different template lengths for a given UMI + R1 start coordinate. Consider that we still want to incorporate the template length along with the UMI and R1 start coordinate; however, we do not want to run into the problems @SPPearce mentions where duplicate may "escape" a bin because of how the aligner mapped the terminal bases of the read. Simply ignoring the template length does not solve this problem. Perhaps you implement some logic to check if the observed template length for a UMI + position matches a previously observed length within some threshold; if so, consider it a duplicate of this bin. If not, create a new length group that would be deduplicated separately. In such a way, we do not use length in a "hard" manner, but allow some wiggle room in the alignment of the terminus of R2. With this, we could deduplicate differently on template lengths but also avoid the duplicate escape problem. This solution should fix @SPPearce's problem while accommodating library prep chemistries that allow several different template lengths for each UMI + R1 start coordinate. There may be problems with this approach that I'm not thinking about, so it'd be nice to hear your thoughts. Hopefully this wouldn't end up being too memory intensive. |
Hi Ian, Ian |
@IanSudbery happy to see this PR should likely resolve this issue: |
Hi,
I have paired end data which I'd like to umi dedup without taking into account the read length. As far as I can see, if I specify paired end then it forces the length to be included, and I would like to turn this off.
Is there a way to do this (or a work around at all)?
Thanks,
Simon
The text was updated successfully, but these errors were encountered: