Paired end data without length #357

SPPearce · 2019-08-06T11:00:08Z

Hi,

I have paired end data which I'd like to umi dedup without taking into account the read length. As far as I can see, if I specify paired end then it forces the length to be included, and I would like to turn this off.

Is there a way to do this (or a work around at all)?

Thanks,
Simon

IanSudbery · 2019-08-06T14:56:25Z

Using paired-end mode forces the use of template, or fragment length as this is how UMI-tools calculates the mapping location of read2 without actually having to find read2. It doesn't use read length.

SPPearce · 2019-08-11T06:48:31Z

Hi Ian.
Sorry, fragment length is what i meant. I'd like to dedup the reads based solely on read1, whether or not the fragment length is the same.
I have reads where the primers for the second end is not perfectly attached at the end of the initial fragment after an initial amplification step, so read 1 is identical but read 2 has slightly different lengths.

SPPearce · 2019-08-19T05:50:48Z

To be specific, this is the kind of reads I have:

All five have the same UMI and identical read 1s, but the fragment length is slightly different. I'd like to just remove the paired ends based on their read 1 position.

IanSudbery · 2019-08-19T11:26:42Z

Hi Simon,

Sorry, this is not something we do at the moment. The closest we get is that group outputs all input reads, and so if you ran it in non-paired mode, you'd still get the the read2s.

It wouldn't actaully be a big change to create an option for what you are asking, however, neither me, nor my co-author @TomSmithCGAT have time at the moment to do this.

If you wanted to have a go, we'd happily accept a pull request. A guess, the best way to do this would be to create a new option (something along the lines of "ignore read2 position", then find where the read "key" is defined in sam_methods.get_bundles to flip the paired mode status (it'd have to be in paired mode everywhere else. Thats just a first guess, and you'd probably have to do some testing as well. Happy to offer advice.

SPPearce · 2019-08-21T13:27:24Z

Hi Ian,

Thanks for the response. As far as I can see, the only place the template length is taken is in sam_methods.py, where line 504 adds self.options.paired*read.tlen to the key.
I've made myself a local copy of umi_tools with this replaced by 0, and that seems to be working at the moment. If I have time I'll try and add it as an option.

Thanks,
Simon

IanSudbery · 2019-08-21T13:34:37Z

Yeah, so I'd guess I'd alter that to say something like (not self.options.ignore_tlen)*self.options.paried*read.tlen Add it as an option, create a test for it to add to the test suite (tests are specified in tests/tests.yml).

…

On Wed, 21 Aug 2019 at 14:27, Simon Pearce ***@***.***> wrote: Hi Ian, Thanks for the response. As far as I can see, the only place the template length is taken is in sam_methods.py, where line 504 adds self.options.paired*read.tlen to the key. I've made myself a local copy of umi_tools with this replaced by 0, and that seems to be working at the moment. If I have time I'll try and add it as an option. Thanks, Simon — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#357?email_source=notifications&email_token=ABJELDVFCQEXOLNNOHSZIQTQFU7D3A5CNFSM4IJVD3M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4ZU6DY#issuecomment-523456271>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJELDW57MUU2MIZYPMCXNTQFU7D3ANCNFSM4IJVD3MQ> .

SPPearce · 2019-11-22T10:18:31Z

My colleague Stephen Kitcatt has helped modify the code as discussed below and is in the pull request. I have been using this change internally for the last few months, and it can make a big difference to the number of reads flagged as duplicates. For instance sometimes the aligner decides to include a non-reference base on the end of the read, and sometimes it decides to soft-mask it. This leads to having slightly different lengths, which then don't get marked as duplicates without this change.

ijhoskins · 2020-02-03T18:27:36Z

Hello @SPPearce and @IanSudbery I was thinking another approach to the problem @SPPearce mentions is to not actually ignore template length, but rather do a better job in determining a bona fide set of different template lengths for a given UMI + R1 start coordinate.

Consider that we still want to incorporate the template length along with the UMI and R1 start coordinate; however, we do not want to run into the problems @SPPearce mentions where duplicate may "escape" a bin because of how the aligner mapped the terminal bases of the read. Simply ignoring the template length does not solve this problem.

Perhaps you implement some logic to check if the observed template length for a UMI + position matches a previously observed length within some threshold; if so, consider it a duplicate of this bin. If not, create a new length group that would be deduplicated separately. In such a way, we do not use length in a "hard" manner, but allow some wiggle room in the alignment of the terminus of R2.

With this, we could deduplicate differently on template lengths but also avoid the duplicate escape problem. This solution should fix @SPPearce's problem while accommodating library prep chemistries that allow several different template lengths for each UMI + R1 start coordinate.

There may be problems with this approach that I'm not thinking about, so it'd be nice to hear your thoughts. Hopefully this wouldn't end up being too memory intensive.

IanSudbery · 2020-02-06T10:04:56Z

Hi Ian,
I'll have to think about how this might be implemented. I'll try to reply more fully at the weekend when I've got some time to spare.

Ian

ijhoskins · 2020-02-26T19:09:01Z

@IanSudbery happy to see this PR should likely resolve this issue:
#180

skitcattCRUKMI mentioned this issue Nov 21, 2019

Added command line flag --ignore-tlen #382

Merged

ijhoskins mentioned this issue Feb 6, 2020

Feature request- allow UMI groups to be extracted but not removed from read #394

Closed

ijhoskins mentioned this issue Feb 26, 2020

Include read/template length in group stats output #400

Closed

TomSmithCGAT closed this as completed Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paired end data without length #357

Paired end data without length #357

SPPearce commented Aug 6, 2019

IanSudbery commented Aug 6, 2019

SPPearce commented Aug 11, 2019

SPPearce commented Aug 19, 2019

IanSudbery commented Aug 19, 2019

SPPearce commented Aug 21, 2019

IanSudbery commented Aug 21, 2019 via email

SPPearce commented Nov 22, 2019

ijhoskins commented Feb 3, 2020

IanSudbery commented Feb 6, 2020

ijhoskins commented Feb 26, 2020

Paired end data without length #357

Paired end data without length #357

Comments

SPPearce commented Aug 6, 2019

IanSudbery commented Aug 6, 2019

SPPearce commented Aug 11, 2019

SPPearce commented Aug 19, 2019

IanSudbery commented Aug 19, 2019

SPPearce commented Aug 21, 2019

IanSudbery commented Aug 21, 2019 via email

SPPearce commented Nov 22, 2019

ijhoskins commented Feb 3, 2020

IanSudbery commented Feb 6, 2020

ijhoskins commented Feb 26, 2020