Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paired end data without length #357

Closed
SPPearce opened this issue Aug 6, 2019 · 10 comments
Closed

Paired end data without length #357

SPPearce opened this issue Aug 6, 2019 · 10 comments

Comments

@SPPearce
Copy link

SPPearce commented Aug 6, 2019

Hi,

I have paired end data which I'd like to umi dedup without taking into account the read length. As far as I can see, if I specify paired end then it forces the length to be included, and I would like to turn this off.

Is there a way to do this (or a work around at all)?

Thanks,
Simon

@IanSudbery
Copy link
Member

Using paired-end mode forces the use of template, or fragment length as this is how UMI-tools calculates the mapping location of read2 without actually having to find read2. It doesn't use read length.

@SPPearce
Copy link
Author

Hi Ian.
Sorry, fragment length is what i meant. I'd like to dedup the reads based solely on read1, whether or not the fragment length is the same.
I have reads where the primers for the second end is not perfectly attached at the end of the initial fragment after an initial amplification step, so read 1 is identical but read 2 has slightly different lengths.

@SPPearce
Copy link
Author

To be specific, this is the kind of reads I have:
Screenshot 2019-08-06 at 12 42 53
All five have the same UMI and identical read 1s, but the fragment length is slightly different. I'd like to just remove the paired ends based on their read 1 position.

@IanSudbery
Copy link
Member

Hi Simon,

Sorry, this is not something we do at the moment. The closest we get is that group outputs all input reads, and so if you ran it in non-paired mode, you'd still get the the read2s.

It wouldn't actaully be a big change to create an option for what you are asking, however, neither me, nor my co-author @TomSmithCGAT have time at the moment to do this.

If you wanted to have a go, we'd happily accept a pull request. A guess, the best way to do this would be to create a new option (something along the lines of "ignore read2 position", then find where the read "key" is defined in sam_methods.get_bundles to flip the paired mode status (it'd have to be in paired mode everywhere else. Thats just a first guess, and you'd probably have to do some testing as well. Happy to offer advice.

@SPPearce
Copy link
Author

Hi Ian,

Thanks for the response. As far as I can see, the only place the template length is taken is in sam_methods.py, where line 504 adds self.options.paired*read.tlen to the key.
I've made myself a local copy of umi_tools with this replaced by 0, and that seems to be working at the moment. If I have time I'll try and add it as an option.

Thanks,
Simon

@IanSudbery
Copy link
Member

IanSudbery commented Aug 21, 2019 via email

@SPPearce
Copy link
Author

My colleague Stephen Kitcatt has helped modify the code as discussed below and is in the pull request. I have been using this change internally for the last few months, and it can make a big difference to the number of reads flagged as duplicates. For instance sometimes the aligner decides to include a non-reference base on the end of the read, and sometimes it decides to soft-mask it. This leads to having slightly different lengths, which then don't get marked as duplicates without this change.

@ijhoskins
Copy link

Hello @SPPearce and @IanSudbery I was thinking another approach to the problem @SPPearce mentions is to not actually ignore template length, but rather do a better job in determining a bona fide set of different template lengths for a given UMI + R1 start coordinate.

Consider that we still want to incorporate the template length along with the UMI and R1 start coordinate; however, we do not want to run into the problems @SPPearce mentions where duplicate may "escape" a bin because of how the aligner mapped the terminal bases of the read. Simply ignoring the template length does not solve this problem.

Perhaps you implement some logic to check if the observed template length for a UMI + position matches a previously observed length within some threshold; if so, consider it a duplicate of this bin. If not, create a new length group that would be deduplicated separately. In such a way, we do not use length in a "hard" manner, but allow some wiggle room in the alignment of the terminus of R2.

With this, we could deduplicate differently on template lengths but also avoid the duplicate escape problem. This solution should fix @SPPearce's problem while accommodating library prep chemistries that allow several different template lengths for each UMI + R1 start coordinate.

There may be problems with this approach that I'm not thinking about, so it'd be nice to hear your thoughts. Hopefully this wouldn't end up being too memory intensive.

@IanSudbery
Copy link
Member

Hi Ian,
I'll have to think about how this might be implemented. I'll try to reply more fully at the weekend when I've got some time to spare.

Ian

@ijhoskins
Copy link

@IanSudbery happy to see this PR should likely resolve this issue:
#180

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants