-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Also report position of 2nd read in "group --group-out" table #180
base: master
Are you sure you want to change the base?
Conversation
Hi @fgp - I've had a look through your pull request and this seems to be a sensible approach. However, I think this will run into issues when the read pairs have multiple alignments since one could see the same read1 before observing the read2, or vis versa. We can of course check for this and store the multiple read1s/read2s but I think ultimately we would want to introduce a flag to allow the user to turn off the behaviour you introduce above when they have multi-mapped reads - reverting to outputting all reads2 as they are observed. I had a few comments for review but before I add them, I wonder if it might be worth considering another approach as I'm slightly concerned that the code for get_bundles.call() is getting a bit unwieldy, and will only get more so if we make this behaviour optional as I suggest. Perhaps a cleaner solution in terms of maintaining the code might be to stick with the current approach of only tagging the read1s but introduce a If you'd like I can make a branch with this approach implemented and you could check it out with your input files. @IanSudbery - Any thoughts on the relative merits of handling read2s prior and post UMI grouping? |
I do understand your concern about the complexity of The reason seems to be that the mapping position as returned by I do agree that |
Hmmm.. yeah I hadn't really considered this. So there is a major advantage to considering the read2s in a single pass. OK, how about we replace lines 1299-1391 of What do you think? If you're happy with this, do you want to make the changes and I'll review the pull request fully. |
Ok, I moved the pair-building code out of I've split the changes up onto smaller units now. The first two commits on the group_endpos branch now fix what looks like oversights in the existing code to me -- one is a redundant check for read.is_mapped, the other seems to count unmapped reads twice in some occasions. The next four introduce Finally, I still need to add a test case for |
Great. I'll review this tomorrow. I think @IanSudbery would also like to look this over too. |
359fed0
to
85ec391
Compare
I've now added two tests, and updated the "dedup" tool (and Currently, the dedup_single_stats test seems to fail, I'll look into that... |
88d9c9a
to
56d8c07
Compare
I've fixed the test failures and style violations. On my machine (using python 2.7) From my end, this is ready for review. Documentation still needs to be updated probably, not sure where to explain the new behaviour though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the worries I have highlighted in the review, I am also worried about the memory and run-time implications of this. We used to have an implementation that used a queue to output read pairs, but it was replaced with the post process resweep of the chr because the former was causing an explosion in memory usage and run times in the days rather than the minutes range.
I'm going to need to some substantial benchmarking before we'd be happy to merge.
I do have an alternative idea for how we might get around the problem of using the proper position of the read2 rather than the tlen, but I need to think about it a little....
umi_tools/umi_methods.py
Outdated
if get_readpairs.NoMate in queue_entry: | ||
# Incomplete entry. Turn into singleton entry | ||
read = next(r for r in queue_entry if r is not get_readpairs.NoMate) | ||
U.warn("inconsistent BAM file: only one of the two mates of proper pair %s found on chromosome %s, treating as unpaired" % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is probably going to be very common for people that are mapping to the transcriptome, where each transcript is a different contig.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but in this case does --paired mode make any sense at all?
umi_tools/umi_methods.py
Outdated
# Read is part of an already queued (incomplete) read pair. Complete it. | ||
queue_entry = self.incomplete_queue_entries.pop(read.qname) | ||
if queue_entry[read_i] is not get_readpairs.NoMate: | ||
U.warn("inconsistent BAM file: both mates %s flagged to be read%d" % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is is going to be a problem with multi mappers. Many people allow multimappers, and they are unavoidable when mapping to the transcriptome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but again I don't really see how multi-mapping and paired reads should work together at all. The fundamental problem seems to be that there are n*n possible pairing if both read1 and read2 are mapped to n locations. And the BAM format at best may tell us which the "primary" pair is, but not how the secondary mapping should be paired. Maybe I'm missing the point, though -- I haven't really worked with multi-mapped data so far...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I could do the same that TwoPassPairWriter
does, and use reference name and mapping position as additional keys to identify mate pairs. Would that alleviate these concerns?
umi_tools/umi_methods.py
Outdated
self.incomplete_queue_entries[read.qname] = queue_entry | ||
|
||
# Output queued reads and readpairs up to the first incomplete pair | ||
while self.queue and sum(1 for r in self.queue[0] if r is get_readpairs.NoMate) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is definately going to become a mess if you have say to alignments for read1 followed by an alignment for read2. The two alignments of read one will be treated as a read one and a read two, and output as a pair, then read 2 will be encountered and a new pair will be created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yeah, multi-mapping breaks this. Again, I'm inclined to simply say "multi-mapping" and --paired are contradictory, but again I might be missing something...
umi_tools/umi_methods.py
Outdated
|
||
for read in inreads: | ||
for read, read2 in generator(inreads): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are waiting for pairs to be output, can we any longer assume that reads will be output in the same order that they are in in the file. Specifically, if a mate is not found, it will not be output until the end? But in order to collect all reads that map to a location, we assume that they are sorted by pos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I had done that correctly, but I realize now there is be a bug in the current version. The idea was to keep the order "intact enough" for get_bundle's purposes. The original version (where collecting pairs was handled within get_bundles) was fine, but it seems I have introduced a bug when factoring get_readpairs out.
What get_bundle needs is basically that read pairs are ordered according to read1. Currently, that is the case if read1 occurs first in the file. But if read2 occurs first, the pair will be output at the correct position for read2, not read1. (In my old pipeline that was indeed the correct thing to do, because I used the leftmost mapped position of a pair for deduplication.)
This is easily fixed AFAICS by enqueueing a pair not immediately, but only after read1 (even if that's the second read of the pair in the BAM file) has been observed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now fixed, see commits below.
Regarding the performance -- I can certainly do some benchmarking. Also, my existing pipeline uses the same kind of queuing to process pairs together, and I haven't run into any problems with memory consumption or runtime. In essence, the memory consumption is bound by the BAM file's maximum coverage, if you call a base "covered" that lies within a sequenced fragment (but wasn't necessarily sequenced itself). Even if that number reaches a million (which amounts to extremely deep sequencing), and each read takes 1k to store, a gigabyte of memory would still suffice. Runtime-wise, the queuing should be O(1) in the number of reads. |
I've changed get_readpairs to report pairs in ascending mapping position of read1 -- this should be sufficient to not break the clustering algorithm get_readpairs now also uses query name plus the mapping position of read1 as the "pair identification key" -- this should work better for multi-mappings. I can't really test this, though -- I don't have any multi-mapped paired-end data to test with Apart from the general worries about performance, this should address the raised concerns I think. I'm currently doing some benchmarking (with pretty much worst-case files - a short bacterial genome, but sequenced extremely deeply, so the number of "pending" pairs is probably in the many thousands at any time) -- I'll post the results soonish |
Would love to test the new paired-end position/template length adjustment for clipped bases. Any estimate on when this PR will merge? |
…red is not specified
In --paired mode, get_bundles now uses get_readpairs to process the first and second read of a pair together. The 2nd reads are now stored in the bundle as well, in a list parallel to "read" called "read2". This currently works only for all_reads=True. So far, the second read is only stored. The next commits will introduce changes that use the information from the second read to improve the grouping.
…d mate The output contains a new column "position2" which contains the position of the 2nd read, computed in the same way as the position of read1 (i.e, the position is the position of the 1st base in sequencing (not reference!) direction, and is adjusted for possible clipped bases. This uses the previously introduced support for collecting read pairs in get_bundles.
…assPairWriter This includes changes to ReadDeduplicator, which now uses the 2nd reads reported by get_bundles, and now returns a list of read pairs instead of a list of plain reads.
…th reads as a grouping key
… read1 Previously, they were reported in ascending mapping position of the left (in reference direction) read, which for get_bundle's purposes is incorrect. The test outputs are updated accordingly
…r's key in get_readpairs This should make get_readpairs work also for multi-mapped data, as long as mates aren't re-used in multiple (multi-mapped) pairs.
This extends the "group" tool's "--group-out" output to report the position of the 2nd read (in a separate column called "position2") of proper mate pairs. For improper pairs, only the first read's position is reported, and the 2nd read is ignored (same as now).
Implementation-wise, the main works happens in bundle_iterator, which now collects 2nd reads in a separate list parallel to "read" called "read2". For now, only all_reads=True is handled correctly -- for all_reads=False we must decide whether to always keep the best pair, or whether to keep the best (in some sense) 1st and 2nd read separately (i.e. the kept reads may stem from different pairs). To recover pairs of reads from the (position-sorted) input file, 2nd reads are stashed away until we reach the corresponding 1st read. Conversely, if the 1st is found first, it's processing is delayed until the 2nd read is reached. To avoid prematurely returning a bundle containing a position for which 1st reads are still deferred, the number of deferred reads for each position is tracked and the returning these position (plus any later ones) is deferred as well.
This happens only for reads that are part of proper pairs -- improper pairs (as flagged by the mapper) are treated same as now, i.e. only the first read is processed. The code assumes that for pairs flagged as proper, the two mates map to the same chromosome. If this assumption is violated, a warning is emitted.
Since the "group" code now has both reads at its disposal, I also changed the way the "--paired" mode works regarding the insert size. So far, the code used the tlen reported in the BAM file, which may change due to slight differences in mapping, and thus prevent UMIs from being grouped. Instead of tlen, the code now uses the position of read2 as computed by
get_read_position
(i.e. the position adjusted for clipped bases) as an additional key in--paired
mode. This makes the grouping less sensitive to slight differences in mapping, and for my samples produces better results than the tlen-based approach. Similarly, read length and whether the read looks like its extending to the next exon is now taken into account for read2 is well (if the corresponding options are enabled). And finally, the UG tag is now added to read2 as well (since that was now easy to do, and seemed sensible to me).So far, I've only done minimal tests -- it basically processes one of my samples successfully, and seems to produce the same results as my old pipeline. If this is deemed to be an acceptable approach, I'll fix the remaining issues and do some more tests.
best,
Florian