Also report position of 2nd read in "group --group-out" table #180

fgp · 2017-09-08T09:42:01Z

This extends the "group" tool's "--group-out" output to report the position of the 2nd read (in a separate column called "position2") of proper mate pairs. For improper pairs, only the first read's position is reported, and the 2nd read is ignored (same as now).

Implementation-wise, the main works happens in bundle_iterator, which now collects 2nd reads in a separate list parallel to "read" called "read2". For now, only all_reads=True is handled correctly -- for all_reads=False we must decide whether to always keep the best pair, or whether to keep the best (in some sense) 1st and 2nd read separately (i.e. the kept reads may stem from different pairs). To recover pairs of reads from the (position-sorted) input file, 2nd reads are stashed away until we reach the corresponding 1st read. Conversely, if the 1st is found first, it's processing is delayed until the 2nd read is reached. To avoid prematurely returning a bundle containing a position for which 1st reads are still deferred, the number of deferred reads for each position is tracked and the returning these position (plus any later ones) is deferred as well.

This happens only for reads that are part of proper pairs -- improper pairs (as flagged by the mapper) are treated same as now, i.e. only the first read is processed. The code assumes that for pairs flagged as proper, the two mates map to the same chromosome. If this assumption is violated, a warning is emitted.

Since the "group" code now has both reads at its disposal, I also changed the way the "--paired" mode works regarding the insert size. So far, the code used the tlen reported in the BAM file, which may change due to slight differences in mapping, and thus prevent UMIs from being grouped. Instead of tlen, the code now uses the position of read2 as computed by get_read_position (i.e. the position adjusted for clipped bases) as an additional key in --paired mode. This makes the grouping less sensitive to slight differences in mapping, and for my samples produces better results than the tlen-based approach. Similarly, read length and whether the read looks like its extending to the next exon is now taken into account for read2 is well (if the corresponding options are enabled). And finally, the UG tag is now added to read2 as well (since that was now easy to do, and seemed sensible to me).

So far, I've only done minimal tests -- it basically processes one of my samples successfully, and seems to produce the same results as my old pipeline. If this is deemed to be an acceptable approach, I'll fix the remaining issues and do some more tests.

best,
Florian

TomSmithCGAT · 2017-09-11T11:47:38Z

Hi @fgp - I've had a look through your pull request and this seems to be a sensible approach. However, I think this will run into issues when the read pairs have multiple alignments since one could see the same read1 before observing the read2, or vis versa. We can of course check for this and store the multiple read1s/read2s but I think ultimately we would want to introduce a flag to allow the user to turn off the behaviour you introduce above when they have multi-mapped reads - reverting to outputting all reads2 as they are observed. I had a few comments for review but before I add them, I wonder if it might be worth considering another approach as I'm slightly concerned that the code for get_bundles.call() is getting a bit unwieldy, and will only get more so if we make this behaviour optional as I suggest.

Perhaps a cleaner solution in terms of maintaining the code might be to stick with the current approach of only tagging the read1s but introduce a --tag-read2 option (with documentation to say this isn't supported for reads with multiple alignments). This option would trigger re-parsing the BAM and adding information to the read2s. The position of read 2 would also be identified and added to the group flatfile. This would make the process of tagging the read2s easier to understand since it would be separated from the process of identifying read groups. We could of course still implement all the same checks for read pair validity which you have included. There would be a slight increase in run time but this would only be very minor.

If you'd like I can make a branch with this approach implemented and you could check it out with your input files.

@IanSudbery - Any thoughts on the relative merits of handling read2s prior and post UMI grouping?

fgp · 2017-09-11T12:26:33Z

I do understand your concern about the complexity of get_bundles.call(), but handling read2 in a second pass after determining UMI groups has the (severe, IMHO) disadvantage that the information from read2 can no longer be used to influence the grouping. As it stands, the patch uses the second read's mapping position as a group key, same as the first read's position. At least for my samples, this yields a much better grouping than the unpatches version in --paired mode which uses only the first read's position plus the fragment length as reported in the BAM file (tlen).

The reason seems to be that the mapping position as returned by get_read_position (i.e. adjusted for clipped bases) is quite resilient to slight differences in mapping between reads of the same UMI (which I assume is why that adjustment is done in the first place). tlen, on the other hand, can easily change by a few bases if read2 is mapped slightly differently (i.e. fewer or more bases are clipped).

I do agree that get_bundles.call() is unwieldy, but I believe that there are better ways to fix that. I could try to move the pair-building code from bundle_iterator to a separate iterator (say, pair_iterator). bundle_iterator would then receive pairs of reads, and the only additional code it would require is the actual handling of read2 (which isn't much).

TomSmithCGAT · 2017-09-11T13:00:28Z

Hmmm.. yeah I hadn't really considered this. So there is a major advantage to considering the read2s in a single pass. OK, how about we replace lines 1299-1391 of get_bundle.__call__() with a call to self.check_unmapped_paired() with the code as per the master branch. Then we can create a new child class from bundle_iterator called, e.g pair_iterator with a check_unmapped_paired method which includes the functionality you've added here. In group we can decide which iterator to use (suggest default to bundle_iterator with option --parse-read2 to switch to pair_iterator).

What do you think? If you're happy with this, do you want to make the changes and I'll review the pull request fully.

fgp · 2017-09-12T17:15:41Z

Ok, I moved the pair-building code out of get_bundles. There is now a separate functor called get_readpairs that does all the legwork of forming pairs and complaining about inconsistent BAM files.

I've split the changes up onto smaller units now.

The first two commits on the group_endpos branch now fix what looks like oversights in the existing code to me -- one is a redundant check for read.is_mapped, the other seems to count unmapped reads twice in some occasions.

The next four introduce get_readpairs and its trivial counter-part get_singletons for single-end mode, extend get_bundles to use them, and extend the group tool to output position2.

Finally, get_bundles is taught to use the information from read2 to improve the grouping and to drop improperly paired reads.

I still need to add a test case for --paired mode and probably update the documentation, but I hope the code is close to being in a merge-able state now.

TomSmithCGAT · 2017-09-12T17:52:33Z

Great. I'll review this tomorrow. I think @IanSudbery would also like to look this over too.

fgp · 2017-09-12T21:57:53Z

I've now added two tests, and updated the "dedup" tool (and ReadDeduplicator) to use the pair support of get_bundles instead of TwoPassPairWriter.

Currently, the dedup_single_stats test seems to fail, I'll look into that...

fgp · 2017-09-13T06:52:19Z

I've fixed the test failures and style violations. On my machine (using python 2.7) test_umi_tools.py and test_style.py now both succeed. We'll see if the CI build succeeds now as well.

From my end, this is ready for review. Documentation still needs to be updated probably, not sure where to explain the new behaviour though.

IanSudbery

Apart from the worries I have highlighted in the review, I am also worried about the memory and run-time implications of this. We used to have an implementation that used a queue to output read pairs, but it was replaced with the post process resweep of the chr because the former was causing an explosion in memory usage and run times in the days rather than the minutes range.

I'm going to need to some substantial benchmarking before we'd be happy to merge.

I do have an alternative idea for how we might get around the problem of using the proper position of the read2 rather than the tlen, but I need to think about it a little....

IanSudbery · 2017-09-13T12:35:32Z

umi_tools/umi_methods.py

+            if get_readpairs.NoMate in queue_entry:
+                # Incomplete entry. Turn into singleton entry
+                read = next(r for r in queue_entry if r is not get_readpairs.NoMate)
+                U.warn("inconsistent BAM file: only one of the two mates of proper pair %s found on chromosome %s, treating as unpaired" %


I think that this is probably going to be very common for people that are mapping to the transcriptome, where each transcript is a different contig.

Ok, but in this case does --paired mode make any sense at all?

IanSudbery · 2017-09-13T12:36:38Z

umi_tools/umi_methods.py

+                # Read is part of an already queued (incomplete) read pair. Complete it.
+                queue_entry = self.incomplete_queue_entries.pop(read.qname)
+                if queue_entry[read_i] is not get_readpairs.NoMate:
+                    U.warn("inconsistent BAM file: both mates %s flagged to be read%d" %


Is is going to be a problem with multi mappers. Many people allow multimappers, and they are unavoidable when mapping to the transcriptome.

True, but again I don't really see how multi-mapping and paired reads should work together at all. The fundamental problem seems to be that there are n*n possible pairing if both read1 and read2 are mapped to n locations. And the BAM format at best may tell us which the "primary" pair is, but not how the secondary mapping should be paired. Maybe I'm missing the point, though -- I haven't really worked with multi-mapped data so far...

On second thought, I could do the same that TwoPassPairWriter does, and use reference name and mapping position as additional keys to identify mate pairs. Would that alleviate these concerns?

IanSudbery · 2017-09-13T12:39:18Z

umi_tools/umi_methods.py

+                self.incomplete_queue_entries[read.qname] = queue_entry
+
+            # Output queued reads and readpairs up to the first incomplete pair
+            while self.queue and sum(1 for r in self.queue[0] if r is get_readpairs.NoMate) == 0:


Yeah, this is definately going to become a mess if you have say to alignments for read1 followed by an alignment for read2. The two alignments of read one will be treated as a read one and a read two, and output as a pair, then read 2 will be encountered and a new pair will be created.

Hm, yeah, multi-mapping breaks this. Again, I'm inclined to simply say "multi-mapping" and --paired are contradictory, but again I might be missing something...

IanSudbery · 2017-09-13T12:54:16Z

umi_tools/umi_methods.py


-        for read in inreads:
+        for read, read2 in generator(inreads):


Since we are waiting for pairs to be output, can we any longer assume that reads will be output in the same order that they are in in the file. Specifically, if a mate is not found, it will not be output until the end? But in order to collect all reads that map to a location, we assume that they are sorted by pos.

I thought I had done that correctly, but I realize now there is be a bug in the current version. The idea was to keep the order "intact enough" for get_bundle's purposes. The original version (where collecting pairs was handled within get_bundles) was fine, but it seems I have introduced a bug when factoring get_readpairs out.

What get_bundle needs is basically that read pairs are ordered according to read1. Currently, that is the case if read1 occurs first in the file. But if read2 occurs first, the pair will be output at the correct position for read2, not read1. (In my old pipeline that was indeed the correct thing to do, because I used the leftmost mapped position of a pair for deduplication.)

This is easily fixed AFAICS by enqueueing a pair not immediately, but only after read1 (even if that's the second read of the pair in the BAM file) has been observed.

This is now fixed, see commits below.

fgp · 2017-09-13T16:01:37Z

Regarding the performance -- I can certainly do some benchmarking.

Also, my existing pipeline uses the same kind of queuing to process pairs together, and I haven't run into any problems with memory consumption or runtime. In essence, the memory consumption is bound by the BAM file's maximum coverage, if you call a base "covered" that lies within a sequenced fragment (but wasn't necessarily sequenced itself). Even if that number reaches a million (which amounts to extremely deep sequencing), and each read takes 1k to store, a gigabyte of memory would still suffice. Runtime-wise, the queuing should be O(1) in the number of reads.

fgp · 2017-10-02T10:48:44Z

I've changed get_readpairs to report pairs in ascending mapping position of read1 -- this should be sufficient to not break the clustering algorithm

get_readpairs now also uses query name plus the mapping position of read1 as the "pair identification key" -- this should work better for multi-mappings. I can't really test this, though -- I don't have any multi-mapped paired-end data to test with

Apart from the general worries about performance, this should address the raised concerns I think. I'm currently doing some benchmarking (with pretty much worst-case files - a short bacterial genome, but sequenced extremely deeply, so the number of "pending" pairs is probably in the many thousands at any time) -- I'll post the results soonish

ijhoskins · 2020-02-26T19:10:46Z

Would love to test the new paired-end position/template length adjustment for clipped bases. Any estimate on when this PR will merge?

…pped is set.

…ad1, read2)

…red is not specified

In --paired mode, get_bundles now uses get_readpairs to process the first and second read of a pair together. The 2nd reads are now stored in the bundle as well, in a list parallel to "read" called "read2". This currently works only for all_reads=True. So far, the second read is only stored. The next commits will introduce changes that use the information from the second read to improve the grouping.

…d mate The output contains a new column "position2" which contains the position of the 2nd read, computed in the same way as the position of read1 (i.e, the position is the position of the 1st base in sequencing (not reference!) direction, and is adjusted for possible clipped bases. This uses the previously introduced support for collecting read pairs in get_bundles.

…assPairWriter This includes changes to ReadDeduplicator, which now uses the 2nd reads reported by get_bundles, and now returns a list of read pairs instead of a list of plain reads.

…th reads as a grouping key

…ed mode

… read1 Previously, they were reported in ascending mapping position of the left (in reference direction) read, which for get_bundle's purposes is incorrect. The test outputs are updated accordingly

…red_contig

…r's key in get_readpairs This should make get_readpairs work also for multi-mapped data, as long as mates aren't re-used in multiple (multi-mapped) pairs.

fgp force-pushed the group_endpos branch from 4cce2bb to 5f6d510 Compare September 12, 2017 17:03

fgp force-pushed the group_endpos branch 2 times, most recently from 359fed0 to 85ec391 Compare September 12, 2017 21:53

fgp force-pushed the group_endpos branch 2 times, most recently from 88d9c9a to 56d8c07 Compare September 13, 2017 06:47

fgp force-pushed the group_endpos branch from 56d8c07 to e5fce53 Compare September 13, 2017 08:07

IanSudbery reviewed Sep 13, 2017

View reviewed changes

fgp force-pushed the group_endpos branch from e5fce53 to fb56cff Compare September 13, 2017 14:31

fgp force-pushed the group_endpos branch from b1d52b5 to 91d73be Compare September 28, 2017 15:21

TomSmithCGAT mentioned this pull request Oct 13, 2017

Output 'cell' and 'end' columns with group --group-out #178

Closed

fgp force-pushed the group_endpos branch from 91d73be to 32fa029 Compare January 26, 2018 15:08

This was referenced Feb 26, 2020

Include read/template length in group stats output #400

Closed

Paired end data without length #357

Closed

fgp added 7 commits February 29, 2020 15:24

Don't increment "Input Reads" twice for unmapped reads if return_unma…

3097211

…pped is set.

Removed redundant check for read.is_mapped

a86c86e

Added get_readpairs, which returns properly paired reads as pairs (re…

1bc37ae

…ad1, read2)

Added (trivial) get_singletons, which replaces get_readpairs if --pai…

e80efdc

…red is not specified

Upated "dedup" to use the pair support in get_bundles instead of TwoP…

6af08d2

…assPairWriter This includes changes to ReadDeduplicator, which now uses the 2nd reads reported by get_bundles, and now returns a list of read pairs instead of a list of plain reads.

fgp added 8 commits February 29, 2020 15:43

In --paired mode, apply the mapping quality filter to 2nd reads as well

dbec68e

In --paired mode, use the mapping position, length and splicing of bo…

45ef109

…th reads as a grouping key

Added test case count_paired_wide_contig with tests "count" in --pair…

585e7db

…ed mode

Added test case dedup_paired_contig which tests "dedup" in --paired mode

aef493d

Added test case group_paired_contig which tests "group" in --paired mode

2503f36

Fixed get_readpairs to reports pairs in ascending mapping position of…

3772ede

… read1 Previously, they were reported in ascending mapping position of the left (in reference direction) read, which for get_bundle's purposes is incorrect. The test outputs are updated accordingly

Added python3 versions of the tests dedup_paired_contig and group_pai…

153aafa

…red_contig

Use mapping position of read1 in addition to query name as a mate pai…

2f9380f

…r's key in get_readpairs This should make get_readpairs work also for multi-mapped data, as long as mates aren't re-used in multiple (multi-mapped) pairs.

fgp force-pushed the group_endpos branch from 622e48f to 2f9380f Compare February 29, 2020 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Also report position of 2nd read in "group --group-out" table #180

Also report position of 2nd read in "group --group-out" table #180

fgp commented Sep 8, 2017 •

edited

Loading

TomSmithCGAT commented Sep 11, 2017 •

edited

Loading

fgp commented Sep 11, 2017

TomSmithCGAT commented Sep 11, 2017 •

edited

Loading

fgp commented Sep 12, 2017

TomSmithCGAT commented Sep 12, 2017

fgp commented Sep 12, 2017

fgp commented Sep 13, 2017

IanSudbery left a comment •

edited

Loading

IanSudbery Sep 13, 2017

fgp Sep 13, 2017

IanSudbery Sep 13, 2017

fgp Sep 13, 2017

fgp Sep 13, 2017

IanSudbery Sep 13, 2017

fgp Sep 13, 2017

IanSudbery Sep 13, 2017

fgp Sep 13, 2017 •

edited

Loading

fgp Sep 13, 2017

fgp commented Sep 13, 2017

fgp commented Oct 2, 2017

ijhoskins commented Feb 26, 2020

Also report position of 2nd read in "group --group-out" table #180

Are you sure you want to change the base?

Also report position of 2nd read in "group --group-out" table #180

Conversation

fgp commented Sep 8, 2017 • edited Loading

TomSmithCGAT commented Sep 11, 2017 • edited Loading

fgp commented Sep 11, 2017

TomSmithCGAT commented Sep 11, 2017 • edited Loading

fgp commented Sep 12, 2017

TomSmithCGAT commented Sep 12, 2017

fgp commented Sep 12, 2017

fgp commented Sep 13, 2017

IanSudbery left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fgp Sep 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fgp commented Sep 13, 2017

fgp commented Oct 2, 2017

ijhoskins commented Feb 26, 2020

fgp commented Sep 8, 2017 •

edited

Loading

TomSmithCGAT commented Sep 11, 2017 •

edited

Loading

TomSmithCGAT commented Sep 11, 2017 •

edited

Loading

IanSudbery left a comment •

edited

Loading

fgp Sep 13, 2017 •

edited

Loading