Deal with "broken-paired" input/output, for better streaming. #733

ctb · 2015-01-18T13:09:01Z

As part of #732 handling broken paired input, a stopgap measure might be to add an option to ONLY output successfully paired reads when trimming/truncating.

A bit more explanation: right now, filter-abund pays no attention to paired-end reads. This means that if PE reads are input, pairs or orphaned /1 or /2 reads can be output. This will break downstream scripts that expect pure paired reads, necessitating various post-processing of upstream script output. This is tolerable (but ugly) when not doing streaming, but horrible for streaming. How do we fix this?

I see a few options:

first, we could have scripts that post-process output and discard or sidetrack single-ended reads, and make sure that only paired reads remain. This is a good stopgap measure, and is in line with all of the streaming work.

second, we could integrate that kind of code into existing scripts, so that e.g. filter-abund provides the option of discarding (or sidetracking) all orphan reads. Getting the command line options right for this is a bit of a UX nightmare tho ;).

third, we could alter all of the scripts to properly accept "broken paired" input. This is probably the right long-term solution, but every time I think about how to do it in code, my brain locks up.

@camillescott any thoughts on doing the third with your multithreading input work?

camillescott · 2015-01-19T19:44:50Z

I'm not entirely sure what the difference between 1 and 3 here is. In what way would the scripts "accept" broken paired input? The obvious solution to me seems to be to check the pairs as we parse them, and spit orphans out into their own file -- this is kind of the standard way most utilities that respect paired reads seem to handle it (when they do handle it, instead of just core dumping). I see no reason this couldn't be supported by the multithreading work. Am I missing something here @ctb?

mr-c · 2015-01-19T19:56:18Z

Broken paired should be easy; wasn't that the whole point of interleaved paired files in the first place?

ctb · 2015-01-19T20:49:29Z

On Mon, Jan 19, 2015 at 11:44:51AM -0800, Camille Scott wrote:

I'm not entirely sure what the difference between 1 and 3 here is. In what way would the scripts "accept" broken paired input? The obvious solution to me seems to be to check the pairs as we parse them, and spit orphans out into their own file -- this is kind of the standard way most utilities that respect paired reads seem to handle it (when they do handle it, instead of just core dumping). I see no reason this couldn't be supported by the multithreading work. Am I missing something here @ctb?

Probably just a brainfart on my part. A simple implementation and support
in your PR (vice semantic versioning) would be wunderbar.

camillescott · 2015-01-19T21:17:41Z

Can do. Also, unclear what you mean by "multithreaded input" :)

ctb · 2015-01-19T22:59:44Z

On Mon, Jan 19, 2015 at 01:17:41PM -0800, Camille Scott wrote:

Can do. Also, unclear what you mean by "multithreaded input" :)

Multithreaded reading.

Look, if you make it work easily, then obviously I was wrong and dumb
and you can rub it in and I will happily take it :)

mr-c · 2015-01-22T21:53:21Z

all scripts accept broken-paired inputs
all scripts treat pairs in a correct way (possibly overridable)
all scripts output broken-paired formatted sequence files.

ctb · 2015-07-18T20:06:38Z

I believe this is now done, after #759 and ensuing. Will double check before 2.0.

ctb · 2015-07-19T15:29:36Z

Relevant to the following scripts, which should all handle broken-paired input properly:

ctb · 2015-07-19T15:30:51Z

So it all looks good, with the caveat that split-paired-reads is still being modified in #847 to deal with orphaned reads.

ctb added the discussion-needed label Jan 18, 2015

ctb mentioned this issue Feb 7, 2015

"broken" paired-read support for a few scripts #759

Merged

ctb removed the discussion-needed label Jul 18, 2015

ctb added this to the 2.0 milestone Jul 18, 2015

ctb closed this as completed Jul 19, 2015

ctb mentioned this issue Jul 19, 2015

Document broken-paired-reads behavior for 2.0+ #1181

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal with "broken-paired" input/output, for better streaming. #733

Deal with "broken-paired" input/output, for better streaming. #733

ctb commented Jan 18, 2015

camillescott commented Jan 19, 2015

mr-c commented Jan 19, 2015

ctb commented Jan 19, 2015

camillescott commented Jan 19, 2015

ctb commented Jan 19, 2015

mr-c commented Jan 22, 2015

ctb commented Jul 18, 2015

ctb commented Jul 19, 2015

ctb commented Jul 19, 2015

Deal with "broken-paired" input/output, for better streaming. #733

Deal with "broken-paired" input/output, for better streaming. #733

Comments

ctb commented Jan 18, 2015

camillescott commented Jan 19, 2015

mr-c commented Jan 19, 2015

ctb commented Jan 19, 2015

camillescott commented Jan 19, 2015

ctb commented Jan 19, 2015

mr-c commented Jan 22, 2015

ctb commented Jul 18, 2015

ctb commented Jul 19, 2015

ctb commented Jul 19, 2015