Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with "broken-paired" input/output, for better streaming. #733

Closed
ctb opened this issue Jan 18, 2015 · 9 comments
Closed

Deal with "broken-paired" input/output, for better streaming. #733

ctb opened this issue Jan 18, 2015 · 9 comments
Milestone

Comments

@ctb
Copy link
Member

ctb commented Jan 18, 2015

As part of #732 handling broken paired input, a stopgap measure might be to add an option to ONLY output successfully paired reads when trimming/truncating.

A bit more explanation: right now, filter-abund pays no attention to paired-end reads. This means that if PE reads are input, pairs or orphaned /1 or /2 reads can be output. This will break downstream scripts that expect pure paired reads, necessitating various post-processing of upstream script output. This is tolerable (but ugly) when not doing streaming, but horrible for streaming. How do we fix this?

I see a few options:

first, we could have scripts that post-process output and discard or sidetrack single-ended reads, and make sure that only paired reads remain. This is a good stopgap measure, and is in line with all of the streaming work.

second, we could integrate that kind of code into existing scripts, so that e.g. filter-abund provides the option of discarding (or sidetracking) all orphan reads. Getting the command line options right for this is a bit of a UX nightmare tho ;).

third, we could alter all of the scripts to properly accept "broken paired" input. This is probably the right long-term solution, but every time I think about how to do it in code, my brain locks up.

@camillescott any thoughts on doing the third with your multithreading input work?

@camillescott
Copy link
Member

I'm not entirely sure what the difference between 1 and 3 here is. In what way would the scripts "accept" broken paired input? The obvious solution to me seems to be to check the pairs as we parse them, and spit orphans out into their own file -- this is kind of the standard way most utilities that respect paired reads seem to handle it (when they do handle it, instead of just core dumping). I see no reason this couldn't be supported by the multithreading work. Am I missing something here @ctb?

@mr-c
Copy link
Contributor

mr-c commented Jan 19, 2015

Broken paired should be easy; wasn't that the whole point of interleaved paired files in the first place?

@ctb
Copy link
Member Author

ctb commented Jan 19, 2015

On Mon, Jan 19, 2015 at 11:44:51AM -0800, Camille Scott wrote:

I'm not entirely sure what the difference between 1 and 3 here is. In what way would the scripts "accept" broken paired input? The obvious solution to me seems to be to check the pairs as we parse them, and spit orphans out into their own file -- this is kind of the standard way most utilities that respect paired reads seem to handle it (when they do handle it, instead of just core dumping). I see no reason this couldn't be supported by the multithreading work. Am I missing something here @ctb?

Probably just a brainfart on my part. A simple implementation and support
in your PR (vice semantic versioning) would be wunderbar.

@camillescott
Copy link
Member

Can do. Also, unclear what you mean by "multithreaded input" :)

@ctb
Copy link
Member Author

ctb commented Jan 19, 2015

On Mon, Jan 19, 2015 at 01:17:41PM -0800, Camille Scott wrote:

Can do. Also, unclear what you mean by "multithreaded input" :)

Multithreaded reading.

Look, if you make it work easily, then obviously I was wrong and dumb
and you can rub it in and I will happily take it :)

@mr-c
Copy link
Contributor

mr-c commented Jan 22, 2015

  • all scripts accept broken-paired inputs
  • all scripts treat pairs in a correct way (possibly overridable)
  • all scripts output broken-paired formatted sequence files.

@ctb
Copy link
Member Author

ctb commented Jul 18, 2015

I believe this is now done, after #759 and ensuing. Will double check before 2.0.

@ctb ctb added this to the 2.0 milestone Jul 18, 2015
@ctb
Copy link
Member Author

ctb commented Jul 19, 2015

Relevant to the following scripts, which should all handle broken-paired input properly:

  • extract-paired-reads.py
  • interleave-reads.py
  • normalize-by-median.py
  • sample-reads-randomly.py
  • split-paired-reads.py
  • trim-low-abund.py

@ctb
Copy link
Member Author

ctb commented Jul 19, 2015

So it all looks good, with the caveat that split-paired-reads is still being modified in #847 to deal with orphaned reads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants