-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deal with "broken-paired" input/output, for better streaming. #733
Comments
I'm not entirely sure what the difference between 1 and 3 here is. In what way would the scripts "accept" broken paired input? The obvious solution to me seems to be to check the pairs as we parse them, and spit orphans out into their own file -- this is kind of the standard way most utilities that respect paired reads seem to handle it (when they do handle it, instead of just core dumping). I see no reason this couldn't be supported by the multithreading work. Am I missing something here @ctb? |
Broken paired should be easy; wasn't that the whole point of interleaved paired files in the first place? |
On Mon, Jan 19, 2015 at 11:44:51AM -0800, Camille Scott wrote:
Probably just a brainfart on my part. A simple implementation and support |
Can do. Also, unclear what you mean by "multithreaded input" :) |
On Mon, Jan 19, 2015 at 01:17:41PM -0800, Camille Scott wrote:
Multithreaded reading. Look, if you make it work easily, then obviously I was wrong and dumb |
|
I believe this is now done, after #759 and ensuing. Will double check before 2.0. |
Relevant to the following scripts, which should all handle broken-paired input properly:
|
So it all looks good, with the caveat that split-paired-reads is still being modified in #847 to deal with orphaned reads. |
As part of #732 handling broken paired input, a stopgap measure might be to add an option to ONLY output successfully paired reads when trimming/truncating.
A bit more explanation: right now, filter-abund pays no attention to paired-end reads. This means that if PE reads are input, pairs or orphaned /1 or /2 reads can be output. This will break downstream scripts that expect pure paired reads, necessitating various post-processing of upstream script output. This is tolerable (but ugly) when not doing streaming, but horrible for streaming. How do we fix this?
I see a few options:
first, we could have scripts that post-process output and discard or sidetrack single-ended reads, and make sure that only paired reads remain. This is a good stopgap measure, and is in line with all of the streaming work.
second, we could integrate that kind of code into existing scripts, so that e.g. filter-abund provides the option of discarding (or sidetracking) all orphan reads. Getting the command line options right for this is a bit of a UX nightmare tho ;).
third, we could alter all of the scripts to properly accept "broken paired" input. This is probably the right long-term solution, but every time I think about how to do it in code, my brain locks up.
@camillescott any thoughts on doing the third with your multithreading input work?
The text was updated successfully, but these errors were encountered: