Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

khmer should be graceful with respect to errors while processing multiple files #87

Closed
camillescott opened this issue Jul 26, 2013 · 11 comments

Comments

@camillescott
Copy link
Member

If we are processing multiple files and an error occurs then we should show that to the user and move on to the next file.

There should be a flag to disable this behavior, in which case we should tell the user there was an error in a specific file and quit gracefully.

"It's now handled for normalize-by-median -- but what about the other scripts? Not sure what load-graph and load-into-counting should do. filter-abund should be tolerant. abundance-dist... should probably fail. Systematic examination needed."

@ctb
Copy link
Member

ctb commented Jul 27, 2013

On Fri, Jul 26, 2013 at 11:33:42AM -0700, cswelcher wrote:

Have screed more gracefully handle format errors.

For example,
in fastq_iter, if a mangled read is encountered, an exception is raised and the calling program dies. It should make some attempt to resume parsing. Otherwise, when you're halfway through, say, a diginorm job, and there is one line in the fastq like
@HWI-blah blah
ATGTFGTGTAT
+
kdidfjj
while the entire rest of the file is fine, the whole thing explodes.

Yes, this happened -_-

This could perhaps be handled in the calling scripts instead?

I'm skeptical that this is a good idea; this sort of situation should only
happen due to file corruption, right? I would prefer to make diginorm itself
(the handling script) robust to occasionally badly formatted files. I don't
think it's sensible to try to recover in the middle of a given FASTQ file.

But I agree that diginorm shouldn't die.

--t

C. Titus Brown, ctb@msu.edu

@fishjord
Copy link
Contributor

Agreed, screed is a parsing library, it has no idea what to do when it encounters malformed data so the only thing it can do is throw an exception.

While it is annoying for a long process to fail because of a bad sequence, I still think that should be the default behavior. A bad sequence could indicate a variety of major errors, still providing an output file in those situations is sort of a dubious thing to do by default.

@camillescott
Copy link
Member Author

I see your point. I think I'll write some sort of robust class that can sit on top of a screed parser and catalogue any bad reads. That way the user can choose whether the errors indicate a truly corrupted file, or just a random couple of mangled reads that can probably be dismissed.

@ctb
Copy link
Member

ctb commented Jul 28, 2013

On Sat, Jul 27, 2013 at 09:31:43AM -0700, cswelcher wrote:

I see your point. I think I'll write some sort of robust class that can sit on top of a screed parser and catalogue any bad reads. That way the user can choose whether the errors indicate a truly corrupted file, or just a random couple of mangled reads that can probably be dismissed.

I'm still wondering why this is worthwhile :). The only source of FASTQ
data should (ultimately) be a sequencing machine or some programs, and they
should output correct data. If we get bad data, it's either a bug internally
(which we should fix) or due to file system corruption (and we should go
find a true copy of the file).

--t

C. Titus Brown, ctb@msu.edu

@mr-c
Copy link
Contributor

mr-c commented Jul 28, 2013

On Sun, Jul 28, 2013 at 3:36 PM, C. Titus Brown notifications@github.comwrote:

On Sat, Jul 27, 2013 at 09:31:43AM -0700, cswelcher wrote:

I see your point. I think I'll write some sort of robust class that can
sit on top of a screed parser and catalogue any bad reads. That way the
user can choose whether the errors indicate a truly corrupted file, or just
a random couple of mangled reads that can probably be dismissed.

I'm still wondering why this is worthwhile :). The only source of FASTQ
data should (ultimately) be a sequencing machine or some programs, and they
should output correct data. If we get bad data, it's either a bug
internally
(which we should fix) or due to file system corruption (and we should go
find a true copy of the file).

Here is a user story that may be useful:

CDubb, a bioinformatician, is processing 70 files of FASTQ data through
both Trimmomatic and diginorm. For reasons unknown Trimmomatic mangles a
single read in file 59 thus requiring the entire diginorm process to be
re-run after the problem is fixed.

CDubb doesn't really care about that one read as he is under a research
deadline and would really like to continue with the analysis.

(all names have been changed to protect the innocent)

--t

C. Titus Brown, ctb@msu.edu

Reply to this email directly or view it on GitHubhttps://github.com//issues/87#issuecomment-21689423
.

@ctb
Copy link
Member

ctb commented Jul 28, 2013

On Sun, Jul 28, 2013 at 01:55:59PM -0700, mr-c wrote:

On Sun, Jul 28, 2013 at 3:36 PM, C. Titus Brown notifications@github.comwrote:

On Sat, Jul 27, 2013 at 09:31:43AM -0700, cswelcher wrote:

I see your point. I think I'll write some sort of robust class that can
sit on top of a screed parser and catalogue any bad reads. That way the
user can choose whether the errors indicate a truly corrupted file, or just
a random couple of mangled reads that can probably be dismissed.

I'm still wondering why this is worthwhile :). The only source of FASTQ
data should (ultimately) be a sequencing machine or some programs, and they
should output correct data. If we get bad data, it's either a bug
internally
(which we should fix) or due to file system corruption (and we should go
find a true copy of the file).

Here is a user story that may be useful:

CDubb, a bioinformatician, is processing 70 files of FASTQ data through
both Trimmomatic and diginorm. For reasons unknown Trimmomatic mangles a
single read in file 59 thus requiring the entire diginorm process to be
re-run after the problem is fixed.

CDubb doesn't really care about that one read as he is under a research
deadline and would really like to continue with the analysis.

(all names have been changed to protect the innocent)

In an earlier comment, I agreed that we should be graceful wrt multiple
files, which I think fits this scenario just fine. No?

--t

@mr-c
Copy link
Contributor

mr-c commented Jul 28, 2013

Agreed

@mr-c
Copy link
Contributor

mr-c commented Aug 1, 2013

Is this handled by @cswelcher's recent push?

@camillescott
Copy link
Member Author

Yup.

@ctb
Copy link
Member

ctb commented Aug 1, 2013

It's now handled for normalize-by-median -- but what about the other scripts? Not sure what load-graph and load-into-counting should do. filter-abund should be tolerant. abundance-dist... should probably fail. Systematic examination needed.

@mr-c mr-c added this to the 1.0 release milestone Feb 28, 2014
@mr-c mr-c modified the milestones: 1.1+ Release, 1.0 release Apr 2, 2014
@mr-c mr-c modified the milestones: 1.1.1+ Release, 1.1 + 2 Aug 1, 2014
@mr-c mr-c added the Python label Sep 30, 2014
@ctb
Copy link
Member

ctb commented Jun 12, 2015

I've decided this is, in general, a horrible idea. See #1057 (comment) for full rationale.

@ctb ctb closed this as completed Jun 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants