khmer should be graceful with respect to errors while processing multiple files #87

camillescott · 2013-07-26T18:33:42Z

If we are processing multiple files and an error occurs then we should show that to the user and move on to the next file.

There should be a flag to disable this behavior, in which case we should tell the user there was an error in a specific file and quit gracefully.

"It's now handled for normalize-by-median -- but what about the other scripts? Not sure what load-graph and load-into-counting should do. filter-abund should be tolerant. abundance-dist... should probably fail. Systematic examination needed."

ctb · 2013-07-27T12:44:04Z

On Fri, Jul 26, 2013 at 11:33:42AM -0700, cswelcher wrote:

Have screed more gracefully handle format errors.

For example,
in fastq_iter, if a mangled read is encountered, an exception is raised and the calling program dies. It should make some attempt to resume parsing. Otherwise, when you're halfway through, say, a diginorm job, and there is one line in the fastq like
@HWI-blah blah
ATGTFGTGTAT
+
kdidfjj
while the entire rest of the file is fine, the whole thing explodes.

Yes, this happened -_-

This could perhaps be handled in the calling scripts instead?

I'm skeptical that this is a good idea; this sort of situation should only
happen due to file corruption, right? I would prefer to make diginorm itself
(the handling script) robust to occasionally badly formatted files. I don't
think it's sensible to try to recover in the middle of a given FASTQ file.

But I agree that diginorm shouldn't die.

--t

C. Titus Brown, ctb@msu.edu

fishjord · 2013-07-27T16:24:48Z

Agreed, screed is a parsing library, it has no idea what to do when it encounters malformed data so the only thing it can do is throw an exception.

While it is annoying for a long process to fail because of a bad sequence, I still think that should be the default behavior. A bad sequence could indicate a variety of major errors, still providing an output file in those situations is sort of a dubious thing to do by default.

camillescott · 2013-07-27T16:31:42Z

I see your point. I think I'll write some sort of robust class that can sit on top of a screed parser and catalogue any bad reads. That way the user can choose whether the errors indicate a truly corrupted file, or just a random couple of mangled reads that can probably be dismissed.

ctb · 2013-07-28T19:36:55Z

On Sat, Jul 27, 2013 at 09:31:43AM -0700, cswelcher wrote:

I see your point. I think I'll write some sort of robust class that can sit on top of a screed parser and catalogue any bad reads. That way the user can choose whether the errors indicate a truly corrupted file, or just a random couple of mangled reads that can probably be dismissed.

I'm still wondering why this is worthwhile :). The only source of FASTQ
data should (ultimately) be a sequencing machine or some programs, and they
should output correct data. If we get bad data, it's either a bug internally
(which we should fix) or due to file system corruption (and we should go
find a true copy of the file).

--t

C. Titus Brown, ctb@msu.edu

mr-c · 2013-07-28T20:55:59Z

On Sun, Jul 28, 2013 at 3:36 PM, C. Titus Brown notifications@github.comwrote:

On Sat, Jul 27, 2013 at 09:31:43AM -0700, cswelcher wrote:

I see your point. I think I'll write some sort of robust class that can
sit on top of a screed parser and catalogue any bad reads. That way the
user can choose whether the errors indicate a truly corrupted file, or just
a random couple of mangled reads that can probably be dismissed.

I'm still wondering why this is worthwhile :). The only source of FASTQ
data should (ultimately) be a sequencing machine or some programs, and they
should output correct data. If we get bad data, it's either a bug
internally
(which we should fix) or due to file system corruption (and we should go
find a true copy of the file).

Here is a user story that may be useful:

CDubb, a bioinformatician, is processing 70 files of FASTQ data through
both Trimmomatic and diginorm. For reasons unknown Trimmomatic mangles a
single read in file 59 thus requiring the entire diginorm process to be
re-run after the problem is fixed.

CDubb doesn't really care about that one read as he is under a research
deadline and would really like to continue with the analysis.

(all names have been changed to protect the innocent)

--t

C. Titus Brown, ctb@msu.edu

—

Reply to this email directly or view it on GitHubhttps://github.com//issues/87#issuecomment-21689423
.

ctb · 2013-07-28T21:00:35Z

On Sun, Jul 28, 2013 at 01:55:59PM -0700, mr-c wrote:

On Sun, Jul 28, 2013 at 3:36 PM, C. Titus Brown notifications@github.comwrote:

On Sat, Jul 27, 2013 at 09:31:43AM -0700, cswelcher wrote:

I see your point. I think I'll write some sort of robust class that can
sit on top of a screed parser and catalogue any bad reads. That way the
user can choose whether the errors indicate a truly corrupted file, or just
a random couple of mangled reads that can probably be dismissed.

I'm still wondering why this is worthwhile :). The only source of FASTQ
data should (ultimately) be a sequencing machine or some programs, and they
should output correct data. If we get bad data, it's either a bug
internally
(which we should fix) or due to file system corruption (and we should go
find a true copy of the file).

Here is a user story that may be useful:

CDubb, a bioinformatician, is processing 70 files of FASTQ data through
both Trimmomatic and diginorm. For reasons unknown Trimmomatic mangles a
single read in file 59 thus requiring the entire diginorm process to be
re-run after the problem is fixed.

CDubb doesn't really care about that one read as he is under a research
deadline and would really like to continue with the analysis.

(all names have been changed to protect the innocent)

In an earlier comment, I agreed that we should be graceful wrt multiple
files, which I think fits this scenario just fine. No?

--t

mr-c · 2013-07-28T21:10:31Z

Agreed

mr-c · 2013-08-01T21:16:10Z

Is this handled by @cswelcher's recent push?

camillescott · 2013-08-01T22:57:28Z

Yup.

ctb · 2013-08-01T23:15:26Z

It's now handled for normalize-by-median -- but what about the other scripts? Not sure what load-graph and load-into-counting should do. filter-abund should be tolerant. abundance-dist... should probably fail. Systematic examination needed.

ctb · 2015-06-12T19:59:29Z

I've decided this is, in general, a horrible idea. See #1057 (comment) for full rationale.

mr-c mentioned this issue Jan 23, 2014

if we run out of storage space while writing we should throw a useful error #246

Closed

mr-c added this to the 1.0 release milestone Feb 28, 2014

mr-c modified the milestones: 1.1+ Release, 1.0 release Apr 2, 2014

ctb added the discussion-needed label Apr 21, 2014

mr-c modified the milestones: 1.1.1+ Release, 1.1 + 2 Aug 1, 2014

mr-c added the Python label Sep 30, 2014

ctb closed this as completed Jun 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

khmer should be graceful with respect to errors while processing multiple files #87

khmer should be graceful with respect to errors while processing multiple files #87

camillescott commented Jul 26, 2013

ctb commented Jul 27, 2013

fishjord commented Jul 27, 2013

camillescott commented Jul 27, 2013

ctb commented Jul 28, 2013

mr-c commented Jul 28, 2013

--t

ctb commented Jul 28, 2013

mr-c commented Jul 28, 2013

mr-c commented Aug 1, 2013

camillescott commented Aug 1, 2013

ctb commented Aug 1, 2013

ctb commented Jun 12, 2015

khmer should be graceful with respect to errors while processing multiple files #87

khmer should be graceful with respect to errors while processing multiple files #87

Comments

camillescott commented Jul 26, 2013

ctb commented Jul 27, 2013

--t

fishjord commented Jul 27, 2013

camillescott commented Jul 27, 2013

ctb commented Jul 28, 2013

--t

mr-c commented Jul 28, 2013

--t

ctb commented Jul 28, 2013

mr-c commented Jul 28, 2013

mr-c commented Aug 1, 2013

camillescott commented Aug 1, 2013

ctb commented Aug 1, 2013

ctb commented Jun 12, 2015