[ADAM-1635] Eliminate passing FASTQ splittable status via config. #1636

fnothaft · 2017-07-26T08:13:15Z

Resolves #1635. Instead of passing whether a FASTQ was splittable via config, checks to see if the compression codec is splittable. This is more reliable. In the case of a .gz file, the BGZFEnhancedGZipCodec properly handles this edge case by checking the stream type; this coupled with us explicitly checking the stream when split picking ensures that we don't try to create an invalid GZIP split. Additionally, I identified and fixed an error in the old FASTQ code that did a seek on the uncompressed input stream to backtrack if seeing a line of quality scores that began with @ when identifying the position of the first valid record in a split. Instead, we check for two successive lines that start with an @, which indicates that the first line contains quality scores, while the second line contains read names.

Resolves bigdatagenomics#1635. Instead of passing whether a FASTQ was splittable via config, checks to see if the compression codec is splittable. This is more reliable. In the case of a .gz file, the BGZFEnhancedGZipCodec properly handles this edge case by checking the stream type; this coupled with us explicitly checking the stream when split picking ensures that we don't try to create an invalid GZIP split. Additionally, I identified and fixed an error in the old FASTQ code that did a seek on the uncompressed input stream to backtrack if seeing a line of quality scores that began with @ when identifying the position of the first valid record in a split. Instead, we check for two successive lines that start with an @, which indicates that the first line contains quality scores, while the second line contains read names.

coveralls · 2017-07-26T08:29:18Z

Coverage remained the same at 83.961% when pulling e64119b on fnothaft:issues/1635-no-splittable-fastq-config into 7449b14 on bigdatagenomics:master.

AmplabJenkins · 2017-07-26T08:39:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2281/
Test PASSed.

heuermh · 2017-07-26T13:30:22Z

adam-core/src/main/java/org/bdgenomics/adam/io/FastqRecordReader.java

- // a contract where it will put the file's splittable status into the hadoop
- // configuration object.
- isSplittable = conf.getBoolean(FastqInputFormat.FILE_SPLITTABLE, false);
+ // if our codec is splittable, we can (tentatively) say that


Yer editor accidentally used tabs for some of these lines

Thanks for catching this; I was editing this patch on a different computer from my usual and I was wondering why the diff looked weird.

heuermh · 2017-07-26T13:32:34Z

adam-core/src/main/java/org/bdgenomics/adam/io/FastqRecordReader.java

 reader = new LineReader(stream);
 } else {
 // see above note about 
 // SplittableCompressionCodec.createInputStream needing the stream
 // to be at offset 0
- stream.seek(0);


the comment above this line can be removed

I think this is still useful info to keep around, but I'll update the comment to better reflect the changed code.

fnothaft · 2017-07-26T15:46:50Z

Pushed a commit addressing reviewer comments.

BTW @heuermh do you think it would be worthwhile to add something to our CI that would flag any tabs in our source and fail the build? I would've missed those if you hadn't caught them.

coveralls · 2017-07-26T15:59:17Z

Coverage remained the same at 83.961% when pulling a78b510 on fnothaft:issues/1635-no-splittable-fastq-config into 7449b14 on bigdatagenomics:master.

AmplabJenkins · 2017-07-26T16:08:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2284/
Test PASSed.

heuermh · 2017-07-26T16:30:06Z

do you think it would be worthwhile to add something to our CI that would flag any tabs in our source and fail the build? I would've missed those if you hadn't caught them.

We have a linter that runs on the scala source, this made it through because it was a java source file. I don't think we can put a CI check on the whole repo because some of our test resources require tab characters.

heuermh · 2017-07-26T16:30:53Z

Thank you, @fnothaft

heuermh · 2017-07-26T16:32:35Z

Sorry, wrong button, I should've squashed.

fnothaft · 2017-07-26T16:37:53Z

We have a linter that runs on the scala source, this made it through because it was a java source file. I don't think we can put a CI check on the whole repo because some of our test resources require tab characters.

I mean, sure, but we could do something like:

find adam-*/src -name "*.java" -exec ./scripts/failIfHasTabs.sh {} \;
find adam-*/src -name "*.R" -exec ./scripts/failIfHasTabs.sh {} \;
find adam-*/src -name "*.py" -exec ./scripts/failIfHasTabs.sh {} \;

heuermh · 2017-07-26T16:46:51Z

+1, add *.pom, *.sh

fnothaft added this to the 0.23.0 milestone Jul 26, 2017

heuermh requested changes Jul 26, 2017

View reviewed changes

Addressing reviewer comments.

a78b510

heuermh approved these changes Jul 26, 2017

View reviewed changes

heuermh merged commit c8a2202 into bigdatagenomics:master Jul 26, 2017

fnothaft deleted the issues/1635-no-splittable-fastq-config branch July 26, 2017 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAM-1635] Eliminate passing FASTQ splittable status via config. #1636

[ADAM-1635] Eliminate passing FASTQ splittable status via config. #1636

fnothaft commented Jul 26, 2017

coveralls commented Jul 26, 2017 •

edited

Loading

AmplabJenkins commented Jul 26, 2017

heuermh Jul 26, 2017

fnothaft Jul 26, 2017

heuermh Jul 26, 2017

fnothaft Jul 26, 2017

fnothaft commented Jul 26, 2017

coveralls commented Jul 26, 2017 •

edited

Loading

AmplabJenkins commented Jul 26, 2017

heuermh commented Jul 26, 2017

heuermh commented Jul 26, 2017

heuermh commented Jul 26, 2017

fnothaft commented Jul 26, 2017

heuermh commented Jul 26, 2017

[ADAM-1635] Eliminate passing FASTQ splittable status via config. #1636

[ADAM-1635] Eliminate passing FASTQ splittable status via config. #1636

Conversation

fnothaft commented Jul 26, 2017

coveralls commented Jul 26, 2017 • edited Loading

AmplabJenkins commented Jul 26, 2017

heuermh Jul 26, 2017

Choose a reason for hiding this comment

fnothaft Jul 26, 2017

Choose a reason for hiding this comment

heuermh Jul 26, 2017

Choose a reason for hiding this comment

fnothaft Jul 26, 2017

Choose a reason for hiding this comment

fnothaft commented Jul 26, 2017

coveralls commented Jul 26, 2017 • edited Loading

AmplabJenkins commented Jul 26, 2017

heuermh commented Jul 26, 2017

heuermh commented Jul 26, 2017

heuermh commented Jul 26, 2017

fnothaft commented Jul 26, 2017

heuermh commented Jul 26, 2017

coveralls commented Jul 26, 2017 •

edited

Loading

coveralls commented Jul 26, 2017 •

edited

Loading