FastqInputFormat.FILE_SPLITTABLE in conf not getting passed properly #1635

fnothaft · 2017-07-26T03:54:48Z

Added in 985e5d8. I made some tweak between when I tested this last before merging and when we merged it that borked this. A BGZF'ed file will get properly split by the input format, but then the record reader will read the config and see false for the FILE_SPLITTABLE flag, and read the whole file.

The text was updated successfully, but these errors were encountered:

Resolves bigdatagenomics#1635. Instead of passing whether a FASTQ was splittable via config, checks to see if the compression codec is splittable. This is more reliable. In the case of a .gz file, the BGZFEnhancedGZipCodec properly handles this edge case by checking the stream type; this coupled with us explicitly checking the stream when split picking ensures that we don't try to create an invalid GZIP split. Additionally, I identified and fixed an error in the old FASTQ code that did a seek on the uncompressed input stream to backtrack if seeing a line of quality scores that began with @ when identifying the position of the first valid record in a split. Instead, we check for two successive lines that start with an @, which indicates that the first line contains quality scores, while the second line contains read names.

Resolves #1635. Instead of passing whether a FASTQ was splittable via config, checks to see if the compression codec is splittable. This is more reliable. In the case of a .gz file, the BGZFEnhancedGZipCodec properly handles this edge case by checking the stream type; this coupled with us explicitly checking the stream when split picking ensures that we don't try to create an invalid GZIP split. Additionally, I identified and fixed an error in the old FASTQ code that did a seek on the uncompressed input stream to backtrack if seeing a line of quality scores that began with @ when identifying the position of the first valid record in a split. Instead, we check for two successive lines that start with an @, which indicates that the first line contains quality scores, while the second line contains read names.

fnothaft added the bug label Jul 26, 2017

fnothaft added this to the 0.23.0 milestone Jul 26, 2017

fnothaft self-assigned this Jul 26, 2017

fnothaft mentioned this issue Jul 26, 2017

[ADAM-1635] Eliminate passing FASTQ splittable status via config. #1636

Merged

heuermh closed this as completed in #1636 Jul 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastqInputFormat.FILE_SPLITTABLE in conf not getting passed properly #1635

FastqInputFormat.FILE_SPLITTABLE in conf not getting passed properly #1635

fnothaft commented Jul 26, 2017

FastqInputFormat.FILE_SPLITTABLE in conf not getting passed properly #1635

FastqInputFormat.FILE_SPLITTABLE in conf not getting passed properly #1635

Comments

fnothaft commented Jul 26, 2017