Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastqInputFormat.FILE_SPLITTABLE in conf not getting passed properly #1635

Closed
fnothaft opened this issue Jul 26, 2017 · 0 comments · Fixed by #1636
Closed

FastqInputFormat.FILE_SPLITTABLE in conf not getting passed properly #1635

fnothaft opened this issue Jul 26, 2017 · 0 comments · Fixed by #1636
Assignees
Labels
Milestone

Comments

@fnothaft
Copy link
Member

Added in 985e5d8. I made some tweak between when I tested this last before merging and when we merged it that borked this. A BGZF'ed file will get properly split by the input format, but then the record reader will read the config and see false for the FILE_SPLITTABLE flag, and read the whole file.

@fnothaft fnothaft added the bug label Jul 26, 2017
@fnothaft fnothaft added this to the 0.23.0 milestone Jul 26, 2017
@fnothaft fnothaft self-assigned this Jul 26, 2017
fnothaft added a commit to fnothaft/adam that referenced this issue Jul 26, 2017
Resolves bigdatagenomics#1635. Instead of passing whether a FASTQ was splittable via config,
checks to see if the compression codec is splittable. This is more reliable.
In the case of a .gz file, the BGZFEnhancedGZipCodec properly handles this
edge case by checking the stream type; this coupled with us explicitly
checking the stream when split picking ensures that we don't try to create an
invalid GZIP split. Additionally, I identified and fixed an error in the old
FASTQ code that did a seek on the uncompressed input stream to backtrack if
seeing a line of quality scores that began with @ when identifying the position
of the first valid record in a split. Instead, we check for two successive lines
that start with an @, which indicates that the first line contains quality
scores, while the second line contains read names.
heuermh pushed a commit that referenced this issue Jul 26, 2017
Resolves #1635. Instead of passing whether a FASTQ was splittable via config,
checks to see if the compression codec is splittable. This is more reliable.
In the case of a .gz file, the BGZFEnhancedGZipCodec properly handles this
edge case by checking the stream type; this coupled with us explicitly
checking the stream when split picking ensures that we don't try to create an
invalid GZIP split. Additionally, I identified and fixed an error in the old
FASTQ code that did a seek on the uncompressed input stream to backtrack if
seeing a line of quality scores that began with @ when identifying the position
of the first valid record in a split. Instead, we check for two successive lines
that start with an @, which indicates that the first line contains quality
scores, while the second line contains read names.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant