-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAM-1635] Eliminate passing FASTQ splittable status via config. #1636
[ADAM-1635] Eliminate passing FASTQ splittable status via config. #1636
Conversation
Resolves bigdatagenomics#1635. Instead of passing whether a FASTQ was splittable via config, checks to see if the compression codec is splittable. This is more reliable. In the case of a .gz file, the BGZFEnhancedGZipCodec properly handles this edge case by checking the stream type; this coupled with us explicitly checking the stream when split picking ensures that we don't try to create an invalid GZIP split. Additionally, I identified and fixed an error in the old FASTQ code that did a seek on the uncompressed input stream to backtrack if seeing a line of quality scores that began with @ when identifying the position of the first valid record in a split. Instead, we check for two successive lines that start with an @, which indicates that the first line contains quality scores, while the second line contains read names.
Test PASSed. |
// a contract where it will put the file's splittable status into the hadoop | ||
// configuration object. | ||
isSplittable = conf.getBoolean(FastqInputFormat.FILE_SPLITTABLE, false); | ||
// if our codec is splittable, we can (tentatively) say that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yer editor accidentally used tabs for some of these lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this; I was editing this patch on a different computer from my usual and I was wondering why the diff looked weird.
reader = new LineReader(stream); | ||
} else { | ||
// see above note about | ||
// SplittableCompressionCodec.createInputStream needing the stream | ||
// to be at offset 0 | ||
stream.seek(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment above this line can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still useful info to keep around, but I'll update the comment to better reflect the changed code.
Pushed a commit addressing reviewer comments. BTW @heuermh do you think it would be worthwhile to add something to our CI that would flag any tabs in our source and fail the build? I would've missed those if you hadn't caught them. |
Test PASSed. |
We have a linter that runs on the scala source, this made it through because it was a java source file. I don't think we can put a CI check on the whole repo because some of our test resources require tab characters. |
Thank you, @fnothaft |
Sorry, wrong button, I should've squashed. |
I mean, sure, but we could do something like:
|
+1, add |
Resolves #1635. Instead of passing whether a FASTQ was splittable via config, checks to see if the compression codec is splittable. This is more reliable. In the case of a .gz file, the BGZFEnhancedGZipCodec properly handles this edge case by checking the stream type; this coupled with us explicitly checking the stream when split picking ensures that we don't try to create an invalid GZIP split. Additionally, I identified and fixed an error in the old FASTQ code that did a seek on the uncompressed input stream to backtrack if seeing a line of quality scores that began with @ when identifying the position of the first valid record in a split. Instead, we check for two successive lines that start with an @, which indicates that the first line contains quality scores, while the second line contains read names.