Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1635] Eliminate passing FASTQ splittable status via config. #1636

Merged

Conversation

fnothaft
Copy link
Member

Resolves #1635. Instead of passing whether a FASTQ was splittable via config, checks to see if the compression codec is splittable. This is more reliable. In the case of a .gz file, the BGZFEnhancedGZipCodec properly handles this edge case by checking the stream type; this coupled with us explicitly checking the stream when split picking ensures that we don't try to create an invalid GZIP split. Additionally, I identified and fixed an error in the old FASTQ code that did a seek on the uncompressed input stream to backtrack if seeing a line of quality scores that began with @ when identifying the position of the first valid record in a split. Instead, we check for two successive lines that start with an @, which indicates that the first line contains quality scores, while the second line contains read names.

Resolves bigdatagenomics#1635. Instead of passing whether a FASTQ was splittable via config,
checks to see if the compression codec is splittable. This is more reliable.
In the case of a .gz file, the BGZFEnhancedGZipCodec properly handles this
edge case by checking the stream type; this coupled with us explicitly
checking the stream when split picking ensures that we don't try to create an
invalid GZIP split. Additionally, I identified and fixed an error in the old
FASTQ code that did a seek on the uncompressed input stream to backtrack if
seeing a line of quality scores that began with @ when identifying the position
of the first valid record in a split. Instead, we check for two successive lines
that start with an @, which indicates that the first line contains quality
scores, while the second line contains read names.
@fnothaft fnothaft added this to the 0.23.0 milestone Jul 26, 2017
@coveralls
Copy link

coveralls commented Jul 26, 2017

Coverage Status

Coverage remained the same at 83.961% when pulling e64119b on fnothaft:issues/1635-no-splittable-fastq-config into 7449b14 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2281/
Test PASSed.

// a contract where it will put the file's splittable status into the hadoop
// configuration object.
isSplittable = conf.getBoolean(FastqInputFormat.FILE_SPLITTABLE, false);
// if our codec is splittable, we can (tentatively) say that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yer editor accidentally used tabs for some of these lines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this; I was editing this patch on a different computer from my usual and I was wondering why the diff looked weird.

reader = new LineReader(stream);
} else {
// see above note about
// SplittableCompressionCodec.createInputStream needing the stream
// to be at offset 0
stream.seek(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment above this line can be removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still useful info to keep around, but I'll update the comment to better reflect the changed code.

@fnothaft
Copy link
Member Author

Pushed a commit addressing reviewer comments.

BTW @heuermh do you think it would be worthwhile to add something to our CI that would flag any tabs in our source and fail the build? I would've missed those if you hadn't caught them.

@coveralls
Copy link

coveralls commented Jul 26, 2017

Coverage Status

Coverage remained the same at 83.961% when pulling a78b510 on fnothaft:issues/1635-no-splittable-fastq-config into 7449b14 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2284/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented Jul 26, 2017

do you think it would be worthwhile to add something to our CI that would flag any tabs in our source and fail the build? I would've missed those if you hadn't caught them.

We have a linter that runs on the scala source, this made it through because it was a java source file. I don't think we can put a CI check on the whole repo because some of our test resources require tab characters.

@heuermh heuermh merged commit c8a2202 into bigdatagenomics:master Jul 26, 2017
@heuermh
Copy link
Member

heuermh commented Jul 26, 2017

Thank you, @fnothaft

@fnothaft fnothaft deleted the issues/1635-no-splittable-fastq-config branch July 26, 2017 16:30
@heuermh
Copy link
Member

heuermh commented Jul 26, 2017

Sorry, wrong button, I should've squashed.

@fnothaft
Copy link
Member Author

We have a linter that runs on the scala source, this made it through because it was a java source file. I don't think we can put a CI check on the whole repo because some of our test resources require tab characters.

I mean, sure, but we could do something like:

find adam-*/src -name "*.java" -exec ./scripts/failIfHasTabs.sh {} \;
find adam-*/src -name "*.R" -exec ./scripts/failIfHasTabs.sh {} \;
find adam-*/src -name "*.py" -exec ./scripts/failIfHasTabs.sh {} \;

@heuermh
Copy link
Member

heuermh commented Jul 26, 2017

+1, add *.pom, *.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FastqInputFormat.FILE_SPLITTABLE in conf not getting passed properly
4 participants