Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BGZFCodec for reading and writing files with a .bgz suffix. #106

Merged
merged 1 commit into from
Jun 30, 2016

Conversation

tomwhite
Copy link
Member

This adds support for writing BGZF-compressed files.

The old BGZF codec has been renamed BGZFEnhancedGzipCodec and reads .gz files
as BGZF (so they are splittable) if possible, otherwise falls back to regular
gzip.

@heuermh
Copy link
Contributor

heuermh commented Jun 27, 2016

I'm generally in favor of this, though it may take us a while to test.

@tomwhite tomwhite mentioned this pull request Jun 28, 2016
2 tasks
@tomwhite
Copy link
Member Author

I'd like this to go into the next release, so please take a look/try it out @heuermh.

private static VCFFileReader parseVcf(File vcf) throws IOException {
File actualVcf;
// work around TribbleIndexedFeatureReader not reading header from .bgz files
if (vcf.getName().endsWith(".bgz")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only a workaround for unit tests, or something that we'll also need to do downstream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only needed for testing. VCFHeaderReader in Hadoop-BAM will correctly read a bgzf stream now (since #97).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! Yep, VCFHeaderReader is now working for us.

Copy link
Collaborator

@cmnbroad cmnbroad Jun 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite - is that (.bgz/TribbleIndexedFeatureReader) a known issue or is there a ticket? It seems like something we should fix for the future....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just created samtools/htsjdk#653 for it.

@heuermh
Copy link
Contributor

heuermh commented Jun 28, 2016

+1, see downstream pull request referenced above

String extension = "";
if (isCompressed) {
Class<? extends CompressionCodec> codecClass =
getOutputCompressorClass(ctx, GzipCodec.class);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GzipCodec seems like an unnatural default, certainly for BCF. Wouldn't BGZFCodec be a better choice for both for both VCF and BCF?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Fixed.

@cmnbroad
Copy link
Collaborator

A couple of minor comments. Not sure why the Travis push job failed. Otherwise LGTM. Back to @tomwhite.

The old BGZF codec has been renamed BGZFEnhancedGzipCodec and reads .gz files
as BGZF (so they are splittable) if possible, otherwise falls back to regular
gzip.
@tomwhite
Copy link
Member Author

Thanks for taking a look @heuermh and @cmnbroad. I think the earlier Travis job failed because I pushed the change to the wrong upstream repository then deleted it before the job had a chance to complete. It's passing now. I'll merge this shortly.

@tomwhite tomwhite merged commit 10dec49 into HadoopGenomics:master Jun 30, 2016
@tomwhite tomwhite deleted the bgzf-writes branch June 30, 2016 08:34
@heuermh
Copy link
Contributor

heuermh commented Jun 30, 2016

Thank you, @tomwhite!

tomwhite added a commit to broadinstitute/gatk that referenced this pull request Jul 1, 2016
tomwhite added a commit to broadinstitute/gatk that referenced this pull request Jul 7, 2016
magicDGS pushed a commit to bioinformagik/gatk that referenced this pull request Jul 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants