Add BGZFCodec for reading and writing files with a .bgz suffix. #106

tomwhite · 2016-06-15T17:17:34Z

This adds support for writing BGZF-compressed files.

The old BGZF codec has been renamed BGZFEnhancedGzipCodec and reads .gz files
as BGZF (so they are splittable) if possible, otherwise falls back to regular
gzip.

heuermh · 2016-06-27T20:57:21Z

I'm generally in favor of this, though it may take us a while to test.

tomwhite · 2016-06-28T09:01:20Z

I'd like this to go into the next release, so please take a look/try it out @heuermh.

heuermh · 2016-06-28T15:21:27Z

src/test/java/org/seqdoop/hadoop_bam/TestVCFRoundTrip.java

+    private static VCFFileReader parseVcf(File vcf) throws IOException {
+        File actualVcf;
+        // work around TribbleIndexedFeatureReader not reading header from .bgz files
+        if (vcf.getName().endsWith(".bgz")) {


Is this only a workaround for unit tests, or something that we'll also need to do downstream?

This is only needed for testing. VCFHeaderReader in Hadoop-BAM will correctly read a bgzf stream now (since #97).

Great, thanks! Yep, VCFHeaderReader is now working for us.

@tomwhite - is that (.bgz/TribbleIndexedFeatureReader) a known issue or is there a ticket? It seems like something we should fix for the future....

I just created samtools/htsjdk#653 for it.

heuermh · 2016-06-28T16:29:14Z

+1, see downstream pull request referenced above

cmnbroad · 2016-06-28T21:35:23Z

src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringVCFOutputFormat.java

+		String extension = "";
+		if (isCompressed) {
+			Class<? extends CompressionCodec> codecClass =
+					getOutputCompressorClass(ctx, GzipCodec.class);


GzipCodec seems like an unnatural default, certainly for BCF. Wouldn't BGZFCodec be a better choice for both for both VCF and BCF?

Agreed. Fixed.

cmnbroad · 2016-06-28T22:12:12Z

A couple of minor comments. Not sure why the Travis push job failed. Otherwise LGTM. Back to @tomwhite.

The old BGZF codec has been renamed BGZFEnhancedGzipCodec and reads .gz files as BGZF (so they are splittable) if possible, otherwise falls back to regular gzip.

tomwhite · 2016-06-29T14:19:15Z

Thanks for taking a look @heuermh and @cmnbroad. I think the earlier Travis job failed because I pushed the change to the wrong upstream repository then deleted it before the job had a chance to complete. It's passing now. I'll merge this shortly.

heuermh · 2016-06-30T14:34:44Z

Thank you, @tomwhite!

a .gz suffix), see HadoopGenomics/Hadoop-BAM#106. Also remove spurious debug.

tomwhite force-pushed the bgzf-writes branch from 251d86c to 05cae06 Compare June 16, 2016 14:57

tomwhite mentioned this pull request Jun 16, 2016

Replace BGzipCodec with equivalent BGZFCodec from Hadoop-BAM hail-is/hail#426

Merged

tomwhite mentioned this pull request Jun 28, 2016

Release Hadoop-BAM 7.6.0 #108

Closed

2 tasks

tomwhite force-pushed the bgzf-writes branch from 05cae06 to ba8b9f0 Compare June 28, 2016 09:05

heuermh reviewed Jun 28, 2016
View reviewed changes

heuermh mentioned this pull request Jun 28, 2016

[ADAM-1057] Remove workaround for gzip/BGZF compressed VCF headers bigdatagenomics/adam#1060

Closed

cmnbroad reviewed Jun 28, 2016
View reviewed changes

Add BGZFCodec for reading and writing files with a .bgz suffix.

b972e8e

The old BGZF codec has been renamed BGZFEnhancedGzipCodec and reads .gz files as BGZF (so they are splittable) if possible, otherwise falls back to regular gzip.

tomwhite force-pushed the bgzf-writes branch from ba8b9f0 to b972e8e Compare June 29, 2016 14:11

tomwhite merged commit 10dec49 into HadoopGenomics:master Jun 30, 2016

tomwhite deleted the bgzf-writes branch June 30, 2016 08:34

tomwhite added a commit to broadinstitute/gatk that referenced this pull request Jul 1, 2016

Add support for block gzipped files with a .bgz suffix (as well as

a4c8e5b

a .gz suffix), see HadoopGenomics/Hadoop-BAM#106. Also remove spurious debug.

tomwhite mentioned this pull request Jul 1, 2016

Add support for block gzipped files with a .bgz suffix (as well as broadinstitute/gatk#1963

Merged

tomwhite added a commit to broadinstitute/gatk that referenced this pull request Jul 7, 2016

Add support for block gzipped files with a .bgz suffix (as well as

2167f0b

a .gz suffix), see HadoopGenomics/Hadoop-BAM#106. Also remove spurious debug.

magicDGS pushed a commit to bioinformagik/gatk that referenced this pull request Jul 15, 2016

Add support for block gzipped files with a .bgz suffix (as well as

f334a5a

a .gz suffix), see HadoopGenomics/Hadoop-BAM#106. Also remove spurious debug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BGZFCodec for reading and writing files with a .bgz suffix. #106

Add BGZFCodec for reading and writing files with a .bgz suffix. #106

tomwhite commented Jun 15, 2016

heuermh commented Jun 27, 2016

tomwhite commented Jun 28, 2016

heuermh Jun 28, 2016

tomwhite Jun 28, 2016

heuermh Jun 28, 2016

cmnbroad Jun 28, 2016 •

edited

Loading

tomwhite Jun 29, 2016

heuermh commented Jun 28, 2016

cmnbroad Jun 28, 2016

tomwhite Jun 29, 2016

cmnbroad commented Jun 28, 2016

tomwhite commented Jun 29, 2016

heuermh commented Jun 30, 2016

Add BGZFCodec for reading and writing files with a .bgz suffix. #106

Add BGZFCodec for reading and writing files with a .bgz suffix. #106

Conversation

tomwhite commented Jun 15, 2016

heuermh commented Jun 27, 2016

tomwhite commented Jun 28, 2016

heuermh Jun 28, 2016

Choose a reason for hiding this comment

tomwhite Jun 28, 2016

Choose a reason for hiding this comment

heuermh Jun 28, 2016

Choose a reason for hiding this comment

cmnbroad Jun 28, 2016 • edited Loading

Choose a reason for hiding this comment

tomwhite Jun 29, 2016

Choose a reason for hiding this comment

heuermh commented Jun 28, 2016

cmnbroad Jun 28, 2016

Choose a reason for hiding this comment

tomwhite Jun 29, 2016

Choose a reason for hiding this comment

cmnbroad commented Jun 28, 2016

tomwhite commented Jun 29, 2016

heuermh commented Jun 30, 2016

cmnbroad Jun 28, 2016 •

edited

Loading