-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BGZFCodec for reading and writing files with a .bgz suffix. #106
Conversation
I'm generally in favor of this, though it may take us a while to test. |
I'd like this to go into the next release, so please take a look/try it out @heuermh. |
private static VCFFileReader parseVcf(File vcf) throws IOException { | ||
File actualVcf; | ||
// work around TribbleIndexedFeatureReader not reading header from .bgz files | ||
if (vcf.getName().endsWith(".bgz")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only a workaround for unit tests, or something that we'll also need to do downstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only needed for testing. VCFHeaderReader in Hadoop-BAM will correctly read a bgzf stream now (since #97).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks! Yep, VCFHeaderReader is now working for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomwhite - is that (.bgz/TribbleIndexedFeatureReader) a known issue or is there a ticket? It seems like something we should fix for the future....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just created samtools/htsjdk#653 for it.
+1, see downstream pull request referenced above |
String extension = ""; | ||
if (isCompressed) { | ||
Class<? extends CompressionCodec> codecClass = | ||
getOutputCompressorClass(ctx, GzipCodec.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GzipCodec seems like an unnatural default, certainly for BCF. Wouldn't BGZFCodec be a better choice for both for both VCF and BCF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Fixed.
A couple of minor comments. Not sure why the Travis push job failed. Otherwise LGTM. Back to @tomwhite. |
The old BGZF codec has been renamed BGZFEnhancedGzipCodec and reads .gz files as BGZF (so they are splittable) if possible, otherwise falls back to regular gzip.
Thank you, @tomwhite! |
a .gz suffix), see HadoopGenomics/Hadoop-BAM#106. Also remove spurious debug.
a .gz suffix), see HadoopGenomics/Hadoop-BAM#106. Also remove spurious debug.
a .gz suffix), see HadoopGenomics/Hadoop-BAM#106. Also remove spurious debug.
This adds support for writing BGZF-compressed files.
The old BGZF codec has been renamed BGZFEnhancedGzipCodec and reads .gz files
as BGZF (so they are splittable) if possible, otherwise falls back to regular
gzip.