-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Files that are not in hdfs or ADAM format bypass Spark #494
Conversation
Test FAILed. |
Test PASSed. |
@@ -0,0 +1,223 @@ | |||
/** | |||
* Licensed to Big Data Genomics (BDG) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heuermh this entire file is copied from ADAM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this isn't fully tested yet, but should be somewhat faster than what is in ADAM
bigdatagenomics/convert#71
And if for your use case, if you are projecting away the attributes
column, perhaps it would be useful to add a flag not to convert those, since they are lazily parsed in htsjdk
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thanks @heuermh! Would the flag be added in the convert library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I could do so. I'm also considering adding a convert
adapter layer to the actual converters in ADAM. That way the implementation classes can continue to be private to ADAM and the convert adapter layer part of the public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, been busy with other things. I can implement the attributes
column flag tomorrow. While thinking about the adapter layer in ADAM I found bigdatagenomics/adam#2156, which needs review and real-life testing to make sure there is no performance regression.
} | ||
|
||
// TODO already defined in ADAM in VariantContextConverter line 266 | ||
def getHeaderLines(header: VCFHeader): Seq[VCFHeaderLine] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heuermh this is also copied from ADAM, although it is pretty small so I don't think copying is the worst thing here
Test FAILed. |
} | ||
} | ||
|
||
if (isGzipped) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does a distinction between GZIP and block-compressed GZIP (BGZF) need to be made here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure, do you have a BGZF reference I can play with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ADAM adam-core/src/test/resources
has
test.compressed.bcf
test.uncompressed.bcf
test.vcf
test.vcf.bgz
test.vcf.bgzf.gz
test.vcf.gz
Disq src/test/resources
has
HiSeq.10000.vcf.bgz
HiSeq.10000.vcf.bgz.tbi
HiSeq.10000.vcf.bgzf.gz
HiSeq.10000.vcf.bgzf.gz.tbi
test.vcf
test.vcf.bgz
test.vcf.bgzf.gz
test.vcf.gz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks @heuermh !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thoughts @heuermh I pushed some tests and a fix. It now works with bgz and bgzf.gz
private def createIndex(fp: String, codec: VCFCodec): String = { | ||
|
||
val file = new java.io.File(fp) | ||
val isGzipped = fp.endsWith(".gz") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...same here
mango-core/src/main/scala/org/bdgenomics/mango/models/AlignmentRecordMaterialization.scala
Show resolved
Hide resolved
Test FAILed. |
Test PASSed. |
@akmorrow13 Go ahead and resolve conversations above that you feel have been resolved. After I wrap up all the ADAM post-release stuff I'll spend some time on |
Thanks @heuermh! just resolved them |
Test PASSed. |
No description provided.