Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files that are not in hdfs or ADAM format bypass Spark #494

Merged
merged 11 commits into from
Jun 28, 2019

Conversation

akmorrow13
Copy link
Contributor

No description provided.

@coveralls
Copy link

coveralls commented May 9, 2019

Coverage Status

Coverage increased (+7.9%) to 80.273% when pulling 39aa701 on akmorrow13:optimize_http into b9efbdb on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/832/
Test FAILed.

@akmorrow13 akmorrow13 changed the title Optimize http Files that are not in hdfs or ADAM format bypass Spark May 10, 2019
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/833/
Test PASSed.

@akmorrow13 akmorrow13 requested a review from heuermh May 10, 2019 18:41
@@ -0,0 +1,223 @@
/**
* Licensed to Big Data Genomics (BDG) under one
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heuermh this entire file is copied from ADAM

Copy link
Member

@heuermh heuermh May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this isn't fully tested yet, but should be somewhat faster than what is in ADAM
bigdatagenomics/convert#71

And if for your use case, if you are projecting away the attributes column, perhaps it would be useful to add a flag not to convert those, since they are lazily parsed in htsjdk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks @heuermh! Would the flag be added in the convert library?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I could do so. I'm also considering adding a convert adapter layer to the actual converters in ADAM. That way the implementation classes can continue to be private to ADAM and the convert adapter layer part of the public API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, been busy with other things. I can implement the attributes column flag tomorrow. While thinking about the adapter layer in ADAM I found bigdatagenomics/adam#2156, which needs review and real-life testing to make sure there is no performance regression.

}

// TODO already defined in ADAM in VariantContextConverter line 266
def getHeaderLines(header: VCFHeader): Seq[VCFHeaderLine] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heuermh this is also copied from ADAM, although it is pretty small so I don't think copying is the worst thing here

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/834/
Test FAILed.

}
}

if (isGzipped)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a distinction between GZIP and block-compressed GZIP (BGZF) need to be made here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, do you have a BGZF reference I can play with?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ADAM adam-core/src/test/resources has

test.compressed.bcf
test.uncompressed.bcf
test.vcf
test.vcf.bgz
test.vcf.bgzf.gz
test.vcf.gz

Disq src/test/resources has

HiSeq.10000.vcf.bgz
HiSeq.10000.vcf.bgz.tbi
HiSeq.10000.vcf.bgzf.gz
HiSeq.10000.vcf.bgzf.gz.tbi
test.vcf
test.vcf.bgz
test.vcf.bgzf.gz
test.vcf.gz

Copy link
Contributor Author

@akmorrow13 akmorrow13 May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks @heuermh !

Copy link
Contributor Author

@akmorrow13 akmorrow13 May 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughts @heuermh I pushed some tests and a fix. It now works with bgz and bgzf.gz

private def createIndex(fp: String, codec: VCFCodec): String = {

val file = new java.io.File(fp)
val isGzipped = fp.endsWith(".gz")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...same here

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/835/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/836/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented May 23, 2019

@akmorrow13 Go ahead and resolve conversations above that you feel have been resolved. After I wrap up all the ADAM post-release stuff I'll spend some time on convert stuff as described above.

@akmorrow13
Copy link
Contributor Author

Thanks @heuermh! just resolved them

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/854/
Test PASSed.

@akmorrow13 akmorrow13 merged commit a989d74 into bigdatagenomics:master Jun 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants