-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADAMFlatGenotype is a smaller, flat version of a genotype schema #259
ADAMFlatGenotype is a smaller, flat version of a genotype schema #259
Conversation
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/349/ |
|
||
class Vcf2FlatGenotypeArgs extends Args4jBase with ParquetArgs { | ||
@Argument(required = true, metaVar = "VCF", usage = "The VCF file to convert", index = 0) | ||
var bamFile: String = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bamFile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, a hold-over from copy-and-pasting the BAM2ADAM code as my starting point :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the latest update. Thanks, Arun.
All automated tests passed. |
All automated tests passed. |
All automated tests passed. |
All automated tests passed. |
All automated tests passed. |
All automated tests passed. |
This code looks good to me. Anyone object to just merging it as it? |
def run() = { | ||
|
||
// Quiet parquet... | ||
ParquetLogger.hadoopLoggerLevel(Level.SEVERE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should already be done, no? Is this needed because this is not a Spark command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a copy-and-paste mistake.
@massie this needs to be squashed before merging. I'm OK with merging. Long term, I'd like to see:
@tdanford @carlyeks how do you see ADAMFlatGenotype relating to ADAMGenotype? Is ADAMFlatGenotype meant to be a lower overhead representation for genotypes, a potential replacement for ADAMGenotype, or...? |
Just to be clear, I see ADAMFlatGenotype as an experimental alternative to ADAMGenotype -- mostly focused on building a data structure that was easier (and faster) to create/write. That's why we didn't write all the converters that you describe above in your comments. I think in the long term, there's room for an alternative to ADAM[Flat]Genotype which doesn't aggregate alleles by position (as both Genotype and FlatGenotype) do, and that's maybe something this model could evolve into, but at the moment this isn't that. The particular selection of fields that went into ADAMFlatGenotype, and the way they went in, is related to the work that we are doing to prototype building internal variant stores for a user (Jason Flannick) here at the Broad. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/369/ |
Jenkins, test this please. |
1 similar comment
Jenkins, test this please. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/373/ |
Jenkins, test this please. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/379/ |
All automated tests passed. |
@tdanford I think the only thing pending on this is to wrap the lines in ADAMFlatGenotypeField, and to squash the commit down and rebase on ToT. I'll merge when ready. |
…hema. Also: 1. Added support for missing genotypes in the VCFLineConverter for ADAMFlatGenotype. 2. Added a ReferenceMapping for the ADAMFlatGenotype 3. the flat genotype converter uses multiple writers for smaller output files. 4. Don't ignore VCF files in .gitignore
@fnothaft Actually, it looks like scalariform doesn't like that formatting -- it keeps putting it back to how it is now. Bad formatting rules? I've squashed and rebased on master. |
@carlyeks I had noticed that yesterday as well. Alas! Thanks for squashing and rebasing; I'll merge after the tests run. Jenkins, test this please. |
All automated tests passed. |
ADAMFlatGenotype is a smaller, flat version of a genotype schema
Includes a rewrite of a non-Tribble VCF parser, which parses (large) VCF files line-by-line and includes a converter into the ADAMFlatGenotype format itself.
Also: