Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADAMFlatGenotype is a smaller, flat version of a genotype schema #259

Merged

Conversation

tdanford
Copy link
Contributor

@tdanford tdanford commented Jun 6, 2014

Includes a rewrite of a non-Tribble VCF parser, which parses (large) VCF files line-by-line and includes a converter into the ADAMFlatGenotype format itself.

Also:

  1. Added support for missing genotypes in the VCFLineConverter for ADAMFlatGenotype.
  2. Added a ReferenceMapping for the ADAMFlatGenotype
  3. the flat genotype converter uses multiple writers for smaller output files.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/349/


class Vcf2FlatGenotypeArgs extends Args4jBase with ParquetArgs {
@Argument(required = true, metaVar = "VCF", usage = "The VCF file to convert", index = 0)
var bamFile: String = null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bamFile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, a hold-over from copy-and-pasting the BAM2ADAM code as my starting point :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest update. Thanks, Arun.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/350/

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/351/

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/354/

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/357/

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/362/

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/364/

@massie
Copy link
Member

massie commented Jun 12, 2014

This code looks good to me. Anyone object to just merging it as it?

def run() = {

// Quiet parquet...
ParquetLogger.hadoopLoggerLevel(Level.SEVERE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should already be done, no? Is this needed because this is not a Spark command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a copy-and-paste mistake.

@fnothaft
Copy link
Member

@massie this needs to be squashed before merging.

I'm OK with merging. Long term, I'd like to see:

  • A converter back to VCF
  • Conversion functions for going between the flat genotype and the main ADAMGenotype object

@tdanford @carlyeks how do you see ADAMFlatGenotype relating to ADAMGenotype? Is ADAMFlatGenotype meant to be a lower overhead representation for genotypes, a potential replacement for ADAMGenotype, or...?

@tdanford
Copy link
Contributor Author

Just to be clear, I see ADAMFlatGenotype as an experimental alternative to ADAMGenotype -- mostly focused on building a data structure that was easier (and faster) to create/write. That's why we didn't write all the converters that you describe above in your comments.

I think in the long term, there's room for an alternative to ADAM[Flat]Genotype which doesn't aggregate alleles by position (as both Genotype and FlatGenotype) do, and that's maybe something this model could evolve into, but at the moment this isn't that.

The particular selection of fields that went into ADAMFlatGenotype, and the way they went in, is related to the work that we are doing to prototype building internal variant stores for a user (Jason Flannick) here at the Broad.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/369/

@massie
Copy link
Member

massie commented Jun 14, 2014

Jenkins, test this please.

1 similar comment
@massie
Copy link
Member

massie commented Jun 14, 2014

Jenkins, test this please.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/373/

@carlyeks
Copy link
Member

Jenkins, test this please.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/379/

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/380/

@fnothaft
Copy link
Member

@tdanford I think the only thing pending on this is to wrap the lines in ADAMFlatGenotypeField, and to squash the commit down and rebase on ToT. I'll merge when ready.

…hema.

Also:
1. Added support for missing genotypes in the VCFLineConverter for
   ADAMFlatGenotype.
2. Added a ReferenceMapping for the ADAMFlatGenotype
3. the flat genotype converter uses multiple writers for smaller output
   files.
4. Don't ignore VCF files in .gitignore
@carlyeks
Copy link
Member

@fnothaft Actually, it looks like scalariform doesn't like that formatting -- it keeps putting it back to how it is now. Bad formatting rules?

I've squashed and rebased on master.

@fnothaft
Copy link
Member

@carlyeks I had noticed that yesterday as well. Alas! Thanks for squashing and rebasing; I'll merge after the tests run.

Jenkins, test this please.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/383/

fnothaft added a commit that referenced this pull request Jun 17, 2014
ADAMFlatGenotype is a smaller, flat version of a genotype schema
@fnothaft fnothaft merged commit 6827783 into bigdatagenomics:master Jun 17, 2014
@fnothaft
Copy link
Member

Merged! Thanks @tdanford and @carlyeks!

@carlyeks carlyeks deleted the flat-genotype-rebased branch June 17, 2014 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants