Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema for sample #84

Merged
merged 1 commit into from
Jun 27, 2016
Merged

Add schema for sample #84

merged 1 commit into from
Jun 27, 2016

Conversation

heuermh
Copy link
Member

@heuermh heuermh commented Jun 6, 2016

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/101/
Test PASSed.

@heuermh heuermh changed the title Adding schema for sample Add schema for sample Jun 6, 2016
@jpdna
Copy link
Member

jpdna commented Jun 7, 2016

see my comment bigdatagenomics/adam#1039 (comment)
about moving sampleDescription and processingDescription into this Sample

@jpdna
Copy link
Member

jpdna commented Jun 7, 2016

@heuermh - if you want to keep this PR focused on schema you can assign an issue to me to update the ETL code to load and save using the new Sample and with these fields in Genotype moved.

@heuermh
Copy link
Member Author

heuermh commented Jun 7, 2016

How are those fields populated? What do they mean? Do they map to SRA metadata fields, which per earlier conversation and the doc comments all go into attributes?

@jpdna
Copy link
Member

jpdna commented Jun 7, 2016

How are those fields populated?

VCF spec, in context of "5.4.10 Sample mixtures" has

##SAMPLE=<ID=Blood,Genomes=Germline,Mixture=1.,Description="Patient germline genome">

but I'm not so sure that is going into one of these fields.
I'd suspect they don't get populated now.
However, these two fields sampleDescription and processingDescription seems reasonable enough in how they are currently described in comments in bdg-formats genotype and not expensive if they are moved to Sample - and can be null

@heuermh
Copy link
Member Author

heuermh commented Jun 7, 2016

The cardinality for Genomes, Mixture, and Description is 0..*, so they would need to go in array fields. Or alternatively a single array of Genome records, which would have name, mixture, and description fields.

However, the doc already specifically recommends those go in attributes, where cardinality doesn't matter (repeated keys are ok).

@jpdna
Copy link
Member

jpdna commented Jun 7, 2016

the doc already specifically recommends those go in attributes

putting in attributes is fine with me

@heuermh
Copy link
Member Author

heuermh commented Jun 7, 2016

Although now that I've said that repeated keys in attributes are ok, they are not, since we're using an avro map<string>. In which case last in wins.

@jpdna
Copy link
Member

jpdna commented Jun 7, 2016

I'm not so convinced that these two fields are coming from this sample mixture related tag anyhow, or that they have any source in VCF currently, so maybe wait to worry about modeling Genomes, Mixture, Description cardinality for now until we know more.

I don't think these fields get populated anyhow - I am more concerned for the principle that in case sample level metadata like this appears in future and somehow are shoe-horned into these fields that it not appear in our inner-inner loop inside of Genotype

@fnothaft
Copy link
Member

Do we want to get this into the next bdg-formats release? CC @heuermh @jpdna

@heuermh
Copy link
Member Author

heuermh commented Jun 15, 2016

Yes, and #83. I'll rebase to resolve conflicts in a sec

@heuermh
Copy link
Member Author

heuermh commented Jun 15, 2016

Attributes with duplicate keys are going to get clobbered. That wasn't a major issue in the Feature record, where this also came up, but will be here. The cardinality of the VCF sample attributes and nearly all of the SRA attributes are 0..*.

Is this a reasonable workaround?

  // ...
  array<Attribute> attributes = [];
}

record Attribute {
  string key;
  string value;
}

or do we want to reopen discussion about a bunch of array fields?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/102/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/104/
Test PASSed.

@heuermh heuermh modified the milestone: 0.8.1 Jun 22, 2016
@fnothaft fnothaft merged commit 244e736 into bigdatagenomics:master Jun 27, 2016
@heuermh heuermh deleted the sample branch June 27, 2016 19:48
@heuermh
Copy link
Member Author

heuermh commented Jun 27, 2016

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants