-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCF sample metadata - proposal for a GenotypedSampleMetadata object #1039
Comments
@heuermh voiced a -0 RE: having |
Maybe not the best place for this, but I hadn't tried writing such a chart before: References
Reads
Variants
Variant annotation (VCF ANN spec)
Sequence features
Other
Currently GA4GH doesn't model Sample, rather sampleId is used in CallSet and ReadGroup. bdg-formats uses sample name in RecordGroupMetadata and AlignmentRecord and sampleId in Genotype. The Sample model I'm most familiar with is from SRA, modeled in XML schema here There are of course 15 other competing standards. I'm not suggesting we get carried away modeling samples, just add enough to support useful queries and get cardinality relationships right. |
Here's a proposal for flattening the SRA sample XSD to avro
Although after all that it is not clear where Example: ENA default sample checklist (values that should go in sample attributes, maybe some of these could move to fields): If you might prefer reading javadoc over XSD docs, we generated jaxb mappings here, though they might be out of date. |
We do want our sample metadata schema in bdg-formats to be able to map to the SRA schema - but IMO we may not want to adopt the SRA schema as fully as the proposal above because many SRA fields only make sense in the use case of running a data archive/repository like NCBI/EBI - and may create noise and ambiguity for users in where to place data in which field for our more general audience. For example - I'd suggest for the bdg-formats SampleRecord a minimal schema with fields:
For data derived from SRA/ENA, we could provide suggested keys for the I think such a minimal schema for the AssayedSampleID, AssayedSampleIdAlias and SampleName provides clarity to that primary cardinality relationship, but at same time leaves plenty of room in the |
From what I understand, storing data in That would seem to lead to a design principle of preferring nullable fields to attributes. |
But isn't the size of data in these sample metadata records so trivially small that we need not worry about performance inefficiency in this case? If we feel SRA sourced metadata is a major use case and we want to make nullable field names which map to sra explicitly rather than suggested key/value attributes then I suggest prefixing the nullable names to be like: I'd still then like to have the basic three fields |
Yep. I'm still trying to figure out what the design principles are for our schema, so I'm trying argumentation. :)
I don't believe the fields of SRA SAMPLE are the interesting bits, rather what might be stored as attributes according to e.g. the ENA default sample checklist linked above. Starting from a minimal record of
would be fine with me. Then since the ENA checklist is the "minimum information required for the sample" for ENA, and I assume one can dig up minimal requirements for SRA, CGHub, dbGap, etc., which should be similar, we might want to add some of those keyed values as nullable fields. |
+1 to the minimal |
+1 to the |
I was hoping that proposal would be the start of discussion, not the end of it. Is it correct that nullable fields are preferable to attributes? If so, are there any attributes of the ENA checklist or other minimal requirements that we should add as fields? |
Looking at the I'd suggest to add |
Strong +1 @jpdna |
I'm +1 removing them from Genotype. -0 adding them as fields to Sample. |
New schema for sample was added in bigdatagenomics/bdg-formats#84. That pull request did not remove any fields from |
Closed by commit 4b6e107. |
proposal for a GenotypedSampleMetadata avro schema separate from RecordGroup so as to not confound the concept of Read/Record group metadata and VCF sample metadata.
As discussed in #1015
The text was updated successfully, but these errors were encountered: