VCF sample metadata - proposal for a GenotypedSampleMetadata object #1039

jpdna · 2016-05-26T16:30:42Z

proposal for a GenotypedSampleMetadata avro schema separate from RecordGroup so as to not confound the concept of Read/Record group metadata and VCF sample metadata.

As discussed in #1015

fnothaft · 2016-05-26T16:42:30Z

@heuermh voiced a -0 RE: having Metadata in any names.

heuermh · 2016-05-26T17:57:04Z

Maybe not the best place for this, but I hadn't tried writing such a chart before:

References

GA4GH	bdg-formats	ADAM
Reference	Contig
ReferenceSet		SequenceDictionary
	NucleotideContigFragment
Position		ReferencePosition
		ReferencePositionPair
		ReferenceRegion

Reads

GA4GH	bdg-formats	ADAM
ReadGroup	RecordGroupMetadata
ReadGroupSet		RecordGroupDictionary
ReadStats
LinearAlignment
ReadAlignment	AlignmentRecord
CigarOperation
CigarElement
Program		ProgramRecord
	Fragment
		ReadBucket
		SingleReadBucket

Variants

GA4GH	bdg-formats	ADAM
Variant	Variant
VariantSetMetadata
VariantSet
CallSet
Call	Genotype
	GenotypeAllele
	GenotypeType
	VariantCallingAnnotations
	DatabaseVariantAnnotation
	StructuralVariant
	StructuralVariantType
		VariantContext

Variant annotation (VCF ANN spec)

GA4GH	bdg-formats	ADAM
VariantAnnotationSet
VariantAnnotation	VariantAnnotation
TranscriptEffect	TranscriptEffect
	VariantAnnotationMessage
AnalysisResult
AlleleLocation
HGVSAnnotation

Sequence features

GA4GH	bdg-formats	ADAM
Feature	Feature
FeatureSet
Strand	Strand
ExternalIdentifier	Dbxref
OntologyTerm	OntologyTerm
Attributes
		Gene
		Transcript
		Exon
		CDS
		UTR

Other

GA4GH	bdg-formats	ADAM
Analysis
Experiment
Dataset

Currently GA4GH doesn't model Sample, rather sampleId is used in CallSet and ReadGroup. bdg-formats uses sample name in RecordGroupMetadata and AlignmentRecord and sampleId in Genotype.

The Sample model I'm most familiar with is from SRA, modeled in XML schema here
https://www.ebi.ac.uk/ena/submit/metadata-model

There are of course 15 other competing standards. I'm not suggesting we get carried away modeling samples, just add enough to support useful queries and get cardinality relationships right.

heuermh · 2016-05-26T23:11:14Z

Here's a proposal for flattening the SRA sample XSD to avro

record Sample {

  // flattened IDENTIFIERS
  union { null, string } primaryId = null;
  union { null, QualifiedName } submitterId = null;
  array<QualifiedName> externalIds = [];
  array<string> secondaryIds = [];
  array<string> uuids = [];

  // flattened SAMPLE_NAME
  union { null, string } anonymizedName = null;
  union { null, string } commonName = null;
  union { null, string } individualName = null;
  union { null, string } scientificName = null;
  union { null, int } taxonId = null;

  // SAMPLE
  union { null, string } accession = null;
  union { null, string } alias = null;
  union { null, string } centerName = null;
  union { null, string } brokerName = null;
  union { null, string } title = null;
  union { null, string } description = null;

  // SAMPLE_LINKS
  array<XrefLink> xrefLinks = [];
  array<UrlLink> urlLinks = [];
  array<EntrezLink> entrezLinks = [];

  // SAMPLE_ATTRIBUTES
  map<string> attributes = {};
}

Although after all that it is not clear where sampleId or sampleName from the current schema should go, and often the stuff you'd most want to have for interesting queries would be on the experimental design side, not necessarily attributes of the sample.

Example:
https://www.ebi.ac.uk/ena/data/view/ERS573550&display=xml

ENA default sample checklist (values that should go in sample attributes, maybe some of these could move to fields):
https://www.ebi.ac.uk/ena/data/view/ERC000011&display=xml

If you might prefer reading javadoc over XSD docs, we generated jaxb mappings here, though they might be out of date.

jpdna · 2016-05-27T04:56:21Z

We do want our sample metadata schema in bdg-formats to be able to map to the SRA schema - but IMO we may not want to adopt the SRA schema as fully as the proposal above because many SRA fields only make sense in the use case of running a data archive/repository like NCBI/EBI - and may create noise and ambiguity for users in where to place data in which field for our more general audience.

For example - submitterId is the field for the "local" submitter name for an assayed sample - but really makes sense only in context of metadata being submitted to a public repository. Similarly primaryId is the NCBI/EBI assigned accession. We may want to capture those fields from datasets that come from public archives, but having a submitterId or primaryId centerName or brokerName may make little sense for some of our users and encourage ad hoc re-purposing of these fields and an ensuing mess interpreting these records. Similarly, capturing the species common name, scientific name, and taxonID - seems more a higher level archive meta-data function that we don't need to capture in the primary schema - if anything they can go in the key-value section.

I'd suggest for the bdg-formats SampleRecord a minimal schema with fields:

AssayedSampleId this represent the name for the physical sample that was assayed, and possibly also maps to the ReadGroup name in the upstream BAM file. In case of SRA maps to "submitterID".
AssayedSampleIdAlias meant to be used if the name in the upstream source file - like the readGroup name, is different than the above - as we want to capture a direct mapping to readGroup if at all possible.
SampleName - used to capture a unique name for a person/individual such as NA12878. Maps to individualName in the SRA schema.
possibly some standardized dbxref or uri array mean to link to external ids
the attributes key value map of custom fields

For data derived from SRA/ENA, we could provide suggested keys for the attributes map that directly capture all the fields from SRA that we want to explicitly map, without having to complicate our main schema with the archive repository specific details.

I think such a minimal schema for the AssayedSampleID, AssayedSampleIdAlias and SampleName provides clarity to that primary cardinality relationship, but at same time leaves plenty of room in the attributes key value section, with associated documentation and guidance we might provide, to capture the full glory of the SRA schema - and future other metadata sources.

heuermh · 2016-05-27T15:11:24Z

From what I understand, storing data in attributes is inefficient given how ADAM and Spark operate on schema records (projections, filters, column compression on disk, SparkSQL queries, etc.). For example, it is not possible to project only a single key from the attributes field.

That would seem to lead to a design principle of preferring nullable fields to attributes.

jpdna · 2016-05-27T15:27:19Z

But isn't the size of data in these sample metadata records so trivially small that we need not worry about performance inefficiency in this case?

If we feel SRA sourced metadata is a major use case and we want to make nullable field names which map to sra explicitly rather than suggested key/value attributes then I suggest prefixing the nullable names to be like:
-sra_submitterId
-sra_primaryId
etc. to make it clear that the meaning of these fields is for SRA mapping use case, and makes it easy for users who don't deal with SRA data to ignore these all.

I'd still then like to have the basic three fields AssayedSampleID AssayedSampleIDAlias SampleName as I proposed above for the more general audience, and we could suggest in the SRA use case that users map the appropriate field there as well so as to make combining/comparing SRA and non-SRA records easier.

heuermh · 2016-05-27T15:57:39Z

But isn't the size of data in these sample metadata records so trivially small that we need not worry about performance inefficiency in this case?

Yep. I'm still trying to figure out what the design principles are for our schema, so I'm trying argumentation. :)

If we feel SRA sourced metadata is a major use case ...

I don't believe the fields of SRA SAMPLE are the interesting bits, rather what might be stored as attributes according to e.g. the ENA default sample checklist linked above.

Starting from a minimal record of

record Sample {
  union { null, string } id = null;  // primary key
  union { null, string } alias = null;  // alternate key for joining with ReadGroups
  union { null, string } name = null;  // descriptive name, e.g. SRA individualName
  map<string> attributes = {};
}

would be fine with me.

Then since the ENA checklist is the "minimum information required for the sample" for ENA, and I assume one can dig up minimal requirements for SRA, CGHub, dbGap, etc., which should be similar, we might want to add some of those keyed values as nullable fields.

jpdna · 2016-05-27T16:08:30Z

Starting from a minimal record of

+1 to the minimal Sample record @heuermh proposes above above being added as a member of bdg-formats avro schema and being the type written by vcf2adam into the _samples.avro file alongside the parquet files.

fnothaft · 2016-05-28T01:24:10Z

+1 to the Sample @heuermh proposed as well!

heuermh · 2016-05-31T15:33:16Z

I was hoping that proposal would be the start of discussion, not the end of it.

Is it correct that nullable fields are preferable to attributes? If so, are there any attributes of the ENA checklist or other minimal requirements that we should add as fields?

jpdna · 2016-06-07T00:24:15Z

Looking at the genotype bdg-formats schema just now, I really don't like the repeated per-sample data such as sampleDescription and processingDescription. These should be normalized out into the new Sample record so as to not cause memory bloat we we actually materialize Genotype objects.

I'd suggest to add
sampleDescription
processingDescription
to the new Sample record and to remove those fields from Genotype

fnothaft · 2016-06-07T00:28:14Z

Strong +1 @jpdna

heuermh · 2016-06-07T00:49:18Z

I'm +1 removing them from Genotype. -0 adding them as fields to Sample.

heuermh · 2016-06-27T21:04:32Z

New schema for sample was added in bigdatagenomics/bdg-formats#84.

That pull request did not remove any fields from Genotype or add code to ADAM to read or write data in those formats. @jpdna would you like to take up those tasks?

heuermh · 2016-07-19T14:42:01Z

Closed by commit 4b6e107.

jpdna mentioned this issue May 26, 2016

[ADAM-909] Refactoring variation RDDs. #1015

Merged

heuermh modified the milestone: 0.20.0 Jun 5, 2016

heuermh mentioned this issue Jun 6, 2016

Add schema for sample bigdatagenomics/bdg-formats#84

Merged

This was referenced Jun 7, 2016

Update feature and related records to support GFF3 bigdatagenomics/bdg-formats#82

Merged

Release ADAM version 0.20.0 #1048

Closed

fnothaft mentioned this issue Jul 7, 2016

[ADAM-1039] Add basic support for Sample record. #1077

Closed

heuermh closed this as completed Jul 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VCF sample metadata - proposal for a GenotypedSampleMetadata object #1039

VCF sample metadata - proposal for a GenotypedSampleMetadata object #1039

jpdna commented May 26, 2016 •

edited

Loading

fnothaft commented May 26, 2016

heuermh commented May 26, 2016 •

edited

Loading

heuermh commented May 26, 2016

jpdna commented May 27, 2016

heuermh commented May 27, 2016

jpdna commented May 27, 2016

heuermh commented May 27, 2016

jpdna commented May 27, 2016

fnothaft commented May 28, 2016

heuermh commented May 31, 2016

jpdna commented Jun 7, 2016

fnothaft commented Jun 7, 2016

heuermh commented Jun 7, 2016

heuermh commented Jun 27, 2016

heuermh commented Jul 19, 2016

VCF sample metadata - proposal for a GenotypedSampleMetadata object #1039

VCF sample metadata - proposal for a GenotypedSampleMetadata object #1039

Comments

jpdna commented May 26, 2016 • edited Loading

fnothaft commented May 26, 2016

heuermh commented May 26, 2016 • edited Loading

heuermh commented May 26, 2016

jpdna commented May 27, 2016

heuermh commented May 27, 2016

jpdna commented May 27, 2016

heuermh commented May 27, 2016

jpdna commented May 27, 2016

fnothaft commented May 28, 2016

heuermh commented May 31, 2016

jpdna commented Jun 7, 2016

fnothaft commented Jun 7, 2016

heuermh commented Jun 7, 2016

heuermh commented Jun 27, 2016

heuermh commented Jul 19, 2016

jpdna commented May 26, 2016 •

edited

Loading

heuermh commented May 26, 2016 •

edited

Loading