Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Dataset role not clearly defined #248

Closed
jeromekelleher opened this issue Mar 6, 2015 · 41 comments
Closed

Dataset role not clearly defined #248

jeromekelleher opened this issue Mar 6, 2015 · 41 comments

Comments

@jeromekelleher
Copy link
Contributor

Currently, the Dataset type is poorly specified, and included in reads.avdl. We have some comments like

TODO: Reads and variants both want to have datasets. Are they the same object?

This needs to be clarified.

I propose:

  1. We change the name to DataSet, so that we are consistent with VariantSet, ReferenceSet, ReadGroupSet and others.
  2. We make the role of the DataSet much more explicit, and make a formal definition of the data model as a hierarchy. (This relates to Query semantics for parent IDs inconsistent and problematic #247 also, since my proposal there assumes a hierarchy.) In this hierachy, theDataSet is the root, and it has child VariantSets and ReadGroupSets. Each VariantSet has child Variants; each ReadGroupSet has child ReadGroups, and ReadGroups have child Reads. We don't need to make any changes to the API to do this, we just need to be more explicit about what the model means.

This formalisation should make reasoning about authorisation much more straightforward.

Thoughts?

@jeromekelleher
Copy link
Contributor Author

This hierarchy idea is broken somewhat by the Call and CallSet classes:

  1. CallSet has a field variantSetIds, so it can belong to more than one VariantSet. Do we really want CallSets to belong to more than one VariantSet? It would be a lot simpler if we had one call belonging to exactly one Variant and one CallSet.
  2. SearchCallsRequest is badly overloaded: we can search for a calls contained in a list of callSetIds, variantSetIds and variantIds. We should replace all of these with a single callSetId, since (a) a CallSet belongs to a VariantSet, so specifying the vaiantSetId is redundant; and (b) obtaining the Calls associated with a given Variant should be done directly using a SearchVariantsRequest. We should ask what the point of a SearchCallsRequest is at all in this case, as the idea of getting all the Calls in a CallSet without reference to their parent Variants seems somewhat obscure...
  3. SearchAlleleCalls suffers from similar problems, having four different lists of IDs that we can provide as parameters.

@dcolligan
Copy link
Member

I would also like to know what function the Dataset was supposed to serve.

@richarddurbin
Copy link
Contributor

It encapsulates a set of data with shared access control (most important) and typically provenance. For example from a single deposited VCF.
There may also be namespace implications, e.g. sample names are unique in a dataset, but not necessarily between.
Something like this is necessary in practice when data from more than one source is stored in a server.

Richard

On 14 May 2015, at 21:07, Danny Colligan notifications@github.com wrote:

I would also like to know what function the Dataset was supposed to serve.


Reply to this email directly or view it on GitHub #248 (comment).

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@calbach
Copy link
Contributor

calbach commented May 15, 2015

I agree with what @richarddurbin said. To add to it, the role is much clearer for reads where there would otherwise be no hierarchical collection beyond a ReadGroupSet. The value of a Dataset for variants was somewhat mitigated by the introduction of VariantSets, which serve a similar purpose. Also the intent was that they are the same object across VariantSets and ReadGroupSets. It is natural to have reads and related variants for a particular study in the same dataset, for ease of sharing and management.

CallSet has a field variantSetIds, so it can belong to more than one VariantSet. Do we really want CallSets to belong to more than one VariantSet? It would be a lot simpler if we had one call belonging to exactly one Variant and one CallSet.

I didn't realize this was the case. I don't think CallSets should be able to span VariantSets; maybe I'm missing some background information as to why this would be desirable.

@pgrosu
Copy link
Contributor

pgrosu commented May 15, 2015

@calbach Will the same sample be part of different VariantSets? Can two VariantSets be technical replicates in a larger study?

@diekhans
Copy link
Contributor

If it's specific to reads, then it should have a less generic name.

If it's intended to be more general, it should be moved out of reads.

What ever is intended, this intention needs to be documented.

CH Albach notifications@github.com writes:

I agree with what @richarddurbin said. To add to it, the role is much clearer
for reads where there would otherwise be no hierarchical collection beyond a
ReadGroupSet. The value of a Dataset for variants was somewhat mitigated by the
introduction of VariantSets, which serve a similar purpose. Also the intent was
that they are the same object across VariantSets and ReadGroupSets. It is
natural to have reads and related variants for a particular study in the same
dataset, for ease of sharing and management.

CallSet has a field variantSetIds, so it can belong to more than one
VariantSet. Do we really want CallSets to belong to more than one
VariantSet? It would be a lot simpler if we had one call belonging to
exactly one Variant and one CallSet.

I didn't realize this was the case. I don't think CallSets should be able to
span VariantSets; maybe I'm missing some background information as to why this
would be desirable.


Reply to this email directly or view it on GitHub.*

@pgrosu
Copy link
Contributor

pgrosu commented May 15, 2015

I'll reference myself from what I posted previously under server: ga4gh/ga4gh-server#376 (comment)

So think of a Dataset as study that was performed and which joins/groups all the relevant data associated with it in a logical grouping. It is a collection of one or more of the following:

  • ReadGroup
  • ReadGroupSet
  • VariantSet
  • FeatureSet

It also encompasses what @richarddurbin is emphasizing. Remember this diagram I posted almost a year ago here:

sample read variant connected workflow structure

Notice how the permissions propagate down to the Variant level, which is initiated by the user's permission that generated the Project/Study. At any time these can be turned to public so everyone can view them. So by utilizing controlled vocabulary through which one can reference other samples, then one can link multiple datasets together. Thus yes, the ability to query multiple datasets is important since one dataset cannot contain all the relevant data that one will ever need. As previously suggested in #253 you will need the ability to query multiple datasets, especially for metadata among other things. Also having one call belonging to exactly one Variant and one CallSet might make indexing costly, though there are other approaches that would cut down those costs. We are still below version 1.0, and adding too many limitations would limit the potential of the possibilities of what we can do with the API.

Hope it helps,
Paul

@dglazer
Copy link
Member

dglazer commented May 19, 2015

Agreeing with and elaborating on @richarddurbin and @calbach 's responses to @dcolligan 's question:

I would also like to know what function the Dataset was supposed to serve.

TL;DR: they're a convenient way for data providers to lump together related data of multiple types.

  • for access control: if a server wants to host data with heterogeneous access (e.g. a public copy of 1000genomes and a private copy of one researcher's study), there needs to be a place for data providers to specify who can see what. Datasets are a convenient level of granularity for doing so.
  • for billing: if a server wants to charge multiple data providers for the resources needed to store their data, it needs to be clear who gets charged for which data. Datasets are a convenient level of granularity for doing so.
  • for provenance: the server doesn't care about where data comes from (at least not with today's methods), but users do. Datasets are a convenient way to group all data from one logical source (e.g. a study).

@mbaudis
Copy link
Member

mbaudis commented May 19, 2015

@pgrosu's comment reflects my take on discussions we had in metadata (e.g. see comment thread in https://docs.google.com/document/d/1Sl6FYwBHjWYYo2Ex29fqvmEiOQjeokgAUfXNGsVQwZ0/edit#heading=h.p2r9dh51rf8 ). I was proposing a however called generic group object, to represent a static or dynamically generated group of data objects (e.g. all samples or analyses matching a given set of criteria).
Currently, there is just a "IndividualGroup" record type, which is rather conservative in mainly referring to "group of individuals e.g. a trio". I would love to see some traction to define a consistent mechanism/object type for this.

@diekhans
Copy link
Contributor

The problem is that the purpose is not documented in the schema.
It should either be a pull request with documentation or one for
it's removal.

If it's to apply to more than reads, it needs to be moved out of
reads.

It also seems inadequate for any of the these goals, so there
really needs to be some use cases investigated:

  • access control - one would have a very hard time
    implementing TCGA's access control policy with dataset as
    key for access control. It would also mean that data would
    move between dataset during the life of the project.
  • billing - vendor-specific tasks, such as billing, should be
    outside of the scope of GA4GH. Vendor's should have the
    flexibility to design policies of their choosing and the API
    should not be cluttered with things that might prove useful
    to someone.
  • provenance - an extremely important problem and something that
    needs a mechanism that has finer granularity than dataset.
    Having this in here because it might prove useful to
    provenance isn't a good way to implement this critical
    feature.

I am have now convinced myself that dataset should be removed.

Also, having objects reference their containers (datasetId) is
inflexible and cumbersome. We would be far better off with a
functional representation of the data model where objects are
not bound to their containers.

Mark

David Glazer notifications@github.com writes:

Agreeing with and elaborating on @richarddurbin and @calbach 's responses to
@dcolligan 's question:

I would also like to know what function the Dataset was supposed to serve.

TL;DR: they're a convenient way for data providers to lump together related
data of multiple types.

• for access control: if a server wants to host data with heterogeneous
access (e.g. a public copy of 1000genomes and a private copy of one
researcher's study), there needs to be a place for data providers to
specify who can see what. Datasets are a convenient level of granularity
for doing so.
• for billing: if a server wants to charge multiple data providers for the
resources needed to store their data, it needs to be clear who gets charged
for which data. Datasets are a convenient level of granularity for doing
so.
• for provenance: the server doesn't care about where data comes from (at
least not with today's methods), but users do. Datasets are a convenient
way to group all data from one logical source (e.g. a study).


Reply to this email directly or view it on GitHub.*

@pgrosu
Copy link
Contributor

pgrosu commented May 20, 2015

So maybe we might want to start discussing how do we really want to use this API and its ideal purpose? Would it be to enable better elucidation of diseases for research purposes, or what might be some of the billing and dataset-as-permission incentives? The reason I pose this question is that, usually it is better to look at all the data than just part of it. If different individuals/teams have different access to it, that would make it difficult to properly identify the causes of some diseases in order to properly perform the followup experiments or recommend treatment. Imagine if BLAST - when it was first introduced in the 90's - provided varied permission levels to different users with different billing structures. Ideally in our case, the individuals would be anonymized so that access to all the data can still be used for research purposes. This would be the ideal way to effectively enable the system for both research and personalized medicine.

Paul

@richarddurbin
Copy link
Contributor

I am concerned about removing Dataset without replacing it with something that has already
been proved to work better.

I also found DataSet vague when I originally came across it, and wanted to change or remove it.
But the reason it is there is that Google, who have a working read and variant repository across
multiple studies/projects, needed this wrapper concept to support their real world system. Then
I realised that we have exactly the same thing within Sanger - we call it a Study. I conclude that
it is essential to have a wrapper concept for practical genetic variation data repositories, and that
despite all sorts of theorising about what might be ideal, a single layer wrapper as provided by DataSet
is practically sufficient in quite complex settings.

I vote strongly for experience, and not removing something known to work without replacing it
and convincing people who actually manage diverse repositories that the replacement is better.

Richard

On 20 May 2015, at 05:18, Mark Diekhans notifications@github.com wrote:

The problem is that the purpose is not documented in the schema.
It should either be a pull request with documentation or one for
it's removal.

If it's to apply to more than reads, it needs to be moved out of
reads.

It also seems inadequate for any of the these goals, so there
really needs to be some use cases investigated:

  • access control - one would have a very hard time
    implementing TCGA's access control policy with dataset as
    key for access control. It would also mean that data would
    move between dataset during the life of the project.
  • billing - vendor-specific tasks, such as billing, should be
    outside of the scope of GA4GH. Vendor's should have the
    flexibility to design policies of their choosing and the API
    should not be cluttered with things that might prove useful
    to someone.
  • provenance - an extremely important problem and something that
    needs a mechanism that has finer granularity than dataset.
    Having this in here because it might prove useful to
    provenance isn't a good way to implement this critical
    feature.

I am have now convinced myself that dataset should be removed.

Also, having objects reference their containers (datasetId) is
inflexible and cumbersome. We would be far better off with a
functional representation of the data model where objects are
not bound to their containers.

Mark

David Glazer notifications@github.com writes:

Agreeing with and elaborating on @richarddurbin and @calbach 's responses to
@dcolligan 's question:

I would also like to know what function the Dataset was supposed to serve.

TL;DR: they're a convenient way for data providers to lump together related
data of multiple types.

• for access control: if a server wants to host data with heterogeneous
access (e.g. a public copy of 1000genomes and a private copy of one
researcher's study), there needs to be a place for data providers to
specify who can see what. Datasets are a convenient level of granularity
for doing so.
• for billing: if a server wants to charge multiple data providers for the
resources needed to store their data, it needs to be clear who gets charged
for which data. Datasets are a convenient level of granularity for doing
so.
• for provenance: the server doesn't care about where data comes from (at
least not with today's methods), but users do. Datasets are a convenient
way to group all data from one logical source (e.g. a study).


Reply to this email directly or view it on GitHub.*


Reply to this email directly or view it on GitHub #248 (comment).

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@mbaudis
Copy link
Member

mbaudis commented May 20, 2015

A flexibel way to define datasets/studies would be an object containing query & access parameters:

  • list of uuids to be matched
  • attributes to be matched (e.g. species, metadata annotation fragment, diagnosis...)
  • list of record types to be returned
  • positive and/or negative attribute selection (e.g. exclude geolocs, age, ... depending on permission)

There is IMHO no real difference between a stored query and a "DataSet"; the latter would be a static version of this query style object (e.g. defined through the fixed object ids).

@pgrosu
Copy link
Contributor

pgrosu commented May 20, 2015

+1 with @richarddurbin, @mbaudis using a dataset/object-as-a-study - which I've had the same experience with in practice, and the reason I mentioned it last year - but then we should not limit the search across multiple datasets in #253.

@diekhans
Copy link
Contributor

@richarddurbin I am all for experience, however there is no
experience for us to learn from because it's purpose and scope
of DataSet is not documented.

The goal of the DWG needs to be limited to providing a data
model and API for data exchange if it's going to be successful.
Some things needed by vendor will be out-of-scope, however their
requirements are also very valuable input.

The Study analogy is very compelling. However, there is no way
to know, since DataSet is buried in reads.avdl and not
documented. An API must be completely implementable from the
IDL, documentation, and conformance suite.

Mark

Richard Durbin notifications@github.com writes:

I am concerned about removing Dataset without replacing it with something that
has already
been proved to work better.

I also found DataSet vague when I originally came across it, and wanted to
change or remove it.
But the reason it is there is that Google, who have a working read and variant
repository across
multiple studies/projects, needed this wrapper concept to support their real
world system. Then
I realised that we have exactly the same thing within Sanger - we call it a
Study. I conclude that
it is essential to have a wrapper concept for practical genetic variation data
repositories, and that
despite all sorts of theorising about what might be ideal, a single layer
wrapper as provided by DataSet
is practically sufficient in quite complex settings.

I vote strongly for experience, and not removing something known to work
without replacing it
and convincing people who actually manage diverse repositories that the
replacement is better.

Richard

On 20 May 2015, at 05:18, Mark Diekhans notifications@github.com wrote:

The problem is that the purpose is not documented in the schema.
It should either be a pull request with documentation or one for
it's removal.

If it's to apply to more than reads, it needs to be moved out of
reads.

It also seems inadequate for any of the these goals, so there
really needs to be some use cases investigated:

  • access control - one would have a very hard time
    implementing TCGA's access control policy with dataset as
    key for access control. It would also mean that data would
    move between dataset during the life of the project.
  • billing - vendor-specific tasks, such as billing, should be
    outside of the scope of GA4GH. Vendor's should have the
    flexibility to design policies of their choosing and the API
    should not be cluttered with things that might prove useful
    to someone.
  • provenance - an extremely important problem and something that
    needs a mechanism that has finer granularity than dataset.
    Having this in here because it might prove useful to
    provenance isn't a good way to implement this critical
    feature.

I am have now convinced myself that dataset should be removed.

Also, having objects reference their containers (datasetId) is
inflexible and cumbersome. We would be far better off with a
functional representation of the data model where objects are
not bound to their containers.

Mark

David Glazer notifications@github.com writes:

Agreeing with and elaborating on @richarddurbin and @calbach 's responses
to
@dcolligan 's question:

I would also like to know what function the Dataset was supposed to serve.

TL;DR: they're a convenient way for data providers to lump together related
data of multiple types.

• for access control: if a server wants to host data with heterogeneous
access (e.g. a public copy of 1000genomes and a private copy of one
researcher's study), there needs to be a place for data providers to
specify who can see what. Datasets are a convenient level of granularity
for doing so.
• for billing: if a server wants to charge multiple data providers for the
resources needed to store their data, it needs to be clear who gets charged
for which data. Datasets are a convenient level of granularity for doing
so.
• for provenance: the server doesn't care about where data comes from (at
least not with today's methods), but users do. Datasets are a convenient
way to group all data from one logical source (e.g. a study).


Reply to this email directly or view it on GitHub.*


Reply to this email directly or view it on GitHub <https://github.com/ga4gh/
schemas/issues/248#issuecomment-103750191>.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.


Reply to this email directly or view it on GitHub.*

@lh3
Copy link
Member

lh3 commented May 20, 2015

How about removing Dataset from reads.avdl and adding Study to common.avdl as a replacement? A study is user/submitter defined, loosely equivalent to a project. A central requirement of Study is that sample names must not be duplicated in a study. BTW, Study is a long existing concept in SRA/ENA.

@mbaudis
Copy link
Member

mbaudis commented May 20, 2015

@lh3 +1 for removing Dataset from reads; but after that starts to get hairy.

  • Using the name "Study" is IMHO okish, but "Dataset" seems even better (more neutral; Study implies an activity, whereas Dataset or something like this just is a wrapper for somehow related data objects).
  • Placing it (e.g. Dataset) into common.avdl seems at the moment o.k., but questions (probably rightfully!) the existence of metadata.avdl. This leads to the general design decision we have to make: Do we want to have a) several collections of loosely connected record definitions like now (-3), or b) do we want to have everything which is not very specifically bound to a certain object type (like e.g. [paraphrasing] library to experiment) in a single document (+3), or c) do we want to populate the schema space with per-record/object files (+2).

Writing this as one of the metadata people, salted with personal opinions.

@diekhans
Copy link
Contributor

I actually agree that DataSet is better than term, but the
documentation should explain how this relates to study.

I would favor moving directly to metadata. Common needs to go
away and be broken into functional modules. Having a dumping
ground module does not help in understanding the API.

Michael Baudis notifications@github.com writes:

@lh3 +1 for removing Dataset from reads; but after that starts to get hairy.

• Using the name "Study" is IMHO okish, but "Dataset" seems even better (more
neutral; Study implies an activity, whereas Dataset or something like this
just is a wrapper for somehow related data objects).
• Placing it (e.g. Dataset) into common.avdl seems at the moment o.k., but
questions (probably rightfully!) the existence of metadata.avdl. This leads
to the general design decision we have to make: Do we want to have a)
several collections of loosely connected record definitions like now (-3),
or b) do we want to have everything which is not very specifically bound to
a certain object type (like e.g. [paraphrasing] library to experiment) in a
single document (+3), or c) do we want to populate the schema space with
per-record/object files (+2).

Writing this as one of the metadata people, salted with personal opinions.


Reply to this email directly or view it on GitHub.*

@mbaudis
Copy link
Member

mbaudis commented May 20, 2015

@diekhans As I said: "metadata - everything but the sequence"...

So, we should define Dataset (DataSet?) inside metadata.avdl, also replacing IndividualGroup?! Absolutely in favo(u)r.

@fnothaft
Copy link
Contributor

I actually agree that DataSet is better than term, but the documentation should explain how this relates to study.

+1

I would favor moving directly to metadata. Common needs to go away and be broken into functional modules. Having a dumping ground module does not help in understanding the API.

+1

@mbaudis
Copy link
Member

mbaudis commented May 20, 2015

So now seeming to understand @diekhans: This is not against a "catch it all", but for existence of OTRTA (one to rule ...), that is metadata.avdl?

@lh3
Copy link
Member

lh3 commented May 20, 2015

I actually prefer this concept not to be too generic, but anyway, I don't really mind naming or where to put Dataset/Study. I only want to see: 1) a global Dataset/Study object and 2) sample names appearing in reads/refVar are not duplicated in a Dataset/Study.

@helenp
Copy link

helenp commented May 21, 2015

+1 moving to Meta Data as well.
My experience is that you always need a container of some kind. I prefer
DataSet. Study has a design associated in my view.

On 20/05/2015 22:10, Michael Baudis wrote:

@lh3 https://github.com/lh3 +1 for removing |Dataset| from reads;
but after that starts to get hairy.

  • Using the name "Study" is IMHO okish, but "Dataset" seems even
    better (more neutral; Study implies an activity, whereas Dataset
    or something like this just is a wrapper for somehow related data
    objects).
  • Placing it (e.g. Dataset) into common.avdl seems at the moment
    o.k., but questions (probably rightfully!) the existence of
    metadata.avdl. This leads to the general design decision we have
    to make: Do we want to have a) several collections of loosely
    connected record definitions like now (-3), or b) do we want to
    have everything which is not very specifically bound to a certain
    object type (like e.g. [paraphrasing] library to experiment) in a
    single document (+3), or c) do we want to populate the schema
    space with per-record/object files (+2).

Writing this as one of the metadata people, salted with personal opinions.


Reply to this email directly or view it on GitHub
#248 (comment).

Helen Parkinson, PhD
Team Leader

Samples, Phenotypes and Ontologies Team

European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

EBI 01223 494672
For scheduling assistance please contact Lynn French lfrench@ebi.ac.uk, 01223 494 453
Skype: helen.parkinson.ebi
http://www.ebi.ac.uk/about/people/helen-parkinson

@mbaudis
Copy link
Member

mbaudis commented May 21, 2015

O.k., I'll submit something to the metadata team discussion, before doing the PR on metadata.avdl. We'll have to discuss if

  • this replaces IndividualGroup (IMHO yes)
  • this should be limited to one type or if all record types associated with a given logical group can be wrapped in (my preference)
  • if datasets should be fixed (defined through record ids) or flexible (defined through queries on metadata objects)
/**
Represents a group of data objects of one or more types (e.g. all Individuals, Samples, Experiments 
associated with a clinical study; or e.g. a trio in genetic diagnostics.)
*/
record DataSet {
  /** The dataset UUID. This is globally unique. */
  string id;

  /** The name of the dataset. */
  union { null, string } name = null;

  /** A description of the dataset. */
  union { null, string } description = null;

  /**
  The time at which this record was created. 
  Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
  */
  string recordCreateTime;

  /**
  The time at which this record was last updated.
  Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
  */
  string recordUpdateTime;

  /** The type of dataset. Examples could be "trio", "metaanalysis", "gwas" ...*/
  union { null, string } type = null;

  /** The uuid's of included records. */
  array<string> recordsIncluded = [];

  /** The query leading to a dynamic assignment of dataset members.
  This is just a placeholder for a yet-to-be-defined query object
  union { null, metadataQuery } datasetQuery = null;
  */

  /**
  A map of additional individual group information.
  */
  map<array<string>> info = {};
}

@vadimzalunin
Copy link

+1 on Dataset being closer to SRA Study
Generic Dataset definition is a potential leaky abstraction. Google's Datasets may be so different from SRA Datasets that clients would have to take this into account.

@richarddurbin
Copy link
Contributor

  1. I am happy to call it DataSet
  2. I am happy for it to move to MetaData
  3. I support strongly Heng’s proposal that sample names are unique within a DataSet
  4. I strongly want it to cover more than one record type. In more detail:
    4a) Some types should only be able to be included in one DataSet. Perhaps they should be required to be in a DataSet. An example would be ReadGroup.
    4b) I think the same is true for Sample, or at least some sort of Sample record. But then there needs to be a mechanism to identify samples
    that correspond to the same individual across DataSets. This might be via some other sort of global Individual record, or via relationships. Probably a metadata question. Not immediately critical to resolve, though important.
    4c) Reference sequence records and Alleles should not have to belong to a DataSet, nor should Alleles, but the various types of Call should. I am not sure whether DataSets could declare additional private References/Allelles - perhaps.
  5. I would prefer it to be defined directly by specifying constituents, not through a dynamic query.
  6. I see this as some sort of data space in which a data owner/provider would have the rights to create entries, and control access to entries.
    I would also, at least for now, have access control at the level of DataSet. Perhaps people could specify different access controls for different record types in the same data set.
  7. Along these lines, I think Vadim’s suggestion of using DataSet for SRA Studies is important. We should develop the model so that SRA/ERA can provide their data via the GA4GH API using the DataSet model for their Studies. This is an important constraint that will help us get practical things in place. It does not constrain others who use the model to use it exactly the same way. e.g. different 1000 Genomes populations are in separate ERA studies (I believe), but the access controls are all the same (open) and someone else providing an analysis on the data might put them in one DataSet in their server.

On 21 May 2015, at 10:41, Vadim Zalunin notifications@github.com wrote:

+1 on Dataset being closer to SRA Study
Generic Dataset definition is a potential leaky abstraction. Google's Datasets may be so different from SRA Datasets that clients would have to take this into account.


Reply to this email directly or view it on GitHub #248 (comment).

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@mbaudis
Copy link
Member

mbaudis commented May 21, 2015

Thanks @richarddurbin for the extensive comments. We'll limit the definition to a list of uuid, and leave the query option out of this - one can always have an extrinsic implementation for creating query based "dynamic" datasets, collections, without promoting the mechanism upfront.

Ad 4b: Individuals and samples (and every other record) have to receive their own uuid at creation time. An Individual will have the same uuid in all different datasets, as long as it is not re-created. Problems will arise implementation wise, in tracking relations between different records - while we (will) have mechanisms in place (e.g. collecting "derivedFrom" and such), one can easily foresee multiple entry points into the system, especially for samples (i.e. new sample for existing individual and re-creation of new individual). But these are data management issues, which we can limit through good schema design, but which cannot completely be avoided.

Ad 4c: While references should not be part of the DataSet, included records may point to specific references; but those pointers are part of the lower level records, not the DataSet itself (?).

@lh3
Copy link
Member

lh3 commented May 21, 2015

Users would want to know: 1) given a Dataset, which samples it has? and conversely 2) given a sample name (e.g. NA12878), which Datasets have this sample? In addition, do we allow a VariantSet to span multiple Datasets? What does a "record" mean in a Dataset?

@dglazer
Copy link
Member

dglazer commented May 21, 2015

A few thoughts:

  • I think we're mixing a couple of conversations here -- cleaning up the current semantics, and potentially introducing new semantics. I suggest we do the first one quickly, to address the original concerns raised in this issue, and to let us take as much time as needed to do the second.
  • Specifically, I'd be happy to +1 (or write) a PR that a) moves Dataset definition from reads.avdl to metadata.avdl (or common.avdl; I'm neutral) and b) documents its current role and purpose.

For the longer discussion, I largely agree with @richarddurbin 's framing of the requirements.

  • I agree with point 6 -- I like us having "some sort of data space in which a data owner/provider would have the rights to create entries, and control access to entries", and I like the simplicity of starting with "access control at the level of Dataset"
  • I strongly agree with point 4, a Dataset should be able to contain more than one record type. I'm less sure about exactly which record types may/must be included in a Dataset; I lean towards it being a universal container, but am interested to see the discussion.
  • I like the idea of point 7, doing the thought experiment of making sure SRA Studies can map cleanly to Datasets. I'll be curious to see what if any new requirements that adds.

I don't know enough yet about the progress on metadata in general to have an opinion on whether Dataset should be merged with the new concepts proposed there, or should be left as an orthogonal administrative grouping, independent from the new semantic groupings.

Last and least:

  • let's separate spelling ("Dataset" vs. "DataSet") from substance. Once we agree on what it does and where it lives, we can debate spelling in a separate PR. (I have an opinion; but I'll follow my own suggestion and not raise it here.)

@lh3
Copy link
Member

lh3 commented May 21, 2015

I have no opinions on using one or multiple PRs to resolve Dataset. As this is a discussion thread, I will raise this: with Dataset becoming global, does it make sense to lift CallSet in variants.avdl also to a global object? The current CallSet is essentially:

record VariantSet { string id; string datasetId; }
record CallSet {
  string id;
  union { null, string } name = null;
  union { null, string } sampleId;
  array<string> variantSetIds = [];
}

We have something different in reads.avdl (in that we don't have a dataset-specific name):

record ReadGroup {
  union { null, string } datasetId = null;
  union { null, string } sampleId;
}

The ReadGroup version is closer to SRA, but I don't think it is working in practice because sample matching across SRA studies is hard and inconsistent at present (for a simple example see the query result of NA12044 in the BioSample database; NA10851 and NA12878 are much more complicated). A proposal is to define in common.avdl or metadata.avdl:

record DatasetSample { // feel free to change the name
  string id; // internal ID of this DatasetSample; unique in a entire data repository
  string datasetId;
  string displayName; // sample name shown in BAM header or in VCF; can't be null
  union { null, string } sampleId = null; // link to a Sample object if available
}

Then in variants.avdl, we remove CallSet and use datasetSampleId instead of callSetId in other variant objects. In reads.avdl, we replace ReadGroup.{datasetId,sampleId} with a single ReadGroup.datasetSampleId.

The major benefit of this proposal is it decouples our simple practical need (get a sample by a short name in BAM/VCF) and the complex procedure of sample matching. The proposal achieves this by demanding a DataSet-specific submitter-defined sample name. It also directly connects variants, reads and possibly other sample-related objects in a dataset.

PS: I will explain why the NA12044 query result is problematic. NA12044 is a hapmap/1000g sample. The cellline is available from Coriell. When people use NA12044, it is almost certainly the same sample. In the result page, however, we see multiple NA12044 BioSamples with different sample names: NA12044, E-MTAB-197:NA12044, GEUV:NA12044 and CEU-NA12044.

@mbaudis
Copy link
Member

mbaudis commented May 21, 2015

@lh3 (further up) These points will have to be solved through query/access implementations. In principle, Dataset would be defined through a list of its member records' UUIDs. Queries would then either go Dataset => UUID list => record, or use a record's UUID to search Datasets.

@pgrosu
Copy link
Contributor

pgrosu commented May 21, 2015

To support all these ideas we would definitely need enforce several naming nomenclatures as appropriate and have controlled vocabularies. I agree with @mbaudis that being able to reference similar samples would simplify the searches, which by standarizing would provide richer results. Also in that approach it would allow for the possibility of implementing the mapping of samples <-> datasets - as @lh3 mentioned - including many other mappings that would be generated on-the-fly, which I previously recommended via inverted indices. Such indices also would allow for searches such as the following key-value pair-type for even more general-mapping:

Liver_Tissue -> DataSet_A_UUID.Sample1.ReadGroup1
Liver_Tissue -> DataSet_B_UUID.Sample44.ReadGroup3

Such mapping would associate (group) studyies or datasets to specific underlying data. This would also satisfy what @vadimzalunin, @richarddurbin and @dglazer is looking for. Again the key-value pairs can be key-key or key-value, or even more complex ones such as key.key.key-value, where key.key.key is a key. So everything is possible.

We can even implement the n-gram concepts of search engines, where key-pairing would improve the results (i.e. a query of the terms "United" and "States" would be better associated as "United States"). In this example such an index would be as follows:

Lung_Tissue & adenocarcinoma_stage_1 -> DataSet_md5sum_3e9e77456ba.CallSet_id_324
Lung_Tissue & adenocarcinoma_stage_1 -> DataSet_md5sum_a275127a7b6.CallSet_id_993247

These types of searches are dynamic and are being constantly generated on-the-fly since today's search engine queryies are almost 70000/second or more world-wide. These can even be generated by filtering through a model to improve the quality of the mapping with rank-scores. Thus our type of read-only access - which is very parallel to web-search engine design - can be accommodated by a similar approach.

Paul

@diekhans
Copy link
Contributor

Richard Durbin notifications@github.com writes:

  1. I support strongly Heng’s proposal that sample names are unique within a
    DataSet

I strongly feel that samples should have globally unique ids
though out the world.

The problems of identification will haunt us forever if we don't
make it part of GA4GH.

@mdmiller53
Copy link

+1 also for removing Dataset from reads; perhaps adding a GenomicDataset for this specific purpose
i don't agree that study is equivalent, at least not in the way i've seen dataset used, which is the data generated for a study from the participants' samples. the study design then allows structuring the relationship of the data in the dataset(s) of the study. the Google use case is simply for the genomic data, not the study that generated it.

@diekhans

  • provenance - an extremely important problem and something that
    needs a mechanism that has finer granularity than dataset.
    Having this in here because it might prove useful to
    provenance isn't a good way to implement this critical
    feature.

take a look at w3c dataset description

right now the GA4GH focus is on variants but my primary interest is as things move forward, to where the summarized (level 3 data in TCGA terms) and the datasets from clustering and further analysis can be described using GA4GH approved standards. it is not a good idea to 'box' everything into GA4GH's initial efforts, but allow room to grow through out the entire functional genomics space

@mellybelly
Copy link

The W3C dataset group would welcome test cases, evaluation, and improvement. They did a nice job reviewing a lot of existing standards; we are using it in our work and have found insufficiencies that we are feeding back. Would be good to synergize these efforts.

@rlesca01
Copy link

rlesca01 commented Jun 3, 2015

okay sounds good

On Wed, Jun 3, 2015 at 1:02 PM, Melissa Haendel notifications@github.com
wrote:

The W3C dataset group would welcome test cases, evaluation, and
improvement. They did a nice job reviewing a lot of existing standards; we
are using it in our work and have found insufficiencies that we are feeding
back. Would be good to synergize these efforts.


Reply to this email directly or view it on GitHub
#248 (comment).

@mbaudis
Copy link
Member

mbaudis commented Jun 3, 2015

@mellybelly But isn't the W3C aimed at the general description of more like dataset=resource? That's not the discussion here about dataset/study/... records; this is about aggregation of records sharing some common features (either intrinsically, e.g. clinical diagnosis; or procedurally, e.g. part of same study, provenance ...).

@pgrosu
Copy link
Contributor

pgrosu commented Jun 3, 2015

@mdmiller53 I like the way you think :) Could you maybe generate a PR or create an issue - or even expand here - regarding the way reads/variants/etc including downstream analysis models would integrate with the other teams' components in optimizing for distributed over-the-wire data models that incorporates what you are envisioning. You probably seen my previous post in terms of how this would all fit together, but I would be interested in how you would tie it all together to allow room for growth with regards to capabilities, which I'm happy to say has also been my message over the past year. Not sure if you might have already seen the Priorities for the Data Working Group document, but I included it just in case :)

Thank you and look forward to it,
Paul

mbaudis pushed a commit that referenced this issue Jun 4, 2015
This PR introduces  ```Dataset``` record type (at the expense of the previous limited ```IndividualGroup```). Please see #248 for some background on this.
```Dataset``` is seen with inherent *evolvability*. There was consensus among the metadata task team that the best way would be to put this forward in a skeleton format with future feature refinement.
@mdmiller53
Copy link

@pgrosu in looking at the priorities, it is 3. Expression, methylation, and other epigenetic data. that i am referring to and the metadata will also need to eventually be suitable for describing that data in the summarized format (tsv usually, TCGA Level 3) once the WG gets to it, not just the initial BAM from sequencing that the summarized and corrected values are generated from. i've been trying to put together a document that describes the work we are doing in my lab at ISB for the metadata as part of our CGC contract. i'm off on vacation next week but will get to that on my return, near the end of the month

@pgrosu
Copy link
Contributor

pgrosu commented Jun 5, 2015

@mdmiller53 I am very excited about what I hear and eagerly look forward to your document. There have been many discussions on the importance of metadata searching. You've probably seen some of my posts on inverted indices for implementing them in a distributed, replicated balanced data-structure for optimized retrieval on any (meta)data efficiently - basically one of the core concepts of how large search engines are implemented these days.

Regarding the expression portion, there is a RNA-Seq task team which you might want to connect with. The only people I know that are part of that team are Sean (@saupchurch) and Alastair (@afirth).

In any case, hope you have a wonderful vacation and look forward to reading your document - no rush :)

Thank you,
Paul

diekhans pushed a commit to diekhans/ga4gh-schemas that referenced this issue Jun 16, 2015
This PR introduces  ```Dataset``` record type (at the expense of the previous limited ```IndividualGroup```). Please see ga4gh#248 for some background on this.
```Dataset``` is seen with inherent *evolvability*. There was consensus among the metadata task team that the best way would be to put this forward in a skeleton format with future feature refinement.
@diekhans diekhans added this to the comprehensive doc milestone Sep 14, 2015
@dglazer
Copy link
Member

dglazer commented Sep 16, 2015

@diekhans -- I believe that #389, which is now committed, means we can close this issue. Doing so now; feel free to reopen if you disagree.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests