-
Notifications
You must be signed in to change notification settings - Fork 111
Dataset role not clearly defined #248
Comments
This hierarchy idea is broken somewhat by the
|
I would also like to know what function the Dataset was supposed to serve. |
It encapsulates a set of data with shared access control (most important) and typically provenance. For example from a single deposited VCF. Richard
The Wellcome Trust Sanger Institute is operated by Genome Research |
I agree with what @richarddurbin said. To add to it, the role is much clearer for reads where there would otherwise be no hierarchical collection beyond a
I didn't realize this was the case. I don't think CallSets should be able to span VariantSets; maybe I'm missing some background information as to why this would be desirable. |
@calbach Will the same sample be part of different VariantSets? Can two VariantSets be technical replicates in a larger study? |
If it's specific to reads, then it should have a less generic name. If it's intended to be more general, it should be moved out of reads. What ever is intended, this intention needs to be documented. CH Albach notifications@github.com writes:
|
I'll reference myself from what I posted previously under So think of a Dataset as study that was performed and which joins/groups all the relevant data associated with it in a logical grouping. It is a collection of one or more of the following:
It also encompasses what @richarddurbin is emphasizing. Remember this diagram I posted almost a year ago here: Notice how the permissions propagate down to the Hope it helps, |
Agreeing with and elaborating on @richarddurbin and @calbach 's responses to @dcolligan 's question:
TL;DR: they're a convenient way for data providers to lump together related data of multiple types.
|
@pgrosu's comment reflects my take on discussions we had in metadata (e.g. see comment thread in https://docs.google.com/document/d/1Sl6FYwBHjWYYo2Ex29fqvmEiOQjeokgAUfXNGsVQwZ0/edit#heading=h.p2r9dh51rf8 ). I was proposing a however called generic group object, to represent a static or dynamically generated group of data objects (e.g. all samples or analyses matching a given set of criteria). |
The problem is that the purpose is not documented in the schema. If it's to apply to more than reads, it needs to be moved out of It also seems inadequate for any of the these goals, so there
I am have now convinced myself that dataset should be removed. Also, having objects reference their containers (datasetId) is Mark David Glazer notifications@github.com writes:
|
So maybe we might want to start discussing how do we really want to use this API and its ideal purpose? Would it be to enable better elucidation of diseases for research purposes, or what might be some of the billing and dataset-as-permission incentives? The reason I pose this question is that, usually it is better to look at all the data than just part of it. If different individuals/teams have different access to it, that would make it difficult to properly identify the causes of some diseases in order to properly perform the followup experiments or recommend treatment. Imagine if BLAST - when it was first introduced in the 90's - provided varied permission levels to different users with different billing structures. Ideally in our case, the individuals would be anonymized so that access to all the data can still be used for research purposes. This would be the ideal way to effectively enable the system for both research and personalized medicine. Paul |
I am concerned about removing Dataset without replacing it with something that has already I also found DataSet vague when I originally came across it, and wanted to change or remove it. I vote strongly for experience, and not removing something known to work without replacing it Richard
The Wellcome Trust Sanger Institute is operated by Genome Research |
A flexibel way to define datasets/studies would be an object containing query & access parameters:
There is IMHO no real difference between a stored query and a "DataSet"; the latter would be a static version of this query style object (e.g. defined through the fixed object ids). |
+1 with @richarddurbin, @mbaudis using a dataset/object-as-a-study - which I've had the same experience with in practice, and the reason I mentioned it last year - but then we should not limit the search across multiple datasets in #253. |
@richarddurbin I am all for experience, however there is no The goal of the DWG needs to be limited to providing a data The Study analogy is very compelling. However, there is no way Mark Richard Durbin notifications@github.com writes:
|
How about removing |
@lh3 +1 for removing
Writing this as one of the metadata people, salted with personal opinions. |
I actually agree that DataSet is better than term, but the I would favor moving directly to metadata. Common needs to go Michael Baudis notifications@github.com writes:
|
@diekhans As I said: "metadata - everything but the sequence"... So, we should define Dataset (DataSet?) inside metadata.avdl, also replacing |
+1
+1 |
So now seeming to understand @diekhans: This is not against a "catch it all", but for existence of OTRTA (one to rule ...), that is metadata.avdl? |
I actually prefer this concept not to be too generic, but anyway, I don't really mind naming or where to put Dataset/Study. I only want to see: 1) a global Dataset/Study object and 2) sample names appearing in reads/refVar are not duplicated in a Dataset/Study. |
+1 moving to Meta Data as well. On 20/05/2015 22:10, Michael Baudis wrote:
Helen Parkinson, PhD Samples, Phenotypes and Ontologies TeamEuropean Bioinformatics Institute (EMBL-EBI) EBI 01223 494672 |
O.k., I'll submit something to the metadata team discussion, before doing the PR on metadata.avdl. We'll have to discuss if
|
+1 on Dataset being closer to SRA Study |
The Wellcome Trust Sanger Institute is operated by Genome Research |
Thanks @richarddurbin for the extensive comments. We'll limit the definition to a list of uuid, and leave the query option out of this - one can always have an extrinsic implementation for creating query based "dynamic" datasets, collections, without promoting the mechanism upfront. Ad 4b: Individuals and samples (and every other record) have to receive their own uuid at creation time. An Individual will have the same uuid in all different datasets, as long as it is not re-created. Problems will arise implementation wise, in tracking relations between different records - while we (will) have mechanisms in place (e.g. collecting "derivedFrom" and such), one can easily foresee multiple entry points into the system, especially for samples (i.e. new sample for existing individual and re-creation of new individual). But these are data management issues, which we can limit through good schema design, but which cannot completely be avoided. Ad 4c: While references should not be part of the DataSet, included records may point to specific references; but those pointers are part of the lower level records, not the DataSet itself (?). |
Users would want to know: 1) given a Dataset, which samples it has? and conversely 2) given a sample name (e.g. NA12878), which Datasets have this sample? In addition, do we allow a VariantSet to span multiple Datasets? What does a "record" mean in a Dataset? |
A few thoughts:
For the longer discussion, I largely agree with @richarddurbin 's framing of the requirements.
I don't know enough yet about the progress on metadata in general to have an opinion on whether Dataset should be merged with the new concepts proposed there, or should be left as an orthogonal administrative grouping, independent from the new semantic groupings. Last and least:
|
I have no opinions on using one or multiple PRs to resolve Dataset. As this is a discussion thread, I will raise this: with Dataset becoming global, does it make sense to lift CallSet in variants.avdl also to a global object? The current CallSet is essentially: record VariantSet { string id; string datasetId; }
record CallSet {
string id;
union { null, string } name = null;
union { null, string } sampleId;
array<string> variantSetIds = [];
} We have something different in reads.avdl (in that we don't have a dataset-specific name): record ReadGroup {
union { null, string } datasetId = null;
union { null, string } sampleId;
} The ReadGroup version is closer to SRA, but I don't think it is working in practice because sample matching across SRA studies is hard and inconsistent at present (for a simple example see the query result of NA12044 in the BioSample database; NA10851 and NA12878 are much more complicated). A proposal is to define in common.avdl or metadata.avdl: record DatasetSample { // feel free to change the name
string id; // internal ID of this DatasetSample; unique in a entire data repository
string datasetId;
string displayName; // sample name shown in BAM header or in VCF; can't be null
union { null, string } sampleId = null; // link to a Sample object if available
} Then in variants.avdl, we remove The major benefit of this proposal is it decouples our simple practical need (get a sample by a short name in BAM/VCF) and the complex procedure of sample matching. The proposal achieves this by demanding a DataSet-specific submitter-defined sample name. It also directly connects variants, reads and possibly other sample-related objects in a dataset. PS: I will explain why the NA12044 query result is problematic. NA12044 is a hapmap/1000g sample. The cellline is available from Coriell. When people use NA12044, it is almost certainly the same sample. In the result page, however, we see multiple NA12044 BioSamples with different sample names: NA12044, E-MTAB-197:NA12044, GEUV:NA12044 and CEU-NA12044. |
@lh3 (further up) These points will have to be solved through query/access implementations. In principle, |
To support all these ideas we would definitely need enforce several naming nomenclatures as appropriate and have controlled vocabularies. I agree with @mbaudis that being able to reference similar samples would simplify the searches, which by standarizing would provide richer results. Also in that approach it would allow for the possibility of implementing the mapping of
Such mapping would associate (group) studyies or datasets to specific underlying data. This would also satisfy what @vadimzalunin, @richarddurbin and @dglazer is looking for. Again the We can even implement the n-gram concepts of search engines, where key-pairing would improve the results (i.e. a query of the terms "United" and "States" would be better associated as "United States"). In this example such an index would be as follows:
These types of searches are dynamic and are being constantly generated on-the-fly since today's search engine queryies are almost 70000/second or more world-wide. These can even be generated by filtering through a model to improve the quality of the mapping with rank-scores. Thus our type of read-only access - which is very parallel to web-search engine design - can be accommodated by a similar approach. Paul |
Richard Durbin notifications@github.com writes:
I strongly feel that samples should have globally unique ids The problems of identification will haunt us forever if we don't |
+1 also for removing Dataset from reads; perhaps adding a GenomicDataset for this specific purpose
take a look at w3c dataset description right now the GA4GH focus is on variants but my primary interest is as things move forward, to where the summarized (level 3 data in TCGA terms) and the datasets from clustering and further analysis can be described using GA4GH approved standards. it is not a good idea to 'box' everything into GA4GH's initial efforts, but allow room to grow through out the entire functional genomics space |
The W3C dataset group would welcome test cases, evaluation, and improvement. They did a nice job reviewing a lot of existing standards; we are using it in our work and have found insufficiencies that we are feeding back. Would be good to synergize these efforts. |
okay sounds good On Wed, Jun 3, 2015 at 1:02 PM, Melissa Haendel notifications@github.com
|
@mellybelly But isn't the W3C aimed at the general description of more like dataset=resource? That's not the discussion here about dataset/study/... records; this is about aggregation of records sharing some common features (either intrinsically, e.g. clinical diagnosis; or procedurally, e.g. part of same study, provenance ...). |
@mdmiller53 I like the way you think :) Could you maybe generate a PR or create an issue - or even expand here - regarding the way reads/variants/etc including downstream analysis models would integrate with the other teams' components in optimizing for distributed over-the-wire data models that incorporates what you are envisioning. You probably seen my previous post in terms of how this would all fit together, but I would be interested in how you would tie it all together to allow room for growth with regards to capabilities, which I'm happy to say has also been my message over the past year. Not sure if you might have already seen the Priorities for the Data Working Group document, but I included it just in case :) Thank you and look forward to it, |
This PR introduces ```Dataset``` record type (at the expense of the previous limited ```IndividualGroup```). Please see #248 for some background on this. ```Dataset``` is seen with inherent *evolvability*. There was consensus among the metadata task team that the best way would be to put this forward in a skeleton format with future feature refinement.
@pgrosu in looking at the priorities, it is 3. Expression, methylation, and other epigenetic data. that i am referring to and the metadata will also need to eventually be suitable for describing that data in the summarized format (tsv usually, TCGA Level 3) once the WG gets to it, not just the initial BAM from sequencing that the summarized and corrected values are generated from. i've been trying to put together a document that describes the work we are doing in my lab at ISB for the metadata as part of our CGC contract. i'm off on vacation next week but will get to that on my return, near the end of the month |
@mdmiller53 I am very excited about what I hear and eagerly look forward to your document. There have been many discussions on the importance of metadata searching. You've probably seen some of my posts on inverted indices for implementing them in a distributed, replicated balanced data-structure for optimized retrieval on any (meta)data efficiently - basically one of the core concepts of how large search engines are implemented these days. Regarding the expression portion, there is a RNA-Seq task team which you might want to connect with. The only people I know that are part of that team are Sean (@saupchurch) and Alastair (@afirth). In any case, hope you have a wonderful vacation and look forward to reading your document - no rush :) Thank you, |
This PR introduces ```Dataset``` record type (at the expense of the previous limited ```IndividualGroup```). Please see ga4gh#248 for some background on this. ```Dataset``` is seen with inherent *evolvability*. There was consensus among the metadata task team that the best way would be to put this forward in a skeleton format with future feature refinement.
Currently, the
Dataset
type is poorly specified, and included inreads.avdl
. We have some comments likeThis needs to be clarified.
I propose:
DataSet
, so that we are consistent withVariantSet
,ReferenceSet
,ReadGroupSet
and others.DataSet
much more explicit, and make a formal definition of the data model as a hierarchy. (This relates to Query semantics for parent IDs inconsistent and problematic #247 also, since my proposal there assumes a hierarchy.) In this hierachy, theDataSet
is the root, and it has childVariantSet
s andReadGroupSets
. EachVariantSet
has childVariant
s; eachReadGroupSet
has childReadGroup
s, andReadGroup
s have childRead
s. We don't need to make any changes to the API to do this, we just need to be more explicit about what the model means.This formalisation should make reasoning about authorisation much more straightforward.
Thoughts?
The text was updated successfully, but these errors were encountered: