-
Notifications
You must be signed in to change notification settings - Fork 110
first-class ReadGroups + convenient ReadSets #32
Comments
I am +1 on doing step 1 and 2 simultaneously. |
I've been hesitant to enter this conversation, but throwing caution to the wind now: I think ReadSets and ReadGroups are orthogonal groupings of reads, where:
That last point is where my understanding of a |
Can ReadSets be created for already imported reads? |
@vadimzalunin That would be my preference. |
Angel, sounds like we're all in sync on what a I think I understand your definition of |
Yes I agree with the proposals of step 1 and have no further comment on it.
Yes, which is why I brought it up :)
Yes, we do and I am working on it, but have some hard commitments for tomorrow that are delaying my progress. Suffice to say, I think of a ReadSet as the resulting array of Reads from a "select" (or query / search) operation. My notion of a ReadSet is defined by the selection criteria, as well as other metadata that help define the ReadSet. An ad-hoc ReadSet is just one you have not bothered to save for the long-term or applied other sorts of annotation on top of.
I don't think we should make this distinction. I have Bad Memories of over-specified objects from my previous foray into developing standards. |
Ok, many thanks. This is useful. I support this, but the main pattern in the future will // pseudo Java like code rs = make_ReadStore_Factory(“http://www.ebi.ac.uk/EGA/GA4GH/Root/“); // helper function to handle references, must go through indirection to make // notice I could have come up with different rgl's // here are some other rgl’s // you can merge etc RGLs // you can iterate over rgls, delete etc // This is the big call for the ReadStore |
Thanks @birney. Making sure I understand:
Is that right? (In particular, I want to make sure that you're comfortable with step 1 as is.) |
I just talked to @birney -- he confirmed my understanding above (yes to step 1; cool on step 2). |
I am +1 for step1 and -0 for step2. As far as step2 goes, if we assume arbitrary read groups according to @birney we will end up applying frequently your final bullet:
I guess this might nullify the purpose of ReadSets. Also I am not that comfortable with your assumption:
Note that "best position" doesn't necessarily mean ground truth. Although I see your point of view, you might need a wider consensus from the GA before adopting that. |
Yes I agree with Step1. Looking at the IDL, I think the concept of "ReadSet" is being confused a bit with a factory "ReadStore" somewhat - and/or the meta-data about the dictionary for reference sequences etc with groupings of people. In my view we have 4 things (I know many people agree with this) ReadStores - these are the root factory interfaces where the main calls to fetch things come from ReadGroups - these are sets of reads which should be analysed together because the scientist who generated the data say that they are together. Most obviously these are from the same sequencing run. It will always be messy about modelling the precise lane/barcode/run/details of different institutes/companies/etc - inside of an institute there might be conventions or other schemes, but once this is exposed to API clients, the contract is that the client can consider these reads to be from (a) only one sample and (b) have relatively homogenous characteristics. Samples - A ReadGroup has one and only one Sample. However we can either allow Samples to have > 1 ReadGroup and/or allow Samples to be equivalent in some testing manner as for sure there will be >1 ReadGroup per Sample in a number of scenarios. ReadGroupLists - these are sets of ReadGroups allowing functions to call across ReadGroups. It is up to the precise implementation (and therefore the client has to know outside of the API) about whether the ReadGroupList means 1 ReadGroup per Sample or >1 ReadGroup per Sample. I like using the name ReadGroupList as it is explicit about the arbitrary-ness of this, and how this can be |
Thanks Ewan. Three quick thoughts:
|
About ReadGroup IDs. I think there should be an internal ID and an external accession number. Each read store instance can freely use any internal IDs in any formats, but it should only expose accession numbers to end users. In SRA, a ReadGroup-level accession looks like SRR012345 and in ENA, looks like ERR012345. The first two letters mark the source of accessions, such that accessions will never be conflictive between read store instances. A submitter only submits data only once to either SRA or ENA. (S)he gets an accession that is synced routinely between SRA/ENA. We can get ENA data (at least accession numbers) from SRA and vice versa. This is not only the practice for short reads, but also for all biological sequences in public repositories. In future read stores, we should continue this practice to assign each read group a stable and globally unique accession across all read store instances. Google, for example, may give a read group submitted to Google Genomics an accession like GOR012345. We should not use a string hash as an accession (fine for internal IDs). For already accessioned public data, such as 1000g, all read store instances should use the same accession numbers. When a user query: /readgroups/search?id=SRR012345, (s)he should get the same data from all read store instances. The implementation of accession should be simple. We may just add a "string accession;" field to ReadGroup. There are a few delicate issues, such as how to recognize existing accessions, what to do if reads are mapped with two mappers and how to sync, but these should be solvable. |
Quick update from yesterday's Reads task team call:
I'll leave this open until all the step 1 changes are underway. |
Closed -- all the step 1 changes to make ReadGroups a first-class object are now in. (And #52 has a proposal for addressing the remaining loose ends.) |
After much discussion (mostly in #24 with @richarddurbin, @lh3, @cassiedoll, and @fnothaft), I have a suggestion on a way forward that hopefully addresses all the requirements people have raised. This writeup is meant to replace #24, which I'm closing, so people don't have to wade through all the history of how we got here.
The design principles are:
To get there, I suggest we proceed in two logical steps:
I believe that will get us an API that helps callers by making “simple things simple and complex things possible”. See below for a sketch of the details -- if people like this direction, we can turn it into pull requests quickly, since most of the work is already at least partly in flight.
Please comment on whether you’re comfortable taking that next step and putting this in code. If folks like both step 1 and step 2, we can turn it all into pull requests at once, which will be a bit more efficient. If folks are comfortable with step 1 but not sure about step 2, we can do them one at a time. (And of course details like object and field names can be hashed out in the pull requests themselves.)
Step 1: ReadGroups only; no ReadSets
Mental model:
GARead changes:
/reads/search
method that takes an array of RGid’s (pending in Initial attempt at specifying 2 method definitions - read search and readset search. #26)GAReadGroup changes:
/readgroups/search
method, with RG-level params (e.g. id, library, sample, tags)/readgroups/get
method, and require RGid to be uniqueNote on implementation-specific extensibility -- implementers:
Step 2: introduce ReadSets
Mental model:
GAReadSet object:
/readsets/search
methodUse of GAReadSet in other places:
Note on implementation-specific flexibility:
The text was updated successfully, but these errors were encountered: