-
Notifications
You must be signed in to change notification settings - Fork 114
Conversation
|
||
// The read group name. | ||
string name = 3; | ||
string name = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should almost never reuse existing tag numbers: https://developers.google.com/protocol-buffers/docs/proto3#assigning-tags
Technically we could decide to be more nitpicky about enforcing this rule; e.g. it's OK to change tag numbers which haven't yet made it into a formal release. In my experience, it's easier to not try to think of all the possible scenarios where a tag might already be in use by a client somewhere, and just disallow reusing tag numbers on master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks! I've seen the future compatibility benefits of fixing field numbers and I'm happy to work with whatever pattern. I understand the reasoning for maintaining field numbers across released versions. Perhaps we can agree on a future release where the field numbers will be fixed?
In practice, do you pick large field numbers for fields that you might place towards the end of a message? For perhaps pedantic reasons I would like to place the info
fields last in the order of tags. Does it become unavoidable to place tag numbers out of order from their vertical position in the .proto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think it's reasonable to 'fix' all of the tag numbers across two major versions of the API.
Leaving space between tag numbers is typically not done in my experience. Simplest is just to add fields with increasing tag numbers over time and use a logical organization vertically. Agree, it's slightly bothersome when they go out of order, but it is a small price to pay to avoid data corruption. With this pattern the only thing you really need to remember is "don't reuse a tag number". In my opinion, alternate schemes introduce too much complexity either by process or by subjectivity.
FYI: For field removal you can also use the reserved keyword to enforce that old tags aren't reused https://developers.google.com/protocol-buffers/docs/proto3#reserved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks! I've updated the PR to not reuse any tag numbers.
|
||
|
||
message SearchReadGroupsRequest { | ||
// The dataset to search. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which of these fields are required? Why allow both search by dataset and read group set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a single sample is represented across many read group sets one would like to return just those read group IDs for constructing the readGroupIds
field of a SearchReadsRequest
. Both fields are optional and only readgroups matching both will be returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something needs to be required - I'd say pick one or the other between dataset and read group set ID, make it required, and drop the other.
Searching across the entire repository without a parent ID is not going to be feasible for all implementations and is not consistent with the other endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I removed read_group_set_id
from the request and made dataset_id
required and then added a list of read_group_ids to the ReadGroupSet
message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be very careful of using required
- please take at the following note as required is forever:
❗ Required Is Forever You should be very careful about marking fields as required. If at some point you wish to stop writing or sending a required field, it will be problematic to change the field to an optional field – old readers will consider messages without this field to be incomplete and may reject or drop them unintentionally. You should consider writing application-specific custom validation routines for your buffers instead. Some engineers at Google have come to the conclusion that using required does more harm than good; they prefer to use only optional and repeated. However, this view is not universal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not in the proto, just the documentation. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, just a friendly reminder in case folks plan to change it in the future. I reposted it since this message is hidden, and it is a critical part when working with Protocol Buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not relevant to proto3; required and optional are no longer part of the specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as proto2 messages don't get imported or vice versa in mixed form - this is from the same page:
Using proto3 Message Types
It's possible to import proto3 message types and use them in your proto2 messages, and vice versa. However, proto2 enums cannot be used in proto3 syntax.
I'll add a note the other post in this PR to make it more clear in case people search for it, and try to use it.
|
||
// Specifying the id of a BioSample record will return only readgroups | ||
// with the given bioSampleId. | ||
string bio_sample_id = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there background on why bio_sample_id is per-ReadGroup
and not per-ReadGroupSet
? I don't see any discussion of this on the original PR. I would have thought the latter would be more natural (and would side-step this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within a single read group set each read group can have its own bio_sample_id
. This is to model BAMs that may have different samples per RG tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SampleId has always been in ReadGroup. The rationale for this
was not written down. Perhaps someone who was there can example.
@delagoya, @lh3, @richarddurbin ???
CH Albach notifications@github.com writes:
Is there background on why bio_sample_id is per-ReadGroup and not
per-ReadGroupSet? I don't see any discussion of this on the original PR. I
would have thought the latter would be more natural (and would side-step this
PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that brings up some old memories :) So there are too many examples to pick from, and below are a few snippets with their respective links, which go into more detail:
@delagoya posted on #32 (comment):
- A
ReadGroup
is defined on submission by the submitter and represents an atomic grouping of reads (could be library, could barcode, could be source sample). ARead
can only belong to oneReadGroup
.ReadSet
is an arbitrary collection of Reads that represents a set of reads to be analyzed at the same time.@birney posted on #32 (comment):
ReadGroups - these are sets of reads which should be analysed together because the scientist who generated the data say that they are together.
Samples - A ReadGroup has one and only one Sample. However we can either allow Samples to have > 1 ReadGroup and/or allow Samples to be equivalent in some testing manner as for sure there will be >1 ReadGroup per Sample in a number of scenarios.
ReadGroupLists - these are sets of ReadGroups allowing functions to call across ReadGroups.@lh3 posted on #24 (comment):
- A read group is the smallest set of homogeneous reads. Although there is still ambiguity in this definition, ReadGroup is in practice more clearly defined and more stable than ReadSet (see add comment defining a GA4GHReadSet #16) and DataSet. It is a first-class concept in SAM widely used in data processing. Many downstream tools, such as GATK, samtools and Picard, focus on read groups and frequently ignore the concept of file (or ReadSet). As a matter of fact, all ReadGroups from 1000g are associated with SRA/ERA accession numbers. Most 1000g BAM files are not for now, because ReadSet is harder to define than ReadGroup.
- Sample and experimental info are attached to ReadGroup in SAM, not to a physical file or a ReadSet in the schema. When we query by sample, by library or by sequencing technology, we are directly searching for ReadGroups, not for ReadSets. In addition, it is possible that a ReadSet may contain ReadGroups from different samples - SAM is the same. When we query by sample, we will have to break a ReadSet. ReadGroup is the minimal unit.
Hope it helps,
~p
I would be very careful when using
As long as we stay with proto3 message types, that should not be a problem but if we mix them as follows, this is still relevant (this is also from the same page):
|
|
||
message SearchReadGroupsRequest { | ||
// The dataset to search (required). | ||
string dataset_id = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add back in read_group_set_id
(optional). I'm thinking of the case here where we need to find the ReadGroups that correspond to a particular ReadGroupSet so that the client can run a SearchReadsRequest
. With this inferface, the client will have to get back all of the ReadGroups in a dataset (potentially hundreds of thousands), and then filter for the read group set they want.
I guess the alternative to this is to add a read_group_set
option to SearchReadsRequest
to cover this use-case. Since Google already does this (AFAIK), this might be the simpler option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to hold off on this change until we work through (and
document) the use cases.
Obtaining all readgroups in a dataset (project) is something
that one would almost never want to do.
In my experience, the command analysis work flow is:
- Find BAMs (readgroupset) meeting meeting some criteria
defined by the metadata. It many contexts, this is more
refined than sample. For instance, a sample can have both
RNA-Seq and genome sequencing. - retrieve reads from the readgroup, regardless of readgroup set.
It would simplified the API if we didn't push dataset down to
the lower levels unless there is a compelling use case and stick
to the container hierarchy. The container hierarchy will serve
the `find all' type of queries (walking the hierarchy) and
criteria queries can build in this plus the metadata.
Again, we are up against dataset being an unclear concept.
Jerome Kelleher notifications@github.com writes:
In src/main/proto/ga4gh/read_service.proto:
@@ -91,6 +99,49 @@ message GetReadGroupSetRequest {
string read_group_set_id = 1;
}+// ****************** /readgroups *********************/
+
+// This request maps to the body ofPOST /readgroups/search
as JSON.
+
+
+message SearchReadGroupsRequest {
- // The dataset to search (required).
- string dataset_id = 1;
I think we should add back in read_group_set_id (optional). I'm thinking of the
case here where we need to find the ReadGroups that correspond to a particular
ReadGroupSet so that the client can run a SearchReadsRequest. With this
inferface, the client will have to get back all of the ReadGroups in a dataset
(potentially hundreds of thousands), and then filter for the read group set
they want.I guess the alternative to this is to add a read_group_set option to
SearchReadsRequest to cover this use-case. Since Google already does this
(AFAIK), this might be the simpler option.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would simplified the API if we didn't push dataset down to
the lower levels unless there is a compelling use case and stick
to the container hierarchy. The container hierarchy will serve
the `find all' type of queries (walking the hierarchy) and
criteria queries can build in this plus the metadata.
I completely agree, but I don't think we have a container hierarchy officially. Attempts to define the formal data model as a hierarchy of containers have failed because of an insistence on the importance of multiple occupancy (e.g., references belong to multiple reference sets, callsets belong to multiple variants sets, etc). Given that servers can make shallow copies of objects I have never understood the necessity of this. Until we actually understand what being "in" a dataset means, we're not going to solve these problems.
btw, I am +1 on adding the end point, it's only the search criteria that needs to be worked through. |
Given the current data model, We could move ahead with presenting a more sensible offering based on the existing message fields as the current metadata is hampered without this change. If other search request fields become desired, we can add them to the request as needed. The use case is specifically filtering read groups by their |
The point is to allow restricting to a readgroupset, not David Steinberg notifications@github.com writes:
|
@diekhans So do you think making the |
Hi David, Think of the bigger picture, what are we really interested in? We want to analyze clean quality-controlled groupings of data (reads, variants, etc.) for clinical purposes (i.e. disease identification, personalized medicine, etc). We want to search by metadata and not drill too closely on the bottom level unless really necessary, and usually that is performed by reaching back to the sequencing center that submitted them, in case there are significant issues with the cloud-uploaded data. We have worked too long with with (g)VCFs and BAMs, which slow us down and become sub-optimal for the type of large-scale analysis that we want to target. We want to query a system and through macros like this (yes I know it looks rough, but omitted the other details for clarity to drive home the point):
So the key question is what type of information and data-structures would most easily and correctly, get us to perform such queries in the most intuitive way. That is really the key goal of GA4GH that matters the most. Hope it helps, |
Thanks Paul, that's very interesting. The decision on whether to include the |
+1! |
Add comments to get readgroup Add read group set ID to search request Move NB comment to Read Group Set outer comment
Remove read group set id from search read groups request Add array of read group ids to read group set
Hi David, The issue is not that it is optional, but that a A Think about this scenario. A baby is born today with a predisposition to a specific cancer subtype. I want sequence that baby every year for the rest of his/her life including the parents at the time of birth. More refined samples of the cancer's molecular evolution will be biopsied on a more frequent basis that reflect the aggressiveness of the stage of the cancer. These might be many We are doing deep sequencing these days because we are limited in the capacity to quickly determine the active site in the form of a 3D confirmation of protein sequences, and their interactions. Thus we are left with what we can inspect most quickly, which are gene variations that become most probabilistically significant. Next-Generation Sequencing has always been an indirect analysis for understanding the some types of biological processes. So in order to correlate with diseases and possible drug targets - and their downstream interactions - we want to perform queryable analysis on a large scale. Do you see my dilemma? We can add a If we add it now and large datasets get loaded this way at different repositories, and several years down the line we change the design to be more inter-connected to suite new analysis techniques, does that mean we have to reload the data, or would we be able to perform schema evolution with minimal updates to the data to fix it? We have to start with the analysis goals, and tailor the data-structures to fit it. Hope you can see where I am coming from. Thank you, |
Complex queries for readgroups should be performed using the sequencing Paul Grosu notifications@github.com writes:
|
@david4096 will create a demo program to show this approach before merging. |
I greatly welcome the idea of a demo, as it is my preferred approach. It will give us the opportunity to hash out and test ideas together for integrating the best ideas from everyone. |
Hi David, I'm assuming the demos will be presented via Github. While you are building the demo try to see if generating temporary federated tables might help you in joining multiple data sources together. These would work as short-lived dynamic databases, where the transitory tables optimize the memoization similar to a dynamic programming approach. With time we can have a cache scheduler where the tables become more permanent if they become highly accessed, and would have the possibility of talking to each other. It is basically a way of balancing the wire-protocol load versus these temporary data-hubs' throughput. These federated tables should be auto-generated by the queries performed, and thus can live ephemerally but would aid in speeding up the dynamic query accessing multiple sites and/or data-sources. Hope it helps, |
@pgrosu That's very interesting, I agree! I currently have enough resources to approach this in such a way so as to verify the basic access patterns, however, I am interested in whether you have some scaffolding to work with in approaching the federation and optimization problems. I'll let you know if I run into any issues putting together a most limited demonstration! |
@david4096 I have ways to test things on a multi-node cluster where the distributed repositories would be simulated by throttling the bandwidth among running processes. There are tools like Pipe Viewer, or RequestsThrottler where one can tweak the bandwidth throughput among parallel running processes representing subsets of repositories. I would not rush it, as it is better to take the time to do it right. Let's all work together to find the optimal approach, as there will be multiple iterations we will have to go through. It'll be fun :) Feel free to ping me anytime if you need anything. It would be nice to work together to come up with an elegant solution for this, which I know would benefit everyone as well. Cheers, |
// `POST /readgroups/search` must accept a JSON version of | ||
// `SearchReadGroupsRequest` as the post body and will return a JSON | ||
// version of `SearchReadGroupsResponse`. | ||
rpc SearchReadGroups(SearchReadGroupsRequest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be updated to include URL endpoints.
The access patterns for the reads API would be improved by adding a read groups endpoint (#650). Currently the readgroupset endpoint returns a repeated list of readgroups matching the search criteria. This is confusing because the readgroupset record is modified at query time.
This PR adds the
/readgroups/search
endpoint, which exposes the available read groups using familiar strict matching on name,bio_sample_id
,dataset_id
, andread_group_set_id
. Read group set messages no longer contain the list of readgroups since they can be retrieved usingreadgroups/<read_group_id>
. This simplifies the read group set endpoint by moving thebio_sample_id
filter to the appropriate entity,ReadGroup
, as a single read group set might contain reads for many samples.