Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Readgroups Endpoint #652

Closed
wants to merge 4 commits into from
Closed

Readgroups Endpoint #652

wants to merge 4 commits into from

Conversation

david4096
Copy link
Member

The access patterns for the reads API would be improved by adding a read groups endpoint (#650). Currently the readgroupset endpoint returns a repeated list of readgroups matching the search criteria. This is confusing because the readgroupset record is modified at query time.

This PR adds the /readgroups/search endpoint, which exposes the available read groups using familiar strict matching on name, bio_sample_id, dataset_id, and read_group_set_id. Read group set messages no longer contain the list of readgroups since they can be retrieved using readgroups/<read_group_id>. This simplifies the read group set endpoint by moving the bio_sample_id filter to the appropriate entity, ReadGroup, as a single read group set might contain reads for many samples.


// The read group name.
string name = 3;
string name = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should almost never reuse existing tag numbers: https://developers.google.com/protocol-buffers/docs/proto3#assigning-tags

Technically we could decide to be more nitpicky about enforcing this rule; e.g. it's OK to change tag numbers which haven't yet made it into a formal release. In my experience, it's easier to not try to think of all the possible scenarios where a tag might already be in use by a client somewhere, and just disallow reusing tag numbers on master.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks! I've seen the future compatibility benefits of fixing field numbers and I'm happy to work with whatever pattern. I understand the reasoning for maintaining field numbers across released versions. Perhaps we can agree on a future release where the field numbers will be fixed?

In practice, do you pick large field numbers for fields that you might place towards the end of a message? For perhaps pedantic reasons I would like to place the info fields last in the order of tags. Does it become unavoidable to place tag numbers out of order from their vertical position in the .proto?

Copy link
Contributor

@calbach calbach Jul 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think it's reasonable to 'fix' all of the tag numbers across two major versions of the API.

Leaving space between tag numbers is typically not done in my experience. Simplest is just to add fields with increasing tag numbers over time and use a logical organization vertically. Agree, it's slightly bothersome when they go out of order, but it is a small price to pay to avoid data corruption. With this pattern the only thing you really need to remember is "don't reuse a tag number". In my opinion, alternate schemes introduce too much complexity either by process or by subjectivity.

FYI: For field removal you can also use the reserved keyword to enforce that old tags aren't reused https://developers.google.com/protocol-buffers/docs/proto3#reserved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks! I've updated the PR to not reuse any tag numbers.



message SearchReadGroupsRequest {
// The dataset to search.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which of these fields are required? Why allow both search by dataset and read group set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a single sample is represented across many read group sets one would like to return just those read group IDs for constructing the readGroupIds field of a SearchReadsRequest. Both fields are optional and only readgroups matching both will be returned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something needs to be required - I'd say pick one or the other between dataset and read group set ID, make it required, and drop the other.

Searching across the entire repository without a parent ID is not going to be feasible for all implementations and is not consistent with the other endpoints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I removed read_group_set_id from the request and made dataset_id required and then added a list of read_group_ids to the ReadGroupSet message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be very careful of using required - please take at the following note as required is forever:

Required Is Forever You should be very careful about marking fields as required. If at some point you wish to stop writing or sending a required field, it will be problematic to change the field to an optional field – old readers will consider messages without this field to be incomplete and may reject or drop them unintentionally. You should consider writing application-specific custom validation routines for your buffers instead. Some engineers at Google have come to the conclusion that using required does more harm than good; they prefer to use only optional and repeated. However, this view is not universal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not in the proto, just the documentation. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, just a friendly reminder in case folks plan to change it in the future. I reposted it since this message is hidden, and it is a critical part when working with Protocol Buffers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not relevant to proto3; required and optional are no longer part of the specification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as proto2 messages don't get imported or vice versa in mixed form - this is from the same page:

Using proto3 Message Types

It's possible to import proto3 message types and use them in your proto2 messages, and vice versa. However, proto2 enums cannot be used in proto3 syntax.

I'll add a note the other post in this PR to make it more clear in case people search for it, and try to use it.


// Specifying the id of a BioSample record will return only readgroups
// with the given bioSampleId.
string bio_sample_id = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there background on why bio_sample_id is per-ReadGroup and not per-ReadGroupSet? I don't see any discussion of this on the original PR. I would have thought the latter would be more natural (and would side-step this PR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within a single read group set each read group can have its own bio_sample_id. This is to model BAMs that may have different samples per RG tag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SampleId has always been in ReadGroup. The rationale for this
was not written down. Perhaps someone who was there can example.
@delagoya, @lh3, @richarddurbin ???

CH Albach notifications@github.com writes:

Is there background on why bio_sample_id is per-ReadGroup and not
per-ReadGroupSet? I don't see any discussion of this on the original PR. I
would have thought the latter would be more natural (and would side-step this
PR).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that brings up some old memories :) So there are too many examples to pick from, and below are a few snippets with their respective links, which go into more detail:

@delagoya posted on #32 (comment):

  • A ReadGroup is defined on submission by the submitter and represents an atomic grouping of reads (could be library, could barcode, could be source sample). A Read can only belong to one ReadGroup.
  • ReadSet is an arbitrary collection of Reads that represents a set of reads to be analyzed at the same time.

@birney posted on #32 (comment):
ReadGroups - these are sets of reads which should be analysed together because the scientist who generated the data say that they are together.
Samples - A ReadGroup has one and only one Sample. However we can either allow Samples to have > 1 ReadGroup and/or allow Samples to be equivalent in some testing manner as for sure there will be >1 ReadGroup per Sample in a number of scenarios.
ReadGroupLists - these are sets of ReadGroups allowing functions to call across ReadGroups.

@lh3 posted on #24 (comment):

  • A read group is the smallest set of homogeneous reads. Although there is still ambiguity in this definition, ReadGroup is in practice more clearly defined and more stable than ReadSet (see add comment defining a GA4GHReadSet #16) and DataSet. It is a first-class concept in SAM widely used in data processing. Many downstream tools, such as GATK, samtools and Picard, focus on read groups and frequently ignore the concept of file (or ReadSet). As a matter of fact, all ReadGroups from 1000g are associated with SRA/ERA accession numbers. Most 1000g BAM files are not for now, because ReadSet is harder to define than ReadGroup.
  • Sample and experimental info are attached to ReadGroup in SAM, not to a physical file or a ReadSet in the schema. When we query by sample, by library or by sequencing technology, we are directly searching for ReadGroups, not for ReadSets. In addition, it is possible that a ReadSet may contain ReadGroups from different samples - SAM is the same. When we query by sample, we will have to break a ReadSet. ReadGroup is the minimal unit.

Hope it helps,
~p

@pgrosu
Copy link
Contributor

pgrosu commented Jul 8, 2016

I would be very careful when using required when defining a protobuf message - please take at the following note as required is forever:

Required Is Forever You should be very careful about marking fields as required. If at some point you wish to stop writing or sending a required field, it will be problematic to change the field to an optional field – old readers will consider messages without this field to be incomplete and may reject or drop them unintentionally. You should consider writing application-specific custom validation routines for your buffers instead. Some engineers at Google have come to the conclusion that using required does more harm than good; they prefer to use only optional and repeated. However, this view is not universal.

As long as we stay with proto3 message types, that should not be a problem but if we mix them as follows, this is still relevant (this is also from the same page):

Using proto3 Message Types

It's possible to import proto3 message types and use them in your proto2 messages, and vice versa. However, proto2 enums cannot be used in proto3 syntax.


message SearchReadGroupsRequest {
// The dataset to search (required).
string dataset_id = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add back in read_group_set_id (optional). I'm thinking of the case here where we need to find the ReadGroups that correspond to a particular ReadGroupSet so that the client can run a SearchReadsRequest. With this inferface, the client will have to get back all of the ReadGroups in a dataset (potentially hundreds of thousands), and then filter for the read group set they want.

I guess the alternative to this is to add a read_group_set option to SearchReadsRequest to cover this use-case. Since Google already does this (AFAIK), this might be the simpler option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to hold off on this change until we work through (and
document) the use cases.

Obtaining all readgroups in a dataset (project) is something
that one would almost never want to do.

In my experience, the command analysis work flow is:

  1. Find BAMs (readgroupset) meeting meeting some criteria
    defined by the metadata. It many contexts, this is more
    refined than sample. For instance, a sample can have both
    RNA-Seq and genome sequencing.
  2. retrieve reads from the readgroup, regardless of readgroup set.

It would simplified the API if we didn't push dataset down to
the lower levels unless there is a compelling use case and stick
to the container hierarchy. The container hierarchy will serve
the `find all' type of queries (walking the hierarchy) and
criteria queries can build in this plus the metadata.

Again, we are up against dataset being an unclear concept.

Jerome Kelleher notifications@github.com writes:

In src/main/proto/ga4gh/read_service.proto:

@@ -91,6 +99,49 @@ message GetReadGroupSetRequest {
string read_group_set_id = 1;
}

+// ****************** /readgroups *********************/
+
+// This request maps to the body of POST /readgroups/search as JSON.
+
+
+message SearchReadGroupsRequest {

  • // The dataset to search (required).
  • string dataset_id = 1;

I think we should add back in read_group_set_id (optional). I'm thinking of the
case here where we need to find the ReadGroups that correspond to a particular
ReadGroupSet so that the client can run a SearchReadsRequest. With this
inferface, the client will have to get back all of the ReadGroups in a dataset
(potentially hundreds of thousands), and then filter for the read group set
they want.

I guess the alternative to this is to add a read_group_set option to
SearchReadsRequest to cover this use-case. Since Google already does this
(AFAIK), this might be the simpler option.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.*

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would simplified the API if we didn't push dataset down to
the lower levels unless there is a compelling use case and stick
to the container hierarchy. The container hierarchy will serve
the `find all' type of queries (walking the hierarchy) and
criteria queries can build in this plus the metadata.

I completely agree, but I don't think we have a container hierarchy officially. Attempts to define the formal data model as a hierarchy of containers have failed because of an insistence on the importance of multiple occupancy (e.g., references belong to multiple reference sets, callsets belong to multiple variants sets, etc). Given that servers can make shallow copies of objects I have never understood the necessity of this. Until we actually understand what being "in" a dataset means, we're not going to solve these problems.

@diekhans
Copy link
Contributor

diekhans commented Jul 8, 2016

btw, I am +1 on adding the end point, it's only the search criteria that needs to be worked through.

@david4096
Copy link
Member Author

david4096 commented Jul 8, 2016

Given the current data model, bio_sample_id, name, and dataset_id are all named fields in the ReadGroup message. This means they can be searched using these fields without restricting to a read group set. I believe the original schema designers were clever regarding this and left a path for accessing read groups at this granularity.

We could move ahead with presenting a more sensible offering based on the existing message fields as the current metadata is hampered without this change. If other search request fields become desired, we can add them to the request as needed.

The use case is specifically filtering read groups by their bioSampleId, although it seems sensible to provide name filtering as well. @jeromekelleher: for performing a search reads request the read group IDS have been included in the ReadGroupSet message itself. That common access pattern is unchanged.

@diekhans
Copy link
Contributor

The point is to allow restricting to a readgroupset, not
require it. readgroupset are the most common (based on the
largest cancer genomics research project to date) data grouping
for analysis.

David Steinberg notifications@github.com writes:

Given the current data model, bio_sample_id, name, and dataset_id are all named
fields in the ReadGroup message. This means they can be searched using these
fields without restricting to a read group set, I believe the original schema
designers were clever regarding this and left a path for accessing read groups
at this granularity.

We could move ahead with presenting a more sensible offering based on the
existing message fields as the current metadata is hampered without this
change. If other search request fields become desired, we can add them to the
request as needed.

The use case is specifically filtering read groups by their bioSampleId,
although it seems sensible to provide name filtering as well. @jeromekelleher:
for performing a search reads request the read group IDS have been included in
the ReadGroupSet message itself. That common access pattern is unchanged.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.*

@david4096
Copy link
Member Author

david4096 commented Jul 11, 2016

@diekhans So do you think making the dataset_id required and the read_group_set_id optional makes more sense?

@pgrosu
Copy link
Contributor

pgrosu commented Jul 12, 2016

Hi David,

Think of the bigger picture, what are we really interested in? We want to analyze clean quality-controlled groupings of data (reads, variants, etc.) for clinical purposes (i.e. disease identification, personalized medicine, etc). We want to search by metadata and not drill too closely on the bottom level unless really necessary, and usually that is performed by reaching back to the sequencing center that submitted them, in case there are significant issues with the cloud-uploaded data. We have worked too long with with (g)VCFs and BAMs, which slow us down and become sub-optimal for the type of large-scale analysis that we want to target. We want to query a system and through macros like this (yes I know it looks rough, but omitted the other details for clarity to drive home the point):

prog GermlineAndCancer {
  qc.reads <- using.all.reads(qc="validated", filter="*carcinogenic*")
  variants.germline <- germline.pipeline.with.qc(qc.reads)
  variants.cancer   <- cancer.pipeline.with.qc(qc.reads)
  report.%date% <- saveAsHTML(getOverlapsAndRare(variants.germline, variants.cancer), globalStagingRepositoryByUser="user@email.com")
}

run.daily( "GermlineAndCancer" )  // Daily-streaming data will eventually trigger new key, rare variants.

So the key question is what type of information and data-structures would most easily and correctly, get us to perform such queries in the most intuitive way. That is really the key goal of GA4GH that matters the most.

Hope it helps,
Paul

@david4096
Copy link
Member Author

Thanks Paul, that's very interesting. The decision on whether to include the read_group_set_id in the request seems like the last contention. Based on @diekhans observation, there is a good case for including it as an optional field of the request, which is reflected in the most recent commit. Can I get your +1?

@diekhans
Copy link
Contributor

+1!

Add comments to get readgroup
Add read group set ID to search request
Move NB comment to Read Group Set outer comment
Remove read group set id from search read groups request
Add array of read group ids to read group set
@pgrosu
Copy link
Contributor

pgrosu commented Jul 13, 2016

Hi David,

The issue is not that it is optional, but that a Readgroup can belong to multiple ReadGroupSets. If it belongs to multiple ones, does that mean we have to search all ReadGroupSets that this ReadGroup belongs to?

A ReadGroup can be a sample, but I want to combine with other samples that are high quality to perform a specific analysis. Then I want to find other ReadGroupSets where this QC'ed sample was analyzed in, so that I do not duplicate results which I can use for comparison. How natural would such a query be to perform?

Think about this scenario. A baby is born today with a predisposition to a specific cancer subtype. I want sequence that baby every year for the rest of his/her life including the parents at the time of birth. More refined samples of the cancer's molecular evolution will be biopsied on a more frequent basis that reflect the aggressiveness of the stage of the cancer. These might be many ReadGroupSets made up of multiple ReadGroups which I might - or might not - want to average out for my comparison.

We are doing deep sequencing these days because we are limited in the capacity to quickly determine the active site in the form of a 3D confirmation of protein sequences, and their interactions. Thus we are left with what we can inspect most quickly, which are gene variations that become most probabilistically significant. Next-Generation Sequencing has always been an indirect analysis for understanding the some types of biological processes.

So in order to correlate with diseases and possible drug targets - and their downstream interactions - we want to perform queryable analysis on a large scale.

Do you see my dilemma? We can add a ReadGroupSet id to a ReadGroup as optional, but the data-structure is back to the form of a BAM which we are trying to get away from. We have been having the same mindset towards NGS for the past 7 years - and reverting to it periodically the past two years here in GA4GH - but the goal is to make the wire-protocol fit large-scale, global multi-repository analysis.

If we add it now and large datasets get loaded this way at different repositories, and several years down the line we change the design to be more inter-connected to suite new analysis techniques, does that mean we have to reload the data, or would we be able to perform schema evolution with minimal updates to the data to fix it?

We have to start with the analysis goals, and tailor the data-structures to fit it.

Hope you can see where I am coming from.

Thank you,
Paul

@diekhans
Copy link
Contributor

Complex queries for readgroups should be performed using the sequencing
metadata.

Paul Grosu notifications@github.com writes:

Hi David,

The issue is not that it is optional, but that a Readgroup can belong to
multiple ReadGroupSets. If it belongs to multiple ones, does that mean we have
to search all ReadGroupSets that this ReadGroup belongs to?

A ReadGroup can be a sample, but I want to combine with other samples that are
high quality to perform a specific analysis. Then I want to find other
ReadGroupSets where this QC'ed sample was analyzed in, so that I do not
duplicate results which I can use for comparison. How natural would such a
query be to perform?

Think about this scenario. A baby is born today with a predisposition to a
specific cancer subtype. I want sequence that baby every year for the rest of
his/her life including the parents at the time of birth. More refined samples
of the cancer's molecular evolution will be biopsied on a more frequent basis
that reflect the aggressiveness of the stage of the cancer. These might be many
ReadGroupSets made up of multiple ReadGroups which I might - or might not -
want to average out for my comparison.

We are doing deep sequencing these days because we are limited in the capacity
to quickly determine the active site in the form of a 3D confirmation of
protein sequences, and their interactions. Thus we are left with what we can
inspect most quickly, which are gene variations that become most
probabilistically significant. Next-Generation Sequencing has always been an
indirect analysis for understanding the some types of biological processes.

So in order to correlate with diseases and possible drug targets - and their
downstream interactions - we want to perform queryable analysis on a large
scale.

Do you see my dilemma? We can add a ReadGroupSet id to a ReadGroup as optional,
but the data-structure is back to the form of a BAM which we are trying to get
away from. We have been having the same mindset towards NGS for the past 7
years - and reverting to it periodically the past two years here in GA4GH - but
the goal is to make the wire-protocol fit large-scale, global multi-repository
analysis.

If we add it now and large datasets get loaded this way at different
repositories, and several years down the line we change the design to be more
inter-connected to suite new analysis techniques, does that mean we have to
reload the data, or would we be able to perform schema evolution with minimal
updates to the data to fix it?

We have to start with the analysis goals, and tailor the data-structures to fit
it.

Hope you can see where I am coming from.

Thank you,
Paul


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.*

@kozbo
Copy link
Contributor

kozbo commented Jul 14, 2016

@david4096 will create a demo program to show this approach before merging.

@pgrosu
Copy link
Contributor

pgrosu commented Jul 14, 2016

I greatly welcome the idea of a demo, as it is my preferred approach. It will give us the opportunity to hash out and test ideas together for integrating the best ideas from everyone.

@pgrosu
Copy link
Contributor

pgrosu commented Jul 15, 2016

Hi David,

I'm assuming the demos will be presented via Github. While you are building the demo try to see if generating temporary federated tables might help you in joining multiple data sources together. These would work as short-lived dynamic databases, where the transitory tables optimize the memoization similar to a dynamic programming approach. With time we can have a cache scheduler where the tables become more permanent if they become highly accessed, and would have the possibility of talking to each other. It is basically a way of balancing the wire-protocol load versus these temporary data-hubs' throughput.

These federated tables should be auto-generated by the queries performed, and thus can live ephemerally but would aid in speeding up the dynamic query accessing multiple sites and/or data-sources.

Hope it helps,
Paul

@david4096
Copy link
Member Author

@pgrosu That's very interesting, I agree! I currently have enough resources to approach this in such a way so as to verify the basic access patterns, however, I am interested in whether you have some scaffolding to work with in approaching the federation and optimization problems.

I'll let you know if I run into any issues putting together a most limited demonstration!

@pgrosu
Copy link
Contributor

pgrosu commented Jul 18, 2016

@david4096 I have ways to test things on a multi-node cluster where the distributed repositories would be simulated by throttling the bandwidth among running processes. There are tools like Pipe Viewer, or RequestsThrottler where one can tweak the bandwidth throughput among parallel running processes representing subsets of repositories.

I would not rush it, as it is better to take the time to do it right. Let's all work together to find the optimal approach, as there will be multiple iterations we will have to go through. It'll be fun :)

Feel free to ping me anytime if you need anything. It would be nice to work together to come up with an elegant solution for this, which I know would benefit everyone as well.

Cheers,
~p

@kozbo kozbo assigned kozbo and dcolligan and unassigned kozbo Feb 23, 2017
@kozbo kozbo added this to the 2017-02 milestone Feb 23, 2017
// `POST /readgroups/search` must accept a JSON version of
// `SearchReadGroupsRequest` as the post body and will return a JSON
// version of `SearchReadGroupsResponse`.
rpc SearchReadGroups(SearchReadGroupsRequest)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be updated to include URL endpoints.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants