Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Readgroups Endpoint #652

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 61 additions & 7 deletions src/main/proto/ga4gh/read_service.proto
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,24 @@ service ReadService {
//
// `POST /readgroupsets/search` must accept a JSON version of
// `SearchReadGroupSetsRequest` as the post body and will return a JSON
// version of `SearchReadGroupSetsResponse`. Only readgroups that
// match an optionally supplied bioSampleId will be included in the
// response.
// version of `SearchReadGroupSetsResponse`.
rpc SearchReadGroupSets(SearchReadGroupSetsRequest)
returns (SearchReadGroupSetsResponse);

// Gets a list of `ReadGroup` matching the search criteria.
//
// `POST /readgroups/search` must accept a JSON version of
// `SearchReadGroupsRequest` as the post body and will return a JSON
// version of `SearchReadGroupsResponse`.
rpc SearchReadGroups(SearchReadGroupsRequest)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be updated to include URL endpoints.

returns (SearchReadGroupsResponse);

// Gets a `ReadGroup` by ID.
//
// `GET /readgroups/{read_group_id}` will return a JSON version of
// `ReadGroup`.
rpc GetReadGroup(GetReadGroupRequest) returns (ReadGroup);

// Gets a `ReadGroupSet` by ID.
//
// `GET /readgroupsets/{read_group_set_id}` will return a JSON version of
Expand Down Expand Up @@ -60,10 +72,6 @@ message SearchReadGroupSetsRequest {
// Only return read group sets with this name (case-sensitive, exact match).
string name = 2;

// Specifying the id of a BioSample record will return only readgroups
// with the given bioSampleId.
string bio_sample_id = 3;

// Specifies the maximum number of results to return in a single page.
// If unspecified, a system default will be used.
int32 page_size = 4;
Expand Down Expand Up @@ -91,6 +99,52 @@ message GetReadGroupSetRequest {
string read_group_set_id = 1;
}

// ****************** /readgroups *********************/

// This request maps to the body of `POST /readgroups/search` as JSON.


message SearchReadGroupsRequest {
// The dataset to search (required).
string dataset_id = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add back in read_group_set_id (optional). I'm thinking of the case here where we need to find the ReadGroups that correspond to a particular ReadGroupSet so that the client can run a SearchReadsRequest. With this inferface, the client will have to get back all of the ReadGroups in a dataset (potentially hundreds of thousands), and then filter for the read group set they want.

I guess the alternative to this is to add a read_group_set option to SearchReadsRequest to cover this use-case. Since Google already does this (AFAIK), this might be the simpler option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to hold off on this change until we work through (and
document) the use cases.

Obtaining all readgroups in a dataset (project) is something
that one would almost never want to do.

In my experience, the command analysis work flow is:

  1. Find BAMs (readgroupset) meeting meeting some criteria
    defined by the metadata. It many contexts, this is more
    refined than sample. For instance, a sample can have both
    RNA-Seq and genome sequencing.
  2. retrieve reads from the readgroup, regardless of readgroup set.

It would simplified the API if we didn't push dataset down to
the lower levels unless there is a compelling use case and stick
to the container hierarchy. The container hierarchy will serve
the `find all' type of queries (walking the hierarchy) and
criteria queries can build in this plus the metadata.

Again, we are up against dataset being an unclear concept.

Jerome Kelleher notifications@github.com writes:

In src/main/proto/ga4gh/read_service.proto:

@@ -91,6 +99,49 @@ message GetReadGroupSetRequest {
string read_group_set_id = 1;
}

+// ****************** /readgroups *********************/
+
+// This request maps to the body of POST /readgroups/search as JSON.
+
+
+message SearchReadGroupsRequest {

  • // The dataset to search (required).
  • string dataset_id = 1;

I think we should add back in read_group_set_id (optional). I'm thinking of the
case here where we need to find the ReadGroups that correspond to a particular
ReadGroupSet so that the client can run a SearchReadsRequest. With this
inferface, the client will have to get back all of the ReadGroups in a dataset
(potentially hundreds of thousands), and then filter for the read group set
they want.

I guess the alternative to this is to add a read_group_set option to
SearchReadsRequest to cover this use-case. Since Google already does this
(AFAIK), this might be the simpler option.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.*

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would simplified the API if we didn't push dataset down to
the lower levels unless there is a compelling use case and stick
to the container hierarchy. The container hierarchy will serve
the `find all' type of queries (walking the hierarchy) and
criteria queries can build in this plus the metadata.

I completely agree, but I don't think we have a container hierarchy officially. Attempts to define the formal data model as a hierarchy of containers have failed because of an insistence on the importance of multiple occupancy (e.g., references belong to multiple reference sets, callsets belong to multiple variants sets, etc). Given that servers can make shallow copies of objects I have never understood the necessity of this. Until we actually understand what being "in" a dataset means, we're not going to solve these problems.


// The read_group_set to search (optional).
string read_group_set_id = 2;

// Only return read groups with this name (case-sensitive, exact match).
string name = 3;

// Specifying the id of a BioSample record will return only readgroups
// with the given bioSampleId.
string bio_sample_id = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there background on why bio_sample_id is per-ReadGroup and not per-ReadGroupSet? I don't see any discussion of this on the original PR. I would have thought the latter would be more natural (and would side-step this PR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within a single read group set each read group can have its own bio_sample_id. This is to model BAMs that may have different samples per RG tag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SampleId has always been in ReadGroup. The rationale for this
was not written down. Perhaps someone who was there can example.
@delagoya, @lh3, @richarddurbin ???

CH Albach notifications@github.com writes:

Is there background on why bio_sample_id is per-ReadGroup and not
per-ReadGroupSet? I don't see any discussion of this on the original PR. I
would have thought the latter would be more natural (and would side-step this
PR).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that brings up some old memories :) So there are too many examples to pick from, and below are a few snippets with their respective links, which go into more detail:

@delagoya posted on #32 (comment):

  • A ReadGroup is defined on submission by the submitter and represents an atomic grouping of reads (could be library, could barcode, could be source sample). A Read can only belong to one ReadGroup.
  • ReadSet is an arbitrary collection of Reads that represents a set of reads to be analyzed at the same time.

@birney posted on #32 (comment):
ReadGroups - these are sets of reads which should be analysed together because the scientist who generated the data say that they are together.
Samples - A ReadGroup has one and only one Sample. However we can either allow Samples to have > 1 ReadGroup and/or allow Samples to be equivalent in some testing manner as for sure there will be >1 ReadGroup per Sample in a number of scenarios.
ReadGroupLists - these are sets of ReadGroups allowing functions to call across ReadGroups.

@lh3 posted on #24 (comment):

  • A read group is the smallest set of homogeneous reads. Although there is still ambiguity in this definition, ReadGroup is in practice more clearly defined and more stable than ReadSet (see add comment defining a GA4GHReadSet #16) and DataSet. It is a first-class concept in SAM widely used in data processing. Many downstream tools, such as GATK, samtools and Picard, focus on read groups and frequently ignore the concept of file (or ReadSet). As a matter of fact, all ReadGroups from 1000g are associated with SRA/ERA accession numbers. Most 1000g BAM files are not for now, because ReadSet is harder to define than ReadGroup.
  • Sample and experimental info are attached to ReadGroup in SAM, not to a physical file or a ReadSet in the schema. When we query by sample, by library or by sequencing technology, we are directly searching for ReadGroups, not for ReadSets. In addition, it is possible that a ReadSet may contain ReadGroups from different samples - SAM is the same. When we query by sample, we will have to break a ReadSet. ReadGroup is the minimal unit.

Hope it helps,
~p


// Specifies the maximum number of results to return in a single page.
// If unspecified, a system default will be used.
int32 page_size = 5;

// The continuation token, which is used to page through large result sets.
// To get the next page of results, set this parameter to the value of
// `next_page_token` from the previous response.
string page_token = 6;
}

// This is the response from `POST /readgroups/search` expressed as JSON.
message SearchReadGroupsResponse {
// The list of matching read groups.
repeated ReadGroupSet read_groups = 1;

// The continuation token, which is used to page through large result sets.
// Provide this value in a subsequent request to return the next page of
// results. This field will be empty if there aren't any additional results.
string next_page_token = 2;
}

// This request maps to the URL `GET /readgroups/{read_group_id}`.
message GetReadGroupRequest {
// The ID of the `ReadGroup` to be retrieved.
string read_group_id = 1;
}

// ****************** /reads *********************
// This request maps to the body of `POST /reads/search` as JSON.
//
Expand Down
18 changes: 10 additions & 8 deletions src/main/proto/ga4gh/reads.proto
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@ message ReadGroup {

// The ID of the dataset this read group belongs to.
string dataset_id = 2;

// The ID of the read group set this read group belongs to.
string read_group_set_id = 15;

// The read group name.
string name = 3;
Expand Down Expand Up @@ -64,6 +67,7 @@ message ReadGroup {
// Statistical data on reads in this read group.
ReadStats stats = 11;

// Program can be used to track the provenance of how read data was generated.
repeated Program programs = 12;

// The ID of the reference set to which the reads in this read group are
Expand All @@ -76,6 +80,8 @@ message ReadGroup {

// A ReadGroupSet is a logical collection of ReadGroups. Typically one
// ReadGroupSet represents all the reads from one experimental sample.
// All read groups require that all readgroups in the set are mapped to
// the same referenceSet.
message ReadGroupSet {
// The read group set ID.
string id = 1;
Expand All @@ -88,12 +94,9 @@ message ReadGroupSet {

// Statistical data on reads in this read group set.
ReadStats stats = 4;

// The read groups in this set.
repeated ReadGroup read_groups = 5;

// NB: we require that all readgroups in the set are mapped to the same
// referenceSet.

// IDs of the Read Groups in this ReadGroupSet
repeated string read_group_ids = 6;
}

// A linear alignment describes the alignment of a read to a Reference, using a
Expand All @@ -110,8 +113,7 @@ message LinearAlignment {
int32 mapping_quality = 2;

// Represents the local alignment of this sequence (alignment matches, indels,
// etc)
// versus the reference.
// etc) versus the reference.
repeated CigarUnit cigar = 3;
}

Expand Down