Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Rationalises the SearchObjectRequest semantics. #253

Merged
merged 1 commit into from
May 28, 2015

Conversation

jeromekelleher
Copy link
Contributor

This PR changes the semantics of all SearchObjectRequests so they require the ID of a single container object, if applicable. Currently, a number of different conventions exist, making implementation and authorisation difficult. This solves issue #247 and is related to the issues in #248.

Most of the changes are straightforward, and just require the change of a field like (for example) array<string> dataSetIds = []; to string dataSetId;. There are some problematic cases:

  1. SearchCallsRequest and SearchAlleleCallsRequest both had a large list of options for the containers that they could search over. As an initial strawman I've reduced this to just the parent CallSet to be consistent with all the other objects. Reasoning for these:
    1. Since a CallSet is part of a VariantSet, the variant set is implied by specifying callSetId.
    2. Finding Calls for a particular variantId is redundant, since Calls are embedded in Variant objects. If we wish to find the calls for a given variant, we can either find them directly using SearchVariantsRequest or make a call to GET /variants/<id>/
    3. We are then left with iterating over all calls in a CallSet, which seems like a peculiar thing to do without reference to the containing Variant object. However, this might be updated with more meaningful semantics later.
  2. SeachAlleles is problematic; what is the interplay between the variantSetId and the sequenceId? If sequenceId is required, then why do we need the variantSetId? Conceptually, it seems natural that an Allele belongs to a single VariantSet.

In terms of backwards compatibility, obviously there are some breaking changes here. These are minor, however, and easily worked around. Server-side, this is a considerable simplification, and so implementation can simply be a special case of the existing semantics. Client-side, users that are using the multi-container search semantics will need to now explicitly iterate over the container IDs that they are interested in.

@jeromekelleher
Copy link
Contributor Author

Any thoughts on this? The current situation with numerous different conventions for object search is a major impediment to implementation, so it would be good to resolve this soon.

@macieksmuga
Copy link
Contributor

@adamnovak could you comment on Jerome's SearchAlleles question? I'd very much like to see the proposed simplifications as a server developer, but can't comment on the impact these will have on intended usage.

@lh3
Copy link
Member

lh3 commented Mar 18, 2015

I am fine with the treatment of dataset, but I would recommend to delay the variant part until we are fully confident. For now, I am not. The ideal model in my head is very different from the current schema. I have not sent a pull request because we have some fundamental questions unanswered, such as "what is an allele?" "what is a Variant made of?". We have not reached a consensus yet.

More specifically, you said "Conceptually, it seems natural that an Allele belongs to a single VariantSet". This is also suggested by the current schema. However, I think the central role of the unary model is to break the boundaries between VariantSets. In that view, an Allele is global, not specific to any VariantSet. I have asked this question but have not got a definite answer from others.

On your point 1, my preferred data model is to embed the AlleleCall equivalent to another object and get rid of both SearchCallsRequest and SearchAlleleCallsRequest.

@pgrosu
Copy link
Contributor

pgrosu commented Mar 18, 2015

+10 Heng! I definitely prefer the model you mentioned. We want to share as many unique Alleles across multiple VariantSets, which will simplify searches and also is a more natural way to think. We had a similar discussion about reads way-back here #24, which simplified our current implementation. I would definitely prefer to see a PR on this, and would be great to hash this out via an in-depth discussion where more folks jump in.

/**
The ReadGroup to search.
*/
string readGroupId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems to remove some significant functionality, illustrated by two use cases in particular:

  1. This regresses from BAM for the use case of reading all reads in coordinate sorted order for a ReadGroupSet, or a single coordinate sorted BAM file. My concerns would be satisfied if there were an additional option to query by readGroupSetId.
  2. Loss of the ability to do "multi-BAM slicing" on the server side. Of course there is a cost of implementation, but I do believe this cost was considered when defining the SearchReads semantics. I would be curious to know if opinions have changed on the subject.

In Google's implementation of v0.5.1, we support the readGroupIds field but we make the additional requirement that all provided read group IDs belong to the same read group set. We recommend querying by readGroupSetId instead (and we support searching by multiple IDs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @calbach, this makes sense. I'm perfectly happy to add a readGroupSetId field; this is logically consistent, and easily implementable. Would this resolve both (1) and (2)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add string readGroupSetId or array<string> readGroupSedIds? We need the latter to achieve "multi-BAM splicing". I know it has a cost as @calbach said. I would recommend to keep this part of schema as it is and adopts the google approach in implementation for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be very much in favour of string readGroupSetId, for all of the reasons given above. However, I don't really understand what "multi-BAM splicing" is: is it an important application?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding readGroupsetId addresses (1) only. Reverting this change or adding array<string> readGroupSetIds addresses both (1) and (2).

By "multi-BAM slicing", I meant retrieval of reads across multiple ReadGroupSets at a specific genomic range, which is a relatively inefficient operation on traditional files. We've had this feature requested in the past, but others can probably provide better commentary on the use cases and how applicable it is today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a ReadGroupSet contains multiple samples, we will sometimes want to retrieve a subset of samples from a ReadGroupSet. It is not the right unit. ReadGroup is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to feel free to break compatibility when necessary.

The API should be versioned, but 0.5 is way too immature to be
anything set in stone. We have a weak conformance suite and
virtually nothing real (at least that we have seen) has been
built with it. If we don't validate by building real
applications while the API is being design, we will end up with
a mess.

The world will have to live with this for probably 20 years. We
can't do a design-by-committee, through over the wall
specification and expect it to be right. This is an absolute
recipe for failure. We have to build real applications and
iterate.

This is why the reference service is so important to GA4GH's
success.

Mark

Jerome Kelleher notifications@github.com writes:

Thanks for all the comments, I think this is a very useful conversation. Some
responses:

@dglazer I appreciate your point about this not being a green-field site, I
really do. Breaking compatibility is a big deal, and not something we
should take lightly. On the other hand though, this is a pre-1.0 API, and
we must have the freedom to make some changes if we are to develop it. The
changes here are simplifications, and as you say, if we don't make these
now it'll be very difficult to make them later. Users can be shielded from
these changes by the client-side libraries quite easily, so I don't think
any serious existing applications need be affected at all. For example,
api-client-java can detect which version of the API a server is running,
and send sequential requests when existing code uses the multi-container
search semantics.

@lh3 This is a good example. If we have the need to merge reads from 1000
ReadGroupSets, and the number of reads per ReadGroupSet is often zero or
much less than the page size, then this would be a compelling performance
argument in favour of doing the merge on the server side. I don't want to
be dogmatic about this, and if this is a real requirement that can't be
worked around then I'm happy to keep searches across ReadGroupSets.
However, why would we want to search across 1000 ReadGroupSets? If we have
reads for 1000 related samples, why not keep them all in a single
ReadGroupSet? I'm happy to add the ability to search over a single
ReadGroupSet by adding a readGroupSetId field as @calbach suggested at the
start of this thread. Would this satisfy the most common requirements? As
@richarddurbin pointed out, we can't solve this problem on the server side
in general in any case, since we might need to merge reads from different
servers.

@richarddurbin This is a very important question I think. What is the real
API: the low-level transport layer of HTTP and JSON, or the API that
developers actually interact with in their language of choice when writing
applications? As a client-side developer (in say C, Java, Python, R or
JavaScript), I don't want to know about HTTP, JSON, or paging any more than
I want to know what my TCP packet size is. What I want is a well-defined
client-side API in my language with documentation written specifically for
that language. A client-side API will look very different in C and R, and
they must have native documentation if developers are going to write
applications. Unless we provide excellent client libraries that make
writing applications substantially easier than using existing tools, there
will be no significant reason for developers to adopt the GA4GH API.


Reply to this email directly or view it on GitHub.*

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think most of the dozen independent changes in this PR are non-controversial, since they don't break any published code, and they do simplify things. @jeromekelleher, if you want to divide and conquer, I think you could get consensus on over half of the PR quickly, while we continue to discuss the remainder. @calbach -- other than this readGroupIds[] proposal, do you know which other edits are potentially breaking changes?
  • @diekhans -- I agree the API shouldn't be set in stone; I also think that we shouldn't casually break compatibility -- that's why this is a good discussion. And I strongly agree that we need to build real applications as part of the design process -- I'm delighted to see the work on the new-and-improved reference server, in addition to the earlier work by EBI, NCBI, https://github.com/googlegenomics/api-provider-local-java, and Google's product.
  • we discussed ReadGroup and ReadGroupSet at great length the first time around. There was passionate support (including by @lh3) to allow retrieving from a single ReadGroup, and there was passionate support (including by me) to allow retrieving all reads from a single ReadGroupSet -- that's what led to the current method accepting an array of readGroupIds[]. There's more than one way to satisfy those two requirements, but I believe that any proposal needs to continue to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi David (@dglazer),

Thank you for the thoughtful and insightful comments, but with the highest esteem and most respect, why are we creating limits based on a design decision at Google? I read most of the Google papers - and many others - and I know what is possible both from them and the different subject areas in computer science.

I posted the following for a reason:

#131

For instance from the Mesa paper - which one can read from the following link - there is a clear line that says:

We show how schema changes for a large number of tables can be performed dynamically and efficiently without affecting correctness or performance of existing applications.

Data and schemas are two separate things. I'm not saying let's use Mesa - which would be nice - but let's implement concepts that utilize some of that functionality in a public setting. If you're up for it, let's re-engineer Google's backend for the GA4GH implementation to make it flexible enough with future implementations. It is just a service, and that means the code does not have to be public - which currently is not, and thus satisfies that requirement as well.

I have worked at too many places to see unnecessary limits set because of an earlier design decision. We have an opportunity here to do things right. Let's put our minds together and make it right.

Let me know what you think.

Thanks,
Paul

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dglazer --- fair points! I've updated the PR, and will comment more below, as this discussion thread will be hidden by default by GitHub.

@richarddurbin
Copy link
Contributor

I still need to catch up with the exchanges on definition of Allele etc., but it also makes sense to me that Alleles are global top level objects, not specific to a VariantSet.
I guess they are in the context of a particular ReferenceSet, though in fact they are defined on sequences, and so may be identifiable across reference sets if the same
underlying reference sequences are present.

Richard

On 18 Mar 2015, at 17:31, Heng Li notifications@github.com wrote:

I am fine with the treatment of dataset, but I would recommend to delay the variant part until we are fully confident. For now, I am not. The ideal model in my head is very different from the current schema. I have not sent a pull request because we have some fundamental questions unanswered, such as "what is an allele?" "what is a Variant made of?". We have not reached a consensus yet.

More specifically, you said "Conceptually, it seems natural that an Allele belongs to a single VariantSet". This is also suggested by the current schema. However, I think the central role of the unary model is to break the boundaries between VariantSets. In that view, an Allele is global, not specific to any VariantSet. I have asked this question but have not got a definite answer from others.

On your point 1, my preferred data model is to embed the AlleleCall equivalent to another object and get rid of both SearchCallsRequest and SearchAlleleCallsRequest.


Reply to this email directly or view it on GitHub #253 (comment).

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@nlwashington
Copy link
Contributor

+10 again for "Allele is global, not specific to any VariantSet." I have been wondering why this design decision was made in the first place because it seems topsy-turvy. For the initial G2P schema, I have chosen to use the proposed Feature in the sequence-annotation branch instead of Variant for this specific reason.

@lh3
Copy link
Member

lh3 commented Mar 19, 2015

David also said allele is global. In principle, an allele could be species specific, but due to the way we find/name alleles, a GA4GH allele is specific to a ReferenceSet. Different ReferenceSets get different alleles.

@jeromekelleher
Copy link
Contributor Author

Thanks for all the feedback. The consensus seems to be, "this change is OK, but there are some details to be worked out". Is this fair?

I'm perfectly happy for an Allele to be global, but we will need to think about what the implications for search and authentication are. Will we want to search for all the Alleles in a given VariantSet or ReferenceSet? (That is, an Allele can exist in multiple VariantSets and ReferenceSets) This is fine by me once we limit the semantics to exactly one of either.

In terms of authentication, how do we decide whether a user has permissions to see a given allele? I'd imagine the model of allowing users to search over a given VariantSet or ReferenceSet or not will give us the restrictions we require.

@lh3 - I'm happy to delete both SearchCall and SearchAlleleCall: I don't think they are particularly useful.

@pgrosu
Copy link
Contributor

pgrosu commented Mar 19, 2015

Jerome, I think authentication was deemed to be resource-specific and not part of this API based on this latest post:

#148 (comment)

@pgrosu
Copy link
Contributor

pgrosu commented Mar 26, 2015

I think the client/server and the reads/Multi-BAM slicing discussion will prove fruitful for future aspects of the API, so I put a link to it here in case folks want to refer to it later:

#253 (diff)

Paul

@jeromekelleher
Copy link
Contributor Author

Thanks for all of the comments on this PR so far, it has been a very interesting discussion. Multi-BAM merge is important, so I accept that it should stay.

I've updated the PR to reintroduce the readGroupIds field in the SearchReadsRequest object, so that it will support merging reads across many ReadGroups. I have changed the semantics slightly, in that we no longer accept the empty list (interpreted as search all readGroups) as an argument. I think this is very important.

Any comments on this or any of the other changes would be much appreciated.

@pgrosu
Copy link
Contributor

pgrosu commented Mar 29, 2015

Hi Richard (@richarddurbin),

Regarding a reference server - which would be a big win - we should just start simple just the way search engines are built via an inverted index. Below are the general steps:

  1. The genome is split into equally sized regions.
  2. Each region is contains a pointer to a list of sample ids. A hash function (i.e. md5sum, sha256sum, etc) is generated between the tuple (region, sample_id) for optimized lookup of the reads for that sample in the specified region.
  3. The hash is then accessed in O(1) for the collection of reads it contains for that region.

This would also enable efficiently parallelized across multiple machines. The storage can also be optimized via the Log-Structured-Merge (LSM) Tree data structure with Differentiated Indexing (Diff-Index), which I put a reference for below.

References

A nice graphical description of an inverted index is shown at the following link:

https://developer.apple.com/library/mac/documentation/UserExperience/Conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html

Regarding a deeper description on inverted indices, below is a nice paper from Google, that contains a description of it:

http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/37367.pdf

Below is another nice paper from Google/IBM/GeorgiaTech regarding the Log-Structured-Merge (LSM) Tree data structure and with Differentiated Indexing (Diff-Index):

http://researcher.watson.ibm.com/researcher/files/us-wtan/DiffIndex-EDBT14-CR.pdf

Let me know what you think.

Thank you,
Paul

@dglazer
Copy link
Member

dglazer commented Mar 29, 2015

@jeromekelleher, that multi-BAM change addresses the biggest concern I saw raised -- thanks.

There are 12 separate changes in this PR, and I was finding the diff's hard to read, so I made the summary below. Assuming I got it right, I:

  • am +0 on the 6 "not yet published" changes (they look reasonable, but I haven't been part of those design discussions)
  • am +1 on the 2 "published but not implemented" changes
  • would like to be sure there aren't any bad consequences to the 4 "already implemented" changes.

A few questions:

  • @jeromekelleher -- does it look like I correctly caught all twelve changes?
  • @calbach -- does it look like I binned them correctly into "not yet published", "published but not implemented", and "already implemented"? (I think I got them right, but please double-check.)
  • @calbach -- of the 4 potentially breaking changes, do you have any thoughts on which if any could cause real problems in existing code?

6 not yet published APIs

Backwards compatibility isn't an issue here.

src/main/resources/avro/metadatamethods.avdl

record SearchIndividualsRequest {
-  array<string> groupIds = [];
+  string groupId;

record SearchSamplesRequest {
-  array<string> individualIds = [];
+  string individualId;

src/main/resources/avro/variantmethods.avdl

record SearchVariantSetSequencesRequest {
-  array<string> parentIds = [];
+  string parentId;

record SearchAllelesRequest {
-  array<string> variantSetIds = [];
+  string variantSetId;

record SearchCallsRequest { 
-  array<string> callSetIds = [];
-  array<string> variantSetIds = [];
-  array<string> variantIds = [];
+  string callSetId;

record SearchAlleleCallsRequest { 
-  array<string> callSetIds = [];
-  array<string> variantSetIds = [];
-  array<string> alleleIds = [];
-  array<string> variantIds = [];
+  string callSetId;

2 published APIs, with unimplemented behavior

As far as I know these changes don't break any existing implementations, so again backwards compatibility isn't an issue.
src/main/resources/avro/variantmethods.avdl

record SearchVariantSetsRequest {
-  array<string> datasetIds = [];
+  string datasetId;

record SearchVariantsRequest {
-  array<string> variantSetIds = [];
+  string variantSetId;

4 published APIs, with implemented behavior

These changes are small, but might cause a backwards compatibility issue.
src/main/resources/avro/variantmethods.avdl

record SearchCallSetsRequest {
-  array<string> variantSetIds = [];
+  string variantSetId;

src/main/resources/avro/readmethods.avdl

record SearchReadsRequest {
-  array<string> readGroupIds = [];
+  array<string> readGroupIds;

record SearchReadGroupSetsRequest {
-  array<string> datasetIds = [];
+  string datasetId;

src/main/resources/avro/referencemethods.avdl

record SearchReferencesRequest {
-  union { null, string } referenceSetId = null;
+  string referenceSetId;

@jeromekelleher jeromekelleher force-pushed the rationalise-search-requests branch from 7d22ea1 to a665878 Compare March 30, 2015 16:07
@jeromekelleher
Copy link
Contributor Author

Thanks for the summary @dglazer, this is very helpful. Your summary looks accurate to me. Since the original PR, another set of APIs has been merged in #246, so I've updated the methods here to follow the single-parent-container-unless-absolutely-necessary paradigm. These go into the not yet published APIs bucket:

src/main/resources/avro/sequenceAnnotationmethods.avdl

record SearchFeaturesRequest {
-    array<string> featureSetIds = [];
+    array<string> featureSetId;

-    array<string> parentIds = [];
+    array<string> parentId;

(Apologies for the whitespace noise in the diff --- I can get rid of this if you prefer).

@pgrosu
Copy link
Contributor

pgrosu commented Mar 30, 2015

@jeromekelleher, @dglazer - One thing to keep in mind, it is not uncommon to merge variants or to search across many variant sets at the same time. So removing this array<string> variantSetIds = []; most likely would be very temporary.

Just out of curiosity, if it is removed, does that mean that before querying across multiple VariantSets, I would merge some or all VariantSets into one gigantic one in order to search across multiple ones. These would be so many that it would defeat the purpose of removing array<string> variantSetIds = [];. My recommendation is to always design the system to match the most natural and flexible means of asking the biological questions if such a search is possible, which in this case it is. This is the foundation of MapReduce and relational algebra.

Paul

@calbach
Copy link
Contributor

calbach commented Mar 31, 2015

@dglazer yes, that summary looks correct to me.

Aside from the referenceSetId change (I'll comment inline), I'm +1 on the other breaking changes and unimplemented changes. I think they don't remove any important functionality, so the tradeoff for implementation simplicity is worth it.

I'm neutral on the unpublished methods.

*/
union { null, string } referenceSetId = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this is breaking a bit from the many to many relationship which has been discussed previously around references : referenceSets. Have the graph changes altered that discussion?

I'm thinking of the CRAM use-case, for example. Given an MD5, find the relevant reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@calbach, see #247 regarding where this is stemming from. Though I agree that many-to-many is something that I think is still important, it probably requires more clarity with examples in the structure of the definitions. Just for reference for others, the original many-to-many was discussed initially in the following issue:

#125 (comment)

You can still search with MD5 via this in the same method:

  /**
  If nonempty, return references which match any of the given `md5checksums`.
  See `Reference::md5checksum` for details.
  */
  array<string> md5checksums = [];

~p

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, I didn't realise there was a many-to-many relationship here. I'm happy to revert this if you think it's important.

On a side note, it would be very useful to have an overview of the various objects and how they relate to each other. It's not at all obvious from the current documentation...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-graph changes: it wasn't required but it wasn't disallowed. If a repository serves reference sets hg19 and GRCh37, it would be up to the implementation whether chr4 (MD5 is identical in both assemblies) is represented as one or two references. The former is easier to implement and might be required in some cases, e.g. if ACLs are involved, but the latter may be conceptually cleaner. But yes, it should probably be made explicit in the docs that this is allowed.

Things have also changed with the graph PR. Now you can nest reference sets within a reference set (I missed that when I was reviewing the changes). I guess that makes the semantics of this field slightly more complicated and I'm not sure how it plays with many : many.

If reference sets are still many:many with references after the graph changes, then I'd definitely prefer this line be reverted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fair enough. @adamnovak or @richarddurbin could you clarify the status of the many-to-many relationships between references and reference sets in the graph world please?

(My opinion here would be to try and keep things simple and avoid many-to-many relationships if we can. Implementations are free to reuse the underlying objects representing the same references within their code, but can present them to the outside world as-if they were different objects.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to allow the same reference to be in many reference sets. For example the same human mitochondrial sequence will be in
many different versions of the human genome reference set. Even the same GRCh38 primary chromosome 1 will be in different reference
sets, with and without EBV, alts, IMGD MHC versions etc. This is essential for practical use. This has nothing to do with the graph world, it
is core to the way that people practically use genome reference sequences.

Slightly more radically, I think in the graph world we need the ability to create extended reference sets that include one previous reference
set and add variants to it. This amounts to a single inheritance tree. In the side graph level hierarchy I would require the included reference
to be first, so that other stuff adds to it. I believe this cleanly captures how we want to build and extend graphs.

Richard

On 3 Apr 2015, at 09:39, Jerome Kelleher notifications@github.com wrote:

In src/main/resources/avro/referencemethods.avdl #253 (comment):

*/

(My opinion here would be to try and keep things simple and avoid many-to-many relationships if we can. Implementations are free to reuse the underlying objects representing the same references within their code, but can present them to the outside world as-if they were different objects.)


Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/pull/253/files#r27720795.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @richarddurbin, that's helpful.

@calbach --- I've reverted the change in question.

@diekhans
Copy link
Contributor

diekhans commented Apr 1, 2015

+1 [with great enthusiasm]

It's very important to try to make the GA4GH APIs more consistent and rational. The current state of the queries are a result of different committees as opposed to a single design effort. So it's done a good job of defining requirements but not one of defining an overall consistent API.

This PR is an excellent step in right direction. It can and will evolve once we get more experience with using it. However, we need to make incremental changes like this to move forward.

@jeromekelleher jeromekelleher force-pushed the rationalise-search-requests branch from a665878 to cdcd615 Compare April 3, 2015 12:16
@jeromekelleher
Copy link
Contributor Author

By my count, we now have three +1s and all of the contentious issues have been resolved. Unless there are any objections, I think this is mergable.

@adamnovak
Copy link
Contributor

I'm a bit concerned about losing the ability to search Calls and AlleleCalls by VariantSet. Isn't a CallSet allowed to occur in multiple VariantSets? Or did we drop that?

/**
If nonempty, return only `AlleleCall`s associated with one of these `Allele`s.
*/
array<string> alleleIds = [];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you go about looking up how a particular CallSet calls a particular Allele without this field? Do you have to look at all the AlleleCalls?

@benedictpaten
Copy link
Contributor

Once Adam's query is resolved, I'm 1+.

@delagoya
Copy link
Contributor

Major Enhancement

The commit touches readMethods.avdl which was a major artifact from the Reads Task Team, which is no longer holding regular meetings. As such the, major enhancement review process will be applied to this pull request.

-1

To clear the vote, the following people need to respond with +1 votes after this message for explicit approval of the changes:

@jeromekelleher (done)
@fnothaft (done)
@dglazer (done)
@vadimzalunin (done)
@saupchurch (done)
@lh3

(P.S. if any of the people above do not want to be part of future explicit approval list, let me know).

@fnothaft
Copy link
Contributor

I am +1.

@saupchurch
Copy link
Contributor

+1 for me.

I agree with @adamnovak concerning early optimization. Single search (thus multiple request) is fine for now with the caveat that usage metrics once we have them may require us to re-visit this in future.

@benedictpaten
Copy link
Contributor

Hi Angel, appreciate you watching out for reads people, but why the need to
get all the people from the retired reads group to 1+ the pull request?
Surely just a couple of 1+s from ex reads members would be sufficient? - it
would have been when the reads group still existed!

On Wed, May 13, 2015 at 12:03 PM, Angel Pizarro notifications@github.com
wrote:

Major Enhancement

The commit touches readMethods.avdl which was a major artifact from the
Reads Task Team, which is no longer holding regular meetings. As such the, major
enhancement review process
https://github.com/ga4gh/schemas/blob/master/CONTRIBUTING.md#retired-task-teams
will be applied to this pull request.

-1

To clear the vote, the following people need to respond with +1 votes
after this message for explicit approval of the changes:

@jeromekelleher https://github.com/jeromekelleher (done)
@fnothaft https://github.com/fnothaft
@dglazer https://github.com/dglazer
@vadimzalunin https://github.com/vadimzalunin
@saupchurch https://github.com/saupchurch
@lh3 https://github.com/lh3

(P.S. if any of the people above do not want to be part of future explicit
approval list, let me know).


Reply to this email directly or view it on GitHub
#253 (comment).

@delagoya
Copy link
Contributor

Precisely because the changes effect the implementations made by a group that no longer meets regularly.

I'm really just acting as a box checker, so if there is a process change that needs to be made please put it on the agenda for the DWG meetings or for Leiden.

@benedictpaten
Copy link
Contributor

Fair enough. I guess I'll make a pull request against the contributions doc!

On Wed, May 13, 2015 at 9:59 PM, Angel Pizarro notifications@github.com
wrote:

Precisely because the changes effect the implementations made by a group
that no longer meets regularly.

I'm really just acting as a box checker, so if there is a process change
that needs to be made please put it on the agenda for the DWG meetings or
for Leiden.


Reply to this email directly or view it on GitHub
#253 (comment).

@dglazer
Copy link
Member

dglazer commented May 14, 2015

Thanks @delagoya -- I don't see any concerns; I'll take a closer look in the morning and vote officially.
(I wish GitHub made it easy to see all changes made between my earlier +1 at the end of March and the latest version.)

@pgrosu
Copy link
Contributor

pgrosu commented May 14, 2015

@dglazer I think you are in luck, and it's a piece of cake :) Just follow link below and adjust accordingly the link parameters. In this case I am comparing master from 63 days ago against @jeromekelleher's present pull request :)

https://github.com/ga4gh/schemas/compare/master@%7B63days%7D...jeromekelleher:rationalise-search-requests

To view the changes just go to the split view, by clicking first on the File Changes tab and then selecting the Split button, as highlighted by the red arrow in the screenshot below:

compare63days

Let me know if this is what you were looking for, or if it was something different.

Thanks,
Paul

@dglazer
Copy link
Member

dglazer commented May 14, 2015

+1 on the changes to already-published methods (which afaict are the same as the last time I looked closely in March), and +0 on the changes to not-yet-published methods. Thanks @jeromekelleher for pushing this forward.

Small process suggestion for next time -- break things up into smaller pieces. For example, I think the big changes to brand-new methods, like {search,get}VariantSetSequences and search[Allele]Calls, could easily have been split out, as could the changes to comment formatting. I expect that would make the discussion and approval process much quicker and easier.

P.S. In case it's useful to others, here are the 5 substantive edits to already-published methods that I found:
src/main/resources/avro/variantmethods.avdl (not yet implemented)

record SearchVariantSetsRequest {
-  array<string> datasetIds = [];
+  string datasetId;

record SearchVariantsRequest {
-  array<string> variantSetIds = [];
+  string variantSetId;

src/main/resources/avro/variantmethods.avdl

record SearchCallSetsRequest {
-  array<string> variantSetIds = [];
+  string variantSetId;

src/main/resources/avro/readmethods.avdl

record SearchReadsRequest {
-  array<string> readGroupIds = [];
+  array<string> readGroupIds;

record SearchReadGroupSetsRequest {
-  array<string> datasetIds = [];
+  string datasetId;

@jeromekelleher
Copy link
Contributor Author

Thanks @dglazer, I appreciate the feedback. I agree about making PRs smaller when possible. Apologies for making it difficult to see the additional changes: this is the drawback of squashing commits before a PR is fully ready for merging.

@delagoya
Copy link
Contributor

@vadimzalunin and @lh3 there were some changes since you last approved. Since there is an active working group and two out three major implementations weighing in as +1, I'll give another day to respond before I remove the -1 process vote. Thanks!

@jeromekelleher
Copy link
Contributor Author

bump... I think we need to @lh3 and @vadimzalunin to weigh in here?

@delagoya
Copy link
Contributor

@jeromekelleher let me reach out via email. If they have not responded by tomorrow, we should take a look at the process to see if there is a time limit and best effort for further review.

@vadimzalunin
Copy link

+1

@jeromekelleher
Copy link
Contributor Author

@delagoya, I think we're still waiting on @lh3's +1 here?

@delagoya
Copy link
Contributor

Ample time a multiple contact tries with now response, so I'm going to call it and merge.

delagoya added a commit that referenced this pull request May 28, 2015
Rationalises the SearchObjectRequest semantics.
@delagoya delagoya merged commit 3daaf79 into ga4gh:master May 28, 2015
@jeromekelleher jeromekelleher deleted the rationalise-search-requests branch May 28, 2015 09:41
@jeromekelleher
Copy link
Contributor Author

Thanks @delagoya!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.