-
Notifications
You must be signed in to change notification settings - Fork 111
Rationalises the SearchObjectRequest semantics. #253
Rationalises the SearchObjectRequest semantics. #253
Conversation
Any thoughts on this? The current situation with numerous different conventions for object search is a major impediment to implementation, so it would be good to resolve this soon. |
@adamnovak could you comment on Jerome's |
I am fine with the treatment of dataset, but I would recommend to delay the variant part until we are fully confident. For now, I am not. The ideal model in my head is very different from the current schema. I have not sent a pull request because we have some fundamental questions unanswered, such as "what is an allele?" "what is a Variant made of?". We have not reached a consensus yet. More specifically, you said "Conceptually, it seems natural that an Allele belongs to a single VariantSet". This is also suggested by the current schema. However, I think the central role of the unary model is to break the boundaries between VariantSets. In that view, an Allele is global, not specific to any VariantSet. I have asked this question but have not got a definite answer from others. On your point 1, my preferred data model is to embed the |
+10 Heng! I definitely prefer the model you mentioned. We want to share as many unique Alleles across multiple VariantSets, which will simplify searches and also is a more natural way to think. We had a similar discussion about reads way-back here #24, which simplified our current implementation. I would definitely prefer to see a PR on this, and would be great to hash this out via an in-depth discussion where more folks jump in. |
/** | ||
The ReadGroup to search. | ||
*/ | ||
string readGroupId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems to remove some significant functionality, illustrated by two use cases in particular:
- This regresses from BAM for the use case of reading all reads in coordinate sorted order for a
ReadGroupSet
, or a single coordinate sorted BAM file. My concerns would be satisfied if there were an additional option to query byreadGroupSetId
. - Loss of the ability to do "multi-BAM slicing" on the server side. Of course there is a cost of implementation, but I do believe this cost was considered when defining the SearchReads semantics. I would be curious to know if opinions have changed on the subject.
In Google's implementation of v0.5.1, we support the readGroupIds field but we make the additional requirement that all provided read group IDs belong to the same read group set. We recommend querying by readGroupSetId instead (and we support searching by multiple IDs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @calbach, this makes sense. I'm perfectly happy to add a readGroupSetId
field; this is logically consistent, and easily implementable. Would this resolve both (1) and (2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add string readGroupSetId
or array<string> readGroupSedIds
? We need the latter to achieve "multi-BAM splicing". I know it has a cost as @calbach said. I would recommend to keep this part of schema as it is and adopts the google approach in implementation for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be very much in favour of string readGroupSetId
, for all of the reasons given above. However, I don't really understand what "multi-BAM splicing" is: is it an important application?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding readGroupsetId
addresses (1) only. Reverting this change or adding array<string> readGroupSetIds
addresses both (1) and (2).
By "multi-BAM slicing", I meant retrieval of reads across multiple ReadGroupSets
at a specific genomic range, which is a relatively inefficient operation on traditional files. We've had this feature requested in the past, but others can probably provide better commentary on the use cases and how applicable it is today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a ReadGroupSet contains multiple samples, we will sometimes want to retrieve a subset of samples from a ReadGroupSet. It is not the right unit. ReadGroup is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to feel free to break compatibility when necessary.
The API should be versioned, but 0.5 is way too immature to be
anything set in stone. We have a weak conformance suite and
virtually nothing real (at least that we have seen) has been
built with it. If we don't validate by building real
applications while the API is being design, we will end up with
a mess.
The world will have to live with this for probably 20 years. We
can't do a design-by-committee, through over the wall
specification and expect it to be right. This is an absolute
recipe for failure. We have to build real applications and
iterate.
This is why the reference service is so important to GA4GH's
success.
Mark
Jerome Kelleher notifications@github.com writes:
Thanks for all the comments, I think this is a very useful conversation. Some
responses:• @dglazer I appreciate your point about this not being a green-field site, I
really do. Breaking compatibility is a big deal, and not something we
should take lightly. On the other hand though, this is a pre-1.0 API, and
we must have the freedom to make some changes if we are to develop it. The
changes here are simplifications, and as you say, if we don't make these
now it'll be very difficult to make them later. Users can be shielded from
these changes by the client-side libraries quite easily, so I don't think
any serious existing applications need be affected at all. For example,
api-client-java can detect which version of the API a server is running,
and send sequential requests when existing code uses the multi-container
search semantics.• @lh3 This is a good example. If we have the need to merge reads from 1000
ReadGroupSets, and the number of reads per ReadGroupSet is often zero or
much less than the page size, then this would be a compelling performance
argument in favour of doing the merge on the server side. I don't want to
be dogmatic about this, and if this is a real requirement that can't be
worked around then I'm happy to keep searches across ReadGroupSets.
However, why would we want to search across 1000 ReadGroupSets? If we have
reads for 1000 related samples, why not keep them all in a single
ReadGroupSet? I'm happy to add the ability to search over a single
ReadGroupSet by adding a readGroupSetId field as @calbach suggested at the
start of this thread. Would this satisfy the most common requirements? As
@richarddurbin pointed out, we can't solve this problem on the server side
in general in any case, since we might need to merge reads from different
servers.• @richarddurbin This is a very important question I think. What is the real
API: the low-level transport layer of HTTP and JSON, or the API that
developers actually interact with in their language of choice when writing
applications? As a client-side developer (in say C, Java, Python, R or
JavaScript), I don't want to know about HTTP, JSON, or paging any more than
I want to know what my TCP packet size is. What I want is a well-defined
client-side API in my language with documentation written specifically for
that language. A client-side API will look very different in C and R, and
they must have native documentation if developers are going to write
applications. Unless we provide excellent client libraries that make
writing applications substantially easier than using existing tools, there
will be no significant reason for developers to adopt the GA4GH API.—
Reply to this email directly or view it on GitHub.*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think most of the dozen independent changes in this PR are non-controversial, since they don't break any published code, and they do simplify things. @jeromekelleher, if you want to divide and conquer, I think you could get consensus on over half of the PR quickly, while we continue to discuss the remainder. @calbach -- other than this
readGroupIds[]
proposal, do you know which other edits are potentially breaking changes? - @diekhans -- I agree the API shouldn't be set in stone; I also think that we shouldn't casually break compatibility -- that's why this is a good discussion. And I strongly agree that we need to build real applications as part of the design process -- I'm delighted to see the work on the new-and-improved reference server, in addition to the earlier work by EBI, NCBI, https://github.com/googlegenomics/api-provider-local-java, and Google's product.
- we discussed
ReadGroup
andReadGroupSet
at great length the first time around. There was passionate support (including by @lh3) to allow retrieving from a singleReadGroup
, and there was passionate support (including by me) to allow retrieving all reads from a singleReadGroupSet
-- that's what led to the current method accepting an array ofreadGroupIds[]
. There's more than one way to satisfy those two requirements, but I believe that any proposal needs to continue to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi David (@dglazer),
Thank you for the thoughtful and insightful comments, but with the highest esteem and most respect, why are we creating limits based on a design decision at Google? I read most of the Google papers - and many others - and I know what is possible both from them and the different subject areas in computer science.
I posted the following for a reason:
For instance from the Mesa paper - which one can read from the following link - there is a clear line that says:
We show how schema changes for a large number of tables can be performed dynamically and efficiently without affecting correctness or performance of existing applications.
Data and schemas are two separate things. I'm not saying let's use Mesa - which would be nice - but let's implement concepts that utilize some of that functionality in a public setting. If you're up for it, let's re-engineer Google's backend for the GA4GH implementation to make it flexible enough with future implementations. It is just a service, and that means the code does not have to be public - which currently is not, and thus satisfies that requirement as well.
I have worked at too many places to see unnecessary limits set because of an earlier design decision. We have an opportunity here to do things right. Let's put our minds together and make it right.
Let me know what you think.
Thanks,
Paul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dglazer --- fair points! I've updated the PR, and will comment more below, as this discussion thread will be hidden by default by GitHub.
I still need to catch up with the exchanges on definition of Allele etc., but it also makes sense to me that Alleles are global top level objects, not specific to a VariantSet. Richard
The Wellcome Trust Sanger Institute is operated by Genome Research |
+10 again for "Allele is global, not specific to any VariantSet." I have been wondering why this design decision was made in the first place because it seems topsy-turvy. For the initial G2P schema, I have chosen to use the proposed |
David also said allele is global. In principle, an allele could be species specific, but due to the way we find/name alleles, a GA4GH allele is specific to a ReferenceSet. Different ReferenceSets get different alleles. |
Thanks for all the feedback. The consensus seems to be, "this change is OK, but there are some details to be worked out". Is this fair? I'm perfectly happy for an Allele to be global, but we will need to think about what the implications for search and authentication are. Will we want to search for all the Alleles in a given In terms of authentication, how do we decide whether a user has permissions to see a given allele? I'd imagine the model of allowing users to search over a given VariantSet or ReferenceSet or not will give us the restrictions we require. @lh3 - I'm happy to delete both SearchCall and SearchAlleleCall: I don't think they are particularly useful. |
Jerome, I think authentication was deemed to be resource-specific and not part of this API based on this latest post: |
028faaf
to
7d22ea1
Compare
I think the client/server and the reads/Multi-BAM slicing discussion will prove fruitful for future aspects of the API, so I put a link to it here in case folks want to refer to it later: Paul |
Thanks for all of the comments on this PR so far, it has been a very interesting discussion. Multi-BAM merge is important, so I accept that it should stay. I've updated the PR to reintroduce the Any comments on this or any of the other changes would be much appreciated. |
Hi Richard (@richarddurbin), Regarding a reference server - which would be a big win - we should just start simple just the way search engines are built via an inverted index. Below are the general steps:
This would also enable efficiently parallelized across multiple machines. The storage can also be optimized via the Log-Structured-Merge (LSM) Tree data structure with Differentiated Indexing (Diff-Index), which I put a reference for below. ReferencesA nice graphical description of an inverted index is shown at the following link: Regarding a deeper description on inverted indices, below is a nice paper from Google, that contains a description of it: http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/37367.pdf Below is another nice paper from Google/IBM/GeorgiaTech regarding the Log-Structured-Merge (LSM) Tree data structure and with Differentiated Indexing (Diff-Index): http://researcher.watson.ibm.com/researcher/files/us-wtan/DiffIndex-EDBT14-CR.pdf Let me know what you think. Thank you, |
@jeromekelleher, that multi-BAM change addresses the biggest concern I saw raised -- thanks. There are 12 separate changes in this PR, and I was finding the diff's hard to read, so I made the summary below. Assuming I got it right, I:
A few questions:
6 not yet published APIsBackwards compatibility isn't an issue here. src/main/resources/avro/metadatamethods.avdl
src/main/resources/avro/variantmethods.avdl
2 published APIs, with unimplemented behaviorAs far as I know these changes don't break any existing implementations, so again backwards compatibility isn't an issue.
4 published APIs, with implemented behaviorThese changes are small, but might cause a backwards compatibility issue.
src/main/resources/avro/readmethods.avdl
src/main/resources/avro/referencemethods.avdl
|
7d22ea1
to
a665878
Compare
Thanks for the summary @dglazer, this is very helpful. Your summary looks accurate to me. Since the original PR, another set of APIs has been merged in #246, so I've updated the methods here to follow the single-parent-container-unless-absolutely-necessary paradigm. These go into the not yet published APIs bucket: src/main/resources/avro/sequenceAnnotationmethods.avdl
(Apologies for the whitespace noise in the diff --- I can get rid of this if you prefer). |
@jeromekelleher, @dglazer - One thing to keep in mind, it is not uncommon to merge variants or to search across many variant sets at the same time. So removing this Just out of curiosity, if it is removed, does that mean that before querying across multiple VariantSets, I would merge some or all VariantSets into one gigantic one in order to search across multiple ones. These would be so many that it would defeat the purpose of removing Paul |
@dglazer yes, that summary looks correct to me. Aside from the I'm neutral on the unpublished methods. |
*/ | ||
union { null, string } referenceSetId = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this is breaking a bit from the many to many relationship which has been discussed previously around references : referenceSets. Have the graph changes altered that discussion?
I'm thinking of the CRAM use-case, for example. Given an MD5, find the relevant reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@calbach, see #247 regarding where this is stemming from. Though I agree that many-to-many is something that I think is still important, it probably requires more clarity with examples in the structure of the definitions. Just for reference for others, the original many-to-many was discussed initially in the following issue:
You can still search with MD5 via this in the same method:
/**
If nonempty, return references which match any of the given `md5checksums`.
See `Reference::md5checksum` for details.
*/
array<string> md5checksums = [];
~p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, I didn't realise there was a many-to-many relationship here. I'm happy to revert this if you think it's important.
On a side note, it would be very useful to have an overview of the various objects and how they relate to each other. It's not at all obvious from the current documentation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-graph changes: it wasn't required but it wasn't disallowed. If a repository serves reference sets hg19 and GRCh37, it would be up to the implementation whether chr4 (MD5 is identical in both assemblies) is represented as one or two references. The former is easier to implement and might be required in some cases, e.g. if ACLs are involved, but the latter may be conceptually cleaner. But yes, it should probably be made explicit in the docs that this is allowed.
Things have also changed with the graph PR. Now you can nest reference sets within a reference set (I missed that when I was reviewing the changes). I guess that makes the semantics of this field slightly more complicated and I'm not sure how it plays with many : many.
If reference sets are still many:many with references after the graph changes, then I'd definitely prefer this line be reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fair enough. @adamnovak or @richarddurbin could you clarify the status of the many-to-many relationships between references and reference sets in the graph world please?
(My opinion here would be to try and keep things simple and avoid many-to-many relationships if we can. Implementations are free to reuse the underlying objects representing the same references within their code, but can present them to the outside world as-if they were different objects.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to allow the same reference to be in many reference sets. For example the same human mitochondrial sequence will be in
many different versions of the human genome reference set. Even the same GRCh38 primary chromosome 1 will be in different reference
sets, with and without EBV, alts, IMGD MHC versions etc. This is essential for practical use. This has nothing to do with the graph world, it
is core to the way that people practically use genome reference sequences.
Slightly more radically, I think in the graph world we need the ability to create extended reference sets that include one previous reference
set and add variants to it. This amounts to a single inheritance tree. In the side graph level hierarchy I would require the included reference
to be first, so that other stuff adds to it. I believe this cleanly captures how we want to build and extend graphs.
Richard
On 3 Apr 2015, at 09:39, Jerome Kelleher notifications@github.com wrote:
In src/main/resources/avro/referencemethods.avdl #253 (comment):
*/
- union { null, string } referenceSetId = null;
OK, fair enough. @adamnovak https://github.com/adamnovak or @richarddurbin https://github.com/richarddurbin could you clarify the status of the many-to-many relationships between references and reference sets in the graph world please?(My opinion here would be to try and keep things simple and avoid many-to-many relationships if we can. Implementations are free to reuse the underlying objects representing the same references within their code, but can present them to the outside world as-if they were different objects.)
—
Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/pull/253/files#r27720795.
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @richarddurbin, that's helpful.
@calbach --- I've reverted the change in question.
+1 [with great enthusiasm] It's very important to try to make the GA4GH APIs more consistent and rational. The current state of the queries are a result of different committees as opposed to a single design effort. So it's done a good job of defining requirements but not one of defining an overall consistent API. This PR is an excellent step in right direction. It can and will evolve once we get more experience with using it. However, we need to make incremental changes like this to move forward. |
a665878
to
cdcd615
Compare
By my count, we now have three +1s and all of the contentious issues have been resolved. Unless there are any objections, I think this is mergable. |
I'm a bit concerned about losing the ability to search |
/** | ||
If nonempty, return only `AlleleCall`s associated with one of these `Allele`s. | ||
*/ | ||
array<string> alleleIds = []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you go about looking up how a particular CallSet
calls a particular Allele
without this field? Do you have to look at all the AlleleCall
s?
Once Adam's query is resolved, I'm 1+. |
Major EnhancementThe commit touches -1 To clear the vote, the following people need to respond with +1 votes after this message for explicit approval of the changes: @jeromekelleher (done) (P.S. if any of the people above do not want to be part of future explicit approval list, let me know). |
I am +1. |
+1 for me. I agree with @adamnovak concerning early optimization. Single search (thus multiple request) is fine for now with the caveat that usage metrics once we have them may require us to re-visit this in future. |
Hi Angel, appreciate you watching out for reads people, but why the need to On Wed, May 13, 2015 at 12:03 PM, Angel Pizarro notifications@github.com
|
Precisely because the changes effect the implementations made by a group that no longer meets regularly. I'm really just acting as a box checker, so if there is a process change that needs to be made please put it on the agenda for the DWG meetings or for Leiden. |
Fair enough. I guess I'll make a pull request against the contributions doc! On Wed, May 13, 2015 at 9:59 PM, Angel Pizarro notifications@github.com
|
Thanks @delagoya -- I don't see any concerns; I'll take a closer look in the morning and vote officially. |
@dglazer I think you are in luck, and it's a piece of cake :) Just follow link below and adjust accordingly the link parameters. In this case I am comparing To view the changes just go to the split view, by clicking first on the File Changes tab and then selecting the Split button, as highlighted by the red arrow in the screenshot below: Let me know if this is what you were looking for, or if it was something different. Thanks, |
+1 on the changes to already-published methods (which afaict are the same as the last time I looked closely in March), and +0 on the changes to not-yet-published methods. Thanks @jeromekelleher for pushing this forward. Small process suggestion for next time -- break things up into smaller pieces. For example, I think the big changes to brand-new methods, like P.S. In case it's useful to others, here are the 5 substantive edits to already-published methods that I found:
src/main/resources/avro/variantmethods.avdl
src/main/resources/avro/readmethods.avdl
|
Thanks @dglazer, I appreciate the feedback. I agree about making PRs smaller when possible. Apologies for making it difficult to see the additional changes: this is the drawback of squashing commits before a PR is fully ready for merging. |
@vadimzalunin and @lh3 there were some changes since you last approved. Since there is an active working group and two out three major implementations weighing in as +1, I'll give another day to respond before I remove the -1 process vote. Thanks! |
bump... I think we need to @lh3 and @vadimzalunin to weigh in here? |
@jeromekelleher let me reach out via email. If they have not responded by tomorrow, we should take a look at the process to see if there is a time limit and best effort for further review. |
+1 |
Ample time a multiple contact tries with now response, so I'm going to call it and merge. |
Rationalises the SearchObjectRequest semantics.
Thanks @delagoya! |
This PR changes the semantics of all
SearchObjectRequest
s so they require the ID of a single container object, if applicable. Currently, a number of different conventions exist, making implementation and authorisation difficult. This solves issue #247 and is related to the issues in #248.Most of the changes are straightforward, and just require the change of a field like (for example)
array<string> dataSetIds = [];
tostring dataSetId;
. There are some problematic cases:SearchCallsRequest
andSearchAlleleCallsRequest
both had a large list of options for the containers that they could search over. As an initial strawman I've reduced this to just the parentCallSet
to be consistent with all the other objects. Reasoning for these:CallSet
is part of aVariantSet
, the variant set is implied by specifyingcallSetId
.Calls
for a particularvariantId
is redundant, sinceCall
s are embedded inVariant
objects. If we wish to find the calls for a given variant, we can either find them directly usingSearchVariantsRequest
or make a call to GET/variants/<id>/
Variant
object. However, this might be updated with more meaningful semantics later.SeachAlleles
is problematic; what is the interplay between thevariantSetId
and thesequenceId
? IfsequenceId
is required, then why do we need thevariantSetId
? Conceptually, it seems natural that an Allele belongs to a singleVariantSet
.In terms of backwards compatibility, obviously there are some breaking changes here. These are minor, however, and easily worked around. Server-side, this is a considerable simplification, and so implementation can simply be a special case of the existing semantics. Client-side, users that are using the multi-container search semantics will need to now explicitly iterate over the container IDs that they are interested in.