-
Notifications
You must be signed in to change notification settings - Fork 112
Conversation
} | ||
|
||
/** | ||
Lists bases by sequence UUID and optional range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why UUID here? I think this should just be an implementation private ID value.
Thanks very much @adamnovak, this looks great. I can't see any backwards compatibility issues, so I would vote to merge this ASAP and start getting some real experience working with the graph model. A couple of minor, non-blocking issues:
|
@benedictpaten wanted UUIDs for sequences to facilitate merging across different API endpoints without having to rewrite or prefix IDs. If we are fine with having to manually construct truly global IDs (or just using accessions of some kind), I can get rid of the UUID thing. For the sequence/reference/segment confusion, here's the simplest way I have found to put it so far: A sequence is a component in a sequence graph, optionally joined at its ends to other sequences. Sequences are not directly retrievable through the API, but ranges of their bases can be downloaded. A A This all kind of depends on the explanation of sequence graphs at the top of common.avdl, but I feel like maybe it should come first to give a higher-level overview. Thoughts? |
OK, as a result of the discussion on the call this morning, I've introduced the concept of supported modes. API servers that support the "graph" mode will always populate the graph fields, and API endpoints that support the "classic" mode will always populate the non-graph-supporting fields that I made nullable (and also won't do confusing things like return If a server supports the "classic" mode, a client doesn't have to know anything about graphs in order to use it. If a server supports the "graph" mode, a client can just look at the new graph fields and not have to worry about doing its own up-conversion. A server can support both modes perfectly fine, although it won't be able to return things only expressible in graph terms. This doesn't require any state on the server side; a server is permanently fixed in the modes it supports and the fields it fills in, regardless of what a client actually wants to use. |
Re UUIDs - we've gone with hashes everywhere else for these sorts of requirements. I think we should change to ID here and add in a hash field of some sort later. |
For metadata we make the format for id => UUID (recommended):
This is from the working doc https://github.com/ga4gh/metadata-team/blob/master/avro-playground/metadata_redo.avdl |
I like the modes distinction @adamnovak --- I'm not sure about the exact mechanism for clients to discover the mode that a server is running in, but I think something like this is what we need. I'd like to clarify what I think is important about having these two modes, and why we need them. There are a lot of people out there who have VCF data relative to a linear reference who would like to share their data. There are a lot of tools out there that ingest data in VCF. The v0.5 classic API is a direct translation of the concepts that these tools understand, and allow us to serve VCF data in a relatively straightforward manner. Importantly, we can convert from VCF to the classic API easily allowing us to make data available. We can also convert data from the classic API back to VCF in a one-to-one manner, allowing us to use existing tools on data coming from a GA4GH server. There will be VCF data for quite a long time, and it's critical to the success of the protocol that these workflows are supported naturally. Ultimately, we hope that the field will move from a linear reference to a graph reference, but this will take a long time. We need to provide an evolutionary path for tools to follow: first they adopt the classic APIs, which translate fairly directly into the concepts that they work with today. Then, slowly, they start to adapt to the graph APIs and begin reaping the benefits (which we will demonstrate). Eventually, everyone (or at least a lot of people) will have moved over to the graph API, and we can then start to deprecate the classic API, eventually removing it altogether. The main reason we don't want two different protocols is to avoid splitting our resources, and wasting time keeping them in sync. The classic and graph mode APIs must be closely compatible for this strategy to work, and the best way to do this is by implementing both simultaneously. If we fork off two separate protocols, they will never again merge and we'll split our limited developer resources between working on one or the other. There are unarguably downsides to this dual protocol approach, but I think they can be alleviated fairly easily. Firstly, server implementations are in no way required to implement graph mode, and are free to continue using classic mode for as long as they want. Secondly, clients should be encouraged to use classic mode by default, as this is by far the most likely thing that they would want to do. Also, graph mode is experimental, and will probably change considerably over time and so new users of the protocol should not be exposed to it unless this is explicitly what they want. A graph mode capable server running in classic mode should not return any of the new attributes in the JSON responses (since they will be null, it is the same thing to just leave the fields out). The approach is a compromise, but it is hopefully one which allows us to both support users who want to work with existing datasets, and also to develop the graph based methods. |
As long as our data can be converted between VCF and the graph model, I think that a gradual development to using the graph is unlikely, and On Thu, Feb 5, 2015 at 2:31 PM, Jerome Kelleher notifications@github.com
|
@jeromekelleher The way I've done the modes now, it's perfectly fine for a server to support both the graph and classic modes, as long as they don't say things that can't be articulated in classic mode. I think that that would be the mode you want to run your servers in, because you have maximum compatibility with all clients. I guess it makes sense not to send any graph fields if you don't actually advertise support for graph mode; however, I didn't write that in as a requirement, and I think clients should have to handle it. A classic-mode-only client doesn't have to interpret graph mode fields, but it does have to correctly ignore them if they are present. Otherwise it wouldn't work on a server that sends both modes. While a client will probably want to support reading data in classic mode, I agree with @ekg and think new client development should be focused on graph mode. A client designed around graph mode can do its own trivial upconversion of classic mode replies, but the reverse is not true. @mbaudis I think standardizing all the GA4GH APIs on recommending the use of UUIDs for record IDs is probably a good idea, but I'd like to do it in a seperate pull request, since it would apply to a bunch of API objects I haven't touched in this one. Does anyone have any recommendations for things I should change? Or other reasons why they can't give this a +1? |
@adamnovak My UUID comment was just a reminder to push this as a global default, not necessarily here; though objects with clear & limited context/scope may probably do better without the generation/storage overhead (?). Don't feel entitled to +1 in though out of sheer personal incompetence ... |
58ae7a4
to
5efdc7b
Compare
I'm +1 on this (obviously), but we need at least a couple of non-UCSC people to +1 it. I think Adam should squash down the commits to make it a bit easier to digest the changes. |
5efdc7b
to
4737b00
Compare
I've squashed everything down to one commit for ease of evaluation. I'd appreciate it if people could take a look at this. |
itself is null, this is the MD5 of the upper-case sequence excluding all | ||
whitespace characters (this is equivalent to SQ:M5 in SAM). | ||
|
||
Otherwise, this is the MD5 of the above MD5 checksum, `segment.startJoin`'s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the MD5 of the {start,end}Join
defined? Is this the md5checksum
fields of those segments? I believe that given the semantics of {start,end}Join, there cannot be a circular dependency here (please verify), but what's the worst case "cascading" MD5 checksum recalculation I could cause by adding a new reference to the graph?
I'm not sure that this is a real problem that should matter; the question is more just to test my understanding of how checksums are working here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next paragraph in the comment describes the MD5 of a Position
(which is what startJoin
and endJoin
are) as the MD5 of the md5checksum
of the Reference
it is on, the position it is at, and the strand as "+" or "-".
You are right that there can't be circular dependencies. But it's also impossible to cause a recalculation by adding a new sequence. New sequences are only allowed to join onto existing sequences---existing sequences can't be made children of new sequences. So since a Reference
's md5checksum
depends only on itself and its parents, adding new sequences won't change the md5checksum
s of any existing Reference
s.
If you did somehow go back and change the parent of some existing sequence, it and all its children, and all their children, and so on, would all need their md5checksum
s recalculated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, these type of questions means the documentation needs enhanced.
adamnovak notifications@github.com writes:
In src/main/resources/avro/references.avdl:
/** The length of this reference's sequence. */
long length;/**
- MD5 of the upper-case sequence excluding all whitespace characters
- (this is equivalent to SQ:M5 in SAM).
- The MD5 checksum uniquely representing this
Reference
and its position in- the
ReferenceSet
's sequence graph, as a lower-case hexadecimal string.
- If
segment.startJoin
andsegment.endJoin
are both null, orsegment
- itself is null, this is the MD5 of the upper-case sequence excluding all
- whitespace characters (this is equivalent to SQ:M5 in SAM).
- Otherwise, this is the MD5 of the above MD5 checksum,
segment.startJoin
'sThe next paragraph in the comment describes the MD5 of a Position (which is
what startJoin and endJoin are) as the MD5 of the md5checksum of the Reference
it is on, the position it is at, and the strand as "+" or "-".You are right that there can't be circular dependencies. But it's also
impossible to cause a recalculation by adding a new sequence. New sequences are
only allowed to join onto existing sequences---existing sequences can't be made
children of new sequences. So since a Reference's md5checksum depends only on
itself and its parents, adding new sequences won't change the md5checksums of
any existing References.If you did somehow go back and change the parent of some existing sequence, it
and all its children, and all their children, and so on, would all need their
md5checksums recalculated.—
Reply to this email directly or view it on GitHub.*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment to the effect that Reference
s are designed to be immutable, with ReferenceSet
s most easily extensible through the addition of new child references. Any other ideas on how the documentation could be enhanced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the segments are explained clearly enough elsewhere. My only suggestion would be that this comment make it clearer that it's referring to the md5Checksum
field of the associated joined segment. Right now it's slightly ambiguous, and could be interpreted as the MD5 checksum of the position
record itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not referring to the md5checksum
of the associated joined segment; it's referring to the MD5 of the joined segment's md5checksum
concatenated with the join position as a decimal string and a "+" or "-" character denoting the strand of the join.
I've rewritten this section entirely to specify exactly what you are supposed to stir into your hash. Please let me know if it's still unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, much clearer.
4737b00
to
aa9e614
Compare
Several comments.
|
@lh3 raises excellent points. I have no issue with creating new types In general, we seem to have two choices:
What do you think @lh3, which is the lesser of two evils? |
Heng is right. The big distinction here, the one that will affect genetics On Tue, Feb 10, 2015 at 3:52 PM, Heng Li notifications@github.com wrote:
|
Well, not really "deprecating". Genotypic is here to stay. Maybe making it On Wed, Feb 11, 2015 at 8:17 AM, David Haussler haussler@soe.ucsc.edu
|
Hi David, Heng, Having been involved in API design before, it is hard to stress We also need to have complete, comprehensive documentation for Mark haussler notifications@github.com writes:
|
I also think we should split |
Might I propose @lh3 and @jeromekeller that we explore splitting these On Wed, Feb 11, 2015 at 9:39 AM, Benedict Paten benedict@soe.ucsc.edu
|
On Wed, Feb 11, 2015 at 9:57 AM, Heng Li notifications@github.com wrote:
That was the intention. There is some weirdness. In the vcf/classic model a
|
|
||
If the API server supports the "graph" mode, this field must not be null. | ||
*/ | ||
union { null, array<string> } alleleIds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that in graph mode, the alternateBases
and referenceBases
needed to be null
? If so, then what does agreement mean?
If this field is set along with
referenceName
,start
,end
,referenceBases
, and/oralternateBases
, those fields must agree with theAllele
s given here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. They have to be non-null if the server supports "calssic" mode, but they don't have to be null if the server supports "graph" mode, because a server can support "classic" and "graph" mode at the same time.
You can have both groups of fields set; they just need to agree with each other (which in practice means you can't use some of the features of the graph reference).
1ba35ac
to
3bd1a68
Compare
OK, by popular demand I have split This is simpler in that It's more complicated in that if you want all the information for a |
Going back to last night's comments by Heng, I've tried to craft a simple The allelic model of genetic variation The field of genetics was born before we could read genomes, yet through The more accurate model is simple: each individual has a genome, where a The allelic model of genetic variation is an abstraction that sits at a Instead of sites, the elementary units of a reference structure are In the allelic model, an allele is defined as a sequence of positions that The allelic model does not have the classical genetics concepts of “site” The allelic model is a richer foundation for genetics. Using something like On Wed, Feb 11, 2015 at 11:34 AM, adamnovak notifications@github.com
|
Speaking of the importance of images, I've done up a UML class diagram of all the data model records, so we can see my proposed changes (on the right) as well as how the whole API fits together. |
@fnothaft , we don't need to use Avro IDL. We can either change it - which many companies do internally - or use something like WebIDL which provides inheritance capabilities: If we keep working around the limitations of Avro, the schema will become cumbersome to support and expand upon. Paul |
I'm in favour of merging this now. It seems clear that the consensus is that we want something like this, but it's also clear that it will take some time to resolve the remaining issues. I would suggest that the most productive approach now would be to merge what we have and start filing issues and PRs against it to resolve the outstanding problems. +1. |
Richard astutely pointed out that there were some obnoxious aspects to the The allelic model of genetic variation In the textbook genetic model, genetic information in a diploid population Genetic information is carried in a genome consisting of molecules of The allelic model of genetic variation is an abstraction that sits at a Instead of sites, the elementary units of a reference structure are In the allelic model, an allele is defined as a sequence of positions that The allelic model is a generalization because it does not include the Being more general, the allelic model is a broader foundation for genetics. On Wed, Feb 11, 2015 at 1:58 PM, David Haussler haussler@soe.ucsc.edu
|
+1, @adamnovak your presentation from today was fantastic |
bbd8b2b
to
4a62248
Compare
Can you squash the commits down please @adamnovak? Looks like we're ready to merge. |
There is now a `Path` and `Segment` in a graph namespace in common.avdl, and a `GraphAlignment` in the graph namespace in reads.avdl. We also have a `/graph/sequence/{id}` endpoint to get sequence bases in a generic way, and a couple more options to search by. Adding support for monoallelic calls. This adds an `Allele` type, and allows for `Call`s to be associated with `Allele`s, or with `Allele`s and `Variant`s. `Allele`s can be used to represent the ref and alt alleles of `Varaint`s. We also provide for associating new sequences with `VariantSet`s. Removing graph namespace and URL prefix Revising comments and field names Unifying checksums Revising variants comments Code reviewed by Prof. Haussler Changing comments to be more descriptive and mathematically tight. Also catching a few typos. Fixing oversights in new API calls Adding in primer on sequence graphs Revising David's sequence graph intro Explaining sequence search better Sequences are searched over, and results are returned as segments describing sequences. Also removing references to an out-of-date object name. Removing update timestamp from Allele Clarifying sequences, segments, and references Fixing up whitespace Introducing "graph" and "classic" modes. A server can support "graph" mode, "classic" mode, or both. If it supports "classic" mode, it will make sure to fill in all of the fields that non-graph clients require, and not use incompatible graph extensions like `GraphAlignment`. If it supports "graph" mode, it will make sure to fill in all of the new graph fields, like `Reference.segment` and `Variant.alleles`. Additionally, if it also does not support "classic" mode, it will not use older, less general things like `LinearAlignment` and single string phase sets. Making David's documentation changes Also adding more pictures and hammering on the parent/child distinction. Making startJoin and endJoin default to null Clarifying the order of segments in a path Fixing example typos Splitting out `AlleleCall`. There is now an `AlleleCall` to take care of calling copy numbers of `Alleles`, while the normal `Call` goes back to just handling genotypes. I also stripped some of the bachwards-compatibility warts off of `AlleleCall`, since it has no legacy fields to keep. Both kinds of calls are constrained to be consistent with each other; you can't call a genotype of [1, 1], and have a copy number of 0 for that allele in that variant. If you support "graph" mode, you have to serve `AlleleCall`s for every `Variant` you have `Call`s on, and visa versa if you support "classic" mode. More descriptive name for new method Fixing case where it would be impossible to get allele sequence. Since a `Position` can be specified by `Reference` name, there is no efficient way to search for a `Reference` by name, and the only way to get sequence bases (in the old and proposed APIs) is by ID, we need to be able to make sure that all `Position`s include IDs in the graph mode, so we can actually get the sequence. Fixing doc comment typos Clarifying computation of `Reference.md5checksum`. I now give an explicit procedure for how you are supposed to compute the `md5checksum` of a `Reference`. It no longer makes reference to the idea of a checksum for a `Position`, since that doesn't actually exist. It should be much clearer now what is to be hashed, which is important because everyone needs to do the same thing to get the same hashes.
4a62248
to
346bfc3
Compare
OK I have squashed everything down. I'm going to tag this as ready to merge. |
I'm merging this. |
that of the basepair itself, and the "+" confirms that we mean the default | ||
forward orientation. The right side is indicated using the "-" orientation. For | ||
example, the right side of the following G/C base pair is represented by the | ||
position (3,-). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An slightly confusing coincidence in this example is that, in this sequence of length 7, the index 3
could be interpreted as being measured from either end.
The asymmetry between the start
index having the nick come just before it while the end
index has the nick after it (when both are read from left-to-right, as the indices are apparently intended to be read) made it hard for me to figure out all that was meant to be implied by the (3,-)
notation.
Let me know if that doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very good, thorough example overall though, thank you for writing this up!
This PR has been merged @ryan-williams, so the discussion thread has ended. Please open a PR if you'd like to make some updates, or open some issues corresponding to the individual comments. |
This PR adds graph support to the References, Reads, and Variant APIs in a backwards-compatible way.
common.avdl
is enhanced withSegment
andPath
types, as well as the notion of a sequence and a sequence graph, which is described at length at the top of the file.Reference
s can now have associatedSegment
s, placing them in a sequence graph for the ReferenceSet.Reads in the Reads API can now be aligned using
GraphAlignment
s as well asLinearAlignment
s.Variant
s in the Variants API can haveAllele
s defining their reference and alternate forms as paths in a sequence graph, andCall
s can call copy numbers for individualAllele
s as well as genotypes forVariant
s.Closes #233