Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Graph Support for GA4GH API #238

Merged
merged 1 commit into from
Feb 21, 2015
Merged

Conversation

adamnovak
Copy link
Contributor

This PR adds graph support to the References, Reads, and Variant APIs in a backwards-compatible way.

common.avdl is enhanced with Segment and Path types, as well as the notion of a sequence and a sequence graph, which is described at length at the top of the file.

References can now have associated Segments, placing them in a sequence graph for the ReferenceSet.

Reads in the Reads API can now be aligned using GraphAlignments as well as LinearAlignments.

Variants in the Variants API can have Alleles defining their reference and alternate forms as paths in a sequence graph, and Calls can call copy numbers for individual Alleles as well as genotypes for Variants.

Closes #233

}

/**
Lists bases by sequence UUID and optional range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why UUID here? I think this should just be an implementation private ID value.

@jeromekelleher
Copy link
Contributor

Thanks very much @adamnovak, this looks great. I can't see any backwards compatibility issues, so I would vote to merge this ASAP and start getting some real experience working with the graph model. A couple of minor, non-blocking issues:

  1. Why UUIDs for sequences? This seems like an unnecessary requirement to me; implementation private IDs would be sufficient surely.
  2. It would be good to clarify the relationship between a sequence/reference/segment, as they are used somewhat interchangeably in the documentation and in the methods.

@adamnovak
Copy link
Contributor Author

@benedictpaten wanted UUIDs for sequences to facilitate merging across different API endpoints without having to rewrite or prefix IDs. If we are fine with having to manually construct truly global IDs (or just using accessions of some kind), I can get rid of the UUID thing.

For the sequence/reference/segment confusion, here's the simplest way I have found to put it so far:

A sequence is a component in a sequence graph, optionally joined at its ends to other sequences. Sequences are not directly retrievable through the API, but ranges of their bases can be downloaded.

A Segment is a pointer to a range on a sequence, possibly including the joins at the sequence's ends. It does not include base data. When the server has to refer to some or all of the bases in a sequence, it uses a Segment to indicate them. A Path through a sequence graph is a list of Segments, while a sequence graph itself can be described using a set of Segments, one for each sequence.

A Reference is an object that marks a sequence as part of a ReferenceSet's sequence graph. It contains a Segment referring to the entirety of the sequence it annotates.

This all kind of depends on the explanation of sequence graphs at the top of common.avdl, but I feel like maybe it should come first to give a higher-level overview. Thoughts?

@adamnovak
Copy link
Contributor Author

OK, as a result of the discussion on the call this morning, I've introduced the concept of supported modes. API servers that support the "graph" mode will always populate the graph fields, and API endpoints that support the "classic" mode will always populate the non-graph-supporting fields that I made nullable (and also won't do confusing things like return GraphAlignments in the reads API).

If a server supports the "classic" mode, a client doesn't have to know anything about graphs in order to use it. If a server supports the "graph" mode, a client can just look at the new graph fields and not have to worry about doing its own up-conversion. A server can support both modes perfectly fine, although it won't be able to return things only expressible in graph terms.

This doesn't require any state on the server side; a server is permanently fixed in the modes it supports and the fields it fills in, regardless of what a client actually wants to use.

@jeromekelleher
Copy link
Contributor

Re UUIDs - we've gone with hashes everywhere else for these sorts of requirements. I think we should change to ID here and add in a hash field of some sort later.

@mbaudis
Copy link
Member

mbaudis commented Feb 5, 2015

For metadata we make the format for id => UUID (recommended):

  /**
  The individual UUID. This is globally unique. 
  The recommendation is to use an anonymous UUID type with extremely low collision probability.
  The typical implementation would be UUIDv4.  
  This has to be provided by the implementation.
  */  
  string id;

This is from the working doc https://github.com/ga4gh/metadata-team/blob/master/avro-playground/metadata_redo.avdl

@jeromekelleher
Copy link
Contributor

I like the modes distinction @adamnovak --- I'm not sure about the exact mechanism for clients to discover the mode that a server is running in, but I think something like this is what we need. I'd like to clarify what I think is important about having these two modes, and why we need them.

There are a lot of people out there who have VCF data relative to a linear reference who would like to share their data. There are a lot of tools out there that ingest data in VCF. The v0.5 classic API is a direct translation of the concepts that these tools understand, and allow us to serve VCF data in a relatively straightforward manner. Importantly, we can convert from VCF to the classic API easily allowing us to make data available. We can also convert data from the classic API back to VCF in a one-to-one manner, allowing us to use existing tools on data coming from a GA4GH server. There will be VCF data for quite a long time, and it's critical to the success of the protocol that these workflows are supported naturally.

Ultimately, we hope that the field will move from a linear reference to a graph reference, but this will take a long time. We need to provide an evolutionary path for tools to follow: first they adopt the classic APIs, which translate fairly directly into the concepts that they work with today. Then, slowly, they start to adapt to the graph APIs and begin reaping the benefits (which we will demonstrate). Eventually, everyone (or at least a lot of people) will have moved over to the graph API, and we can then start to deprecate the classic API, eventually removing it altogether.

The main reason we don't want two different protocols is to avoid splitting our resources, and wasting time keeping them in sync. The classic and graph mode APIs must be closely compatible for this strategy to work, and the best way to do this is by implementing both simultaneously. If we fork off two separate protocols, they will never again merge and we'll split our limited developer resources between working on one or the other.

There are unarguably downsides to this dual protocol approach, but I think they can be alleviated fairly easily. Firstly, server implementations are in no way required to implement graph mode, and are free to continue using classic mode for as long as they want. Secondly, clients should be encouraged to use classic mode by default, as this is by far the most likely thing that they would want to do. Also, graph mode is experimental, and will probably change considerably over time and so new users of the protocol should not be exposed to it unless this is explicitly what they want. A graph mode capable server running in classic mode should not return any of the new attributes in the JSON responses (since they will be null, it is the same thing to just leave the fields out).

The approach is a compromise, but it is hopefully one which allows us to both support users who want to work with existing datasets, and also to develop the graph based methods.

@ekg
Copy link

ekg commented Feb 5, 2015

As long as our data can be converted between VCF and the graph model,
people will continue to be able to use tools structured around the linear
reference. The same goes for BAM and graph-based alignments. Obviously
these models are not completely congruent, so conversion from graphs ->
linear systems will lose data, but in the vast majority of cases it seems
trivial to convert back and forth between a linear and a graph way of
working.

I think that a gradual development to using the graph is unlikely, and
instead big leaps will be made where it is adopted. Most old tools will
remain useful as long as conversion back to the linear model is easy. New
tools will be written against a graph model.

On Thu, Feb 5, 2015 at 2:31 PM, Jerome Kelleher notifications@github.com
wrote:

I like the modes distinction @adamnovak https://github.com/adamnovak
--- I'm not sure about the exact mechanism for clients to discover the mode
that a server is running in, but I think something like this is what we
need. I'd like to clarify what I think is important about having these two
modes, and why we need them.

There are a lot of people out there who have VCF data relative to a linear
reference who would like to share their data. There are a lot of tools out
there that ingest data in VCF. The v0.5 classic API is a direct translation
of the concepts that these tools understand, and allow us to serve VCF data
in a relatively straightforward manner. Importantly, we can convert from
VCF to the classic API easily allowing us to make data available. We can
also convert data from the classic API back to VCF in a one-to-one manner,
allowing us to use existing tools on data coming from a GA4GH server. There
will be VCF data for quite a long time, and it's critical to the success of
the protocol that these workflows are supported naturally.

Ultimately, we hope that the field will move from a linear reference to a
graph reference, but this will take a long time. We need to provide an
evolutionary path for tools to follow: first they adopt the classic APIs,
which translate fairly directly into the concepts that they work with
today. Then, slowly, they start to adapt to the graph APIs and begin
reaping the benefits (which we will demonstrate). Eventually, everyone (or
at least a lot of people) will have moved over to the graph API, and we can
then start to deprecate the classic API, eventually removing it altogether.

The main reason we don't want two different protocols is to avoid
splitting our resources, and wasting time keeping them in sync. The classic
and graph mode APIs must be closely compatible for this strategy to
work, and the best way to do this is by implementing both simultaneously.
If we fork off two separate protocols, they will never again merge and
we'll split our limited developer resources between working on one or the
other.

There are unarguably downsides to this dual protocol approach, but I think
they can be alleviated fairly easily. Firstly, server implementations are
in no way required to implement graph mode, and are free to continue using
classic mode for as long as they want. Secondly, clients should be
encouraged to use classic mode by default, as this is by far the most
likely thing that they would want to do. Also, graph mode is experimental,
and will probably change considerably over time and so new users of the
protocol should not be exposed to it unless this is explicitly what they
want. A graph mode capable server running in classic mode should not return
any of the new attributes in the JSON responses (since they will be null,
it is the same thing to just leave the fields out).

The approach is a compromise, but it is hopefully one which allows us to
both support users who want to work with existing datasets, and also to
develop the graph based methods.


Reply to this email directly or view it on GitHub
#238 (comment).

@adamnovak
Copy link
Contributor Author

@jeromekelleher The way I've done the modes now, it's perfectly fine for a server to support both the graph and classic modes, as long as they don't say things that can't be articulated in classic mode. I think that that would be the mode you want to run your servers in, because you have maximum compatibility with all clients.

I guess it makes sense not to send any graph fields if you don't actually advertise support for graph mode; however, I didn't write that in as a requirement, and I think clients should have to handle it. A classic-mode-only client doesn't have to interpret graph mode fields, but it does have to correctly ignore them if they are present. Otherwise it wouldn't work on a server that sends both modes.

While a client will probably want to support reading data in classic mode, I agree with @ekg and think new client development should be focused on graph mode. A client designed around graph mode can do its own trivial upconversion of classic mode replies, but the reverse is not true.

@mbaudis I think standardizing all the GA4GH APIs on recommending the use of UUIDs for record IDs is probably a good idea, but I'd like to do it in a seperate pull request, since it would apply to a bunch of API objects I haven't touched in this one.

Does anyone have any recommendations for things I should change? Or other reasons why they can't give this a +1?

@mbaudis
Copy link
Member

mbaudis commented Feb 5, 2015

@adamnovak My UUID comment was just a reminder to push this as a global default, not necessarily here; though objects with clear & limited context/scope may probably do better without the generation/storage overhead (?).

Don't feel entitled to +1 in though out of sheer personal incompetence ...

@benedictpaten
Copy link
Contributor

I'm +1 on this (obviously), but we need at least a couple of non-UCSC people to +1 it. I think Adam should squash down the commits to make it a bit easier to digest the changes.

@adamnovak
Copy link
Contributor Author

I've squashed everything down to one commit for ease of evaluation. I'd appreciate it if people could take a look at this.

itself is null, this is the MD5 of the upper-case sequence excluding all
whitespace characters (this is equivalent to SQ:M5 in SAM).

Otherwise, this is the MD5 of the above MD5 checksum, `segment.startJoin`'s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the MD5 of the {start,end}Join defined? Is this the md5checksum fields of those segments? I believe that given the semantics of {start,end}Join, there cannot be a circular dependency here (please verify), but what's the worst case "cascading" MD5 checksum recalculation I could cause by adding a new reference to the graph?

I'm not sure that this is a real problem that should matter; the question is more just to test my understanding of how checksums are working here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next paragraph in the comment describes the MD5 of a Position (which is what startJoin and endJoin are) as the MD5 of the md5checksum of the Reference it is on, the position it is at, and the strand as "+" or "-".

You are right that there can't be circular dependencies. But it's also impossible to cause a recalculation by adding a new sequence. New sequences are only allowed to join onto existing sequences---existing sequences can't be made children of new sequences. So since a Reference's md5checksum depends only on itself and its parents, adding new sequences won't change the md5checksums of any existing References.

If you did somehow go back and change the parent of some existing sequence, it and all its children, and all their children, and so on, would all need their md5checksums recalculated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, these type of questions means the documentation needs enhanced.

adamnovak notifications@github.com writes:

In src/main/resources/avro/references.avdl:

/** The length of this reference's sequence. */
long length;

/**

  • MD5 of the upper-case sequence excluding all whitespace characters
  • (this is equivalent to SQ:M5 in SAM).
  • The MD5 checksum uniquely representing this Reference and its position in
  • the ReferenceSet's sequence graph, as a lower-case hexadecimal string.
  • If segment.startJoin and segment.endJoin are both null, or segment
  • itself is null, this is the MD5 of the upper-case sequence excluding all
  • whitespace characters (this is equivalent to SQ:M5 in SAM).
  • Otherwise, this is the MD5 of the above MD5 checksum, segment.startJoin's

The next paragraph in the comment describes the MD5 of a Position (which is
what startJoin and endJoin are) as the MD5 of the md5checksum of the Reference
it is on, the position it is at, and the strand as "+" or "-".

You are right that there can't be circular dependencies. But it's also
impossible to cause a recalculation by adding a new sequence. New sequences are
only allowed to join onto existing sequences---existing sequences can't be made
children of new sequences. So since a Reference's md5checksum depends only on
itself and its parents, adding new sequences won't change the md5checksums of
any existing References.

If you did somehow go back and change the parent of some existing sequence, it
and all its children, and all their children, and so on, would all need their
md5checksums recalculated.


Reply to this email directly or view it on GitHub.*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment to the effect that References are designed to be immutable, with ReferenceSets most easily extensible through the addition of new child references. Any other ideas on how the documentation could be enhanced?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the segments are explained clearly enough elsewhere. My only suggestion would be that this comment make it clearer that it's referring to the md5Checksum field of the associated joined segment. Right now it's slightly ambiguous, and could be interpreted as the MD5 checksum of the position record itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not referring to the md5checksum of the associated joined segment; it's referring to the MD5 of the joined segment's md5checksum concatenated with the join position as a decimal string and a "+" or "-" character denoting the strand of the join.

I've rewritten this section entirely to specify exactly what you are supposed to stir into your hash. Please let me know if it's still unclear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, much clearer.

@lh3
Copy link
Member

lh3 commented Feb 10, 2015

Several comments.

  1. Are "classical" and "graph" the right names? This PR consists of two largely independent components: graph representation and allelic model. By independence, I mean we could in principle split this PR into the two with little interference. The graph mode is actually "graph+allelic", and the "classical" is "linear+genotypic". There could also be other combinations "linear+allelic" and "graph+genotypic" as well. In variants.avdl of this PR, we frequently say that some fields are mandatory in the graph mode, but more exactly, most of them are required by the allelic model, not by the graph representation. I think we need to rename the "graph mode" to something that is conceptually correct. I couldn't think of a clean name, though.
  2. The Call object is overloaded. In the genotypic model, a call is a genotype call. In the allelic model, a call is an allele call with a very different set of new fields (the only overlapping field is apparently callSetId). The overloading obscures the meaning of Call. I would suggest completely separating allelic and genotypic calls by adding an AlleleCall object and moving allele-associated fields to the new object. AlleleCall is conceptually different from the old Call object.
  3. Similarly the Variant object is overloaded to a lesser extent - some fields are allelic only and some are genotypic only. We might consider a similar treatment (see also point 4 below). On the other hand, Variant is a concept common to both allelic and genotypic models. Adding AlleleVariant seems inconsistent. I am not sure if there is a better solution.
  4. If we decide to use one Variant object in both the genotypic and allelic models, we need to be careful about the updated time. In the genotypic model, the allele sequences are members of Variant, which would imply that update the allele sequences would change the updated time. In the allelic model, the ID of alleles, rather than the alleles themselves, are members of Variant. If we change an allele sequence without changing its ID, do we change the updated time? In addition, for the referenceName, start and a few other fields, the API doc only says "If the API server supports the classic mode, this field must not be null", but it does not specify what these fields mean in the "graph" (more exactly allelic) mode. Do they have to be null? If we create an AlleleVariant object, this won't be a problem.
  5. Could someone write down code snippets to retrieve the genotype of some samples in a small region? I wrote some when the allelic model was on a separate branch. With both genotypic and allelic models mixed together and concepts overloading, I am even less sure about what is the right code. This is critically important. If we creators of the APIs have difficulties in writing down the queries, endusers will have more troubles.

@jeromekelleher
Copy link
Contributor

@lh3 raises excellent points. I have no issue with creating new types AlleleCall and AlleleVariant, except it does complicate things further up the stack. For example, SearchVariantsResponse contains a list variants, which has the type array<Variant>. This would then need to be union( array<Variant>, array<AlleleVariant>), so that we can return either type. (Avro's lack of inheritance is getting increasingly annoying.) We're then just pushing the overloading further up the stack. I'm not sure which option is better: have the overloading occurring at the query level, or at the object level.

In general, we seem to have two choices:

  1. Overload some types so that we can to implement the graph+allelic model as well as the linear+genotype model using the same queries;
  2. Make new types and queries exclusively for the graph+allelic model. This brings us back to the namespacing approach.

What do you think @lh3, which is the lesser of two evils?

@haussler
Copy link

Heng is right. The big distinction here, the one that will affect genetics
for decades, is "genotypic" versus "allelic". It would be a mistake to try
to get the world to convert to a graph model without explaining the deep
motivation for adding an "allelic" mode and eventually deprecating the
"genotypic" mode.

On Tue, Feb 10, 2015 at 3:52 PM, Heng Li notifications@github.com wrote:

Several comments.

Are "classical" and "graph" the right names? This PR consists of two
largely independent components: graph representation and allelic model. By
independence, I mean we could in principle split this PR into the two with
little interference. The graph mode is actually "graph+allelic", and the
"classical" is "linear+genotypic". There could also be other combinations
"linear+allelic" and "graph+genotypic" as well. In variants.avdl of
this PR, we frequently say that some fields are mandatory in the graph
mode, but more exactly, most of them are required by the allelic model, not
by the graph representation. I think we need to rename the "graph mode" to
something that is conceptually correct. I couldn't think of a clean name,
though.
2.

The Call object is overloaded. In the genotypic model, a call is a
genotype call. In the allelic model, a call is an allele call with a very
different set of new fields (the only overlapping field is apparently
callSetId). The overloading obscures the meaning of Call. I would
suggest completely separating allelic and genotypic calls by adding an
AlleleCall object and moving allele-associated fields to the new
object. AlleleCall is conceptually different from the old Call object.
3.

Similarly the Variant object is overloaded to a lesser extent - some
fields are allelic only and some are genotypic only. We might consider a
similar treatment (see also point 4 below). On the other hand, Variant
is a concept common to both allelic and genotypic models. Adding
AlleleVariant seems inconsistent. I am not sure if there is a better
solution.
4.

If we decide to use one Variant object in both the genotypic and
allelic models, we need to be careful about the updated time. In the
genotypic model, the allele sequences are members of Variant, which
would imply that update the allele sequences would change the updated
time. In the allelic model, the ID of alleles, rather than the alleles
themselves, are members of Variant. If we change an allele sequence
without changing its ID, do we change the updated time? In addition,
for the referenceName, start and a few other fields, the API doc only
says "If the API server supports the classic mode, this field must not be
null", but it does not specify what these fields mean in the "graph" (more
exactly allelic) mode. Do they have to be null? If we create an
AlleleVariant object, this won't be a problem.
5.

Could someone write down code snippets to retrieve the genotype of
some samples in a small region? I wrote some when the allelic model was on
a separate branch. With both genotypic and allelic models mixed together, I
am even less sure about what is the right code. This is critically
important. If we creators of the APIs have difficulties in writing down the
queries, endusers will have more troubles.


Reply to this email directly or view it on GitHub
#238 (comment).

@haussler
Copy link

Well, not really "deprecating". Genotypic is here to stay. Maybe making it
more of a derived summary from the allelic information.

On Wed, Feb 11, 2015 at 8:17 AM, David Haussler haussler@soe.ucsc.edu
wrote:

Heng is right. The big distinction here, the one that will affect genetics
for decades, is "genotypic" versus "allelic". It would be a mistake to try
to get the world to convert to a graph model without explaining the deep
motivation for adding an "allelic" mode and eventually deprecating the
"genotypic" mode.

On Tue, Feb 10, 2015 at 3:52 PM, Heng Li notifications@github.com wrote:

Several comments.

Are "classical" and "graph" the right names? This PR consists of two
largely independent components: graph representation and allelic model. By
independence, I mean we could in principle split this PR into the two with
little interference. The graph mode is actually "graph+allelic", and the
"classical" is "linear+genotypic". There could also be other combinations
"linear+allelic" and "graph+genotypic" as well. In variants.avdl of
this PR, we frequently say that some fields are mandatory in the graph
mode, but more exactly, most of them are required by the allelic model, not
by the graph representation. I think we need to rename the "graph mode" to
something that is conceptually correct. I couldn't think of a clean name,
though.
2.

The Call object is overloaded. In the genotypic model, a call is a
genotype call. In the allelic model, a call is an allele call with a very
different set of new fields (the only overlapping field is apparently
callSetId). The overloading obscures the meaning of Call. I would
suggest completely separating allelic and genotypic calls by adding an
AlleleCall object and moving allele-associated fields to the new
object. AlleleCall is conceptually different from the old Call object.
3.

Similarly the Variant object is overloaded to a lesser extent - some
fields are allelic only and some are genotypic only. We might consider a
similar treatment (see also point 4 below). On the other hand, Variant
is a concept common to both allelic and genotypic models. Adding
AlleleVariant seems inconsistent. I am not sure if there is a better
solution.
4.

If we decide to use one Variant object in both the genotypic and
allelic models, we need to be careful about the updated time. In the
genotypic model, the allele sequences are members of Variant, which
would imply that update the allele sequences would change the updated
time. In the allelic model, the ID of alleles, rather than the alleles
themselves, are members of Variant. If we change an allele sequence
without changing its ID, do we change the updated time? In addition,
for the referenceName, start and a few other fields, the API doc only
says "If the API server supports the classic mode, this field must not be
null", but it does not specify what these fields mean in the "graph" (more
exactly allelic) mode. Do they have to be null? If we create an
AlleleVariant object, this won't be a problem.
5.

Could someone write down code snippets to retrieve the genotype of
some samples in a small region? I wrote some when the allelic model was on
a separate branch. With both genotypic and allelic models mixed together, I
am even less sure about what is the right code. This is critically
important. If we creators of the APIs have difficulties in writing down the
queries, endusers will have more troubles.


Reply to this email directly or view it on GitHub
#238 (comment).

@diekhans
Copy link
Contributor

Hi David, Heng,

Having been involved in API design before, it is hard to stress
how important it is to get the terminology and naming right now.
If the terminology is confusing, it will confuse people for the
next 20 years.

We also need to have complete, comprehensive documentation for
all of the APIs, including rationale, in of the source tree.
Documentation not in the source tree will be out-of-sync, lost,
or just plain wrong.

Mark

haussler notifications@github.com writes:

Heng is right. The big distinction here, the one that will affect genetics
for decades, is "genotypic" versus "allelic". It would be a mistake to try
to get the world to convert to a graph model without explaining the deep
motivation for adding an "allelic" mode and eventually deprecating the
"genotypic" mode.

On Tue, Feb 10, 2015 at 3:52 PM, Heng Li notifications@github.com wrote:

Several comments.

Are "classical" and "graph" the right names? This PR consists of two
largely independent components: graph representation and allelic model. By
independence, I mean we could in principle split this PR into the two with
little interference. The graph mode is actually "graph+allelic", and the
"classical" is "linear+genotypic". There could also be other combinations
"linear+allelic" and "graph+genotypic" as well. In variants.avdl of
this PR, we frequently say that some fields are mandatory in the graph
mode, but more exactly, most of them are required by the allelic model, not
by the graph representation. I think we need to rename the "graph mode" to
something that is conceptually correct. I couldn't think of a clean name,
though.
2.

The Call object is overloaded. In the genotypic model, a call is a
genotype call. In the allelic model, a call is an allele call with a very
different set of new fields (the only overlapping field is apparently
callSetId). The overloading obscures the meaning of Call. I would
suggest completely separating allelic and genotypic calls by adding an
AlleleCall object and moving allele-associated fields to the new
object. AlleleCall is conceptually different from the old Call object.
3.

Similarly the Variant object is overloaded to a lesser extent - some
fields are allelic only and some are genotypic only. We might consider a
similar treatment (see also point 4 below). On the other hand, Variant
is a concept common to both allelic and genotypic models. Adding
AlleleVariant seems inconsistent. I am not sure if there is a better
solution.
4.

If we decide to use one Variant object in both the genotypic and
allelic models, we need to be careful about the updated time. In the
genotypic model, the allele sequences are members of Variant, which
would imply that update the allele sequences would change the updated
time. In the allelic model, the ID of alleles, rather than the alleles
themselves, are members of Variant. If we change an allele sequence
without changing its ID, do we change the updated time? In addition,
for the referenceName, start and a few other fields, the API doc only
says "If the API server supports the classic mode, this field must not be
null", but it does not specify what these fields mean in the "graph" (more
exactly allelic) mode. Do they have to be null? If we create an
AlleleVariant object, this won't be a problem.
5.

Could someone write down code snippets to retrieve the genotype of
some samples in a small region? I wrote some when the allelic model was on
a separate branch. With both genotypic and allelic models mixed together, I
am even less sure about what is the right code. This is critically
important. If we creators of the APIs have difficulties in writing down the
queries, endusers will have more troubles.


Reply to this email directly or view it on GitHub
#238 (comment).


Reply to this email directly or view it on GitHub.*

@lh3
Copy link
Member

lh3 commented Feb 11, 2015

I also think we should split Call. I actually more like to add GenotypeCall and gradually deprecate Call, but I wouldn't insist on this. As to Variant, are we regarding a variant in the VCF/classical mode exactly the same as a variant in the graph+allelic mode?

@benedictpaten
Copy link
Contributor

Might I propose @lh3 and @jeromekeller that we explore splitting these
objects as a subsequent pull request? For sake of getting things done, we
really need to get something accepted soon - we have a developer hired now
to build this into the reference server! I suspect that as we implement it
we will be proposing changes to improve it. As I say, I'm currently the
only +1 on this, and it would be great for others to feel that this is at
least a good-enough starting point on which we can iterate.

On Wed, Feb 11, 2015 at 9:39 AM, Benedict Paten benedict@soe.ucsc.edu
wrote:

Thanks Heng, Jerome.

We debated splitting out the graph fields from both call and variant. Adam
has a nice UML diagram that he's working on that makes it easier to see the
relationships - @adamnovak, I think you should post this. Jerome wanted
backwards compatibility - well we've got it, but the cost is some apparent
duplication.

I agree with Heng about creating code snippets. Adam - can you get Maciek
to do this as an exercise?

To David's point, I think it's very important to distinguish the concept
of genotypes (sites) vs. allelic and linear reference vs. graph reference.
These are separate ideas. What we have now allows us to express, as Heng
says, all four combinations of model: genotypes+linear-ref,
genotypes+graph-ref, allelic+graph-ref, and allelic+linear-ref (because
the linear-reference is a special case of the graph model - so you can use
the new segment object to describe an allele).

I think this will be cleaner in the future, because we can slowly
deprecate the linear reference stuff, then we'll end up with a single
variant object, which allows us to express genotypes on the graph and a
single allele object.

My strong opinion is not to split variant into variant and allele variant,
it does not fit the semantics.

I am ambivalent about splitting the call object into allele and genotype
calls.

On Wed, Feb 11, 2015 at 6:44 AM, Jerome Kelleher <notifications@github.com

wrote:

@lh3 https://github.com/lh3 raises excellent points. I have no issue
with creating new types AlleleCall and AlleleVariant, except it does
complicate things further up the stack. For example,
SearchVariantsResponse contains a list variants, which has the type
array. This would then need to be union( array,
array), so that we can return either type. (Avro's lack
of inheritance is getting increasingly annoying.) We're then just pushing
the overloading further up the stack. I'm not sure which option is better:
have the overloading occurring at the query level, or at the object level.

In general, we seem to have two choices:

  1. Overload some types so that we can to implement the graph+allelic
    model as well as the linear+genotype model using the same queries;
  2. Make new types and queries exclusively for the graph+allelic
    model. This brings us back to the namespacing approach.

What do you think @lh3 https://github.com/lh3, which is the lesser of
two evils?


Reply to this email directly or view it on GitHub
#238 (comment).

@benedictpaten
Copy link
Contributor

On Wed, Feb 11, 2015 at 9:57 AM, Heng Li notifications@github.com wrote:

I also think we should split Call. I actually more like to add
GenotypeCall and gradually deprecate Call, but I wouldn't insist on this.
As to Variant, are we regarding a variant in the VCF/classical mode
exactly the same as a variant in the graph+allelic mode?

That was the intention. There is some weirdness. In the vcf/classic model a
site is a set of edits to a reference substring. In the graph model a
variant is a collection of alleles - they don't have to be homologous,
because we don't know how to define homologous in the graph world. I think
@richarddurbin has made the point repeatedly, but homology is difficult to
define and not transitive when talking about the relationship between
substrings (rather than individual bases). I'd welcome a proposal to
clarify this.


Reply to this email directly or view it on GitHub
#238 (comment).


If the API server supports the "graph" mode, this field must not be null.
*/
union { null, array<string> } alleleIds;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that in graph mode, the alternateBases and referenceBases needed to be null? If so, then what does agreement mean?

If this field is set along with referenceName, start, end, referenceBases, and/or alternateBases, those fields must agree with the Alleles given here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. They have to be non-null if the server supports "calssic" mode, but they don't have to be null if the server supports "graph" mode, because a server can support "classic" and "graph" mode at the same time.

You can have both groups of fields set; they just need to agree with each other (which in practice means you can't use some of the features of the graph reference).

@adamnovak adamnovak force-pushed the graphBackport branch 2 times, most recently from 1ba35ac to 3bd1a68 Compare February 11, 2015 19:33
@adamnovak
Copy link
Contributor Author

OK, by popular demand I have split Call into Call and AlleleCall, and added a searchAlleleCalls() method.

This is simpler in that Calls will always have genotypes, and AlleleCalls will always have Allele copy number assertions (possibly associated with Variants). This lets the objects be simpler and removes a bunch of constraints about what needs to be set in what mode.

It's more complicated in that if you want all the information for a CallSet you need to use both searchCalls() to download any genotypes and searchAlleleCalls() to download Allele copy number assertions. I've added the constraint that the two views need to be consistent. I've also required that you serve Calls for every Variant you have AlleleCalls for if you support "classic" mode, and visa versa if you support "graph" mode, so you should be able to just query the kind of call that you want and get a more or less complete view of the CallSet.

@haussler
Copy link

Going back to last night's comments by Heng, I've tried to craft a simple
beginner's introduction to what we are doing for the benefit of those
outside our group. It is pasted below and attached in Word. Heng, Richard,
Adam others, please replace or edit as needed, pointing to diagrams and
examples (I agree with Adam these are essential). I think we need something
that gives a high-level intro and is pointed to from the schemas. -D

The allelic model of genetic variation

The field of genetics was born before we could read genomes, yet through
exceedingly clever experiments and careful observations, the pioneers of
genetics were able to form a very accurate model of the basic structure of
genetic information. We refer to this model as the genotypic model. In the
genotypic model, (1) each individual in a diploid population has a set of
chromosomes that may be numbered 1, …, n, (2) each individual carries
two copies of each of these n chromosomes, the sex chromosomes and
mitochondria being considered a special case, and (3) each chromosome copy
consists of a defined linear (or circular) sequence of sites, where each
site carries one of a finite set of possible alleles. In its most
fine-grained form, each site consists of a single basepair of DNA. While
this model has enormous conceptual and practical advantages, it is an
abstraction that is not completely accurate, and this has been a problem
for the field.

The more accurate model is simple: each individual has a genome, where a
genome is simply a set of molecules of double-stranded DNA. In a real
genome, pieces of DNA that represent alleles of particular sites in the
genotypic model occasionally appear in different numbers, different
locations, and different orientations, and this matters. Also, completely
novel DNA is observed. As we learn to read genomes with increasing
fidelity, the inadequacies of the genotypic model become more apparent. We
are headed toward an era in which complete and highly accurate genomes will
be commonplace in genetics. We might contemplate simply replacing the
genotypic model with the direct observation of a complete genome. However,
it will never be practical to base all genetic analysis directly on the
reading of an entire genome. For one thing, a genome is too big and
complicated to be treated as a unit. Moreover, despite its shortcomings,
the genotypic model remains a very powerful and useful abstraction.

The allelic model of genetic variation is an abstraction that sits at a
level below the genotypic model, but above the model that merely represents
a genome as a set of molecules of double-stranded DNA. Instead of a single
linear coordinate system that is used to demarcate multi-allelic sites
along the chromosomes, in the allelic model the genome variation present in
the population is described by a network- or graph-like structure,
representing segments of DNA that appear in typical genomes in the
population, along with their common alternatives and common variations in
number, order and orientation [diagram]. We refer to this graph as a
(genome) reference structure. A chromosome is a path through this reference
structure, and a genome is a set of chromosomes.

Instead of sites, the elementary units of a reference structure are
positions. Each position consists of an orientation and a single base of
DNA. The orientation tells us whether that base is on the top or bottom
strand, and thus indicates on which side of it we expect to find the next
base when moving in the standard 5’ to 3’ direction. When a component of
the reference structure is linear, with no branching or “bubble”
alternatives, one can think of it in conventional terms as a series of
1-base sites defining the coordinate system plus the typical base content
of a reference chromosome, e.g. as in the “primary path” of the current
human genome reference of the International Nucleotide Sequence Database
Consortium, version GRCh38. Non-linear aspects of the reference structure,
e.g. “bubbles” that come off of the primary path, can be thought of as
alternate haplotypes, like the alternate haplotypes that are defined as an
adjunct to the primary paths of GRCh38. One difference, however, is that a
general reference structure allows cycles, so the path that defines a
particular individual’s chromosome may repeat a part of the reference
structure an arbitrary number of times.

In the allelic model, an allele is defined as a sequence of positions that
form a (sub)path in the graph structure, i.e. such that each position is
allowed by the edges of the graph structure to follow the previous
position. One can think of an allele roughly as the sequence of DNA spelled
out by a particular sequence of positions, but it is more than that. The
same sequence of DNA may be spelled out other places in the graph as well.
When we say that a genome has a particular allele, we are asserting that
the path for one of the DNA molecules in that genome passes through this
sequence of positions, contiguously, at some point along its extent, either
in one direction (strand) or the other. This means that the individual’s
genome contains whatever biological feature, such as a particular form of
an exon, gene, etc., that is represented by this sequence of positions in
the reference structure. If we say the allele appears twice in the
individual’s genome then this asserts that the set of paths representing
that individual’s chromosomes, considered in aggregate, pass through this
particular sequence of positions twice. It is not asserted that these
allele traversals occur in two separate DNA molecules, or in any particular
orientations. Two alleles, be they the same or different, are not assumed
to be either phased or unphased by default. Phasing information must be
explicitly specified in addition to allelic information.

The allelic model does not have the classical genetics concepts of “site”
or “genotype at a site”. These concepts must be added on top of the allelic
model if needed. In the allelic model, each position, and hence each
allele, has just one DNA sequence associated with it. One might define an
“allelotype” as a set of alleles that are asserted to be present in an
individual’s genome, each with a specified copy number, zero being an
allowed copy number. This asserts that the double-stranded DNA molecules of
the individual’s genome follow a set of paths that traverse the sequences
of positions defined by these alleles the required number of times. These
traversals may occur in any combination of order, orientation and phasing.
Other DNA may also be included in the individual’s genome that is not
listed in the allelotype. This contrasts with a “genotype” in which it
asserted that the individual’s double-stranded DNA molecules include a
specific set of (classical) allelic variants in the default order and
orientation, with only the phasing (and possible additional DNA) left
unspecified. In this sense we claim that the classical genotypic model can
be built on top of the allelic model as a special case, by using additional
assumptions or assertions.

The allelic model is a richer foundation for genetics. Using something like
it becomes a requirement when one attempts to analyze and reason about
whole genomes. We hope that the new data schemas and application
programming interfaces we are defining for the allelic model will support
the further development of this model for both research and clinical use.

On Wed, Feb 11, 2015 at 11:34 AM, adamnovak notifications@github.com
wrote:

OK, by popular demand I have split Call into Call and AlleleCall, and
added a searchAlleleCalls() method.

This is simpler in that Calls will always have genotypes, and AlleleCalls
will always have Allele copy number assertions (possibly associated with
Variants). This lets the objects be simpler and removes a bunch of
constraints about what needs to be set in what mode.

It's more complicated in that if you want all the information for a
CallSet you need to use both searchCalls() to download any genotypes and
searchAlleleCalls() to download Allele copy number assertions. I've added
the constraint that the two views need to be consistent. I've also required
that you serve Calls for every Variant you have AlleleCalls for if you
support "classic" mode, and visa versa if you support "graph" mode, so you
should be able to just query the kind of call that you want and get a more
or less complete view of the CallSet.


Reply to this email directly or view it on GitHub
#238 (comment).

@adamnovak
Copy link
Contributor Author

Speaking of the importance of images, I've done up a UML class diagram of all the data model records, so we can see my proposed changes (on the right) as well as how the whole API fits together.

http://hgwdev.sdsc.edu/~anovak/images/Global%20UML.png

@pgrosu
Copy link
Contributor

pgrosu commented Feb 12, 2015

@fnothaft , we don't need to use Avro IDL. We can either change it - which many companies do internally - or use something like WebIDL which provides inheritance capabilities:

http://www.w3.org/TR/WebIDL/

If we keep working around the limitations of Avro, the schema will become cumbersome to support and expand upon.

Paul

@jeromekelleher
Copy link
Contributor

I'm in favour of merging this now. It seems clear that the consensus is that we want something like this, but it's also clear that it will take some time to resolve the remaining issues. I would suggest that the most productive approach now would be to merge what we have and start filing issues and PRs against it to resolve the outstanding problems.

+1.

@haussler
Copy link

Richard astutely pointed out that there were some obnoxious aspects to the
previous summary of the allelic model. Here is a revised and a bit shorter
version. Further comments, edits very welcome. D

The allelic model of genetic variation

In the textbook genetic model, genetic information in a diploid population
is approximated as follows: (1) each individual has a set of chromosomes
that may be numbered 1, …, n, (2) each individual carries two copies of
each of these n chromosomes, the sex chromosomes and mitochondria being
considered a special case, and (3) each chromosome copy consists of a
defined linear (or circular) sequence of sites, where each site carries one
of a finite set of possible alleles. We refer to this model as the
genotypic model. In its most fine-grained form, each site consists of a
single basepair of DNA. While this model has enormous conceptual and
practical advantages, even prior to genome sequencing, geneticists realized
that it is not always sufficient, and proposed various generalizations. The
allelic model is such a generalization.

Genetic information is carried in a genome consisting of molecules of
double-stranded DNA. Inadequacies in the genotypic model arise because in a
real genome, pieces of DNA that represent alleles of particular sites
occasionally appear in different numbers, different locations, and
different orientations, and this can matter. While sufficient for many
purposes, the genotypic model doesn't handle these cases. As it becomes
more commonplace to read genomes with increasing fidelity, the inadequacies
of the genotypic model become more apparent.

The allelic model of genetic variation is an abstraction that sits at a
level below the genotypic model, but above the trivial model that merely
represents a genome as a set of molecules of double-stranded DNA. Instead
of the linear coordinate system that is used in the genotypic model to
demarcate multi-allelic sites along the chromosomes, in the allelic model
the genome variation present in the population is described by a network-
or graph-like structure, representing segments of DNA that appear in
typical genomes in the population, along with their common alternatives and
common variations in number, order and orientation [diagram]. We refer to
this graph as a (genome) reference structure. A chromosome is a path
through this reference structure, and a genome is a set of chromosomes.

Instead of sites, the elementary units of a reference structure are
positions. Each position consists of an orientation and a single base of
DNA. The orientation tells us whether that base is on the top or bottom
strand, and thus indicates on which side of it we expect to find the next
base when moving in the standard 5’ to 3’ direction. When a component of
the reference structure is linear, with no branching or “bubble”
alternatives, one can think of it in conventional terms as a series of
1-base sites defining the coordinate system plus the typical base content
of a reference chromosome, e.g. as in the “primary path” of the current
human genome reference of the International Nucleotide Sequence Database
Consortium, version GRCh38. Non-linear aspects of the reference structure,
e.g. “bubbles” that come off of the primary path, can be thought of as
alternate haplotypes, like the alternate haplotypes that are defined as an
adjunct to the primary paths of GRCh38. One difference, however, is that a
general reference structure allows cycles, so the path that defines a
particular individual’s chromosome may repeat a part of the reference
structure an arbitrary number of times.

In the allelic model, an allele is defined as a sequence of positions that
form a (sub)path in the graph structure, i.e. such that each position is
allowed by the edges of the graph structure to follow the previous
position. One can think of an allele roughly as the sequence of DNA spelled
out by a particular sequence of positions, but it is more than that. The
same sequence of DNA may be spelled out other places in the graph as well.
When we say that a genome has a particular allele, we are asserting that
the path for one of the DNA molecules of that genome passes through this
sequence of positions, contiguously, at some point along its extent, either
in one direction (strand) or the other. This means that the individual’s
genome contains whatever biological feature, such as a particular form of
an exon, gene, etc., that is represented by this sequence of positions in
the reference structure. If we say the allele appears twice in the
individual’s genome then this asserts that the set of paths representing
that individual’s chromosomes, considered in aggregate, pass through this
particular sequence of positions twice. It is not asserted that these
allele traversals occur in two separate DNA molecules, or in any particular
orientations. Two alleles, be they the same or different, are not assumed
to be either phased or unphased by default. Phasing information must be
explicitly specified in addition to allelic information.

The allelic model is a generalization because it does not include the
classical genetics concepts of “site” or “genotype at a site”. These
concepts must be added on top of the allelic model in order to derive the
genotypic model. In the allelic model, each position, and hence each
allele, has just one DNA sequence associated with it. One might define an
“allelotype” as a set of alleles that are asserted to be present in an
individual’s genome, each with a specified copy number. Zero is an allowed
copy number. An allelotype asserts that the double-stranded DNA molecules
of the individual’s genome follow a set of paths that traverse the
sequences of positions defined by its alleles the required number of times.
These traversals may occur in any combination of order, orientation and
phasing. Other DNA may also be included in the individual’s genome that is
not listed in the allelotype. This contrasts with a “genotype” in which it
asserted that the individual’s double-stranded DNA molecules include a
specific set of (classical) allelic variants in the default order and
orientation, with only the phasing (and possible additional DNA) left
unspecified. Thus, the classical genotypic model can be built on top of the
allelic model as a special case, by using additional assumptions and
assertions.

Being more general, the allelic model is a broader foundation for genetics.
Using something like it often becomes a requirement when one attempts to
analyze and reason about whole genomes. We hope that the new data schemas
and application programming interfaces we are defining for the allelic
model will support the further development of this model for both research
and clinical use.

On Wed, Feb 11, 2015 at 1:58 PM, David Haussler haussler@soe.ucsc.edu
wrote:

Going back to last night's comments by Heng, I've tried to craft a simple
beginner's introduction to what we are doing for the benefit of those
outside our group. It is pasted below and attached in Word. Heng, Richard,
Adam others, please replace or edit as needed, pointing to diagrams and
examples (I agree with Adam these are essential). I think we need something
that gives a high-level intro and is pointed to from the schemas. -D

The allelic model of genetic variation

The field of genetics was born before we could read genomes, yet through
exceedingly clever experiments and careful observations, the pioneers of
genetics were able to form a very accurate model of the basic structure of
genetic information. We refer to this model as the genotypic model. In
the genotypic model, (1) each individual in a diploid population has a set
of chromosomes that may be numbered 1, …, n, (2) each individual
carries two copies of each of these n chromosomes, the sex chromosomes
and mitochondria being considered a special case, and (3) each chromosome
copy consists of a defined linear (or circular) sequence of sites, where
each site carries one of a finite set of possible alleles. In its most
fine-grained form, each site consists of a single basepair of DNA. While
this model has enormous conceptual and practical advantages, it is an
abstraction that is not completely accurate, and this has been a problem
for the field.

The more accurate model is simple: each individual has a genome, where a
genome is simply a set of molecules of double-stranded DNA. In a real
genome, pieces of DNA that represent alleles of particular sites in the
genotypic model occasionally appear in different numbers, different
locations, and different orientations, and this matters. Also, completely
novel DNA is observed. As we learn to read genomes with increasing
fidelity, the inadequacies of the genotypic model become more apparent. We
are headed toward an era in which complete and highly accurate genomes will
be commonplace in genetics. We might contemplate simply replacing the
genotypic model with the direct observation of a complete genome. However,
it will never be practical to base all genetic analysis directly on the
reading of an entire genome. For one thing, a genome is too big and
complicated to be treated as a unit. Moreover, despite its shortcomings,
the genotypic model remains a very powerful and useful abstraction.

The allelic model of genetic variation is an abstraction that sits at a
level below the genotypic model, but above the model that merely represents
a genome as a set of molecules of double-stranded DNA. Instead of a single
linear coordinate system that is used to demarcate multi-allelic sites
along the chromosomes, in the allelic model the genome variation present in
the population is described by a network- or graph-like structure,
representing segments of DNA that appear in typical genomes in the
population, along with their common alternatives and common variations in
number, order and orientation [diagram]. We refer to this graph as a
(genome) reference structure. A chromosome is a path through this reference
structure, and a genome is a set of chromosomes.

Instead of sites, the elementary units of a reference structure are
positions. Each position consists of an orientation and a single base of
DNA. The orientation tells us whether that base is on the top or bottom
strand, and thus indicates on which side of it we expect to find the next
base when moving in the standard 5’ to 3’ direction. When a component of
the reference structure is linear, with no branching or “bubble”
alternatives, one can think of it in conventional terms as a series of
1-base sites defining the coordinate system plus the typical base content
of a reference chromosome, e.g. as in the “primary path” of the current
human genome reference of the International Nucleotide Sequence Database
Consortium, version GRCh38. Non-linear aspects of the reference structure,
e.g. “bubbles” that come off of the primary path, can be thought of as
alternate haplotypes, like the alternate haplotypes that are defined as an
adjunct to the primary paths of GRCh38. One difference, however, is that a
general reference structure allows cycles, so the path that defines a
particular individual’s chromosome may repeat a part of the reference
structure an arbitrary number of times.

In the allelic model, an allele is defined as a sequence of positions that
form a (sub)path in the graph structure, i.e. such that each position is
allowed by the edges of the graph structure to follow the previous
position. One can think of an allele roughly as the sequence of DNA spelled
out by a particular sequence of positions, but it is more than that. The
same sequence of DNA may be spelled out other places in the graph as well.
When we say that a genome has a particular allele, we are asserting that
the path for one of the DNA molecules in that genome passes through this
sequence of positions, contiguously, at some point along its extent, either
in one direction (strand) or the other. This means that the individual’s
genome contains whatever biological feature, such as a particular form of
an exon, gene, etc., that is represented by this sequence of positions in
the reference structure. If we say the allele appears twice in the
individual’s genome then this asserts that the set of paths representing
that individual’s chromosomes, considered in aggregate, pass through this
particular sequence of positions twice. It is not asserted that these
allele traversals occur in two separate DNA molecules, or in any particular
orientations. Two alleles, be they the same or different, are not assumed
to be either phased or unphased by default. Phasing information must be
explicitly specified in addition to allelic information.

The allelic model does not have the classical genetics concepts of “site”
or “genotype at a site”. These concepts must be added on top of the allelic
model if needed. In the allelic model, each position, and hence each
allele, has just one DNA sequence associated with it. One might define
an “allelotype” as a set of alleles that are asserted to be present in an
individual’s genome, each with a specified copy number, zero being an
allowed copy number. This asserts that the double-stranded DNA molecules of
the individual’s genome follow a set of paths that traverse the sequences
of positions defined by these alleles the required number of times. These
traversals may occur in any combination of order, orientation and phasing.
Other DNA may also be included in the individual’s genome that is not
listed in the allelotype. This contrasts with a “genotype” in which it
asserted that the individual’s double-stranded DNA molecules include a
specific set of (classical) allelic variants in the default order and
orientation, with only the phasing (and possible additional DNA) left
unspecified. In this sense we claim that the classical genotypic model can
be built on top of the allelic model as a special case, by using additional
assumptions or assertions.

The allelic model is a richer foundation for genetics. Using something
like it becomes a requirement when one attempts to analyze and reason about
whole genomes. We hope that the new data schemas and application
programming interfaces we are defining for the allelic model will support
the further development of this model for both research and clinical use.

On Wed, Feb 11, 2015 at 11:34 AM, adamnovak notifications@github.com
wrote:

OK, by popular demand I have split Call into Call and AlleleCall, and
added a searchAlleleCalls() method.

This is simpler in that Calls will always have genotypes, and AlleleCalls
will always have Allele copy number assertions (possibly associated with
Variants). This lets the objects be simpler and removes a bunch of
constraints about what needs to be set in what mode.

It's more complicated in that if you want all the information for a
CallSet you need to use both searchCalls() to download any genotypes and
searchAlleleCalls() to download Allele copy number assertions. I've
added the constraint that the two views need to be consistent. I've also
required that you serve Calls for every Variant you have AlleleCalls for
if you support "classic" mode, and visa versa if you support "graph" mode,
so you should be able to just query the kind of call that you want and get
a more or less complete view of the CallSet.


Reply to this email directly or view it on GitHub
#238 (comment).

@hershman
Copy link
Contributor

+1, @adamnovak your presentation from today was fantastic

@jeromekelleher
Copy link
Contributor

Can you squash the commits down please @adamnovak? Looks like we're ready to merge.

There is now a `Path` and `Segment` in a graph namespace in common.avdl, and a
`GraphAlignment` in the graph namespace in reads.avdl. We also have a
`/graph/sequence/{id}` endpoint to get sequence bases in a generic way, and a
couple more options to search by.

Adding support for monoallelic calls.

This adds an `Allele` type, and allows for `Call`s to be associated with
`Allele`s, or with `Allele`s and `Variant`s.

`Allele`s can be used to represent the ref and alt alleles of `Varaint`s.

We also provide for associating new sequences with `VariantSet`s.

Removing graph namespace and URL prefix

Revising comments and field names

Unifying checksums

Revising variants comments

Code reviewed by Prof. Haussler

Changing comments to be more descriptive and mathematically tight.

Also catching a few typos.

Fixing oversights in new API calls

Adding in primer on sequence graphs

Revising David's sequence graph intro

Explaining sequence search better

Sequences are searched over, and results are returned as segments describing
sequences.

Also removing references to an out-of-date object name.

Removing update timestamp from Allele

Clarifying sequences, segments, and references

Fixing up whitespace

Introducing "graph" and "classic" modes.

A server can support "graph" mode, "classic" mode, or both.

If it supports "classic" mode, it will make sure to fill in all of
the fields that non-graph clients require, and not use incompatible
graph extensions like `GraphAlignment`.

If it supports "graph" mode, it will make sure to fill in all of the new
graph fields, like `Reference.segment` and `Variant.alleles`. Additionally,
if it also does not support "classic" mode, it will not use older, less
general things like `LinearAlignment` and single string phase sets.

Making David's documentation changes

Also adding more pictures and hammering on the parent/child distinction.

Making startJoin and endJoin default to null

Clarifying the order of segments in a path

Fixing example typos

Splitting out `AlleleCall`.

There is now an `AlleleCall` to take care of calling copy numbers of `Alleles`,
while the normal `Call` goes back to just handling genotypes. I also stripped
some of the bachwards-compatibility warts off of `AlleleCall`, since it has no
legacy fields to keep.

Both kinds of calls are constrained to be consistent with each other; you
can't call a genotype of [1, 1], and have a copy number of 0 for that allele
in that variant.

If you support "graph" mode, you have to serve `AlleleCall`s for every
`Variant` you have `Call`s on, and visa versa if you support "classic" mode.

More descriptive name for new method

Fixing case where it would be impossible to get allele sequence.

Since a `Position` can be specified by `Reference` name, there is no
efficient way to search for a `Reference` by name, and the only way to get
sequence bases (in the old and proposed APIs) is by ID, we need to be able
to make sure that all `Position`s include IDs in the graph mode, so we can
actually get the sequence.

Fixing doc comment typos

Clarifying computation of `Reference.md5checksum`.

I now give an explicit procedure for how you are supposed to compute the
`md5checksum` of a `Reference`. It no longer makes reference to the idea of a
checksum for a `Position`, since that doesn't actually exist.

It should be much clearer now what is to be hashed, which is important because
everyone needs to do the same thing to get the same hashes.
@adamnovak
Copy link
Contributor Author

OK I have squashed everything down. I'm going to tag this as ready to merge.

@skeenan
Copy link
Member

skeenan commented Feb 21, 2015

I'm merging this.

that of the basepair itself, and the "+" confirms that we mean the default
forward orientation. The right side is indicated using the "-" orientation. For
example, the right side of the following G/C base pair is represented by the
position (3,-).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An slightly confusing coincidence in this example is that, in this sequence of length 7, the index 3 could be interpreted as being measured from either end.

The asymmetry between the start index having the nick come just before it while the end index has the nick after it (when both are read from left-to-right, as the indices are apparently intended to be read) made it hard for me to figure out all that was meant to be implied by the (3,-) notation.

Let me know if that doesn't make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good, thorough example overall though, thank you for writing this up!

@jeromekelleher
Copy link
Contributor

This PR has been merged @ryan-williams, so the discussion thread has ended. Please open a PR if you'd like to make some updates, or open some issues corresponding to the individual comments.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet