Side graphs: `Sequence`s and `Join`s #250

adamnovak · 2015-03-06T22:34:09Z

Here's a draft of what our graph model would look like if instead of the "insert graphs" we are using now (where sequences come with associated end joins), we used a "side graph". In a side graph, we split out Sequences (which are just strings with IDs) and Joins (which bind two sides of two bases in two sequences together). Both kinds of objects, in this world, can occur in ReferenceSets and VariantSets.

I've made a bunch of API changes to see how that would actually work: creating a Sequence object, to start with. @richarddurbin made Joins, and then I changed the methods for searching to give us nice clean searchSequences() and searchJoins() methods to get them.

@richarddurbin: is this what you were going for?

@lh3: have I captured side graphs correctly?

All: do we actually like side graphs better than insert graphs?

I don't expect this to be merged for a while, but I think we need a concrete proposal to help anchor the discussion in #243 and its ensuing massive e-mail thread.

macieksmuga · 2015-03-17T04:40:14Z

src/main/resources/avro/common.avdl

-basepairs. In the default forward orientation a sequence is specified by the DNA
-letters of its top strand, e.g. in this case GGTGGNG, where N indicates an
+example where the top strand is GGTGGAG.
+


Rest of example below still uses GGTGGNG. Did you want to change A back to an N?

Yeah, it should probably be N. I will fix that.

Sorry, when I was editing I half edited N to A, then stopped. I think it would be good to remove N from
the examples, although to say somewhere that N is a legitimate character in a sequence. There are a few
Ns, but they are not typical.

Richard

On 17 Mar 2015, at 04:40, Maciek Smuga-Otto notifications@github.com wrote:

In src/main/resources/avro/common.avdl #250 (comment):

-```

| <- top strand nick at position (2,+)
-5' ------- ------------------- 3'

G G T G G N G

C C A C C N C
-3' ---------------- ---------- 5'

- | <- bottom strand nick at position (3,-)

+0- +1- +2- +3- +4- +5- +6- <- coordinates
-```

-A sequence is a piece of double-stranded DNA composed of a series of DNA
-basepairs. In the default forward orientation a sequence is specified by the DNA
-letters of its top strand, e.g. in this case GGTGGNG, where N indicates an
+example where the top strand is GGTGGAG.
+
Example below still uses GGTGGNG. Do you want to change N->A throughout the rest of the example?

—
Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/pull/250/files#r26549104.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

jeromekelleher · 2015-03-18T14:15:30Z

src/main/resources/avro/common.avdl

-  a terminal side. The value of `startJoin` or `endJoin`, if set, is the side to
-  which the corresponding side of this `Sequence` is connected.
+  A `Segment` may have zero length (for example, when it is being used to
+  specify a `Path` consisting onyl of a `Join`.


OK, I am fixing this.

jeromekelleher · 2015-03-18T15:28:57Z

Thanks for the updates @adamnovak. The model itself seems sensible enough to me, but I'm really struggling to understand how the API would work. How would a client do something useful with this? In particular, I don't understand what the client is supposed to do with Join objects. I'm obviously missing something important here! It would be really useful I think to try and write down some client code to do some very straightforward calculations. Perhaps we could flesh out some of the operations that we are going to do as part of the graph comparison, and use those as test-cases for the API?

fnothaft · 2015-03-18T16:54:45Z

+1

As an aside, I agree with @jeromekelleher:

How would a client do something useful with this?

But, I think we're in a bit of a chicken/egg situation (e.g., we can't reason about how to perform common query X, unless we've got a data model). I think this is a data model that we can work with, and we'll figure out what the query patterns are by using the data model.

benedictpaten · 2015-03-18T17:14:52Z

+1

Critical to "how do we do useful things" question is adding in variant
search queries, like "give me all the variants in the graph related to chr1
positions 1 to 100?". The rank on positions/joins makes it possible to
define this concretely. A related problem is agreeing on exactly what we
refer to as an allele in the graph. There has been a good deal of
conversation on that subject, but we need some pull requests (building on
this one), to make it concrete.

On Wed, Mar 18, 2015 at 9:54 AM, Frank Austin Nothaft <
notifications@github.com> wrote:

+1

As an aside, I agree with @jeromekelleher
https://github.com/jeromekelleher:

How would a client do something useful with this?

But, I think we're in a bit of a chicken/egg situation (e.g., we can't
reason about how to perform common query X, unless we've got a data model).
I think this is a data model that we can work with, and we'll figure out
what the query patterns are by using the data model.

—
Reply to this email directly or view it on GitHub
#250 (comment).

jeromekelleher · 2015-03-18T17:23:21Z

I'm basically +1 on the data model here, but I've still got issues with the API methods. Perhaps we could break these into separate PRs, so we're not holding up implementation of the data model?

macieksmuga · 2015-03-18T17:53:33Z

@jeromekelleher Interesting suggestion on splitting the PR. I know it would help me a lot to have the basic graph model stabilized, even as the discussion on API methods and Alleles carries on. @adamnovak what do you think?

fnothaft · 2015-03-18T17:57:09Z

I think @jeromekelleher's suggestion is a great suggestion.

macieksmuga · 2015-03-18T18:13:49Z

src/main/resources/avro/references.avdl

 */
+
 record ReferenceSet {
  /** The reference set ID. Unique in the repository. */


If ReferenceSets are now meant to be composeable, should this be explicitly added somewhere in the definition of the object here?

It is in the definition (see includedReferenceSets), but it should be in the doc comment.

@adamnovak oops... was accidentally searching on master, not your PR version of the file. Sorry!

adamnovak · 2015-03-18T18:15:17Z

I do see how it would be good to hash out things like searchJoins versus searchVariantSetJoins and so on independently from the data model change, but if we accept the data model change without either of those methods, there would be no way to retrieve connectivity information until that discussion was finished, which is definitely a regression.

Would people be OK with that?

fnothaft · 2015-03-18T18:16:02Z

Would people be OK with that?

I am OK with that.

adamnovak · 2015-03-19T00:12:23Z

OK, in response to @diekhans's comments about needing a good way to specify a Position without a strand, I now have a Position with no strand field, and a Side which is a position and a strand (which is required to be + or -, and cannot be null). If you don't know if you will have a strand or not, you would use a uinon { Position, Side }.

@diekhans, is that good for your use case? I didn't find anything in the metadata API to switch over to the new type.

jeromekelleher · 2015-03-19T09:51:13Z

Re. splitting the PR, I think it's totally fine if we're temporarily missing some methods from our git repo.

adamnovak · 2015-03-20T17:06:55Z

OK, here's a version with no changes to the methods files.

jeromekelleher · 2015-03-23T16:52:35Z

Thanks @adamnovak, +1.

We should squash the commits - I guess one Adam commit, one Richard commit, one Adam commit is the most appropriate?

adamnovak · 2015-03-23T17:05:13Z

OK I've squashed everything down (even though I have heard it is rude to rewrite other people's commits). What do you think?

jeromekelleher · 2015-03-23T17:27:19Z

Looks good Adam - we could probably tidy up the commit comments a bit though. Stuff like "Rolling methods back" doesn't really tell future-us very much about what this commit actually does.

benedictpaten · 2015-03-25T18:05:46Z

I'll merge this in the next 24 hours (unless Jerome feels strongly about tidying up the commit comments).

There has been much confusion caused by the lack of a `Sequence` type, and instead attempting to represent everything with `Segment`s. This is an attempt to add in a `Sequence` and see what things look like. `VariantSet`s now contain extra `Sequence`s instead of estra `Segment`s. `Segment`s don't have joins any more, since they now only are path components, and thus don't need them. `Path`s have optional joins at their ends, if they need to include adjacencies before and after the bases they cover.

…ct, David H and Heng. The graph model now has Sequences and Joins, which join two arbitrary oriented positions (sometimes known as sides) in sequences. We also remove the NO_STRAND option in the Strand enum - this should be done by setting the variable to null. Revised graph representation with Sequence and Join without start/endJoin. Also added isPrimary to Reference, and a couple more changes. Add includedReferenceSets to permit ReferenceSets to be based on others.

adamnovak · 2015-03-26T01:09:15Z

OK, I've cleaned up the concatenated commit messages slightly.

Describing sticky end handling Clarifying Segments and join semantics Making the new `ReferenceSet` composing system more fleshed out and described. Clarifying the semantics in `SearchSequencesRequest` and `SearchJoinsRequest`. Removing Position, calling it Side instead Adding a new, un-oriented Position I'm going to make the `Sequence`/`Join` method changes later.

jeromekelleher · 2015-03-26T09:28:06Z

Looks good, thanks @adamnovak. Merging.

Side graphs: `Sequence`s and `Join`s

macieksmuga reviewed Mar 17, 2015
View reviewed changes

adamnovak force-pushed the sequences branch 3 times, most recently from 39ae636 to a74921a Compare March 18, 2015 01:00

jeromekelleher reviewed Mar 18, 2015
View reviewed changes

adamnovak mentioned this pull request Mar 18, 2015

Generic Query Conventions Need Defining #260

Open

macieksmuga reviewed Mar 18, 2015
View reviewed changes

adamnovak force-pushed the sequences branch from 2b5c8a6 to 97b96d9 Compare March 23, 2015 17:04

benedictpaten added the RefVariationTaskTeam label Mar 25, 2015

adamnovak and others added 2 commits March 25, 2015 18:05

adamnovak force-pushed the sequences branch from 97b96d9 to 6551c8e Compare March 26, 2015 01:08

adamnovak force-pushed the sequences branch from 6551c8e to 5d7f800 Compare March 26, 2015 01:58

jeromekelleher added a commit that referenced this pull request Mar 26, 2015

Merge pull request #250 from adamnovak/sequences

4b7b422

Side graphs: `Sequence`s and `Join`s

jeromekelleher merged commit 4b7b422 into ga4gh:master Mar 26, 2015

adamnovak mentioned this pull request Mar 30, 2015

Updating methods to work with side graphs. #269

Merged

adamnovak deleted the sequences branch April 30, 2015 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Side graphs: `Sequence`s and `Join`s #250

Side graphs: `Sequence`s and `Join`s #250

adamnovak commented Mar 6, 2015

macieksmuga Mar 17, 2015

adamnovak Mar 17, 2015

richarddurbin Mar 19, 2015

jeromekelleher Mar 18, 2015

adamnovak Mar 18, 2015

jeromekelleher commented Mar 18, 2015

fnothaft commented Mar 18, 2015

benedictpaten commented Mar 18, 2015

jeromekelleher commented Mar 18, 2015

macieksmuga commented Mar 18, 2015

fnothaft commented Mar 18, 2015

macieksmuga Mar 18, 2015

adamnovak Mar 18, 2015

macieksmuga Mar 18, 2015

adamnovak commented Mar 18, 2015

fnothaft commented Mar 18, 2015

adamnovak commented Mar 19, 2015

jeromekelleher commented Mar 19, 2015

adamnovak commented Mar 20, 2015

jeromekelleher commented Mar 23, 2015

adamnovak commented Mar 23, 2015

jeromekelleher commented Mar 23, 2015

benedictpaten commented Mar 25, 2015

adamnovak commented Mar 26, 2015

jeromekelleher commented Mar 26, 2015

Side graphs: Sequences and Joins #250

Side graphs: Sequences and Joins #250

Conversation

adamnovak commented Mar 6, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

- | <- bottom strand nick at position (3,-)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromekelleher commented Mar 18, 2015

fnothaft commented Mar 18, 2015

benedictpaten commented Mar 18, 2015

jeromekelleher commented Mar 18, 2015

macieksmuga commented Mar 18, 2015

fnothaft commented Mar 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamnovak commented Mar 18, 2015

fnothaft commented Mar 18, 2015

adamnovak commented Mar 19, 2015

jeromekelleher commented Mar 19, 2015

adamnovak commented Mar 20, 2015

jeromekelleher commented Mar 23, 2015

adamnovak commented Mar 23, 2015

jeromekelleher commented Mar 23, 2015

benedictpaten commented Mar 25, 2015

adamnovak commented Mar 26, 2015

jeromekelleher commented Mar 26, 2015

Side graphs: `Sequence`s and `Join`s #250

Side graphs: `Sequence`s and `Join`s #250