Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Side graphs: Sequences and Joins #250

Merged
merged 3 commits into from
Mar 26, 2015
Merged

Conversation

adamnovak
Copy link
Contributor

Here's a draft of what our graph model would look like if instead of the "insert graphs" we are using now (where sequences come with associated end joins), we used a "side graph". In a side graph, we split out Sequences (which are just strings with IDs) and Joins (which bind two sides of two bases in two sequences together). Both kinds of objects, in this world, can occur in ReferenceSets and VariantSets.

I've made a bunch of API changes to see how that would actually work: creating a Sequence object, to start with. @richarddurbin made Joins, and then I changed the methods for searching to give us nice clean searchSequences() and searchJoins() methods to get them.

@richarddurbin: is this what you were going for?

@lh3: have I captured side graphs correctly?

All: do we actually like side graphs better than insert graphs?

I don't expect this to be merged for a while, but I think we need a concrete proposal to help anchor the discussion in #243 and its ensuing massive e-mail thread.

basepairs. In the default forward orientation a sequence is specified by the DNA
letters of its top strand, e.g. in this case GGTGGNG, where N indicates an
example where the top strand is GGTGGAG.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest of example below still uses GGTGGNG. Did you want to change A back to an N?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should probably be N. I will fix that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, when I was editing I half edited N to A, then stopped. I think it would be good to remove N from
the examples, although to say somewhere that N is a legitimate character in a sequence. There are a few
Ns, but they are not typical.

Richard

On 17 Mar 2015, at 04:40, Maciek Smuga-Otto notifications@github.com wrote:

In src/main/resources/avro/common.avdl #250 (comment):

-```

  •        | <- top strand nick at position (2,+)
    
    -5' ------- ------------------- 3'
  • G   G   T   G   G   N   G
    
  • C   C   A   C   C   N   C
    
    -3' ---------------- ---------- 5'

- | <- bottom strand nick at position (3,-)

  • +0- +1- +2- +3- +4- +5- +6- <- coordinates
    -```

-A sequence is a piece of double-stranded DNA composed of a series of DNA
-basepairs. In the default forward orientation a sequence is specified by the DNA
-letters of its top strand, e.g. in this case GGTGGNG, where N indicates an
+example where the top strand is GGTGGAG.
+
Example below still uses GGTGGNG. Do you want to change N->A throughout the rest of the example?


Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/pull/250/files#r26549104.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@adamnovak adamnovak force-pushed the sequences branch 3 times, most recently from 39ae636 to a74921a Compare March 18, 2015 01:00
a terminal side. The value of `startJoin` or `endJoin`, if set, is the side to
which the corresponding side of this `Sequence` is connected.
A `Segment` may have zero length (for example, when it is being used to
specify a `Path` consisting onyl of a `Join`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: onyl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I am fixing this.

@jeromekelleher
Copy link
Contributor

Thanks for the updates @adamnovak. The model itself seems sensible enough to me, but I'm really struggling to understand how the API would work. How would a client do something useful with this? In particular, I don't understand what the client is supposed to do with Join objects. I'm obviously missing something important here! It would be really useful I think to try and write down some client code to do some very straightforward calculations. Perhaps we could flesh out some of the operations that we are going to do as part of the graph comparison, and use those as test-cases for the API?

@fnothaft
Copy link
Contributor

+1

As an aside, I agree with @jeromekelleher:

How would a client do something useful with this?

But, I think we're in a bit of a chicken/egg situation (e.g., we can't reason about how to perform common query X, unless we've got a data model). I think this is a data model that we can work with, and we'll figure out what the query patterns are by using the data model.

@benedictpaten
Copy link
Contributor

+1

Critical to "how do we do useful things" question is adding in variant
search queries, like "give me all the variants in the graph related to chr1
positions 1 to 100?". The rank on positions/joins makes it possible to
define this concretely. A related problem is agreeing on exactly what we
refer to as an allele in the graph. There has been a good deal of
conversation on that subject, but we need some pull requests (building on
this one), to make it concrete.

On Wed, Mar 18, 2015 at 9:54 AM, Frank Austin Nothaft <
notifications@github.com> wrote:

+1

As an aside, I agree with @jeromekelleher
https://github.com/jeromekelleher:

How would a client do something useful with this?

But, I think we're in a bit of a chicken/egg situation (e.g., we can't
reason about how to perform common query X, unless we've got a data model).
I think this is a data model that we can work with, and we'll figure out
what the query patterns are by using the data model.


Reply to this email directly or view it on GitHub
#250 (comment).

@jeromekelleher
Copy link
Contributor

I'm basically +1 on the data model here, but I've still got issues with the API methods. Perhaps we could break these into separate PRs, so we're not holding up implementation of the data model?

@macieksmuga
Copy link
Contributor

@jeromekelleher Interesting suggestion on splitting the PR. I know it would help me a lot to have the basic graph model stabilized, even as the discussion on API methods and Alleles carries on. @adamnovak what do you think?

@fnothaft
Copy link
Contributor

I think @jeromekelleher's suggestion is a great suggestion.

*/

record ReferenceSet {
/** The reference set ID. Unique in the repository. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ReferenceSets are now meant to be composeable, should this be explicitly added somewhere in the definition of the object here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in the definition (see includedReferenceSets), but it should be in the doc comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamnovak oops... was accidentally searching on master, not your PR version of the file. Sorry!

@adamnovak
Copy link
Contributor Author

I do see how it would be good to hash out things like searchJoins versus searchVariantSetJoins and so on independently from the data model change, but if we accept the data model change without either of those methods, there would be no way to retrieve connectivity information until that discussion was finished, which is definitely a regression.

Would people be OK with that?

@fnothaft
Copy link
Contributor

Would people be OK with that?

I am OK with that.

@adamnovak
Copy link
Contributor Author

OK, in response to @diekhans's comments about needing a good way to specify a Position without a strand, I now have a Position with no strand field, and a Side which is a position and a strand (which is required to be + or -, and cannot be null). If you don't know if you will have a strand or not, you would use a uinon { Position, Side }.

@diekhans, is that good for your use case? I didn't find anything in the metadata API to switch over to the new type.

@jeromekelleher
Copy link
Contributor

Re. splitting the PR, I think it's totally fine if we're temporarily missing some methods from our git repo.

@adamnovak
Copy link
Contributor Author

OK, here's a version with no changes to the methods files.

@jeromekelleher
Copy link
Contributor

Thanks @adamnovak, +1.

We should squash the commits - I guess one Adam commit, one Richard commit, one Adam commit is the most appropriate?

@adamnovak
Copy link
Contributor Author

OK I've squashed everything down (even though I have heard it is rude to rewrite other people's commits). What do you think?

@jeromekelleher
Copy link
Contributor

Looks good Adam - we could probably tidy up the commit comments a bit though. Stuff like "Rolling methods back" doesn't really tell future-us very much about what this commit actually does.

@benedictpaten
Copy link
Contributor

I'll merge this in the next 24 hours (unless Jerome feels strongly about tidying up the commit comments).

adamnovak and others added 2 commits March 25, 2015 18:05
There has been much confusion caused by the lack of a `Sequence` type,
and instead attempting to represent everything with `Segment`s. This is
an attempt to add in a `Sequence` and see what things look like.

`VariantSet`s now contain extra `Sequence`s instead of estra `Segment`s.

`Segment`s don't have joins any more, since they now only are path
components, and thus don't need them. `Path`s have optional joins at
their ends, if they need to include adjacencies before and after the
bases they cover.
…ct, David H and Heng.

The graph model now has Sequences and Joins, which join two arbitrary oriented positions (sometimes
known as sides) in sequences.
We also remove the NO_STRAND option in the Strand enum - this should be done by setting the variable to null.

Revised graph representation with Sequence and Join without start/endJoin.
Also added isPrimary to Reference, and a couple more changes.

Add includedReferenceSets to permit ReferenceSets to be based on others.
@adamnovak
Copy link
Contributor Author

OK, I've cleaned up the concatenated commit messages slightly.

Describing sticky end handling

Clarifying Segments and join semantics

Making the new `ReferenceSet` composing system more fleshed out and
described.

Clarifying the semantics in `SearchSequencesRequest` and
`SearchJoinsRequest`.

Removing Position, calling it Side instead

Adding a new, un-oriented Position

I'm going to make the `Sequence`/`Join` method changes later.
@jeromekelleher
Copy link
Contributor

Looks good, thanks @adamnovak. Merging.

jeromekelleher added a commit that referenced this pull request Mar 26, 2015
Side graphs: `Sequence`s and `Join`s
@jeromekelleher jeromekelleher merged commit 4b7b422 into ga4gh:master Mar 26, 2015
@adamnovak adamnovak deleted the sequences branch April 30, 2015 22:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants