Adding sequence records, and cleaning up reference records. #162

fnothaft · 2014-10-13T23:08:34Z

@jeromekelleher and I plotted this one out this PM; we think this is a more consistent and general approach to serving references and sequences. Specifically, we:

Added GASequence and GASequenceFragment models. These models define a generic contig sequence, which could be used to describe a reference assembly, a de novo assembly, variant calls spliced back into a reference assembly, etc...
Cleaned up a few fields in GAReference and GAReferenceSet to make the relations of the models slightly cleaner/clearer.
Removed /references/{id}/bases GET request. Instead, we have added /sequences/{id} and /sequencefragments/{id}` GET requests, which are more general.

calbach · 2014-10-15T16:40:16Z

Hi Frank,

Can you give some more concrete examples of how you envision GASequence{Fragment,} being used? I'm not sure I'm yet understanding the benefits of their addition.
fragment : sequence; what's the relationship? Are fragments contiguous? Can they overlap? Can they have gaps? Or is this just your preferred way of implementing pagination over bases?

CH

fnothaft · 2014-10-15T16:53:43Z

@calbach Certainly! Right now, the API assumes that the only contig sequences which will accessed using the GA4GH APIs are reference sequences. In reality, reference sequences are a subset of possible contigs someone might want to access. For example, I could want to store contigs that represent:

A reference sequence
A non-reference assembly (e.g., I sequence a genome of interest to myself and then de novo assemble the reads)
A transcriptome assembly
Contig sequences where someone has merged variants into a reference (e.g., for comparing whether two complex alleles are the same)

Essentially, this provides a more generic approach for accessing contigs.

As for the fragment:sequence question, the assumption is that fragments would be non-overlapping continuous subsequences from a contig sequence. Essentially, this would be the preferred way of implementing pagination for contig sequences.

pgrosu · 2014-10-15T17:33:23Z

+1 I was wondering when we would pick up on this discussion from #133 (diff) - glad to see we're allowing for these possibilities :)

richarddurbin · 2014-10-15T17:44:38Z

Hello @fnothaft and @jeromekelleher,

I can appreciate the concept of a more generic Sequence. I see that building it from Fragments addresses practical issues in very long
sequences, and perhaps some other things. This is in fact how the reference genome is managed by the GRC, and also by Ensembl.
You might want to allow Fragments to be reverse complemented with respect to the Sequence.
I am not sure about the methods for getting actually bases back. The key method seems to be SearchSequenceFragmentsRequest.
But if you give an array of multiple sequenceIds and a start and end, do the same start and end apply to each sequenceId? And I get
back an array of fragments, but is that what I want? I would have thought I wanted to give a start and end on a sequence and for the
back end to piece together the literal base sequence between them from the fragments for me. That is what Ensembl APIs do for you.

I would be interested to discuss on Friday.

As a side effect, you have made a major change in that you now have Reference point to its referenceSet, rather than ReferenceSet contain a list of the references
contained in it. This changes the semantics. Now each reference can only be in one referenceSet. I don't think we want that. There are many different human referenceSets
that contain the same mitochondrial reference sequences for example, and even different versions of the GRCh37 reference used for different mapping that contain the
same basic chromosomes, but different additional sequences. So I would vote -1 on this change.

On the positive side, I like the change to an array of derivedFrom and sourceDivergence.

Richard

On 14 Oct 2014, at 00:08, Frank Austin Nothaft notifications@github.com wrote:

@jeromekelleher and I plotted this one out this PM; we think this is a more consistent and general approach to serving references and sequences. Specifically, we:

Added GASequence and GASequenceFragment models. These models define a generic contig sequence, which could be used to describe a reference assembly, a de novo assembly, variant calls spliced back into a reference assembly, etc...
Cleaned up a few fields in GAReference and GAReferenceSet to make the relations of the models slightly cleaner/clearer.
Removed /references/{id}/bases GET request. Instead, we have added /sequences/{id} and /sequencefragments/{id}` GET requests, which are more general.
You can merge this Pull Request by running

git pull https://github.com/fnothaft/ReadTaskTeam add-sequences
Or view, comment on, or merge it at:

#162

Commit Summary

Adding sequence records, and cleaning up reference records.
File Changes

M src/main/resources/avro/referencemethods.avdl (63)
M src/main/resources/avro/references.avdl (44)
A src/main/resources/avro/sequencemethods.avdl (133)
A src/main/resources/avro/sequences.avdl (61)
Patch Links:

https://github.com/ga4gh/schemas/pull/162.patch
https://github.com/ga4gh/schemas/pull/162.diff
—
Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

jeromekelleher · 2014-10-16T01:34:27Z

After discussing this with @cassiedoll and @calbach today, I think we can simplify this a great deal and address @richarddurbin's concern. The key issue that we were trying to deal with here is to give a consistent way of paging over large numbers of objects. This is complicated in this case, as we are really just trying to page over a large string and not over a set of well defined objects. A more effective way to do this might be to simply stream these large strings over HTTP using a GET method (with optional bounds), and avoid dealing with this complication at the GA4GH protocol level. This gets rid of the need for SequenceFragments and the extra search methods.

However, there are some changes here that we still think are useful. @franknothaft, I'll email you offline with the details, and hopefully we can update the PR.

fnothaft · 2014-10-16T01:35:49Z

@jeromekelleher Sounds good to me; this:

This gets rid of the need for SequenceFragments and the extra search methods.

Sounds like a good idea here.

pgrosu · 2014-10-16T05:06:32Z

I wouldn't throw out the baby with the bathwater yet. Actually if the source type (i.e. GAReadSet, GAReference, GAVariantSet) is also specified in GASequence then we should be able to reconstruct what GAReference{Set} it is derived from. We should be able to add start/end as necessary, but by going up the chain to the reference it would be easy to regenerate the full sequence in a pipeline module.

Let's keep working on this, since we're on a good track and almost there.

cassiedoll · 2014-10-28T21:09:03Z

@jeromekelleher @fnothaft - any update on this PR? is there going to be a new one or an updated version?

delagoya · 2014-12-17T17:30:06Z

Bringing this issue back into notice. May have implications for #211 and #212.
Will also need to account for name-spacing introduced in #209

jeromekelleher · 2014-12-18T09:24:18Z

I'm happy to close the PR at this point - it's probably easier to make a new one when we hit the issues involved in the reference server. What do you think @fnothaft?

adamnovak · 2015-02-24T20:32:06Z

@fnothaft?

fnothaft · 2015-02-24T20:33:44Z

Is there interest in continuing with this, or closing it?

pgrosu · 2015-02-24T20:40:19Z

@fnothaft, of course there is interest! I think working with RNA-seq data would be quite important. What is the concern?

adamnovak · 2015-02-24T23:23:38Z

I think we should think about what we want to do with this in light of #238. Right now in the API we have a concept of a "sequence", which is a string with an associated ID and end joins onto other sequences, and we have a way to page through the bases of a sequence. We also still have the /references/{id}/bases call lying around.

The simplest thing would be to close this up and run with the sequence system we have, but I still think what we have is kinda ugly, and so would be open to changing it around.

jeromekelleher · 2015-02-25T11:26:07Z

I vote to close this - it would be simpler to bring in the aspects we want to keep as a new PR than to update this one.

delagoya · 2015-02-25T17:19:03Z

I agree with @jeromekelleher. There are a couple of issues that were discussed in the conversation, and due to the age of the issue, we should refresh ourselves on the motivation for the PR.

If no one objects, I will close this issue on Friday Feb 27th @ 8AM UTC.

fnothaft · 2015-02-25T17:19:48Z

I am in agreement about closing this.

update variant annotation test data with HGVSg, names and metadata

Adding sequence records, and cleaning up reference records.

d2bb3e8

cassiedoll added the ReadTaskTeam label Oct 15, 2014

benedictpaten closed this Mar 16, 2015

dcolligan pushed a commit to dcolligan/ga4gh-schemas that referenced this pull request Jul 20, 2016

Merge pull request ga4gh#162 from sarahhunt/testdata

c2ddb4c

update variant annotation test data with HGVSg, names and metadata

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding sequence records, and cleaning up reference records. #162

Adding sequence records, and cleaning up reference records. #162

fnothaft commented Oct 13, 2014

calbach commented Oct 15, 2014

fnothaft commented Oct 15, 2014

pgrosu commented Oct 15, 2014

richarddurbin commented Oct 15, 2014

jeromekelleher commented Oct 16, 2014

fnothaft commented Oct 16, 2014

pgrosu commented Oct 16, 2014

cassiedoll commented Oct 28, 2014

delagoya commented Dec 17, 2014

jeromekelleher commented Dec 18, 2014

adamnovak commented Feb 24, 2015

fnothaft commented Feb 24, 2015

pgrosu commented Feb 24, 2015

adamnovak commented Feb 24, 2015

jeromekelleher commented Feb 25, 2015

delagoya commented Feb 25, 2015

fnothaft commented Feb 25, 2015

Adding sequence records, and cleaning up reference records. #162

Adding sequence records, and cleaning up reference records. #162

Conversation

fnothaft commented Oct 13, 2014

calbach commented Oct 15, 2014

fnothaft commented Oct 15, 2014

pgrosu commented Oct 15, 2014

richarddurbin commented Oct 15, 2014

jeromekelleher commented Oct 16, 2014

fnothaft commented Oct 16, 2014

pgrosu commented Oct 16, 2014

cassiedoll commented Oct 28, 2014

delagoya commented Dec 17, 2014

jeromekelleher commented Dec 18, 2014

adamnovak commented Feb 24, 2015

fnothaft commented Feb 24, 2015

pgrosu commented Feb 24, 2015

adamnovak commented Feb 24, 2015

jeromekelleher commented Feb 25, 2015

delagoya commented Feb 25, 2015

fnothaft commented Feb 25, 2015