Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Adding sequence records, and cleaning up reference records. #162

Closed
wants to merge 1 commit into from

Conversation

fnothaft
Copy link
Contributor

@jeromekelleher and I plotted this one out this PM; we think this is a more consistent and general approach to serving references and sequences. Specifically, we:

  • Added GASequence and GASequenceFragment models. These models define a generic contig sequence, which could be used to describe a reference assembly, a de novo assembly, variant calls spliced back into a reference assembly, etc...
  • Cleaned up a few fields in GAReference and GAReferenceSet to make the relations of the models slightly cleaner/clearer.
  • Removed /references/{id}/bases GET request. Instead, we have added /sequences/{id} and /sequencefragments/{id}` GET requests, which are more general.

@calbach
Copy link
Contributor

calbach commented Oct 15, 2014

Hi Frank,

  1. Can you give some more concrete examples of how you envision GASequence{Fragment,} being used? I'm not sure I'm yet understanding the benefits of their addition.
  2. fragment : sequence; what's the relationship? Are fragments contiguous? Can they overlap? Can they have gaps? Or is this just your preferred way of implementing pagination over bases?

CH

@fnothaft
Copy link
Contributor Author

@calbach Certainly! Right now, the API assumes that the only contig sequences which will accessed using the GA4GH APIs are reference sequences. In reality, reference sequences are a subset of possible contigs someone might want to access. For example, I could want to store contigs that represent:

  • A reference sequence
  • A non-reference assembly (e.g., I sequence a genome of interest to myself and then de novo assemble the reads)
  • A transcriptome assembly
  • Contig sequences where someone has merged variants into a reference (e.g., for comparing whether two complex alleles are the same)

Essentially, this provides a more generic approach for accessing contigs.

As for the fragment:sequence question, the assumption is that fragments would be non-overlapping continuous subsequences from a contig sequence. Essentially, this would be the preferred way of implementing pagination for contig sequences.

@pgrosu
Copy link
Contributor

pgrosu commented Oct 15, 2014

+1 I was wondering when we would pick up on this discussion from #133 (diff) - glad to see we're allowing for these possibilities :)

@richarddurbin
Copy link
Contributor

Hello @fnothaft and @jeromekelleher,

I can appreciate the concept of a more generic Sequence. I see that building it from Fragments addresses practical issues in very long
sequences, and perhaps some other things. This is in fact how the reference genome is managed by the GRC, and also by Ensembl.
You might want to allow Fragments to be reverse complemented with respect to the Sequence.
I am not sure about the methods for getting actually bases back. The key method seems to be SearchSequenceFragmentsRequest.
But if you give an array of multiple sequenceIds and a start and end, do the same start and end apply to each sequenceId? And I get
back an array of fragments, but is that what I want? I would have thought I wanted to give a start and end on a sequence and for the
back end to piece together the literal base sequence between them from the fragments for me. That is what Ensembl APIs do for you.

I would be interested to discuss on Friday.

As a side effect, you have made a major change in that you now have Reference point to its referenceSet, rather than ReferenceSet contain a list of the references
contained in it. This changes the semantics. Now each reference can only be in one referenceSet. I don't think we want that. There are many different human referenceSets
that contain the same mitochondrial reference sequences for example, and even different versions of the GRCh37 reference used for different mapping that contain the
same basic chromosomes, but different additional sequences. So I would vote -1 on this change.

On the positive side, I like the change to an array of derivedFrom and sourceDivergence.

Richard

On 14 Oct 2014, at 00:08, Frank Austin Nothaft notifications@github.com wrote:

@jeromekelleher and I plotted this one out this PM; we think this is a more consistent and general approach to serving references and sequences. Specifically, we:

Added GASequence and GASequenceFragment models. These models define a generic contig sequence, which could be used to describe a reference assembly, a de novo assembly, variant calls spliced back into a reference assembly, etc...
Cleaned up a few fields in GAReference and GAReferenceSet to make the relations of the models slightly cleaner/clearer.
Removed /references/{id}/bases GET request. Instead, we have added /sequences/{id} and /sequencefragments/{id}` GET requests, which are more general.
You can merge this Pull Request by running

git pull https://github.com/fnothaft/ReadTaskTeam add-sequences
Or view, comment on, or merge it at:

#162

Commit Summary

Adding sequence records, and cleaning up reference records.
File Changes

M src/main/resources/avro/referencemethods.avdl (63)
M src/main/resources/avro/references.avdl (44)
A src/main/resources/avro/sequencemethods.avdl (133)
A src/main/resources/avro/sequences.avdl (61)
Patch Links:

https://github.com/ga4gh/schemas/pull/162.patch
https://github.com/ga4gh/schemas/pull/162.diff

Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@jeromekelleher
Copy link
Contributor

After discussing this with @cassiedoll and @calbach today, I think we can simplify this a great deal and address @richarddurbin's concern. The key issue that we were trying to deal with here is to give a consistent way of paging over large numbers of objects. This is complicated in this case, as we are really just trying to page over a large string and not over a set of well defined objects. A more effective way to do this might be to simply stream these large strings over HTTP using a GET method (with optional bounds), and avoid dealing with this complication at the GA4GH protocol level. This gets rid of the need for SequenceFragments and the extra search methods.

However, there are some changes here that we still think are useful. @franknothaft, I'll email you offline with the details, and hopefully we can update the PR.

@fnothaft
Copy link
Contributor Author

@jeromekelleher Sounds good to me; this:

This gets rid of the need for SequenceFragments and the extra search methods.

Sounds like a good idea here.

@pgrosu
Copy link
Contributor

pgrosu commented Oct 16, 2014

I wouldn't throw out the baby with the bathwater yet. Actually if the source type (i.e. GAReadSet, GAReference, GAVariantSet) is also specified in GASequence then we should be able to reconstruct what GAReference{Set} it is derived from. We should be able to add start/end as necessary, but by going up the chain to the reference it would be easy to regenerate the full sequence in a pipeline module.

Let's keep working on this, since we're on a good track and almost there.

@cassiedoll
Copy link
Member

@jeromekelleher @fnothaft - any update on this PR? is there going to be a new one or an updated version?

@delagoya
Copy link
Contributor

Bringing this issue back into notice. May have implications for #211 and #212.
Will also need to account for name-spacing introduced in #209

@jeromekelleher
Copy link
Contributor

I'm happy to close the PR at this point - it's probably easier to make a new one when we hit the issues involved in the reference server. What do you think @fnothaft?

@adamnovak
Copy link
Contributor

@fnothaft?

@fnothaft
Copy link
Contributor Author

Is there interest in continuing with this, or closing it?

@pgrosu
Copy link
Contributor

pgrosu commented Feb 24, 2015

@fnothaft, of course there is interest! I think working with RNA-seq data would be quite important. What is the concern?

@adamnovak
Copy link
Contributor

I think we should think about what we want to do with this in light of #238. Right now in the API we have a concept of a "sequence", which is a string with an associated ID and end joins onto other sequences, and we have a way to page through the bases of a sequence. We also still have the /references/{id}/bases call lying around.

The simplest thing would be to close this up and run with the sequence system we have, but I still think what we have is kinda ugly, and so would be open to changing it around.

@jeromekelleher
Copy link
Contributor

I vote to close this - it would be simpler to bring in the aspects we want to keep as a new PR than to update this one.

@delagoya
Copy link
Contributor

I agree with @jeromekelleher. There are a couple of issues that were discussed in the conversation, and due to the age of the issue, we should refresh ourselves on the motivation for the PR.

If no one objects, I will close this issue on Friday Feb 27th @ 8AM UTC.

@fnothaft
Copy link
Contributor Author

I am in agreement about closing this.

dcolligan pushed a commit to dcolligan/ga4gh-schemas that referenced this pull request Jul 20, 2016
update variant annotation test data with HGVSg, names and metadata
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants