Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representing multiple assemblies of the same organism #149

Open
cybersiddhu opened this issue Jan 16, 2015 · 4 comments
Open

Representing multiple assemblies of the same organism #149

cybersiddhu opened this issue Jan 16, 2015 · 4 comments
Assignees
Milestone

Comments

@cybersiddhu
Copy link
Member

Context

This is the case when the same organism is sequenced multiple times, then there has to be a way to capture the information. Once we have the information we will be able to figure out which particular build this genome belongs or this is the canonical build etc.
At dictybase, this will happen in case the strain is sequenced multiple times. This will be seldom but it is definitely possible, for example, multiple research group has sequenced the canonical AX4 strain. It's wise to have the provision in the data model.

Data model for implementation

As discussed in the chado mailing list, there are few options with their ups and downs.

  • Each assembly has its own organism entry. It could be done by appending assembly id to the species value. However, it creates fake organism entries that is less desirable. You also have to do extra work to get all information for a particular organism with different assembly.
  • Create an assembly feature for grouping. It will modeled around the concept of GenBank and Ensembl handling of assemblies where you create a chado feature to represent the assembly. An example genome assembly page from NCBI. The information gather from this could easily be turned into an chado feature for assembly. The assembled features like chromosomes and contigs will be its biological descendants.
    Representing the entry: Use assembly cvterm to type the feature and feature relation member_of to relate the chromosomes and contigs.
    Notes: The downside would be to have that fake grouping feature, this will be always hacky. The other solution is to plug in the chado group module, however it is still not in final shape to be released.
  • Use analysis and analysis_feature tables to model the assembly. Represent the assembly as an analysis entry and then link the required feature through analysis_feature. This is quite clear, straightforward, less hacky and chado centric. So, lets stick with this model.

Other avenues to explore

  • As mentioned before chado group module, have been discussed and being actively or planned to be used in few places.
  • Karl Pinc's data model about storing genomes from different population of the same organism.
@cybersiddhu cybersiddhu changed the title Representing multiple assemblies Representing multiple assemblies of the same organism Jan 16, 2015
@cybersiddhu
Copy link
Member Author

Tied to dictyBase/Migration#5

@cybersiddhu
Copy link
Member Author

Software implementation

A GFF3 post-processor script that would extract information from GenBank assembly page and load it in chado. The script will also create links between downstream features. It might take the assembly id or taxon id as input, however it needs a little bit of trial and error before settling on the one that works.

@cybersiddhu cybersiddhu added this to the Sprint #6 milestone Jan 16, 2015
@cybersiddhu cybersiddhu self-assigned this Jan 20, 2015
@cybersiddhu
Copy link
Member Author

Asked chado schema group for ideas.

@cybersiddhu
Copy link
Member Author

An implementation to look at as suggested in the mailing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant