Representing multiple assemblies of the same organism #149

cybersiddhu · 2015-01-16T21:36:42Z

Context

This is the case when the same organism is sequenced multiple times, then there has to be a way to capture the information. Once we have the information we will be able to figure out which particular build this genome belongs or this is the canonical build etc.
At dictybase, this will happen in case the strain is sequenced multiple times. This will be seldom but it is definitely possible, for example, multiple research group has sequenced the canonical AX4 strain. It's wise to have the provision in the data model.

Data model for implementation

As discussed in the chado mailing list, there are few options with their ups and downs.

Each assembly has its own organism entry. It could be done by appending assembly id to the species value. However, it creates fake organism entries that is less desirable. You also have to do extra work to get all information for a particular organism with different assembly.
Create an assembly feature for grouping. It will modeled around the concept of GenBank and Ensembl handling of assemblies where you create a chado feature to represent the assembly. An example genome assembly page from NCBI. The information gather from this could easily be turned into an chado feature for assembly. The assembled features like chromosomes and contigs will be its biological descendants.
Representing the entry: Use assembly cvterm to type the feature and feature relation member_of to relate the chromosomes and contigs.
Notes: The downside would be to have that fake grouping feature, this will be always hacky. The other solution is to plug in the chado group module, however it is still not in final shape to be released.
Use analysis and analysis_feature tables to model the assembly. Represent the assembly as an analysis entry and then link the required feature through analysis_feature. This is quite clear, straightforward, less hacky and chado centric. So, lets stick with this model.

Other avenues to explore

As mentioned before chado group module, have been discussed and being actively or planned to be used in few places.
Karl Pinc's data model about storing genomes from different population of the same organism.

The text was updated successfully, but these errors were encountered:

cybersiddhu · 2015-01-16T21:44:09Z

Tied to dictyBase/Migration#5

cybersiddhu · 2015-01-16T21:45:25Z

Software implementation

A GFF3 post-processor script that would extract information from GenBank assembly page and load it in chado. The script will also create links between downstream features. It might take the assembly id or taxon id as input, however it needs a little bit of trial and error before settling on the one that works.

cybersiddhu · 2015-01-20T22:52:22Z

Asked chado schema group for ideas.

cybersiddhu · 2015-01-21T19:02:03Z

An implementation to look at as suggested in the mailing list.

cybersiddhu added Data import overhaul labels Jan 16, 2015

cybersiddhu changed the title ~~Representing multiple assemblies~~ Representing multiple assemblies of the same organism Jan 16, 2015

cybersiddhu added the in progress label Jan 16, 2015

cybersiddhu added this to the Sprint #6 milestone Jan 16, 2015

cybersiddhu self-assigned this Jan 20, 2015

cybersiddhu added ready in progress and removed in progress ready labels Jan 23, 2015

cybersiddhu removed the in progress label Nov 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Representing multiple assemblies of the same organism #149

Representing multiple assemblies of the same organism #149

cybersiddhu commented Jan 16, 2015

cybersiddhu commented Jan 16, 2015

cybersiddhu commented Jan 16, 2015

cybersiddhu commented Jan 20, 2015

cybersiddhu commented Jan 21, 2015

Representing multiple assemblies of the same organism #149

Representing multiple assemblies of the same organism #149

Comments

cybersiddhu commented Jan 16, 2015

Context

Data model for implementation

Other avenues to explore

cybersiddhu commented Jan 16, 2015

cybersiddhu commented Jan 16, 2015

Software implementation

cybersiddhu commented Jan 20, 2015

cybersiddhu commented Jan 21, 2015