-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic infrastructure for pulling genome data from Ensembl #563
Conversation
Not sure how to test this, btw. |
I guess you could test the |
26414c7
to
d3ae40c
Compare
There's some basic infrastructure here for pulling the genome/assembly data for a species from Ensembl here. The idea is that we use the REST API to get information on the assembly, and then write this info into a Python file. We can then easily read this in, and use it to populate the objects. I decided that writing a simple Python file was better than trying to write the actual object definitions here because (a) it's a bit faffy writing out Python source code; and (b) we're still combining information from multiple sources, as the mutation/recombination rates aren't going to be available from Ensembl. We could equally use a JSON file, it probably doesn't matter. I just thought that a Black formatting python file like this would be an unequivocal and diff-friendly way to stating the information. If we like this approach, then I think what we should do is to make each species module in the catalog a package directory, where the There's a little bit more infrastructure needed to go over all the species automatically, but it's straightforward. Do we like the general layout, before I go any further? Pinging @grahamgower @andrewkern @ndukler |
Oh yeah, Ensembl doesn't use the "chr" prefix and we currently do. Hooray. |
Heh. I use a package in R called |
Very nice @jeromekelleher, I like it a lot! |
Seconded! |
Is it worth me re-implementing the chromosome renaming functionality @jeromekelleher ? Or should we just agree that everyone always has to use ensembl nomenclature? Or just set up a specific ensembl to ucsc conversion when the genome data is imported and force everywhere else to be UCSC? |
I've pushed through the changes and it's mostly working pretty well. There's some details with some assemblies like PonAbe that we need to make decisions on, but there's no big issues. The only real roadblock is the chromosome naming conventions. We either need to
I think (a) would be the ideal solution. Anyone like to do a bit of digging to see if we can find the required data? I'd be surprised if it wasn't in Ensembl somewhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking very useful. I think using the karyotype field of the /info/assembly/
endpoint should clean up the Pongo chromosomes.
maintenance/ensembl.py
Outdated
data["chromosomes"] = { | ||
name: chromosomes[name] for name in sorted(chromosomes.keys())} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "karyotype" output field gives a more natural ordering than sorting.
>>> sorted(["1", "2", "10"])
['1', '10', '2']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, excellent idea. I'll change that.
stdpopsim/catalog/PonAbe/__init__.py
Outdated
# JK: I've no idea what this chr is. Is this the correct thing | ||
# to do? | ||
"Un": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "Un" means unplaced?
Do you want me to prototype this? I looked around for a web service but there genuinely does not seem to be one. There is something in Galaxy here but it seems pretty limited. There are also some lists here. I can just grab the tables from one of the places I mentioned then set up a system for automatically attaching the synonyms to each chromosome when they are constructed. Then |
That sounds good @ndukler, but I worry a bit about how extensive the species data will be and potential conflicts with the Ensembl data model. I'll ping a few more people to see if anyone knows, so maybe don't put too much effort into it. |
Some helpful comments on twitter here |
I've update this with your ideas @ndukler - it's still pretty messy tbh. I don't really know where we should get the data from - perhaps the best thing is to git submodule the R code? Or, maybe we should pull in the data using the R package by running R from Python? Then we can checkout a particular release tag from GenomeInfoDb. I had a go at getting this information from UCSC via the database API too, but everything so bloody inconsistent :( |
Codecov Report
@@ Coverage Diff @@
## master #563 +/- ##
=======================================
Coverage 99.64% 99.64%
=======================================
Files 23 29 +6
Lines 1973 1997 +24
Branches 203 206 +3
=======================================
+ Hits 1966 1990 +24
Misses 3 3
Partials 4 4
Continue to review full report at Codecov.
|
This is nearly done, I think, so could use some review from those in the know. We can get some synonym information from Ensembl. It's not that comprehensive, but it's good enough - we can just say we only allow Ensembl IDs for the things we don't have (like DroMel, weirdly). I fixed the the test cases that broke because of this. The only thing that seriously broken now is the recombination maps. These have UCSC IDs in them, which is fun. Hmm, I should be able to fix this.. wait on. |
OK! Recombination maps are fixed, so I think this is ready for review and final cleanups now. Pinging @andrewkern, @grahamgower, @ndukler, @apragsdale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @jeromekelleher. The only blocker I see here is that DroMel needs a _recombination_rate_data
dict like the other species, to set the recombination rate to zero for Y and Mt.
Also, how do you feel about having a consistent synonym for Mt across all species?
Notes from the call 07/07
|
What information do we actually need for a sex chromosome? In my mutation calling software, what is important is to know the expected ploidy of an individual at a locus 0, 1, and 2 and how it is inherited. Chromosomes X, Y, etc. are used as shortcuts for determining the polidy of males and females and how it can be transmitted to offspring. This is then used to prune a family-history graph into the proper transmission structure. In that sense, stdpopsim would need a low level API that specifies the polidy and transmission rules for chromosomes/fragments in species that have defined sexes. On top of that would be user-friendly labels like Autosomal, X, Y, Z, etc. And on top of that would be annotating a genome based on these labels. |
Good questions @reedacartwright! I think how we actually simulate sex chrs is a big question. All we're interested in for now, I think, is to basically record the information that something is not an autosome, so that we can emit a warning. |
guys i've spent a bit of time playing with this code this AM. first off @jeromekelleher this is awesome. second, i think we should merge this as is and then fine tune things as we try to maintain it. undoubtedly we will come on myriad issues trying to align ourselves with ensembl (e.g. the synonym stuff), so i vote we merge this and give it a go while our user base is still small. |
This commit changes the basic data associated with species from being locally curated information to being derived from Ensembl.
5483631
to
97e6efa
Compare
I've updated this with the bug-fixes, and opened issues to track the outstanding stuff raised above. In the interest of making progress I suggest we merge this much. @grahamgower, would you mind taking a last look and merging if you're happy please? |
Looks good @jeromekelleher! Merging. |
This seems a like a reasonable place to start with integrating with upstream genome data providers.