-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inserting existing annotations on a genome 'path' into a vg graph #1381
Comments
I think if you express your annotations as GAM alignments to the graph, you can use the What you would need then would be a way to convert an annotation specified in path coordinates into a mapping in node coordinates. I don't think we have anything that does that right now. You can use You might be able to get away with extracting the region relevant to the gene, mapping the gene's reference sequence against that region with |
Is there some documentation for the GAM file available. My google/bing attempts are not getting me to any helpful pages. |
If I may step in, the ag (annotation graph) code from Computomics offers a way to store annotation. For more please checkout https://gitlab.codenic.de/computomics/ag/blob/master/src/ag.hpp. I think it could be worth to take a look at it. |
@subwaystation thank you for letting me know about ag, but it is a technical direction I am personally not interested in. My aim is to use VG for building variation graphs and for sequence orientated searches, but use SPARQL and RDF for annotation and integration with existing systems. Mostly my aim is to combine genomic variation graphs with protein variation graphs with all of the richness of Ensembl and UniProt annotation and I do not think it economically feasible to push this all into a VG/XG like system. |
@JervenBolleman That makes total sense from my point of view. Thanks for pointing it out. I am curious how you are going to implement this :) |
@subwaystation An old wiki page about the general concept is here https://github.com/vgteam/vg/wiki/Annotating-a-VG-graph-the-RDF-way and the specific hack attack I would like to pick up is here https://github.com/vgteam/vg/wiki/VG-RDF,-the-Ensembl-bacteria-E.-coli-genome-hack-attack Which was stopped at the time exactly at the point of needing to add "gene" paths to the VG. With the node breaking in the existing genome paths logic being the most significant issue. |
@JervenBolleman Thanks! |
@subwaystation its just an example to illustrate the thinking. Basically I see current existing gene annotations as sub paths of the reference genome path. So yes it would use the nodeIDs of VG (of course there is whole different set of problems with current node ID assignment in VG, that is the topic of different issues/pages). In the JSON syntax we have the existing VG graph of the two genomes {
"node": [{"sequence": "AA", "id": 1},
{"sequence": "C", "id": 2},
{"sequence": "C", "id": 3},
{"sequence": "AAGG", "id": 4},
{"sequence": "T", "id": 5},
{"sequence": "A", "id": 6},
{"sequence": "AA", "id": 7},
{"sequence": "AG", "id": 8}
],
"path": [ { "name" : "genome 1",
"mapping":[
{"position":{"nodeid":1}},
{"position":{"nodeid":2}},
{"position":{"nodeid":3}},
{"position":{"nodeid":4}},
{"position":{"nodeid":5}}]},
{ "name" : "genome 2",
"mapping":[
{"position":{"nodeid":1}},
{"position":{"nodeid":7}},
{"position":{"nodeid":8}},
{"position":{"nodeid":4}},
{"position":{"nodeid":9}}]},
{ "name" : "genome 1 gene JJ",
"mapping":[
{"position":{"nodeid":2}},
{"position":{"nodeid":3}}]},
] ,
}
} This is the level we can do today without any further changes required from VG. At least we can hack this from the RDF side. The challenge is when we want to add a new gene-path that requires us to split an existing node. e.g. Thus modifying the original graph into the following one. {
"node": [{"sequence": "AA", "id": 1},
{"sequence": "C", "id": 2},
{"sequence": "C", "id": 3},
{"sequence": "AAG", "id": 4a},
{"sequence": "G", "id": 4b},
{"sequence": "T", "id": 5},
{"sequence": "A", "id": 6},
{"sequence": "AA", "id": 7},
{"sequence": "AG", "id": 8}
],
"path": [ { "name" : "genome 1",
"mapping":[
{"position":{"nodeid":1}},
{"position":{"nodeid":2}},
{"position":{"nodeid":3}},
{"position":{"nodeid":4a}},
{"position":{"nodeid":4b}},
{"position":{"nodeid":5}}]},
{ "name" : "genome 2",
"mapping":[
{"position":{"nodeid":1}},
{"position":{"nodeid":7}},
{"position":{"nodeid":8}},
{"position":{"nodeid":4a}},
{"position":{"nodeid":4b}},
{"position":{"nodeid":9}}]},
{ "name" : "genome 1 gene JJ",
"mapping":[
{"position":{"nodeid":2}},
{"position":{"nodeid":3}}]},
{ "name" : "genome 1 gene KK",
"mapping":[
{"position":{"nodeid":3}},
{"position":{"nodeid":4a}}]},
] ,
}
} PS. JSON syntax adapted from example https://github.com/vgteam/vg/blob/master/test/cyclic/reverse_self.json which is either out of date or something is odd with the json vg. |
@JervenBolleman Thanks for the illustration. In the AG, this is actually solved as you mentioned above: The nodes are split according to the annotation. |
@subwaystation The problem is that there is a lack of an tool to add the genes as paths in the graph in VG today. Specifically using the existing tools in side VG itself. Which is why @adamnovak suggested building a GAM file and using that. Unfortunately I can't find a spec for a GAM file :( |
Ah. I thought that you would insert them as follows: |
I'd suggest spooling path entities directly into the protobuf, but I think it needs rather a lot of raw information about nodes in the path to be. |
@nerdstrike that is as complicated as direct fixing the RDF or JSON. Mostly it's about efficiently splitting nodes and fixing up existing paths. I think someone will need to write a new vg subcommand for it. I was thinking of naming it inject but that might get confusing with surject. |
The GAM "spec" right now is whatever is written by stream.hpp operating on Basically it ends up as a gzip-compressed series of chunks, with each chunk consisting of an int64 record count, followed by that many int32-length-prefixed Protobuf-serialized However, rather than just dumping that out manually, you might be better off generating JSON to describe the alignment you want and running it through |
Note that you can leave out everything from your Alignment records apart
from a Name and a Path if you want.
So if you can express your annotations as a list of name/path pairs in JSON
like
{"path": {"mapping": [{"position": {"node_id": 1, "offset": 813,
"is_reverse": true}, "edit": [{"from_length": 9, "to_length": 9}], "rank":
1}]}, "name": "annotation1"}
{"path": {"mapping": [{"position": {"node_id": 1}, "edit": [{"from_length":
9, "to_length": 9}], "rank": 1}]}, "name": "annotation2"}
etc.
You can convert the list into gam like Adam said with vg view -JGa. Then
you can embed that gam as paths in the graph with vg mod -i. This will
take care of breaking nodes and updating existing paths.
…On Thu, Jan 18, 2018 at 1:01 PM, Adam Novak ***@***.***> wrote:
The GAM "spec" right now is whatever is written by stream.hpp
<https://github.com/vgteam/vg/blob/1f9951060e8c6be34fbd2a313d1f748e5df859c4/src/stream.hpp>
operating on Alignment objects from vg.proto
<https://github.com/vgteam/vg/blob/b6b2fd52880db49f69cef571977d8a8f6a162eb3/src/vg.proto#L104-L141>
.
Basically it ends up as a gzip-compressed series of chunks, with each
chunk consisting of an int64 record count, followed by that many
int32-length-prefixed Protobuf-serialized Alignments. The lengths and
counts are encoded with Protobuf's variable-length int encoding.
However, rather than just dumping that out manually, you might be better
off generating JSON to describe the alignment you want and running it
through vg view -JGa to serialize it in binary GAM format.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1381 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA2_7hSUHmC6k37hLzLhbgcAqUYmsLRMks5tL4b3gaJpZM4Rh-nO>
.
|
@glennhickey @adamnovak thanks that is very helpful. Will try to play with this in the next days. |
I'm sorry that I haven't joined this conversation yet. In the hackathon I built a function in
Then you can add the alignments to the graph and you will have embedded them. If GBWT would support lookup related to path names then we would have an annotation database on vg. (Does it yet?) |
@ekg that is very helpful. Unfortunately the Ensembl bacteria species I have access to do not have BED files. I did look at the annotate command and was thinking about using the --novelty option but was not sure if I would have enough information to make it work. |
@JervenBolleman I see. How are the genes annotated then? I'm a little confused. If you have an example data set for a small region we could include it in the distribution as a way to drive tests. I do not believe that node ids should be a stable concept. What should be stable are the paths. Through them we can construct a query for a particular graph position that will be stable as long as the graphs we are querying have the same embedded path. If node ids need to be stable then we can simply begin with 1bp nodes, then all the problems with node splitting and merging disappear. These are only there for performance reasons. Keeping a node id per graph position is expensive in most use cases, and I would not recommend it. @subwaystation I am somewhat confused by AG. Why do you need to add a new concept for annotation to the graph? Are you familiar with the GBWT and gPBWT models that allow us to encode large numbers of annotations (aka haplotypes / paths / alignments) into compact queryable indexes that relate these paths to the graph? To support the colored graph concept that you are developing, it would be enough to use single-node paths with a consistent naming pattern or prefix. Of course doing so will be no more efficient in GBWT than in simpler encodings, but the concept will be generic to longer and shorter paths. If you are interested in this technical development, I would suggest finding how these models are insufficient in their current implementations and then building on them. That said you are completely free to do as you wish with the codebase! This is just a suggestion that may improve the reach of your work and prevent us from duplicating effort. |
@JervenBolleman For proof of concept, you could consider extracting BED from rest.ensemblgenomes.org/overlap/$species/$region?feature=gene;content-type=text/x-bed Sadly it's capped on region size to 500kb out of necessity. @ekg Ensembl don't produce BED in bulk. You'd normally make them for genome browsers, but our browser runs direct off the database. Not-awful alternatives are sadly limited to GFF. Everything else would need code to unpack the correct coordinates. |
@nerdstrike but can't we convert GFF into BED? Am I missing something obvious? |
If you like. One more hurdle. Us service developers are averse to these circuitous processes - they get fragile in production. It's more fun to bug developers to expose the right interface for us :) |
Big discussion on side topics and I think we should split that out into independent topics. Also what I think off as annotations are things like go terms etc.. not so much gene locations as such. i.e. not what it is but what it does. Especially I am interested in derivative product annotation i.e. mRNA, proteins and even their 3D structures as how it relates to genomic variation. @nerdstrike yes totally! ruby goldberg machines are cool on youtube not on saturday morning when your timesenstive production cycle blows up because one of the domino's fell over to early. |
I would like to describe the usage of As long as annotations are the subpath of existing paths represented like following BED format, they can be converted into GAM format. This is the graph generated from
It means there is an annotation on the path x from 13 to 17 named "gene". It can be converted into GAM format following via
Moreover, it can be integrated on VG using In this case, node identifiers are preserved but the offset of annotations is not preserved. Otherwise, nodes can split according to alignments using I found bugs in this function, so I fixed it (#1386). And I think I can add an interface to accept GFF or GTF format in addition to BED format if you need. |
@ekg Honestly, I am not quite familiar with the annotation concept implemented in |
I think this issue can be closed since Ensembl Bacteria now produces BED files and this can be used with vg annotate, index and mod to merge the gene annotation from Ensembl genomes into a VG pan genome graph. How this can be done is discussed in the wiki |
I would like to add existing genome annotations to a vg graph. The conceptual straightforward way is to see each genome annotation as a new path in the graph.
assume linear genome:a=> aaccaaggcc
we have a gene:cc=>cc
we know from existing work that this gene matches the first 'cc' pair in the linear path.
if we have vg graph like
so we have genome:a 1,2,3,4,5
and genome:b 1,7,8,4,9
then our gene is a path of node 2 and node 3
Method wise is this something that can be easily done using existing vg commands? Especially introducing new nodes as required when adding new "gene" paths?
The text was updated successfully, but these errors were encountered: