Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inserting existing annotations on a genome 'path' into a vg graph #1381

Closed
JervenBolleman opened this issue Jan 17, 2018 · 27 comments
Closed

Comments

@JervenBolleman
Copy link
Contributor

I would like to add existing genome annotations to a vg graph. The conceptual straightforward way is to see each genome annotation as a new path in the graph.

assume linear genome:a=> aaccaaggcc
we have a gene:cc=>cc
we know from existing work that this gene matches the first 'cc' pair in the linear path.

if we have vg graph like

aa-c-c-aagg-cc
  \t-a/    \gg

so we have genome:a 1,2,3,4,5
and genome:b 1,7,8,4,9
then our gene is a path of node 2 and node 3

Method wise is this something that can be easily done using existing vg commands? Especially introducing new nodes as required when adding new "gene" paths?

@adamnovak
Copy link
Member

I think if you express your annotations as GAM alignments to the graph, you can use the -i option of vg mod to embed them, breaking any nodes.

What you would need then would be a way to convert an annotation specified in path coordinates into a mapping in node coordinates. I don't think we have anything that does that right now. You can use vg find -p {path name}:{start}-{end} to extract the nodes relevant to a given path region, but then you would need to use vg find -n {node} -P {path name} to nget the actual coordinate of each returned node in the path, and use your own code to synthesize a perfect-match alignment to the relevant nodes.

You might be able to get away with extracting the region relevant to the gene, mapping the gene's reference sequence against that region with vg align, and then embedding that alignment in the original graph with vg mod -i, if you assume that the gene's sequence will align the right way to the extracted graph region.

@JervenBolleman
Copy link
Contributor Author

Is there some documentation for the GAM file available. My google/bing attempts are not getting me to any helpful pages.

@subwaystation
Copy link
Member

If I may step in, the ag (annotation graph) code from Computomics offers a way to store annotation.
That is each node can be annotated by a list of IDs. Each of these IDs correspond to a feature.
Of course the nodes as well as the annotation are still indexible.

For more please checkout https://gitlab.codenic.de/computomics/ag/blob/master/src/ag.hpp.
And for the modified xg: https://gitlab.codenic.de/computomics/xg

I think it could be worth to take a look at it.

@JervenBolleman
Copy link
Contributor Author

@subwaystation thank you for letting me know about ag, but it is a technical direction I am personally not interested in. My aim is to use VG for building variation graphs and for sequence orientated searches, but use SPARQL and RDF for annotation and integration with existing systems. Mostly my aim is to combine genomic variation graphs with protein variation graphs with all of the richness of Ensembl and UniProt annotation and I do not think it economically feasible to push this all into a VG/XG like system.

@subwaystation
Copy link
Member

@JervenBolleman That makes total sense from my point of view. Thanks for pointing it out. I am curious how you are going to implement this :)

@JervenBolleman
Copy link
Contributor Author

@subwaystation An old wiki page about the general concept is here https://github.com/vgteam/vg/wiki/Annotating-a-VG-graph-the-RDF-way and the specific hack attack I would like to pick up is here https://github.com/vgteam/vg/wiki/VG-RDF,-the-Ensembl-bacteria-E.-coli-genome-hack-attack

Which was stopped at the time exactly at the point of needing to add "gene" paths to the VG. With the node breaking in the existing genome paths logic being the most significant issue.

@subwaystation
Copy link
Member

@JervenBolleman Thanks!
One question: So we kind of annotated node 1-3,4-1,and 5-2 to be part of the reference genome. Why don't you take the actual (I think unique) node IDs of VG? Or is this just to exemplify what you have in mind?

@JervenBolleman
Copy link
Contributor Author

JervenBolleman commented Jan 18, 2018

@subwaystation its just an example to illustrate the thinking. Basically I see current existing gene annotations as sub paths of the reference genome path. So yes it would use the nodeIDs of VG (of course there is whole different set of problems with current node ID assignment in VG, that is the topic of different issues/pages).

In the JSON syntax we have the existing VG graph of the two genomes

{
  "node": [{"sequence": "AA", "id": 1},
		{"sequence": "C", "id": 2},
		{"sequence": "C", "id": 3},
		{"sequence": "AAGG", "id": 4},
		{"sequence": "T", "id": 5},
		{"sequence": "A", "id": 6},
		{"sequence": "AA", "id": 7},
		{"sequence": "AG", "id": 8}
	],
  "path": [ { "name" : "genome 1", 
                     "mapping":[ 
                       {"position":{"nodeid":1}},
                       {"position":{"nodeid":2}},
                       {"position":{"nodeid":3}},
                       {"position":{"nodeid":4}},
                       {"position":{"nodeid":5}}]},
               { "name" : "genome 2", 
                    "mapping":[ 
                       {"position":{"nodeid":1}},
                       {"position":{"nodeid":7}},
                       {"position":{"nodeid":8}},
                       {"position":{"nodeid":4}},
                       {"position":{"nodeid":9}}]},
             { "name" : "genome 1 gene JJ", 
                    "mapping":[ 
                       {"position":{"nodeid":2}},
                       {"position":{"nodeid":3}}]},
              ] ,
  }
}

This is the level we can do today without any further changes required from VG. At least we can hack this from the RDF side.

The challenge is when we want to add a new gene-path that requires us to split an existing node. e.g.
we add a gene "aaag" that uses node:6 and the first three nucleotides of node:4 requiring us to split node:4 into node:4a and node:4b and update the existing paths to have the new node identifiers etc...

Thus modifying the original graph into the following one.

{
  "node": [{"sequence": "AA", "id": 1},
		{"sequence": "C", "id": 2},
		{"sequence": "C", "id": 3},
		{"sequence": "AAG", "id": 4a},
		{"sequence": "G", "id": 4b},
		{"sequence": "T", "id": 5},
		{"sequence": "A", "id": 6},
		{"sequence": "AA", "id": 7},
		{"sequence": "AG", "id": 8}
	],
  "path": [ { "name" : "genome 1", 
                     "mapping":[ 
                       {"position":{"nodeid":1}},
                       {"position":{"nodeid":2}},
                       {"position":{"nodeid":3}},
                       {"position":{"nodeid":4a}},
                       {"position":{"nodeid":4b}},
                       {"position":{"nodeid":5}}]},
               { "name" : "genome 2", 
                    "mapping":[ 
                       {"position":{"nodeid":1}},
                       {"position":{"nodeid":7}},
                       {"position":{"nodeid":8}},
                       {"position":{"nodeid":4a}},
                       {"position":{"nodeid":4b}},
                       {"position":{"nodeid":9}}]},
             { "name" : "genome 1 gene JJ", 
                    "mapping":[ 
                       {"position":{"nodeid":2}},
                       {"position":{"nodeid":3}}]},
             { "name" : "genome 1 gene KK", 
                    "mapping":[ 
                       {"position":{"nodeid":3}},
                       {"position":{"nodeid":4a}}]},
              ] ,
  }
}

PS. JSON syntax adapted from example https://github.com/vgteam/vg/blob/master/test/cyclic/reverse_self.json which is either out of date or something is odd with the json vg.

@subwaystation
Copy link
Member

@JervenBolleman Thanks for the illustration. In the AG, this is actually solved as you mentioned above: The nodes are split according to the annotation.
But I am not quite sure if I understood your problem correctly: There are paths in the graph of both genomes and genes. And when you add these genes as paths in the graph, nodes have to be split and therefore the node IDs change. The JSON should be updated automatically, right? But you miss the mapping of this on the RDF site? Because the IDs changed?

@JervenBolleman
Copy link
Contributor Author

JervenBolleman commented Jan 18, 2018

@subwaystation The problem is that there is a lack of an tool to add the genes as paths in the graph in VG today. Specifically using the existing tools in side VG itself. Which is why @adamnovak suggested building a GAM file and using that. Unfortunately I can't find a spec for a GAM file :(

@subwaystation
Copy link
Member

Ah. I thought that you would insert them as follows:
GFF -> FASTA. Then align all sequences. But they could be aligned anywhere of course.
Actually I thought that what you want is possible, yet. Obviously it is not.
But the rest of your concept makes looks good.

@nerdstrike
Copy link

I'd suggest spooling path entities directly into the protobuf, but I think it needs rather a lot of raw information about nodes in the path to be.

@JervenBolleman
Copy link
Contributor Author

@nerdstrike that is as complicated as direct fixing the RDF or JSON. Mostly it's about efficiently splitting nodes and fixing up existing paths. I think someone will need to write a new vg subcommand for it. I was thinking of naming it inject but that might get confusing with surject.

@adamnovak
Copy link
Member

The GAM "spec" right now is whatever is written by stream.hpp operating on Alignment objects from vg.proto.

Basically it ends up as a gzip-compressed series of chunks, with each chunk consisting of an int64 record count, followed by that many int32-length-prefixed Protobuf-serialized Alignments. The lengths and counts are encoded with Protobuf's variable-length int encoding.

However, rather than just dumping that out manually, you might be better off generating JSON to describe the alignment you want and running it through vg view -JGa to serialize it in binary GAM format.

@glennhickey
Copy link
Contributor

glennhickey commented Jan 18, 2018 via email

@JervenBolleman
Copy link
Contributor Author

@glennhickey @adamnovak thanks that is very helpful. Will try to play with this in the next days.

@ekg
Copy link
Member

ekg commented Jan 19, 2018

I'm sorry that I haven't joined this conversation yet. In the hackathon I built a function in vg annotate that converts BED records into GAM records. I apologize that it is not well documented. It needs to be properly tested.

vg annotate -b x.bed -x x.xg >x.gam

Then you can add the alignments to the graph and you will have embedded them. If GBWT would support lookup related to path names then we would have an annotation database on vg. (Does it yet?)

@JervenBolleman
Copy link
Contributor Author

@ekg that is very helpful. Unfortunately the Ensembl bacteria species I have access to do not have BED files. I did look at the annotate command and was thinking about using the --novelty option but was not sure if I would have enough information to make it work.

@ekg
Copy link
Member

ekg commented Jan 19, 2018

@JervenBolleman I see. How are the genes annotated then? I'm a little confused. If you have an example data set for a small region we could include it in the distribution as a way to drive tests.

I do not believe that node ids should be a stable concept. What should be stable are the paths. Through them we can construct a query for a particular graph position that will be stable as long as the graphs we are querying have the same embedded path. If node ids need to be stable then we can simply begin with 1bp nodes, then all the problems with node splitting and merging disappear. These are only there for performance reasons. Keeping a node id per graph position is expensive in most use cases, and I would not recommend it.

@subwaystation I am somewhat confused by AG. Why do you need to add a new concept for annotation to the graph? Are you familiar with the GBWT and gPBWT models that allow us to encode large numbers of annotations (aka haplotypes / paths / alignments) into compact queryable indexes that relate these paths to the graph? To support the colored graph concept that you are developing, it would be enough to use single-node paths with a consistent naming pattern or prefix. Of course doing so will be no more efficient in GBWT than in simpler encodings, but the concept will be generic to longer and shorter paths. If you are interested in this technical development, I would suggest finding how these models are insufficient in their current implementations and then building on them. That said you are completely free to do as you wish with the codebase! This is just a suggestion that may improve the reach of your work and prevent us from duplicating effort.

@nerdstrike
Copy link

@JervenBolleman For proof of concept, you could consider extracting BED from rest.ensemblgenomes.org/overlap/$species/$region?feature=gene;content-type=text/x-bed

Sadly it's capped on region size to 500kb out of necessity.

@ekg Ensembl don't produce BED in bulk. You'd normally make them for genome browsers, but our browser runs direct off the database. Not-awful alternatives are sadly limited to GFF. Everything else would need code to unpack the correct coordinates.

@ekg
Copy link
Member

ekg commented Jan 19, 2018

@nerdstrike but can't we convert GFF into BED? Am I missing something obvious?

@nerdstrike
Copy link

If you like. One more hurdle. Us service developers are averse to these circuitous processes - they get fragile in production. It's more fun to bug developers to expose the right interface for us :)

@JervenBolleman
Copy link
Contributor Author

Big discussion on side topics and I think we should split that out into independent topics.
@ekg would you mind opening an issue on node identifiers and if they should be stable or deterministic? It distracts from this issue.

Also what I think off as annotations are things like go terms etc.. not so much gene locations as such. i.e. not what it is but what it does. Especially I am interested in derivative product annotation i.e. mRNA, proteins and even their 3D structures as how it relates to genomic variation.

@nerdstrike yes totally! ruby goldberg machines are cool on youtube not on saturday morning when your timesenstive production cycle blows up because one of the domino's fell over to early.

@6br
Copy link
Collaborator

6br commented Jan 20, 2018

I would like to describe the usage of vg annotate -b BED -x XG -p. A concept (but undocumented...) is here #1036.

As long as annotations are the subpath of existing paths represented like following BED format, they can be converted into GAM format.

This is the graph generated from test/tiny/tiny.vcf.gz and test/tiny/tiny.fa. x3
If you would like to get an alignment of GAAA from node 8 to node 9, which is a subpath of path x, BED file is described as following.

x      13        17      gene

It means there is an annotation on the path x from 13 to 17 named "gene". It can be converted into GAM format following via vg annotate -b example.bed -p -x tiny.xg.

{"path": {"mapping": [{"edit": [{"from_length": 1, "to_length": 1}], "position": {"node_id": 8}, "rank": 5}, {"edit": [{"from_length": 3, "to_length": 3}], "position": {"node_id": 9}, "rank": 6}]}, "name": "gene"}

Moreover, it can be integrated on VG using vg mod -P --include-aln.

result

In this case, node identifiers are preserved but the offset of annotations is not preserved. Otherwise, nodes can split according to alignments using vg mod --include-aln.

result2

I found bugs in this function, so I fixed it (#1386). And I think I can add an interface to accept GFF or GTF format in addition to BED format if you need.

@JervenBolleman
Copy link
Contributor Author

@ekg I forgot there is an already existing issue #112 which is about stable node identifiers.

@subwaystation
Copy link
Member

@ekg Honestly, I am not quite familiar with the annotation concept implemented in vg. What I got so far is that annotations are basically paths through the graph, right? Can you point out to me, where I could find deep information about GBWT and gPBWT? I did not code any part of AG, so sometimes I am confused about it, too ;) However, what I would have suspected is that the direct annotation of a single node is much faster in practice than the annotation of a single node with a path for each annotation. But that is pure theory here.
Currently I am not developing anything, neither code for the graph nor code for the VIZ. All of what you can observe in AG and AGV is as it is. Also, because I am not working for CTX. But, I completely am on your side about the duplication effort. I would not have wanted that either.
Thanks for the helpful tips!

@JervenBolleman
Copy link
Contributor Author

I think this issue can be closed since Ensembl Bacteria now produces BED files and this can be used with vg annotate, index and mod to merge the gene annotation from Ensembl genomes into a VG pan genome graph. How this can be done is discussed in the wiki

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants