Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VG RDF generate stablish identifiers for nodes and paths #112

Open
JervenBolleman opened this issue Oct 1, 2015 · 4 comments
Open

VG RDF generate stablish identifiers for nodes and paths #112

JervenBolleman opened this issue Oct 1, 2015 · 4 comments
Labels

Comments

@JervenBolleman
Copy link
Contributor

Currently the IRI identifying a node in the RDF representation of a VG is determined by order of serialization. This means that at this time the IRIs are not global identifiers that one can use to place a VG rdf document on the web.

A better IRI generation scheme would perhaps use a hash of the input so that identical input generates identical IRIs.

UUIDs could be used for the IRIs but they compress very badly, they are hard to regenerate in an identical version.

@ekg
Copy link
Member

ekg commented Oct 23, 2015

We could use some kind of representation of the node and its context to do this. However, due to the breaky nature of vg nodes (editing breaks them) I'm not sure if this is the right approach. ga4gh/ga4gh-schemas#444 suggests one path forward. If you have ideas I'm all ears.

@JervenBolleman
Copy link
Contributor Author

I am not a fan of the idea of ga4gh/ga4gh-schemas#444, mostly because the problem there is linear coordinates to search a graph space which is going to be tricky no matter what your graph representation is. But that does not really matter, I think I am getting confused by the in group terminology again.

For now a small braindump about possible solutions, that work in the RDF solution.

The aim is that the same input gives the same ids, different input won't and can't give the same identifiers and that is ok.

In RDF like XML ids are namespaced, e.g. ftp://example.org/GRCh38/VG/assembledbyjerven/2015/Chr1/node/XXXXXX and ftp://example.org/GRCh38/VG/assembledbyekg/2015/Chr2/node/XXXXXX. The namespaces ensure (normally) that global id collisions don't occur. These namespaces are long but are 100% compressible in any sane file format and storage system.

We do need a way to number the nodes in the graph in a predictable way. This means we a stable graph algorithm that visits each node at least once in a repeatable manner. Then we just number the nodes from 1 to ∞.

Then every edit we make we just increment the id sequence by 1 for all new nodes. Then given the history of edits we can always rebuild the same node graph. The fact that one node is split into multiple nodes does not mean that the old node needs to be deleted, for many purposes it can just be disconnected from the global graph. Maybe with new predicates, was attachedOn5PrimeEndOf. These disconnected nodes can be taken out of RAM and logged to disk.

An other option is if the original node was named ftp://example.org/GRCh38/VG/byekg/2015/Chr2/node/12345 and it is split into two we can call the new nodes for the split point. e.g. assume node ex:12345 has sequence actgactg and is split into two for example act and gactg then we name those nodes ex:12345s0l3 and ex:12345s3l5. In other words once we have our first graph we have a tree encoding for determining new node ids, that are proper splits of original nodes. The nice things is that in a sorted list these kinds of ids compress really nicely.

The new variant that is from a patient that required the snip can be named into a patient namespace. e.g.

patient:ABC a <Patient> , 
    <has_variant> variant:XYZ .
variant:XYZ a <SNP> ,
    <has_node> <http://example.org/patient/12354/XYZ/node/1> .
<http://example.org/patient/12354/XYZ/node/1> vg:linksOn35to53 ex:12345s0l3 .

The same with new major revisions of the reference VG. E.g nodes in Ghcr38 are in a new namespace when they are not simple splits or identical to nodes in Ghcr37.

Of course these are only external node ids, in memory they are just a integers/pointers to a lookup table. Which can be stored very efficiently using delta encodings for most nodes.

The nice thing is that most node edits are data driven, e.g. by adding a new sample file you need new nodes for the side graph to work. Then while all the nodes in the original graph as well as their split children are in the original namespace, all the new nodes can put into a new namespace.
e.g. something like https://example.org/uk1000/patient1/Chr1/node/XXXXXX :linksOn35to53 ftp://example.org/official/GRCh38/Chr1/node/YYYYYY.

It definitely allows us to merge different VG from different sources, and we can deal with weird cases such as cancer genomes, HELA lines etc... being merged into a super graph.

@JervenBolleman
Copy link
Contributor Author

@JervenBolleman
Copy link
Contributor Author

This has come up a few times in the last few years and I would like to note down my current thinking on this for future reference.

@ekg has convinced me that nodes are identified by the VG that they are in and that we should only care about path identification. i.e. it is OK, that VG nodes are "blank nodes" in the RDF world.

However, the path identification is more critical than ever in this world.
Currently VG just stores a name for a path. This name is identical to the linear sequence that was inserted/mapped into the graph. The names need to be IRIs in the RDF output, which is easy enough to do/fix.

It is critical to note that the IRI of the paths need to be distinct from IRI of the linear sequence that the path is a representation of. I will try to explain why with an temporal example.

Assume that we build a VG graph of two linear (fasta of) genomes A and B. We then do the same a year later, using a new version of VG, the VG graph produced can be different. i.e. we have 4 paths through 2 VG graphs for only 2 genomes. If these paths are exported into different formats we should give them different IRIs or extended names. Or we end up with two path descriptions getting merged into one, leading to collisions and confusions.

To solve that VG should maintain in the path name, some minimal form of provenance of how the path was arrived at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants