-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VG RDF generate stablish identifiers for nodes and paths #112
Comments
We could use some kind of representation of the node and its context to do this. However, due to the breaky nature of vg nodes (editing breaks them) I'm not sure if this is the right approach. ga4gh/ga4gh-schemas#444 suggests one path forward. If you have ideas I'm all ears. |
I am not a fan of the idea of ga4gh/ga4gh-schemas#444, mostly because the problem there is linear coordinates to search a graph space which is going to be tricky no matter what your graph representation is. But that does not really matter, I think I am getting confused by the in group terminology again. For now a small braindump about possible solutions, that work in the RDF solution. The aim is that the same input gives the same ids, different input won't and can't give the same identifiers and that is ok. In RDF like XML ids are namespaced, e.g. ftp://example.org/GRCh38/VG/assembledbyjerven/2015/Chr1/node/XXXXXX and ftp://example.org/GRCh38/VG/assembledbyekg/2015/Chr2/node/XXXXXX. The namespaces ensure (normally) that global id collisions don't occur. These namespaces are long but are 100% compressible in any sane file format and storage system. We do need a way to number the nodes in the graph in a predictable way. This means we a stable graph algorithm that visits each node at least once in a repeatable manner. Then we just number the nodes from 1 to ∞. Then every edit we make we just increment the id sequence by 1 for all new nodes. Then given the history of edits we can always rebuild the same node graph. The fact that one node is split into multiple nodes does not mean that the old node needs to be deleted, for many purposes it can just be disconnected from the global graph. Maybe with new predicates, was attachedOn5PrimeEndOf. These disconnected nodes can be taken out of RAM and logged to disk. An other option is if the original node was named ftp://example.org/GRCh38/VG/byekg/2015/Chr2/node/12345 and it is split into two we can call the new nodes for the split point. e.g. assume node The new variant that is from a patient that required the snip can be named into a patient namespace. e.g. patient:ABC a <Patient> ,
<has_variant> variant:XYZ .
variant:XYZ a <SNP> ,
<has_node> <http://example.org/patient/12354/XYZ/node/1> .
<http://example.org/patient/12354/XYZ/node/1> vg:linksOn35to53 ex:12345s0l3 . The same with new major revisions of the reference VG. E.g nodes in Ghcr38 are in a new namespace when they are not simple splits or identical to nodes in Ghcr37. Of course these are only external node ids, in memory they are just a integers/pointers to a lookup table. Which can be stored very efficiently using delta encodings for most nodes. The nice thing is that most node edits are data driven, e.g. by adding a new sample file you need new nodes for the side graph to work. Then while all the nodes in the original graph as well as their split children are in the original namespace, all the new nodes can put into a new namespace. It definitely allows us to merge different VG from different sources, and we can deal with weird cases such as cancer genomes, HELA lines etc... being merged into a super graph. |
This has come up a few times in the last few years and I would like to note down my current thinking on this for future reference. @ekg has convinced me that nodes are identified by the VG that they are in and that we should only care about path identification. i.e. it is OK, that VG nodes are "blank nodes" in the RDF world. However, the path identification is more critical than ever in this world. It is critical to note that the IRI of the paths need to be distinct from IRI of the linear sequence that the path is a representation of. I will try to explain why with an temporal example. Assume that we build a VG graph of two linear (fasta of) genomes A and B. We then do the same a year later, using a new version of VG, the VG graph produced can be different. i.e. we have 4 paths through 2 VG graphs for only 2 genomes. If these paths are exported into different formats we should give them different IRIs or extended names. Or we end up with two path descriptions getting merged into one, leading to collisions and confusions. To solve that VG should maintain in the path name, some minimal form of provenance of how the path was arrived at. |
Currently the IRI identifying a node in the RDF representation of a VG is determined by order of serialization. This means that at this time the IRIs are not global identifiers that one can use to place a VG rdf document on the web.
A better IRI generation scheme would perhaps use a hash of the input so that identical input generates identical IRIs.
UUIDs could be used for the IRIs but they compress very badly, they are hard to regenerate in an identical version.
The text was updated successfully, but these errors were encountered: