Add local db for performance reasons #10

Panaetius · 2020-05-04T13:31:09Z

Instead of saving a flat list as a .json or .yaml file, it'd be nice to have a local db (key-value-sture) that saves objects by IRI.

this could be done with https://docs.python.org/3/library/dbm.html#module-dbm or https://github.com/coleifer/unqlite-python or https://github.com/RaRe-Technologies/sqlitedict (this doesn't seem to be actively supported). Also check if there's other solutions that could fit our needs better.

Panaetius · 2020-05-28T14:47:22Z

After some more thought, adding a database to calamus is not really needed,we can use rdflib for that, since it already supports a multitude of DB backends through plugins.

Instead, we can just support an RDFLib graph object, since that already allows reading/writing triples, and RDFLib supports several backends Including SPARQL endpoints, solving #9 ). Then we don't need to worry about how to store things, and instead we just need to interact with rdflib.

The thing needed for this is to be able to deserialize rdflib triples to calamus objects and vice versa.

An easy implementation would be to just use rdflib-jsonld plugin to convert python objects <---> jsonld <---> rdflib representation. But that plugin does not support JSONLD 1.1 yet and using RDFLib types directly would likely increase performance.

So in addition to JsonLd, it'd be great if we could serialize to rdflib triples as well.

Proposal for an interface:

import rdflib

graph = Graph('Sleepycat', identifier='mygraph')

graph.open('/home/user/data/myRDFLibStore', create = True)

dataset = DatasetSchema(graph=graph).load_by_id("https://example.com/1")


DatasetSchema(graph=graph).store(dataset)

query = graph.query(
    """SELECT DISTINCT ?a
       WHERE {
          ?a rdf:type schema:Dataset .
       }""")

datasets = DatasetSchema().load_triples(query)

We could also think about supporting multiple graph, especially together with lazy:

local_graph = Graph('Sleepycat', identifier='mygraph')

local_graph.open('/home/user/data/myRDFLibStore', create = True)

remote_graph = Graph(store=SPARQLStore('http://dbpedia.org/sparql'))
remote_graph.open()

graph = calamus.FallbackGraph(local_graph, remote_graph)

DatasetSchema(graph=graph, lazy=True).load_by_id("https://example.com/1")

Which would try to access objects in local_graph but if they're not found, try the remote_graph. But this should probably be done in a followup ticket.

rokroskar · 2020-06-12T21:54:33Z

Graph above is rdflib.Graph?

I wonder if it could be made even simpler, it still reads maybe a little verbose to me but I really like the general idea.

I was just thinking that having a simple way to define what you want (e.g. a Person with a Name and a Birthplace), providing a really clean interface to getting that information from some valid endpoint, and followed with a means to then transform that data (into e.g. a pandas dataframe?), would be really valuable beyond the immediate context in which we happen to envision using this. For an example, have a look at this blogpost and note how complicated (ugly?) the "Retrieving SPARQL queries with Python" gets. And this is a really simple use-case.

rokroskar · 2020-06-16T22:47:21Z

This seems to be exactly what we need: https://github.com/RDFLib/rdflib-hdt

The interface seems to fit exactly into the use-case you sketched above.

Extra bonus: it seems currently supported unlike 99% of RDF tools out there

Panaetius · 2020-09-09T12:01:09Z

https://cayley.io/ could be an option

rokroskar · 2020-09-29T08:34:40Z

A PoC for this was implemented in #47 with an interface like this:

from calamus.backends.neo4j import CalamusNeo4JBackend
neo = CalamusNeo4JBackend()
neo.initialize()

book = BookSchema(session=neo).load(
    neo.fetch_by_id(
        "http://example.com/books/1"
    )
)

passing session implies flattened=True so if there are links to nodes for which data does not exist, it is fetched automatically via an additional query from the db.

An improvement of this functionality was discussed to enable something like this:

book = BookSchema().load({"_id": "http://example.com/books/1"}, session=neo)

which would translate into a query for a book by that id. The query building could be expanded for simple matching on nested properties:

book = BookSchema().load(
    {
        "http://schema.org/author": 
            {
                "http://schema.org/name": "Isaac Newton" 
            }
    }

which would create a query like

MATCH (n:`http://schema.org/Book`) -[:`http://schema.org/author`]-> ({`http://schema.org/name`: "Isaac Newton"})

This example is with neo4j but it should be quite backend-independent.

Panaetius added this to the sprint-2020-05-28 milestone May 26, 2020

Panaetius self-assigned this May 27, 2020

Panaetius added the needs design label May 28, 2020

Panaetius modified the milestones: sprint-2020-05-28, sprint-2020-06-18 Jun 17, 2020

mohammad-alisafaee modified the milestones: sprint-2020-06-18, sprint-2020-07-09 Jul 8, 2020

Panaetius modified the milestones: sprint-2020-07-09, sprint-2020-07-31 Jul 29, 2020

Panaetius modified the milestones: sprint-2020-07-31, sprint-2020-08-20 Aug 19, 2020

Panaetius modified the milestones: sprint-2020-08-20, sprint-2020-09-10 Sep 9, 2020

Panaetius assigned rokroskar Sep 9, 2020

rokroskar mentioned this issue Sep 22, 2020

feat: add local db support with neo4j #47

Draft

Panaetius moved this to Backlog in renku-python May 18, 2022

Panaetius added this to renku-python May 18, 2022

Panaetius removed their assignment Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add local db for performance reasons #10

Add local db for performance reasons #10

Panaetius commented May 4, 2020

Panaetius commented May 28, 2020

rokroskar commented Jun 12, 2020 •

edited

Loading

rokroskar commented Jun 16, 2020

Panaetius commented Sep 9, 2020

rokroskar commented Sep 29, 2020

Add local db for performance reasons #10

Add local db for performance reasons #10

Comments

Panaetius commented May 4, 2020

Panaetius commented May 28, 2020

rokroskar commented Jun 12, 2020 • edited Loading

rokroskar commented Jun 16, 2020

Panaetius commented Sep 9, 2020

rokroskar commented Sep 29, 2020

rokroskar commented Jun 12, 2020 •

edited

Loading