Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add local db for performance reasons #10

Open
Panaetius opened this issue May 4, 2020 · 5 comments
Open

Add local db for performance reasons #10

Panaetius opened this issue May 4, 2020 · 5 comments
Assignees

Comments

@Panaetius
Copy link
Member

Instead of saving a flat list as a .json or .yaml file, it'd be nice to have a local db (key-value-sture) that saves objects by IRI.

this could be done with https://docs.python.org/3/library/dbm.html#module-dbm or https://github.com/coleifer/unqlite-python or https://github.com/RaRe-Technologies/sqlitedict (this doesn't seem to be actively supported). Also check if there's other solutions that could fit our needs better.

@Panaetius Panaetius added this to the sprint-2020-05-28 milestone May 26, 2020
@Panaetius Panaetius self-assigned this May 27, 2020
@Panaetius
Copy link
Member Author

After some more thought, adding a database to calamus is not really needed,we can use rdflib for that, since it already supports a multitude of DB backends through plugins.

Instead, we can just support an RDFLib graph object, since that already allows reading/writing triples, and RDFLib supports several backends Including SPARQL endpoints, solving #9 ). Then we don't need to worry about how to store things, and instead we just need to interact with rdflib.

The thing needed for this is to be able to deserialize rdflib triples to calamus objects and vice versa.

An easy implementation would be to just use rdflib-jsonld plugin to convert python objects <---> jsonld <---> rdflib representation. But that plugin does not support JSONLD 1.1 yet and using RDFLib types directly would likely increase performance.

So in addition to JsonLd, it'd be great if we could serialize to rdflib triples as well.

Proposal for an interface:

import rdflib

graph = Graph('Sleepycat', identifier='mygraph')

graph.open('/home/user/data/myRDFLibStore', create = True)

dataset = DatasetSchema(graph=graph).load_by_id("https://example.com/1")


DatasetSchema(graph=graph).store(dataset)

query = graph.query(
    """SELECT DISTINCT ?a
       WHERE {
          ?a rdf:type schema:Dataset .
       }""")

datasets = DatasetSchema().load_triples(query)

We could also think about supporting multiple graph, especially together with lazy:

local_graph = Graph('Sleepycat', identifier='mygraph')

local_graph.open('/home/user/data/myRDFLibStore', create = True)

remote_graph = Graph(store=SPARQLStore('http://dbpedia.org/sparql'))
remote_graph.open()

graph = calamus.FallbackGraph(local_graph, remote_graph)

DatasetSchema(graph=graph, lazy=True).load_by_id("https://example.com/1")

Which would try to access objects in local_graph but if they're not found, try the remote_graph. But this should probably be done in a followup ticket.

@rokroskar
Copy link
Member

rokroskar commented Jun 12, 2020

Graph above is rdflib.Graph?

I wonder if it could be made even simpler, it still reads maybe a little verbose to me but I really like the general idea.

I was just thinking that having a simple way to define what you want (e.g. a Person with a Name and a Birthplace), providing a really clean interface to getting that information from some valid endpoint, and followed with a means to then transform that data (into e.g. a pandas dataframe?), would be really valuable beyond the immediate context in which we happen to envision using this. For an example, have a look at this blogpost and note how complicated (ugly?) the "Retrieving SPARQL queries with Python" gets. And this is a really simple use-case.

@rokroskar
Copy link
Member

This seems to be exactly what we need: https://github.com/RDFLib/rdflib-hdt

The interface seems to fit exactly into the use-case you sketched above.

Extra bonus: it seems currently supported unlike 99% of RDF tools out there

@Panaetius
Copy link
Member Author

https://cayley.io/ could be an option

@rokroskar
Copy link
Member

A PoC for this was implemented in #47 with an interface like this:

from calamus.backends.neo4j import CalamusNeo4JBackend
neo = CalamusNeo4JBackend()
neo.initialize()

book = BookSchema(session=neo).load(
    neo.fetch_by_id(
        "http://example.com/books/1"
    )
)

passing session implies flattened=True so if there are links to nodes for which data does not exist, it is fetched automatically via an additional query from the db.

An improvement of this functionality was discussed to enable something like this:

book = BookSchema().load({"_id": "http://example.com/books/1"}, session=neo)

which would translate into a query for a book by that id. The query building could be expanded for simple matching on nested properties:

book = BookSchema().load(
    {
        "http://schema.org/author": 
            {
                "http://schema.org/name": "Isaac Newton" 
            }
    }

which would create a query like

MATCH (n:`http://schema.org/Book`) -[:`http://schema.org/author`]-> ({`http://schema.org/name`: "Isaac Newton"})

This example is with neo4j but it should be quite backend-independent.

@Panaetius Panaetius moved this to Backlog in renku-python May 18, 2022
@Panaetius Panaetius removed their assignment Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

3 participants