Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDF Support Roadmap #1570

Open
semihsalihoglu-uw opened this issue May 25, 2023 · 3 comments
Open

RDF Support Roadmap #1570

semihsalihoglu-uw opened this issue May 25, 2023 · 3 comments
Assignees
Labels
rdf Issues related to RDF support

Comments

@semihsalihoglu-uw
Copy link
Contributor

semihsalihoglu-uw commented May 25, 2023

This is an in progress issue that describes our roadmap for implementing Resource Description Framework (RDF) support in Kùzu. In Kùzu, users model their records as a set of node and relationship tables. So we make an explicit distinction between tables that store nodes and relationships and the storage/indexing of their properties/columns are different. We can call this the "structured property graph model". Ultimately this model is a simple variant of the standard relational model. RDF on the other hand is a separate graph-based data model. RDF is specifically suitable for modeling heterogenous, complex knowledge graphs. It has several advantages that don't exist in the relational model. Two of the important differences are (though there are others):

  • (i) Ability to define type/class hierarchies: In the relational model, the relation names can be interpreted as giving types of entities that are represented in your records but one cannot natively define a class hierarchy when modeling your records as one would do in object-oriented programming languages (e.g., that Cow is a Mammal and Mammals are Animals, etc.). RDF has a standard vocabulary, rdf:type to define simple type/class hierarchies.

  • (ii) Ability to express schema and data in a homogeneous way: In the relational model there is a clear distinction between schemas, which are declared in CREATE Table Mammal(prop1, prop2, ...) statements, vs the actual data, i.e., the records that are inserted into these tables. One can query only the data records with explicit references to the tables/schema of the data records. In RDF, both schema and data are represented in the same records, which are "triples" that represent facts (explained momentarily). This is quite powerful when the modeled application domain is very complex and very hard to tabulate in a set of tables.

Overview of RDF

Subject, predicate, object triples and IRIs: In RDF, one represents a set of facts as (subject, predicate, object) triples. Subjects and predicates are always Resources (the "R" of RDF) that have globally unique identifiers called IRIs (often long strings that look like URLs).1 Two examples that indicate two facts about Justin Trudeau in the DBPedia RDF datasets are:

  • <https://dbpedia.org/page/Justin_Trudeau, https://dbpedia.org/ontology/almaMater, https://dbpedia.org/page/McGill_University>
  • <https://dbpedia.org/page/Justin_Trudeau, https://dbpedia.org/ontology/birthDate, 1971-12-25 (xsd:date)>

Objects in triples on the other hand can either be: (i) Resources; or (ii) Values, such as integers, strings, dates. Let's refer to these as "Resource objects" vs "Value objects". Triples can naturally be mapped to a graph abstraction as a set of edges. In the RDF community and the SPARQL query language that has been developed to query RDF databases, each (s, p, o) triple is thought of as an edge with label p s-[p]->o, where s is a Resource and o is either a Resource or a value. Hence, RDF databases, i.e., sets of triples, are referred to as knowledge graphs.

Other components of RDF are as follows:
Blank Nodes: Some Resources may lack an IRI yet one may still indicate facts about such Resources. As an example, in the N3 raw file format to store triples, blank nodes can be expressed as follows: <dbr:Justin_Trudeau, dbo:spouse, [<dbo:birthDate, 1975-04-24 (xsd:date)>]> (I might be getting the syntax a bit wrong). The part between [ ] indicate a blank node (representing Sophie Trudeau) who has a dbo:birthDate property with value 1975-04-24 (xsd:date).

Named Graphs: An extension of RDF is to label each triple with the label of a graph the triple belongs to. These are called named graphs. For example if we have had an RDF named graph http://dbpedia.org/ that contained the triple about the alma mater of Justin Trudeau, we can represent it as a quadruple: <dbr:Justin_Trudeau, dbo:almaMater, dbr:McGill_University, http://dbpedia.org/>, where the last part is the IRI of the named graph. In SPARQL you can query over multiple named graphs and even attach variables to the named graphs.

Inference/Reasoning: The real power of RDF and the standards that come around RDF, such as RDFS and OWL, are that they provide a standard vocabulary around which one can develop a logical reasoner. This is quite advanced for us and beyond the scope of our initial roadmap (though arguably this is what makes RDF very powerful and super interesting) so I will also avoid getting into details but I will give a simple example. In RDFS (for RDF Schema), one can define the domains and ranges of predicates. You can say <dbo:almaMater, rdfs:range, dbo:EducationalInstitution> to specify the constraint that the range of dbo:almaMater is dbo:EducationalInstitution. From the triple about Justin Trudeau's alma mater above, a DBMS that has RDFS inference capabilities can deduce that <dbr:McGill_University, rdf:type, dbo:EducationalInstitution> even though this triple may never have been inserted into the system. Therefore, if you asked a query to return all triples <?, rdf:type, dbo:EducationalInstitution>, you'd get dbr:McGill_University. A lot more advanced reasoning/inferences can be done automatically in the standards around RDF, such as OWL. This is the real strength of storing your records and the semantics of your records in RDF, which is founded on a logical formalism which facilitates reasoning over these records.

Goals of RDF Support in Kùzu

RDF and Kùzu's data model have differences but RDF triples can still be ingested into Kùzu as a set of nodes, relationships, and possibly some node properties and queries in Cypher. Our immediate goals in RDF support in Kùzu is twofold:

  1. Quick triple ingestion from standard raw file formats, such as n3, json-ld, turtle, rdf/xml: As a first step for this, we need to define how these triples get mapped to Nodes, Edges and properties. I will make a proposal here but we have some options here as I will describe below.
  2. Fast querying of triples in Cypher: Once we can ingest these triples into Kùzu, we want to support querying these triples in Cypher, thus benefiting from our fast many-to-many join and recursive join capabilities and efficient storage structures. This should also allow linking the Resources that are modeled in RDF Graphs with other Node and Relationship tables that store structured information and query them seamlessly.

Outline for the rest of the document:
We should go in iterations and implement increasingly more advanced features. I will propose a rough design for the first 2 obvious steps and also mention more advanced features, such as RDF-Star and inferencing, which should be on our roadmap.

  1. Basic RDF support with named graphs but only supporting "Resource objects".
  2. Support for "value objects".

Step 1: Basic RDF Support

In this version we only support ingesting triples whose objects are Resources (so we do not ingest and query triples whose object is a value, e.g., <dbr:Sophie_Trudeau, dbo:birthDate, 1975-04-24 (xsd:date)>).

I will use the following example in this section (JT stands for Justin_Trudeau):

dbr:JT, dbo:spouse, [dbo:birthPlace, dbr:Quebec] .
dbr:JT, dbo:almaMater, dbr:McGill_University .
dbr:McGill_University, dbo:locatedIn, dbr:Quebec .
dbo:almaMater, rdf:type, rdf:Property .

There are 5 triples (note the 1 triple for the blank node representing Sophie Trudeau). The objects of all of these triples are Resources. There is 1 blank node and 9 IRIs identifying 9 Resources.

RDFGraphs and Triple Ingestion Commands:

Users need to first create an RDFGraph in their databases before they can insert triples into it. So far in Kùzu, users can create either a Node Table or a Rel Table. RDFGraphs would be the third logical storage abstraction supported in the system. Implicitly an RDFGraph represents a pair of Resource Node Table and a Triples Rel Table. However, RDFGraph is a new and separate abstraction that the system and users know about. That is, an RDFGraph is not just a wrapper around Resource Node Table and a Triples Rel Table (as will become clear).

From the user's perspective, ingesting triples into a named graph should be as simple as running these two commands:

CREATE RDFGraph DBPedia
COPY FROM dbpedia.ttl INTO DBPedia

Users can create multiple RDF graphs and give each one an explicit name. For example a user can create CREATE RDFGraph YAGO, so we should start off by supporting named graphs.

Mapping RDF Triples to Node/Rel Tables

Once these commands are executed, internally we should be mapping these triples into Kùzu's property graph model. For the basic RDF support, where objects of the ingested triples need to be Resources (and not values), there is a very natural way to do this (and I doubt there is a reasonable alternative), which is: internally, we create one node table and rel table with the following schemas:

  • Node Table DBPedia.Resources(ID SERIAL, iri STRING, PRIMARY KEY ID): We cannot make the iri primary key because blank nodes do not have IRIs. We define an ID property with data typeSERIAL, which is a new feature in Kùzu to indicate that the values of the column are dense integers from 0, ..., numNodes. This table will contain 9 nodes one for each resource and blank node. For example it can look like this:
| ID  | IRI                   |
| --- | ----------------------|
| 0   | dbr:JT                |
| 1   | dbo:birthPlace        |
| 2   | dbo:spouse            |
| 3   | dbo:almaMater         |
| 4   | rdf:type              |
| 5   | null                  |
| 6   | rdf:Property          |
| 7   | dbr:McGill_University |
| 8   | dbr:Quebec            |
| 9   | dbo:locatedIn         |

Note that the Resource with ID 5 has null IRI and corresponds to the blank node representing Sophie Trudeau.

  • Rel Table DBPedia.Triples(FROM DBPedia.Resources, TO DBPedia.Resources, pIntIRI INT64): pIntIRI stands for "predicate integer IRI" and stores the integer ID of the IRI of the predicate in the Resources table. Observe that every IRI appears as a node in the Resources table even if it never appears as a subject or an object of a triple. This will create the following Rel Table. Note that we store Rel Tables in our Lists storage structure, which is a disk-based CSR but I'm showing the records as a regular table for simplicity.
|_relID | FROM  | pIntIRI | TO    |
|-------|-------|---------|-------|
| 0     | 0     | 2       | 5     |
| 1     | 0     | 3       | 7     |
| 2     | 3     | 4       | 6     |
| 3     | 5     | 1       | 8     |
| 4     | 7     | 9       | 8     |

Internal _relID is a system level identifier property we give to every relationship in the graph. Let's ignore the _relID property and look at the first tuple (0, 2, 5) as an example. This is is the <dbr:JT, dbo:spouse, [dbo:birthPlace, dbr:Quebec]> triple if you follow the mapping of IRIs/blank nodes to Resource IDs in the DBPedia.Resources node table.

Note 1: I highly suggest that we sort the triple edges in Lists based on predicate value. This will be critical in doing quick search of predicates over large adjacency lists.

Note 2: A big challenge will be that some of the backwards adjacency lists will be very large. For example in DBPedia, there is a Resource that appears as the object of 16M many triples. So we would have 2 backward adjacency lists of size 16M (one storing the pIntIRI and one storing the TO properties of the DBPedia.Triples Rel table). We need optimizations to improve the scalability of such very large adjacency lists.

Hash Index to store IRI-to-Resource ID Mapping: Along with these 2 tables, we need to use our Hash Index to map IRIs to Resource IDs. We already have the capability to do this for mapping string primary keys to system-level node offsets. However, this would be a case where we use the HashIndex for a non-primary key column. We do not need the reverse map because it is already stored in the IRI column of the Resources Node table as a column.

As a summary: Internally we use 3 storage structures to store the triples: 1) Resources Node Table; 2) Triples Rel Table; 3) IRI HashIndex. Each RDFGraph creates its own Resources, Triples, and HashIndices.

Querying RDF Triples in Cypher

Users can then query triples in their RDFGraph using the ()-[]->() rel pattern in their MATCH or OPTIONAL MATCH clauses as follows:

MATCH (s1:DBPedia)-[p1:DBPedia]->(so:DBPedia)-[p2:DBPedia]->(o:DBPedia)
WHERE s1.iri = "dbr:JT"
RETURN *

Output:
| s1  | p1  | so   | p2     | o   | 
|-------|---------|-------|
| {ID: 0, iri: "dbr:JT"}  |  {_relID: 0, iri: "dbo:spouse"} |  {ID: 5, iri: null} |  {_relID: 3, iri: "dbo:birthPlace"} | {ID: 8, iri: "dbr:Quebec"} |
| {ID: 0, iri: "dbr:JT"}  |  {_relID: 3, iri: "dbo:almaMater"} |  {ID: 7, iri: "dbr:McGill_University"} |  {_relID: 4, iri: "dbo:locatedIn"} | {ID: 8, iri: "dbr:Quebec"} |

The query asks for a 2-path query starting from the subject is dbr:JT over a single RDFGraph. We could also ask queries over multiple RDFGraphs, e.g.:

MATCH (s1:DBPedia)-[p1:DBPedia]->(o1:DBPedia), (s2:YAGO)-[p2:YAGO]->(o2:YAGO)
WHERE s1.iri = "dbr:JT" and s1.iri = s2.iri
RETURN *

Or with a pattern that assigns multiple labels to nodes/relationships: MATCH (s:DBPedia|YAGO)-[p:DBPedia|YAGO]->(o:DBPedia|YAGO) RETURN label(s), s, p, o. Here label(s) will return the name of the RDFGraph, similar to how users can bind a variable to the name of a named graph. At least at the face of it, supporting these query variants do not seem to require any changes to our query planner and executor.

Note1: This example already shows us that RDFGraph is not just a wrapper to map triples to node/rel tables. For example, the query processor of the system needs to be aware that what users want is not the pIntIRI property that we stored in the Triples rel table but the string IRI versions of those. So for example in the second tuple, for the triple matching (s1:DBPedia)-[p1:DBPedia]->(o:DBPedia) part of the query, we return ({ID: 0, iri: "dbr:JT"}, {_relID: 3, iri: "dbo:almaMater"}, {ID: 7, iri: "dbr:McGill_University"}) instead of (..., {_relID: 3, pIntIri: 2}, ...). So we convert the pIntIRI: 2, which is what we store in the first row of the Triples table, into iri: "dbo:almaMater". The query processor needs to be conscious of such RDF-specific query processing when evaluating queries that refer to the Triples and Resources tables.

Note 2: We can without any special treatment support querying of RDFGraphs and explicitly defined Node and Rel tables. This should just work seamlessly. For example, one can have an Employee node table and RelatedTo Rel Table from Employee nodes to DBPedia.Resources nodes. One can do this:

CREATE Rel Table RelatedTo(FROM Employee, TO DBPedia.Resources)
// Some CREATE Statements to insert records in to RelatedTo

MATCH (s:DBPedia)-[p:DBPedia]->(o:DBPedia)<-[p2:RelatedTo]-(a:Employee)
WHERE s1.iri = "dbr:JT"
RETURN a

This query asks for all employees that are related to some Resource that Justin Trudeau has a connection to in the DBPedia graph. We could have also asked:

MATCH (a:DBPedia.Resource)<-[p2:RelatedTo]-(a:Employee)
WHERE a.name = "Alice"
RETURN a

This query asks for all Resource the Employee Alice is relatedTo. Note that I used the DBPedia.Resources label in variable a instead of DBPedia. See my next node on this.

Note 3: In the first query we used this rel pattern (s1:DBPedia)-[p1:DBPedia]->(so:DBPedia) instead of (s1:DBPedia.Resources)-[p1:DBPedia.Triples]->(so:DBPedia.Resources), which is more verbose. However in the immediately above query we used MATCH (a:DBPedia.Resource)<-[p2:RelatedTo]-(a:Employee) as it seemed more natural to be explicit that we were querying the Resources in DBPedia and that is how we had defined the RelatedTo table. We need to fix a consistent syntax here but assuming it is not hard to infer, we might just be flexible here and allow both syntaxes whenever they make sense.

Note 4: We can also allow queries that only query the Resources nodes and not triples as follows:

MATCH (s:DBPedia.Resources)
RETURN *

For now, I suggest we support this even if this deviates from the RDF/SPARQL standards one can query triples only using triple patterns (e.g., in SPARQL you have to bind both subject, predicate, and objects as in SELECT * WHERE { ?s, ?p, ?o }). So we would deviate from this norm but it is also less code on our side to restrict these queries.

Implementation Challenges: I won't go into many details here, but the main changes are to: (i) Catalog; (ii) execution of the COPY statement to a parser for Turtle, N3, Json-LD files etc; (iii) Planner and join optimizer to ensure we bind correct types etc.

Updating RDF Triples in Cypher

We need to support CREATE and DELETE statements. There is a decision to make here about whether we force users to CREATE and DELETE Resources and Triples separately or only as triples. I would suggest that we force users to CREATE and DELETE as triples and also ingest data in COPY FROM statements as triples. This is another part we might want to get feedback from users and RDF community.

Note: When deleting triples, we cannot remove nodes from the Resources table or the IRI Hash Index. We can only remove relationships from the Triples table. So the Resources table and IRI Hash Index for RDFGraphs can only be growing. We can only remove relationships from the Triples table. The reason for this is this: when a triple (s, p, o) is deleted, we can check if s.IRI or p.IRI etc. appear as subject or an object quickly by checking whether their adjacency lists in the Triples table are empty. But we cannot easily check if they appear as a predicate. This is because we do not have an index to check whether an IRI is stored as a predicate in some triple.

Step 2: Value Objects

Our second step is to support storing and querying triples whose objects can also be values, such as <dbr:ST, dbo:birthDate, "1975-04-24"^xsd:date>. This requires major changes to the system.

Let's change our example a bit:

dbr:JT, dbo:spouse, [dbo:birthPlace, dbr:Quebec;
                     dbo:birthDate, "1975-04-24"^xsd:date] .
dbr:JT, dbo:almaMater, dbr:McGill_University .
dbr:McGill_University, dbo:locatedIn, dbr:Quebec .
dbo:almaMater, rdf:type, rdf:Property .

I added one new triple to the blank node with [dbo:birthDate, "1975-04-24"^xsd:date] as predicate and object.

Mapping RDF Triples to Node/Rel Tables

I will discuss two options for storing/mapping these triples to our Node/Rel Tables. The second one is suggested by Gaurav and Xiyang and is worth considering but we would need to do some back of the envelope calculations about its storage costs. My position to go with the first storage design.

Option 1: We extend the Resources node table with a semiStructuredTriples column and store a blob property for each node to store these triples in the form of intPredicateIRI:dataType:value triples. Note that this is semi-structured as we define the type of the data along with the data. So we would have DBPedia.Resources(ID SERIAL, iri STRING, semiStructuredTriples BLOB, PRIMARY KEY ID).

| ID  | IRI                   | semiStructuredTriples | 
| --- | ----------------------| ----------------------| 
| ... |     ...               |       ...             |
| 4   | rdf:type              | null                  |
| 5   | null                  |  {dbo:birthDate, "1975-04-24"^xsd:date} |
| ... |     ...               |       ...             |

Above {dbo:birthDate, "1975-04-24"^xsd:date} would not be stored as string but in an efficient binary encoding of these values. For example, the date would be stored in 4 bytes, data type xsd:date would be stored in 1 byte, and dbo:birthDate in 8 bytes in the integer version of the IRI.2

Option 2: For each possible data type of the objects we store, we can extend the Resources table with a column: e.g., DBPedia.Resources(ID SERIAL, iri STRING, int64Predicate List<INT64>, int64Values: List<INT64>, floatPredicate List<INT64>, floatValues: List<FLOAT>, ...., PRIMARY KEY ID). This may actually not too bad in storage given most things will be null and we will compress nulls very well with our Storage V2 design. It might be worth investigating as a side project. The real differences can be that this type of storage can slow down queries that need to check if a particular node has a particular predicate. That is because dozens of columns would need to be scanned as we are destined to support dozens of data types.

Irrespective of which choice we take, we also need to store the IRIs of the predicates in these triples in the IRI Hash Index and allocate them an offset in the Resources table. For example for the triple [{dbo:birthDate, "1975-04-24"^xsd:date}] of blank node, the predicate is dbo:birthDate and still has an IRI, so is a Resource and should be stored and mapped to an integer in the IRI Hash Index.

Querying RDF Triples in Cypher

Irrespective of how we store these triples, when querying these triples we have a choice. I'm a bit divided about this, but Option 1 below follows SPARQL more closely, so that is my preference for now. We should try to get some insights on this from outside.

  1. We can interpret them as objects of triples similar to triples whose objects are Resources. Queries and outputs would look as follows:
MATCH (s1:DBPedia)-[p1:DBPedia]->(so:DBPedia)-[p2:DBPedia]->(o:DBPedia)
WHERE s1.iri IS NULL
RETURN *

Output (scroll to the very right to see the last column value of the last output tuple):
| s1  | p1  | so   | p2     | o   | 
|-------|---------|-------|
| {ID: 0, iri: "dbr:JT"}  |  {_relID: 0, iri: "dbo:spouse"} |  {ID: 5, iri: null} |  {_relID: 3, iri: "dbo:birthPlace"} | {ID: 8, iri: "dbr:Quebec"} |
| {ID: 0, iri: "dbr:JT"}  |  {_relID: 3, iri: "dbo:almaMater"} |  {ID: 7, iri: "dbr:McGill_University"} |  {_relID: 4, iri: "dbo:locatedIn"} | {ID: 8, iri: "dbr:Quebec"} |
| {ID: 0, iri: "dbr:JT"}  |  {_relID: 0, iri: "dbo:spouse"} |  {ID: 5, iri: null} |  {_relID: 3, iri: "dbo:birthDate"} | **{value: "1975-04-24"}** |

Note: The variable o's type is sometimes a Resource and sometimes another type. We already have similar capabilities for returning variables that can bind to multiple Node or Rel Tables, e.g., the query MATCH (a) RETURN * would return nodes from all node tables in the database bound to the variable a. In this case, for RDF value objects, we should probably define a new type called Value or Literal that can store a wide range of types. Another option is to make any object always Value and extend Value also with a Resource type, so it can hold both a Resource or other values and return them in a single data type.

  1. We can interpret them as properties of Resources. Queries and outputs would look as follows:
MATCH (s:DBPedia)-[p:DBPedia]->(o:DBPedia)
WHERE s.iri IS NULL
RETURN *

Output:
| s   | p   | o   |    
|-------|---------|-------|
 {ID: 5, iri: null, dbo:birthDate:1975-04-24} |  {_relID: 3, iri: "dbo:birthPlace"} | {ID: 8, iri: "dbr:Quebec"} |

Note 1: TheWHERE s.iri IS NULL is a way to search blank nodes, though we can support a separate function such as bnode(s) to be more explicit.

Note 2: dbo:birthDate:1975-04-24 is now a property on s1. The obvious problem with this approach is that users no longer have a single way to query triples seamlessly as they do in SPARQL. To query if a Resource has a dbo:birthDate predicate, they have to query node properties, e.g., MATCH (s:DBPedia) WHERE s.dbo:birthDate = "1975-04-24" RETURN *, instead of writing a rel pattern.

Note 3: Our running example 2-hop query's output also changes if we model value objects as node properties. Recall that the query was: MATCH (s1:DBPedia)-[p1:DBPedia]->(so:DBPedia)-[p2:DBPedia]->(o:DBPedia). Now that dbo:birthDate:1975-04-24 is not part of the Triples table, the system would not return the output tuple({ID: 0, iri: "dbr:JT"}, {_relID: 0, iri: "dbo:spouse"}, {ID: 5, iri: null}, {_relID: 3, iri: "dbo:birthDate"}, {value: "1975-04-24"}) because the last two variables {_relID: 3, iri: "dbo:birthDate"}, {value: "1975-04-24"} are now a property of so and not a triple. This is an important problem because that changes the meaning of the join. So for example, if you wanted to query all properties of a node with IRI dbo:xyz, you'd have to run 2 queries: One MATCH (s:DBPedia) WHERE s.iri = dbo:xyz RETURN *, and another one MATCH (s:DBPedia)-[p:DBPedia]->(o:DBPedia) WHERE s.iri = dbo:xyz RETURN *.

Implementation Challenge: Irrespective of how we model value objects, changing our query processor and expression and function evaluators will need to change a lot as our system can no longer be fully typed, i.e., there will be variables whose types we cannot infer during compilation time. So if we support running expressions and functions on those variables, we need a new unary and binary expression evaluators that assumes that the type of their operands may not be known apriori, so they first check their operands' types and then call a function (and maybe there is certain ways to cast Value's into specific classes for some queries).

Updating RDF Triples in Cypher

Assuming we pick Option 1, the CREATE and DELETE statements should be the same except we need new operators to be able to update the blobs that store the predicate value pairs as a node property efficiently.

Optimizations

  • Native Namespace Prefix Support: Catalog can keep the prefixes of the IRIs separately and we can encode IRIs as an integer followed by a string in the HashIndex. Or we can have a new type IRI that is an integer prefix followed by a string or we could store the prefix as an integer separately than the rest of the string. We may need to experiment with these different ways of storing.
  • FSST String Compression Support: We are keeping a global HashIndex that will serve as a string dictionary. However, the strings we will use for the Hash Index will be very long, so we would like to compress them. One problem is that this is likely to be a very large dictionary, i.e., it would be very difficult to update and/or change its codes. We can experiment with using a single FSST dictionary for all the IRIs. A more natural approach could be to have a trie index, instead of a Hash Index but that sounds like a lot of engineering. Trie index can naturally compress common prefixes but not sure how effective it would be.
  • Figuring from the rel patterns during compilation when a particular object maps only to Resources and not to values: Our primary performance problem will be to try to bind both a Resource and a Value to the same variables, e.g., o, that binds to objects. Therefore, whenever we can we should infer during compilation if o can only bind to Resources, so we can use faster query execution paths. For example, in this query MATCH (s:DBPedia)-[p:DBPedia]->(o:DBPedia)<-[p2:RelatedTo]-(a:Employee), o can only be a Resource as it has an incoming edge from an Employee node.

-HDT RDF Compression Format: This is something to look into for two reasons: (1) to explore if it makes sense as a way to store RDF triples in the system (probably not); and (2) be able to ingest and output results in HDT format. This might make sense.

Advanced Features:

RDF-Star

I am intentionally leaving this blank to be designed later. RDF-Star is an extension to RDF to make statements about statements (as I understand also soon to be part of the RDF 1.2 standard). Here by a statement I refer to a triple as making a "statement" about something in the modeled application domain. So in RDF-Star, a subject or object resource can be an quoted triple. So <<dbr:JT, dbo:spouse, dbr:ST>, abc:reportedBy, abc:John_Doe> is a statement that John Doe reported that JT's spouse is ST. Importantly <dbr:JT, dbo:spouse, dbr:ST> is not necessarily a triple in the database. So if you did WHERE ?s, ?p, ?o. My current thoughts are that we can support mapping RDF-Star by defining a new resourceType property for Resources stored in Resources Node Table that can be: (i) Resource or (ii) Implicit_Statement. But this needs to be designed and we need to first see the interest on these more advanced features.

Inference Features

It's also unclear what to do here. The DBMS community does not have a lot of expertise in inference algorithms and we need to understand this space much much better before we can propose a good design.

Footnotes

  1. Resources can be thought of as objects as in object-oriented programming languages and one can make statements about them.

  2. This is effectively a simpler implementation of the "Semi-structured Node Property Lists" we had before we made Kùzu open sourced (for any one interested: before November 2022, we had a feature to store arbitrary key value properties on nodes called "Semi-structured Node Property Lists", which we removed.

@th5
Copy link

th5 commented May 26, 2023

Semih –

This looks great. Today I'm busy preparing work so that I can take vacation next week. After I return I'll review this more in-depth and give any further feedback I may have. From a user perspective, this looks quite promising. I like the two main goals laid out.

There are a ton of RDF related technologies. Instead of trying to support them all directly (eg implementing your own SHACL validation), it might best to expose enough so that we can use existing external tools. I like the Kuzu strategy for supporting machine learning applications: be aware of popular use cases, don't reinvent the wheel, expose enough to plug into external tools. Also, many existing RDF tools are not fully compliant with standards. If common simple cases work well, I'm less concerned about weird edge cases.

Do you envision being able to export back out to RDF?

What happens when imported RDF is mixed with native prop graph data?

I'm personally OK with the lack of SPARQL support. In my group I'm trying to build support for moving away from Gremlin and SPARQL, and towards Cypher as our main graph query language.

One common technique when converting from RDF to a property graph is to somehow mark which nodes should become properties. OBO has created several tools for this and others are floating around. If adding anything other than the most basic RDF support, I might suggest something like this.

Namespaces are mentioned. This is something I'm curious about in the pure prop graph space. If we ever get a standard for property graph schema, I would love for that to be supported.

Tom

@semihsalihoglu-uw
Copy link
Contributor Author

Hi Tom,

Yes, certainly we would support features to export out in different RDF formats. This should certainly be fine if the results contain triples stored in RDFGraphs but it's less clear what to do when the results mix triples stored in RDFGraphs as well as other Node/Rel Tables. We would either restrict exporting to cases when the query results contain only triples or adopt a set of exporting rules.

For Namespaces: Can you say more? I mentioned namespaces in the document in the context of an optimization to store the IRI strings more efficiently in the system. Beyond that we are not considering supporting a notion of "namespace" in the system but I may not know the features you have in mind.

Semih

@ceteri
Copy link

ceteri commented Jun 12, 2023

Hi @semihsalihoglu-uw this looks really good -

FWIW, I did a quick implementation of RDF atop Kùzu by creating a custom RDFlib.Store plugin class where nodes and predicates get stored in Kùzu tables, much like you've described:
https://github.com/DerwenAI/kuzu-rdflib

This follows from work that our team at Derwen had been doing on kglab which is an open source PyData-ish abstraction layer to provide integration paths for different areas of graph technologies.

A benefit of using an RDFlib plugin is that so much of the W3C stack in Python comes along, nearly for free. In other words, this resolves the question of inference, validation, etc., since SPARQL, SHACL, OWL-RL, etc., become trivial to use.

Of course, requests to add or delete a single node or triple become more complex, although with a bit of juggling inside Cypher queries that's feasible.

In practice, something would need to manage the namespaces and prefixes, also as mentioned above. You might look at how we approached this in kglab — and I'm happy to talk through that and brainstorm approaches here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rdf Issues related to RDF support
Projects
None yet
Development

No branches or pull requests

5 participants