-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDF Support Roadmap #1570
Comments
Semih – This looks great. Today I'm busy preparing work so that I can take vacation next week. After I return I'll review this more in-depth and give any further feedback I may have. From a user perspective, this looks quite promising. I like the two main goals laid out. There are a ton of RDF related technologies. Instead of trying to support them all directly (eg implementing your own SHACL validation), it might best to expose enough so that we can use existing external tools. I like the Kuzu strategy for supporting machine learning applications: be aware of popular use cases, don't reinvent the wheel, expose enough to plug into external tools. Also, many existing RDF tools are not fully compliant with standards. If common simple cases work well, I'm less concerned about weird edge cases. Do you envision being able to export back out to RDF? What happens when imported RDF is mixed with native prop graph data? I'm personally OK with the lack of SPARQL support. In my group I'm trying to build support for moving away from Gremlin and SPARQL, and towards Cypher as our main graph query language. One common technique when converting from RDF to a property graph is to somehow mark which nodes should become properties. OBO has created several tools for this and others are floating around. If adding anything other than the most basic RDF support, I might suggest something like this. Namespaces are mentioned. This is something I'm curious about in the pure prop graph space. If we ever get a standard for property graph schema, I would love for that to be supported. Tom |
Hi Tom, Yes, certainly we would support features to export out in different RDF formats. This should certainly be fine if the results contain triples stored in RDFGraphs but it's less clear what to do when the results mix triples stored in RDFGraphs as well as other Node/Rel Tables. We would either restrict exporting to cases when the query results contain only triples or adopt a set of exporting rules. For Namespaces: Can you say more? I mentioned namespaces in the document in the context of an optimization to store the IRI strings more efficiently in the system. Beyond that we are not considering supporting a notion of "namespace" in the system but I may not know the features you have in mind. Semih |
Hi @semihsalihoglu-uw this looks really good - FWIW, I did a quick implementation of RDF atop Kùzu by creating a custom This follows from work that our team at Derwen had been doing on A benefit of using an Of course, requests to add or delete a single node or triple become more complex, although with a bit of juggling inside Cypher queries that's feasible. In practice, something would need to manage the namespaces and prefixes, also as mentioned above. You might look at how we approached this in |
This is an in progress issue that describes our roadmap for implementing Resource Description Framework (RDF) support in Kùzu. In Kùzu, users model their records as a set of node and relationship tables. So we make an explicit distinction between tables that store nodes and relationships and the storage/indexing of their properties/columns are different. We can call this the "structured property graph model". Ultimately this model is a simple variant of the standard relational model. RDF on the other hand is a separate graph-based data model. RDF is specifically suitable for modeling heterogenous, complex knowledge graphs. It has several advantages that don't exist in the relational model. Two of the important differences are (though there are others):
(i) Ability to define type/class hierarchies: In the relational model, the relation names can be interpreted as giving types of entities that are represented in your records but one cannot natively define a class hierarchy when modeling your records as one would do in object-oriented programming languages (e.g., that Cow is a Mammal and Mammals are Animals, etc.). RDF has a standard vocabulary,
rdf:type
to define simple type/class hierarchies.(ii) Ability to express schema and data in a homogeneous way: In the relational model there is a clear distinction between schemas, which are declared in
CREATE Table Mammal(prop1, prop2, ...)
statements, vs the actual data, i.e., the records that are inserted into these tables. One can query only the data records with explicit references to the tables/schema of the data records. In RDF, both schema and data are represented in the same records, which are "triples" that represent facts (explained momentarily). This is quite powerful when the modeled application domain is very complex and very hard to tabulate in a set of tables.Overview of RDF
Subject, predicate, object triples and IRIs: In RDF, one represents a set of facts as
(subject, predicate, object)
triples. Subjects and predicates are always Resources (the "R" of RDF) that have globally unique identifiers called IRIs (often long strings that look like URLs).1 Two examples that indicate two facts about Justin Trudeau in the DBPedia RDF datasets are:<https://dbpedia.org/page/Justin_Trudeau, https://dbpedia.org/ontology/almaMater, https://dbpedia.org/page/McGill_University>
<https://dbpedia.org/page/Justin_Trudeau, https://dbpedia.org/ontology/birthDate, 1971-12-25 (xsd:date)>
Objects in triples on the other hand can either be: (i) Resources; or (ii) Values, such as integers, strings, dates. Let's refer to these as "Resource objects" vs "Value objects". Triples can naturally be mapped to a graph abstraction as a set of edges. In the RDF community and the SPARQL query language that has been developed to query RDF databases, each (s, p, o) triple is thought of as an edge with label
p
s-[p]->o
, wheres
is a Resource ando
is either a Resource or a value. Hence, RDF databases, i.e., sets of triples, are referred to as knowledge graphs.Other components of RDF are as follows:
Blank Nodes: Some Resources may lack an IRI yet one may still indicate facts about such Resources. As an example, in the N3 raw file format to store triples, blank nodes can be expressed as follows:
<dbr:Justin_Trudeau, dbo:spouse, [<dbo:birthDate, 1975-04-24 (xsd:date)>]>
(I might be getting the syntax a bit wrong). The part between[ ]
indicate a blank node (representing Sophie Trudeau) who has a dbo:birthDate property with value 1975-04-24 (xsd:date).Named Graphs: An extension of RDF is to label each triple with the label of a graph the triple belongs to. These are called named graphs. For example if we have had an RDF named graph
http://dbpedia.org/
that contained the triple about the alma mater of Justin Trudeau, we can represent it as a quadruple:<dbr:Justin_Trudeau, dbo:almaMater, dbr:McGill_University, http://dbpedia.org/>
, where the last part is the IRI of the named graph. In SPARQL you can query over multiple named graphs and even attach variables to the named graphs.Inference/Reasoning: The real power of RDF and the standards that come around RDF, such as RDFS and OWL, are that they provide a standard vocabulary around which one can develop a logical reasoner. This is quite advanced for us and beyond the scope of our initial roadmap (though arguably this is what makes RDF very powerful and super interesting) so I will also avoid getting into details but I will give a simple example. In RDFS (for RDF Schema), one can define the domains and ranges of predicates. You can say
<dbo:almaMater, rdfs:range, dbo:EducationalInstitution>
to specify the constraint that the range ofdbo:almaMater
isdbo:EducationalInstitution
. From the triple about Justin Trudeau's alma mater above, a DBMS that has RDFS inference capabilities can deduce that<dbr:McGill_University, rdf:type, dbo:EducationalInstitution>
even though this triple may never have been inserted into the system. Therefore, if you asked a query to return all triples<?, rdf:type, dbo:EducationalInstitution>
, you'd getdbr:McGill_University
. A lot more advanced reasoning/inferences can be done automatically in the standards around RDF, such as OWL. This is the real strength of storing your records and the semantics of your records in RDF, which is founded on a logical formalism which facilitates reasoning over these records.Goals of RDF Support in Kùzu
RDF and Kùzu's data model have differences but RDF triples can still be ingested into Kùzu as a set of nodes, relationships, and possibly some node properties and queries in Cypher. Our immediate goals in RDF support in Kùzu is twofold:
Outline for the rest of the document:
We should go in iterations and implement increasingly more advanced features. I will propose a rough design for the first 2 obvious steps and also mention more advanced features, such as RDF-Star and inferencing, which should be on our roadmap.
Step 1: Basic RDF Support
In this version we only support ingesting triples whose objects are Resources (so we do not ingest and query triples whose object is a value, e.g., <dbr:Sophie_Trudeau, dbo:birthDate, 1975-04-24 (xsd:date)>).
I will use the following example in this section (JT stands for Justin_Trudeau):
There are 5 triples (note the 1 triple for the blank node representing Sophie Trudeau). The objects of all of these triples are Resources. There is 1 blank node and 9 IRIs identifying 9 Resources.
RDFGraphs and Triple Ingestion Commands:
Users need to first create an RDFGraph in their databases before they can insert triples into it. So far in Kùzu, users can create either a Node Table or a Rel Table. RDFGraphs would be the third logical storage abstraction supported in the system. Implicitly an RDFGraph represents a pair of Resource Node Table and a Triples Rel Table. However, RDFGraph is a new and separate abstraction that the system and users know about. That is, an RDFGraph is not just a wrapper around Resource Node Table and a Triples Rel Table (as will become clear).
From the user's perspective, ingesting triples into a named graph should be as simple as running these two commands:
Users can create multiple RDF graphs and give each one an explicit name. For example a user can create
CREATE RDFGraph YAGO
, so we should start off by supporting named graphs.Mapping RDF Triples to Node/Rel Tables
Once these commands are executed, internally we should be mapping these triples into Kùzu's property graph model. For the basic RDF support, where objects of the ingested triples need to be Resources (and not values), there is a very natural way to do this (and I doubt there is a reasonable alternative), which is: internally, we create one node table and rel table with the following schemas:
DBPedia.Resources(ID SERIAL, iri STRING, PRIMARY KEY ID)
: We cannot make theiri
primary key because blank nodes do not have IRIs. We define an ID property with data typeSERIAL
, which is a new feature in Kùzu to indicate that the values of the column are dense integers from 0, ..., numNodes. This table will contain 9 nodes one for each resource and blank node. For example it can look like this:Note that the Resource with ID 5 has null IRI and corresponds to the blank node representing Sophie Trudeau.
DBPedia.Triples(FROM DBPedia.Resources, TO DBPedia.Resources, pIntIRI INT64)
:pIntIRI
stands for "predicate integer IRI" and stores the integer ID of the IRI of the predicate in the Resources table. Observe that every IRI appears as a node in the Resources table even if it never appears as a subject or an object of a triple. This will create the following Rel Table. Note that we store Rel Tables in our Lists storage structure, which is a disk-based CSR but I'm showing the records as a regular table for simplicity.Internal
_relID
is a system level identifier property we give to every relationship in the graph. Let's ignore the_relID
property and look at the first tuple (0, 2, 5) as an example. This is is the<dbr:JT, dbo:spouse, [dbo:birthPlace, dbr:Quebec]>
triple if you follow the mapping of IRIs/blank nodes to Resource IDs in the DBPedia.Resources node table.Note 1: I highly suggest that we sort the triple edges in
Lists
based on predicate value. This will be critical in doing quick search of predicates over large adjacency lists.Note 2: A big challenge will be that some of the backwards adjacency lists will be very large. For example in DBPedia, there is a Resource that appears as the object of 16M many triples. So we would have 2 backward adjacency lists of size 16M (one storing the pIntIRI and one storing the TO properties of the DBPedia.Triples Rel table). We need optimizations to improve the scalability of such very large adjacency lists.
Hash Index to store IRI-to-Resource ID Mapping: Along with these 2 tables, we need to use our Hash Index to map IRIs to Resource IDs. We already have the capability to do this for mapping string primary keys to system-level node offsets. However, this would be a case where we use the HashIndex for a non-primary key column. We do not need the reverse map because it is already stored in the IRI column of the Resources Node table as a column.
As a summary: Internally we use 3 storage structures to store the triples: 1) Resources Node Table; 2) Triples Rel Table; 3) IRI HashIndex. Each RDFGraph creates its own Resources, Triples, and HashIndices.
Querying RDF Triples in Cypher
Users can then query triples in their RDFGraph using the
()-[]->()
rel pattern in theirMATCH
orOPTIONAL MATCH
clauses as follows:The query asks for a 2-path query starting from the subject is dbr:JT over a single RDFGraph. We could also ask queries over multiple RDFGraphs, e.g.:
Or with a pattern that assigns multiple labels to nodes/relationships:
MATCH (s:DBPedia|YAGO)-[p:DBPedia|YAGO]->(o:DBPedia|YAGO) RETURN label(s), s, p, o
. Here label(s) will return the name of the RDFGraph, similar to how users can bind a variable to the name of a named graph. At least at the face of it, supporting these query variants do not seem to require any changes to our query planner and executor.Note1: This example already shows us that RDFGraph is not just a wrapper to map triples to node/rel tables. For example, the query processor of the system needs to be aware that what users want is not the
pIntIRI
property that we stored in the Triples rel table but the string IRI versions of those. So for example in the second tuple, for the triple matching(s1:DBPedia)-[p1:DBPedia]->(o:DBPedia)
part of the query, we return({ID: 0, iri: "dbr:JT"}, {_relID: 3, iri: "dbo:almaMater"}, {ID: 7, iri: "dbr:McGill_University"})
instead of(..., {_relID: 3, pIntIri: 2}, ...)
. So we convert thepIntIRI: 2
, which is what we store in the first row of the Triples table, intoiri: "dbo:almaMater"
. The query processor needs to be conscious of such RDF-specific query processing when evaluating queries that refer to the Triples and Resources tables.Note 2: We can without any special treatment support querying of RDFGraphs and explicitly defined Node and Rel tables. This should just work seamlessly. For example, one can have an
Employee
node table andRelatedTo
Rel Table fromEmployee
nodes toDBPedia.Resources
nodes. One can do this:This query asks for all employees that are related to some Resource that Justin Trudeau has a connection to in the DBPedia graph. We could have also asked:
This query asks for all Resource the Employee Alice is relatedTo. Note that I used the
DBPedia.Resources
label in variablea
instead ofDBPedia
. See my next node on this.Note 3: In the first query we used this rel pattern
(s1:DBPedia)-[p1:DBPedia]->(so:DBPedia)
instead of(s1:DBPedia.Resources)-[p1:DBPedia.Triples]->(so:DBPedia.Resources)
, which is more verbose. However in the immediately above query we usedMATCH (a:DBPedia.Resource)<-[p2:RelatedTo]-(a:Employee)
as it seemed more natural to be explicit that we were querying the Resources in DBPedia and that is how we had defined the RelatedTo table. We need to fix a consistent syntax here but assuming it is not hard to infer, we might just be flexible here and allow both syntaxes whenever they make sense.Note 4: We can also allow queries that only query the Resources nodes and not triples as follows:
For now, I suggest we support this even if this deviates from the RDF/SPARQL standards one can query triples only using triple patterns (e.g., in SPARQL you have to bind both subject, predicate, and objects as in
SELECT * WHERE { ?s, ?p, ?o }
). So we would deviate from this norm but it is also less code on our side to restrict these queries.Implementation Challenges: I won't go into many details here, but the main changes are to: (i) Catalog; (ii) execution of the COPY statement to a parser for Turtle, N3, Json-LD files etc; (iii) Planner and join optimizer to ensure we bind correct types etc.
Updating RDF Triples in Cypher
We need to support
CREATE
andDELETE
statements. There is a decision to make here about whether we force users toCREATE
andDELETE
Resources and Triples separately or only as triples. I would suggest that we force users toCREATE
andDELETE
as triples and also ingest data inCOPY FROM
statements as triples. This is another part we might want to get feedback from users and RDF community.Note: When deleting triples, we cannot remove nodes from the Resources table or the IRI Hash Index. We can only remove relationships from the Triples table. So the Resources table and IRI Hash Index for RDFGraphs can only be growing. We can only remove relationships from the Triples table. The reason for this is this: when a triple (s, p, o) is deleted, we can check if s.IRI or p.IRI etc. appear as subject or an object quickly by checking whether their adjacency lists in the Triples table are empty. But we cannot easily check if they appear as a predicate. This is because we do not have an index to check whether an IRI is stored as a predicate in some triple.
Step 2: Value Objects
Our second step is to support storing and querying triples whose objects can also be values, such as
<dbr:ST, dbo:birthDate, "1975-04-24"^xsd:date>
. This requires major changes to the system.Let's change our example a bit:
I added one new triple to the blank node with [dbo:birthDate, "1975-04-24"^xsd:date] as predicate and object.
Mapping RDF Triples to Node/Rel Tables
I will discuss two options for storing/mapping these triples to our Node/Rel Tables. The second one is suggested by Gaurav and Xiyang and is worth considering but we would need to do some back of the envelope calculations about its storage costs. My position to go with the first storage design.
Option 1: We extend the Resources node table with a
semiStructuredTriples
column and store ablob
property for each node to store these triples in the form of intPredicateIRI:dataType:value triples. Note that this issemi-structured
as we define the type of the data along with the data. So we would haveDBPedia.Resources(ID SERIAL, iri STRING, semiStructuredTriples BLOB, PRIMARY KEY ID)
.Above
{dbo:birthDate, "1975-04-24"^xsd:date}
would not be stored as string but in an efficient binary encoding of these values. For example, the date would be stored in 4 bytes, data type xsd:date would be stored in 1 byte, and dbo:birthDate in 8 bytes in the integer version of the IRI.2Option 2: For each possible data type of the objects we store, we can extend the Resources table with a column: e.g., DBPedia.Resources(ID SERIAL, iri STRING, int64Predicate List<INT64>, int64Values: List<INT64>, floatPredicate List<INT64>, floatValues: List<FLOAT>, ...., PRIMARY KEY ID). This may actually not too bad in storage given most things will be null and we will compress nulls very well with our Storage V2 design. It might be worth investigating as a side project. The real differences can be that this type of storage can slow down queries that need to check if a particular node has a particular predicate. That is because dozens of columns would need to be scanned as we are destined to support dozens of data types.
Irrespective of which choice we take, we also need to store the IRIs of the predicates in these triples in the IRI Hash Index and allocate them an offset in the Resources table. For example for the triple [{dbo:birthDate, "1975-04-24"^xsd:date}] of blank node, the predicate is
dbo:birthDate
and still has an IRI, so is a Resource and should be stored and mapped to an integer in the IRI Hash Index.Querying RDF Triples in Cypher
Irrespective of how we store these triples, when querying these triples we have a choice. I'm a bit divided about this, but Option 1 below follows SPARQL more closely, so that is my preference for now. We should try to get some insights on this from outside.
Note: The variable
o
's type is sometimes a Resource and sometimes another type. We already have similar capabilities for returning variables that can bind to multiple Node or Rel Tables, e.g., the queryMATCH (a) RETURN *
would return nodes from all node tables in the database bound to the variablea
. In this case, for RDF value objects, we should probably define a new type calledValue
orLiteral
that can store a wide range of types. Another option is to make any object alwaysValue
and extendValue
also with aResource
type, so it can hold both a Resource or other values and return them in a single data type.Note 1: The
WHERE s.iri IS NULL
is a way to search blank nodes, though we can support a separate function such asbnode(s)
to be more explicit.Note 2:
dbo:birthDate:1975-04-24
is now a property ons1
. The obvious problem with this approach is that users no longer have a single way to query triples seamlessly as they do in SPARQL. To query if a Resource has adbo:birthDate
predicate, they have to query node properties, e.g.,MATCH (s:DBPedia) WHERE s.dbo:birthDate = "1975-04-24" RETURN *
, instead of writing a rel pattern.Note 3: Our running example 2-hop query's output also changes if we model value objects as node properties. Recall that the query was:
MATCH (s1:DBPedia)-[p1:DBPedia]->(so:DBPedia)-[p2:DBPedia]->(o:DBPedia)
. Now thatdbo:birthDate:1975-04-24
is not part of theTriples
table, the system would not return the output tuple({ID: 0, iri: "dbr:JT"}, {_relID: 0, iri: "dbo:spouse"}, {ID: 5, iri: null}, {_relID: 3, iri: "dbo:birthDate"}, {value: "1975-04-24"})
because the last two variables{_relID: 3, iri: "dbo:birthDate"}, {value: "1975-04-24"}
are now a property ofso
and not a triple. This is an important problem because that changes the meaning of the join. So for example, if you wanted to query all properties of a node with IRIdbo:xyz
, you'd have to run 2 queries: OneMATCH (s:DBPedia) WHERE s.iri = dbo:xyz RETURN *
, and another oneMATCH (s:DBPedia)-[p:DBPedia]->(o:DBPedia) WHERE s.iri = dbo:xyz RETURN *
.Implementation Challenge: Irrespective of how we model value objects, changing our query processor and expression and function evaluators will need to change a lot as our system can no longer be fully typed, i.e., there will be variables whose types we cannot infer during compilation time. So if we support running expressions and functions on those variables, we need a new unary and binary expression evaluators that assumes that the type of their operands may not be known apriori, so they first check their operands' types and then call a function (and maybe there is certain ways to cast Value's into specific classes for some queries).
Updating RDF Triples in Cypher
Assuming we pick Option 1, the CREATE and DELETE statements should be the same except we need new operators to be able to update the blobs that store the predicate value pairs as a node property efficiently.
Optimizations
o
, that binds to objects. Therefore, whenever we can we should infer during compilation ifo
can only bind to Resources, so we can use faster query execution paths. For example, in this queryMATCH (s:DBPedia)-[p:DBPedia]->(o:DBPedia)<-[p2:RelatedTo]-(a:Employee)
, o can only be a Resource as it has an incoming edge from anEmployee
node.-HDT RDF Compression Format: This is something to look into for two reasons: (1) to explore if it makes sense as a way to store RDF triples in the system (probably not); and (2) be able to ingest and output results in HDT format. This might make sense.
Advanced Features:
RDF-Star
I am intentionally leaving this blank to be designed later. RDF-Star is an extension to RDF to make statements about statements (as I understand also soon to be part of the RDF 1.2 standard). Here by a statement I refer to a triple as making a "statement" about something in the modeled application domain. So in RDF-Star, a subject or object resource can be an
quoted triple
. So <<dbr:JT, dbo:spouse, dbr:ST>, abc:reportedBy, abc:John_Doe> is a statement that John Doe reported that JT's spouse is ST. Importantly <dbr:JT, dbo:spouse, dbr:ST> is not necessarily a triple in the database. So if you didWHERE ?s, ?p, ?o
. My current thoughts are that we can support mapping RDF-Star by defining a newresourceType
property for Resources stored in Resources Node Table that can be: (i)Resource
or (ii)Implicit_Statement
. But this needs to be designed and we need to first see the interest on these more advanced features.Inference Features
It's also unclear what to do here. The DBMS community does not have a lot of expertise in inference algorithms and we need to understand this space much much better before we can propose a good design.
Footnotes
Resources can be thought of as objects as in object-oriented programming languages and one can make statements about them. ↩
This is effectively a simpler implementation of the "Semi-structured Node Property Lists" we had before we made Kùzu open sourced (for any one interested: before November 2022, we had a feature to store arbitrary key value properties on nodes called "Semi-structured Node Property Lists", which we removed. ↩
The text was updated successfully, but these errors were encountered: