Layered JSON schemas can be used instead of JSON-LD to translate JSON documents to knowledge graphs (and thus, to RDF). This is a proof-of-concept showing how that's done.
JSON-LD is a format for encoding linked data. It allows mapping of JSON properties to a web ontology to enable interoperability between differents systems. JSON-LD also offers a JSON-based encoding of knowledge graphs. A JSON-LD document can be translated into RDF and vice-versa.
A layered JSON schema is simply a JSON schema with additional overlays that annotate the schema. This annotated schema (the "schema variant") is used to ingest a JSON document. The added annotations encode the necessary terminology mappings and graph-shaping instructions to build a knowledge graph, and thus, RDF.
There are several advantages of using layered JSON schemas: JSON schemas are widely available in the industry to specify data standards. They describe valid JSON documents expected by data exchange partners or API users in a machine-readable manner so they can be used to validate JSON documents or to generate code for different languages.
The LSA tools are available here. Here's an overview of the idea:
A schema variant is composed of a schema that defines data structures, and zero or more overlays (hence the name "layered schemas") that annotate the schema with semantic information and metadata. These annotations adjust and enrich the schema by adding or removing constraints, metadata, processing information such as pointers to normalization tables, or mappings to an ontology.
Schema variants are useful in an environment where there are multiple varying implementation of a standard, or where there are multiple standards or proprietary data structures within an ecosystem. Different schema variants can be used to ingest and harmonize disparate data structures.
When a document is ingested using LSA tools, the ingestion process creates a labeled property graph (LPG) containing both the input data elements and the schema variant. Each node in this graph contains the ingested data value and the schema information corresponding to the data element. Thus, the ingested LPG is a self-describing object that contains the input values and schema annotations for each data element. The annotations can also contain graph-shaping instructions to fine-tune the ingestion process.
This proof-of-concept uses the LSA packages to compose schema variants
and ingest JSON documents. The layers
program from the LSA
repository can also be used together with json2rdf
program in this
repository.
Let's consider the following simple JSON-LD document describing a
Person
. It uses the https://schema.org
ontology to describe a
person object:
{
"@context": "https://schema.org",
"@type": "Person",
"@id": "http://linkedin.com/jane-doe",
"address": {
"@type": "PostalAddress",
"addressLocality": "Denver",
"addressRegion": "CO"
},
"colleague": [
"http://example.com/John.html",
"http://example.com/Amy.html"
],
"email": "jane@example.com",
"name": "Jane Doe",
"sameAs" : [ "https://facebook.com/jane-doe",
"https://twitter.com/jane-doe"]
}
JSON-LD uses the context
to map JSON keys to RDF. This is done by
mapping individual object properties to concepts in an ontology (the
context
defines the "semantics" of the data.) The graphical RDF
representation for this object is:
As you can see, the structure of the output graph depends on both the
JSON-LD context mappings and the structure of the JSON input file. A
JSON object in the input file is represented as a node in the output
graph, and a JSON property is represented as an edge (predicate). If
the JSON object does not have an @id
, then it is translated into a
blank node.
This process can be summarized as:
The goal is to produce RDF from a JSON document using a JSON schema, as opposed to using a JSON-LD document and context. For this, we will annotate the JSON schema using an overlay (remember: schema + overlay = schema variant), ingest the JSON document, and translate the resulting LPG to RDF using the annotations embedded in the ingested object.
First, we write a JSON schema to describe the data structures
(Person
and PostalAddress
). Then we combine it with an overlay
that describes how to translate JSON data points into RDF. For
example, the JSON key-value pair:
"email": "jane@example.com"
should be translated as an RDF predicate http://schema.org/email
and
an RDF literal object jane@example.com
. To do this, the overlay
annotates the schema for the email
property with the IRI
http://schema.org/email
. The overlay should also specify that it
should be an RDF predicate. The following annotation serves this
purpose:
"rdfPredicate": "http://schema.org/email"
As another example, consider the Person
object:
{
"@id": "http://linkedin.com/jane-doe",
...
}
This should be translated to an RDF IRI node
http://linkedin.com/jane-doe
, which is given in the @id
attribute. So we should be able to tell that an RDF IRI node should be
created from a given data element to represent this object:
"rdfIRI": "ref:<pointer to the @id field>
The following diagram summarizes the approach:
Based on this, we devise the following tags:
rdfPredicate
: Declares a JSON property as an RDF predicate (edge), with a mapping to a term. This is similar to mapping a JSON property using JSON-LD context. The difference here is that using a layered schema, we can explicitly declare that a JSON property should be translated as a predicate instead of a node.rdfIRI
: Declares a JSON property as an RDF node, while also providing its IRI mapping. The IRI can be a fixed value, or it can be collected from another node in the input document (like, the@id
property in our example.)rdfType
: Defines the type of the literal, or the type of the node.rdfLang
: Defined the language of a literal.
The schema defines the object Person
as follows:
{
"definitions": {
"Person": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
...
The overlay follows the same structure, but adds the tags under the
x-ls
object:
{
"definitions": {
"Person": {
"x-ls": {
"rdfType": "http://schema.org/Person",
"rdfIRI": "ref:http://schema.org/Person/@id"
},
"properties": {
"@id": {
"type": "string"
},
"name": {
"x-ls": {
"rdfPredicate": "http://schema.org/name"
}
}
...
The @id
property does not exist in the original schema. The overlay
adds it.
All annotations go under the x-ls
JSON object. This is the
recommended way of adding extensions to a JSON schema: it starts with
x-
. ls
standard for layered schema
. The layered schema processor
builds a composite schema by adding x-ls
objects to corresponding
places in the schema.
The schema variant is a composition of the two:
{
"definitions": {
"Person": {
"type": "object",
"x-ls": {
"rdfType": "http://schema.org/Person",
"rdfIRI": "ref:http://schema.org/Person/@id"
},
"properties": {
"@id": {
"type": "string"
},
"name": {
"type": "string",
"x-ls": {
"rdfPredicate": "http://schema.org/name"
}
},
...
Here's the complete person schema and the overlay.
For rdfIRI
, json2rdf
uses these conventions:
- The following uses the given value as the IRI node:
"rdfIRI": "value"
Example:
Input:
"PostalAddress": {
"x-ls": {
"rdfIRI": "http://schema.org/PostalAddress"
}
Output:
The RDF node corresponding to the "PostalAddress" property with IRI: "http://schema.org/PostalAddress"
- The following creates a blank node for the JSON property:
"rdfIRI": "blank"
Example:
Input:
"PostalAddress": {
"x-ls": {
"rdfIRI": "blank"
}
Output:
The RDF node corresponding to the "PostalAddress" property will be a blank node.
- The following uses the referenced node value to create an IRI node. The
node value must be an IRI. The first node accessible from the
current node that has
schemaNodeId: <reference>
value will be used.
"rdfIRI": "ref:<reference>"
Example:
Input:
"Person": {
"x-ls": {
"rdfIRI": "ref:http://schema.org/Person/@id"
},
Output:
The RDF node corresponding to the "Person" property will have the IRI
extracted from the "@id" property (the LPG node with schemaNodeId: http://schema.org/Person/@id
) under the "Person" object.
- The following uses the JSON property value to create an IRI node. The JSON property value must be an IRI:
"rdfIRI": "."
LSA already provides tools to ingest data and produce a labeled property graph, so we will use those. LSA ingests a data file based on a schema variant and produces a labeled property graph. The following image illustrates the data ingestion process.
The JSON schema person.schema.json defines the
structure of the JSON objects, in this example, Person
and
PostalAddress
. The overlay person.ovl.json
annotates this schema to define the mappings to schema.org
terms
using the above tags. The bundle file
person.bundle.yaml combines the schema and the
overlay, and defines the schema variant. The schema variant itself is
an LPG. This schema variant LPG contains a node for every JSON data
point (every object, array, and value.) Data ingestion process takes
the input data file person-sample.json and
interprets it using the schema variant LPG, creating a new LPG for the
data object. This LPG becomes a self-describing object that contains
all input data values and corresponding schema annotations. We then
take this LPG, use the RDF annotations at each node, and produce the
RDF output.
The LPG for the ingested data contains the schema annotations as well
as the input data. We can process the annotations in each node to
create the RDF output. This creates an RDF IRI node with value taken
from the @id
property under Person
, and with type
http://schema.org/Person
. Note that the original schema does not
contain the @id
property. That is added by the overlay. The output
looks like:
For Person/name
, we get:
The algorithm to convert an ingested data LPG into RDF using these annotations is implemented in graph2rdf.go. The algorithm sketch is as follows:
- We first build the top-level nodes. These are nodes that have
rdfIRI
annotation. We keep a mapping between input LPG nodes and the RDF nodes. This step builds a list of input graph nodes for which RDF nodes are built. - Using the list of graph nodes built in the previous step of the
previous iteration, we process all graph nodes that are connected
with
rdfPredicate
, and extend the RDF graph in a breadth-first manner. We put every new input graph nodes for which a non-literal RDF node is generated to the list of nodes, and iterate as long as the list of nodes is nonempty.
You can run the json2rdf
program as follows:
json2rdf --bundle person.bundle.yaml --type http://schema.org/Person person-sample.json
Alternatively, you can first ingest the JSON file using LSA tools, and then pass the output graph to json2rdf:
layers ingest json --bundle person.bundle.yaml --type http://schema.org/Person person-sample.json | json2rdf
Which produces:
<http://linkedin.com/jane-doe> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://schema.org/PostalAddress> .
<http://linkedin.com/jane-doe> <http://schema.org/email> "info@example.com" .
<http://linkedin.com/jane-doe> <http://schema.org/name> "Jane Doe" .
<http://linkedin.com/jane-doe> <http://schema.org/birthPlace> "Boulder, CO" .
<http://linkedin.com/jane-doe> <http://schema.org/birthDate> "1972-11-12"^^<http://schema.org/Date> .
<http://linkedin.com/jane-doe> <http://schema.org/height> "71" .
<http://linkedin.com/jane-doe> <http://schema.org/gender> "female" .
<http://linkedin.com/jane-doe> <http://schema.org/address> _:b0 .
<http://linkedin.com/jane-doe> <http://schema.org/colleague> "http://www.example.com/John.html" .
<http://linkedin.com/jane-doe> <http://schema.org/colleague> "http://www.example.com/Jane.html" .
<http://linkedin.com/jane-doe> <http://schema.org/sameAs> "https://www.facebook.com/" .
<http://linkedin.com/jane-doe> <http://schema.org/sameAs> "https://www.linkedin.com/" .
<http://linkedin.com/jane-doe> <http://schema.org/sameAs> "http://twitter.com/" .
_:b0 <http://schema.org/addressLocality> "Denver" .
_:b0 <http://schema.org/addressRegion> "CO" .
_:b0 <http://schema.org/postalCode> "80123" .
_:b0 <http://schema.org/streetAddress> "100 Main Street" .
So this is how you can generate RDF from a JSON document using a JSON schema describing the format of the input documents, a JSON schema overlay that describes the RDF mapping, and a bundle that combines the schema and the overlay.
The important takeaways are:
- The shape of the RDF output can be better controlled by extending the tags and the translation algorithm. A JSON document can be translated in multiple ways to produce different RDF outputs.
- The input to
json2rdf
is an LPG. LSA supports other schema formats. It is possible to translate an XML document or a CSV file to an RDF using the same framework. - LSA supports multiple data types. It is possible to normalize and translate non-standard date/time representations.
- Unlike a JSON-LD document, JSON schemas provide structural validation.