Layered Schema Architecture (LSA) enables semantic interoperability between heterogeneous systems. LSA uses a schema (such as FHIR schemas for health data) to define data structures, and overlays to add semantics, contextual metadata, and processing directives. A schema variant is the composition of the schema base and use-case specific overlays. Different schema variants can be used to ingest data from or export data to disparate systems. Each variant encodes source specific metadata and rules to ingest data into a knowledge graph, or target specific metadata and rules to translate the knowledge graph into another format.
A layered schema defines a mapping from structured data (in JSON, XML, tabular format) to a labeled property graph. For instance, when a JSON document is ingested using a layered schema, a labeled property graph is constructed by combining schema information with the input data elements. The resulting graph is self-contained, and includes both the input data and the schema. This allows processing data based on its semantic attributes instead of its attributes names or other structural properties. For example, it is possible to use layered schema to ingest health data, mark certain attributes as "identifiable", and then, remove all attributes that are marked as such. The algorithm to remove "identifiable" elements can be written without any knowledge of the input structure. Same algorithm can be used to de-identify health data as well as marketing data.
The main documentation site for layered schemas is:
This Go module contains the reference implementation of the layered schema specification. It contains a layered schema processor to slice/compose schemas, import JSON schemas, and ingest data from JSON/CSV/XML sources.
A traditional schema (a JSON schema, an XML schema, etc.) defines one or more data types. Such schemas are primarily used for documentation, data validation, and automated code generation. A schema includes definitions about the shape of the data object such as the list of fields, nested fields, or constraints on field contents. A layered schema adds additional interchangeable layers to a traditional schema so the same schema can be adjusted for different use cases, or can be processed with different processing directives to handle data variations. The layered schema architecture provides the framework to extend traditional schemas using such layers.
A schema defines a data structure (attributes). A schema has a
valueType
which is the the type of data structure defined by the
schema (e.g. Person
).
An overlay defines semantics for attributes, and may override schema
definitions. A Person
schema can only be composed with Person
overlays.
A schema variant is a schema that is the result of the composition of a schema with zero or more overlays.
A bundle defines the schema variants for multiple valueType
s. That
is, a bundle specifies all the schemas and overlays for each type of
object used in a particular use case. For instance, a bundle may
specify that Person
means the composition of a Person
schema and
an overlay containing terms from data privacy vocabulary. Data
ingested using this bundle will have annotations using data privacy
vocabulary terms.
Below is an example schema for a Person
data type containing
firstName
and lastName
.
person.schema.json:
{
"@context": "https://lschema.org/v1/ls.json",
"@id": "https://example.org/Person/schema",
"@type": "Schema",
"valueType": "https://example.org/Person",
"layer": {
"@type": "Object",
"@id": "https://example.org/Person",
"attributes": {
"https://example.org/Person/firstName": {
"@type": "Value"
},
"https://example.org/Person/lastName": {
"@type": "Value"
}
}
}
}
The Person
object is defined as an Object
that contains two
Value
attributes: firstName
and lastName
. These are the unique
attribute IDs. An overlay can be defined to add new semantics to these
attributes. The following overlay adds personal data category Name
to the lastName
attribute:
person-dpv.overlay.json:
{
"@context": "https://lschema.org/v1/ls.json",
"@id": "https://example.org/Person/dpv-ovl",
"@type": "Overlay",
"valueType": "https://example.org/Person",
"layer": {
"https://example.org/lastName": {
"@type": "Value",
"http://www.w3.org/ns/dpv#hasPersonalDataCategory": "http://www.w3.org/dpv#Name"
}
}
}
A bundle that defines Person
by composing the person schema and the
dpv overlay is:
{
"typeNames" : {
"https://example.org/Person": {
"schema" : "person.schema.json",
"overlays" : [
{
"schema" : "person-dpv.overlay.json"
}
]
}
}
}
A schema variant is a schema and zero or more overlays that modify the semantics of the schema for a use-case. A schema variant is composed by combining attributes of overlays with matching attributes of the schema. By default, the terms specified for the attributes of the overlays are combined as a set with the same terms of the schema. That is, if a schema has
{
"@id": <attributeId>,
"@type": "Value",
"someAnnotation": "someValue"
}
and the overlay has:
{
"@id": <attributeId>,
"@type": ["Value","someType"],
"someAnnotation": "otherValue"
}
the schema variant will have:
{
"@id": <attributeId>,
"@type": ["Value","someType"],
"someAnnotation": ["someValue", "otherValue"]
}
Some of the terms have override semantics, and can be used to modify schema behavior using overlays. For example:
{
"@id": <attributeId>,
"@type": "Value",
"valueType": "string"
}
The valueType
can be overridden by an overlay:
{
"@id": <attributeId>,
"@type": "Value",
"valueType": "xsd:date"
}
To compose schema variants using the layers
binary, use:
layers compose --bundle person-dpv.bundle.json --type https://example.org/Person
This will compose and output the schema variant for Person
as defined in the person-dpv.bundle
.
A schema variant may reference other schema variants. For example, the
Person
object in the examples/contact
directory refers to a
Contact
object to represent contact information. Compiling a schema
variant resolves these references and creates a single self-contained
schema. Such a schema can be used for data ingestion.
To compile the Person
schema, use:
layers compile --bundle person-dpv.bundle.json --type https://example.org/Person
This operation will attach the Contact
schema variant as configured
in the bundle to the Person
variant, resulting in a schema that has
no external references.
Data ingestion process parses an input object, combines the data attributes with the matching schema attributes and builds a labeled property graph (LPG). The LPG contains both the input object and the relevant schema information, so it can be processed using the annotations in the graph without any external references.
To ingest data using the layers
binary, use:
layers ingest json person_sample.json --bundle person-dpv.bundle.json --type https://example.org/Person
This operation will output a graph containing the nodes for the input
as well as nodes for the schema. All the input nodes will have the
label https://lschema.org/DocumentNode
, and all schema nodes will
have the label https://lschema.org/Attribute
. Every input node will
be connected to its corresponding schema node with a
https://lschema.org/instanceOf
edge. The input nodes will contain
https://lschema.org/schemaNodeId
annotation giving the ID of the
schema node for the attribute.
The schema nodes and input nodes can be combined using the
embedSchemaNodes
flag:
layers ingest json person_sample.json --bundle person-dpv.bundle.json --type https://example.org/Person --embedSchemaNodes
The output graph will have nodes containing only
https://lschema.org/DocumentNode
s that include all the annotations
given in the corresponding schema node. The original schema node can
still be found using the https://lschema.org/schemaNodeId
annotation.
Once you clone the repository, you can build the schema compiler using the Go build system.
cd layers
go build
This will build the layers
binary in the current directory.
The examples/
directory contains some example data processing
scenarios.
Commercial support for this library is available from Cloud Privacy Labs: support@cloudprivacylabs.com