Skip to content

Commit

Permalink
Adds the "input blank node identifier map" (#100)
Browse files Browse the repository at this point in the history
* Adds the "input blank node identifier map" and a way to initialize it from an input N-Quads document, otherwise uses arbitrary identifiers for each blank node in the input dataset.

Clears up some ambiguity about original blank node identifiers.

* Broken terms.

* Apply suggestions from code review

Co-authored-by: Ivan Herman <ivan@ivan-herman.net>
Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>
Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>

* In step 7, allow the normalized data set and input blank node identifier map to be used instead of the seralized N-Quads result.

* Apply suggestions from code review

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

* Apply suggestions from code review

Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>

* Grammar suggstion from Ted.

* Add input blank node identifier map and canonical issuer to the normalized dataset. This makes the RDF dataset portion of this immutable, initialized from the input dataset.

Updates some steps to clarify that updates are for map entries, and not adding quads, which simplifies step 6.

* Apply suggestions from code review

Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>

* Update spec/index.html

Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>

* Update spec/index.html

Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>

* Update spec/index.html

Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>

---------

Co-authored-by: Ivan Herman <ivan@ivan-herman.net>
Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>
Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
  • Loading branch information
5 people authored May 24, 2023
1 parent 1a80f75 commit 5903d52
Showing 1 changed file with 120 additions and 38 deletions.
158 changes: 120 additions & 38 deletions spec/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,25 @@ <h2>Uses of Dataset Canonicalization</h2>
RDF store or serialization is arbitrary,
and typically not relatable to the context within which it is used.</p>

<p>This specification defines an algorithm for creating stable <a>blank node identifiers</a> repeatably for different serializations possibly using individualized <a>blank node identifiers</a> of the same RDF graph (dataset) by grounding each <a>blank node</a> through the nodes to which it is connected, essentially creating <em>Skolem <a>blank node identifiers</a></em>. As a result, a graph signature can be obtained by hashing a canonical serialization of the resulting <a>normalized dataset</a>, allowing for the isomorphism and digital signing use cases. As blank node identifiers can be stable even with other changes to a graph (dataset), in some cases it is possible to compute the difference between two graphs (datasets), for example if changes are made only to ground triples, or if new blank nodes are introduced which do not create an automorphic confusion with other existing blank nodes. If any information which would change the generated blank node identifier, a resulting diff might indicate a greater set of changes than actually exists.</p>
<p>This specification defines an algorithm for creating stable
<a>blank node identifiers</a> repeatably for different serializations
possibly using individualized <a>blank node identifiers</a>
of the same RDF graph (dataset) by grounding each <a>blank node</a>
through the nodes to which it is connected.
As a result, a graph signature can be obtained by hashing a canonical serialization
of the resulting <a>normalized dataset</a>,
allowing for the isomorphism and digital signing use cases.
As blank node identifiers can be stable even with other changes to a graph (dataset),
in some cases it is possible to compute the difference between two graphs (datasets),
for example if changes are made only to ground triples,
or if new blank nodes are introduced which do not create an automorphic confusion
with other existing blank nodes.
If any information which would change the generated blank node identifier,
a resulting diff might indicate a greater set of changes than actually exists.
Additionally, if the starting dataset is an N-Quads document,
it may be possible to correlate the original blank node identifiers
used within that N-Quads document with those issued in the
<a>normalized dataset</a>.</p>

<div class="ednote">
<p>Add descriptions for relevant historical discussions and prior art:</p>
Expand Down Expand Up @@ -295,15 +313,35 @@ <h3>Terms defined by this specification</h3>
<dt><dfn data-lt="input dataset|input datasets">input dataset</dfn></dt>
<dd>The abstract <a>RDF dataset</a> that is provided as input to
the algorithm.</dd>
<dt><dfn>input blank node identifier map</dfn></dt>
<dd>Records any blank node identifiers already assigned to the
<a>input dataset</a>.
If the <a>input dataset</a> is provided as an N-Quads document,
the <a>map</a> relates blank nodes in the abstract <a>input dataset</a>
to the blank node identifiers used within the N-Quads document,
otherwise, identifiers are assigned arbitrarily for
each blank node in the input dataset not previously identified.
<div class="note">Implementations or environments might deal with blank
node identifiers more directly; for example, some implementations might
retain blank node identifiers in the parsed or abstract dataset. Implementations
are expected to reuse these to enable usable mappings between input blank node
identifiers and output blank node identifiers outside of the algorithm.</div>
</dd>
<dt><dfn>normalized dataset</dfn></dt>
<dd>A <a>normalized dataset</a> is the combination of an <a>RDF dataset</a>
and a <a>map</a> where [=map/keys=]
are <a>blank nodes</a> from the dataset
and [=map/values=] are the associated canonical <a>blank node identifiers</a>.
<dd>A <a>normalized dataset</a> is the combination of the following:
<ul>
<li>an <a>RDF dataset</a>
the <a>input dataset</a>,</li>
<li>the <a>input blank node identifier map</a>
mapping <a>blank nodes</a> in the input dataset to <a>blank node identifiers</a>, and</li>
<li>the <a>canonical issuer</a>
mapping identifiers in the input dataset to canonical identifiers</li>
</ul>
A concrete serialization of a <a>normalized dataset</a> MUST label
all <a>blank nodes</a> using these stable <a>blank node identifiers</a>.</dd>
all <a>blank nodes</a> using the canonical <a>blank node identifiers</a>.
</dd>
<dt><dfn>identifier issuer</dfn></dt>
<dd>An identifier issuer is used to issue new <a>blank node identifier</a>. It
<dd>An identifier issuer is used to issue new <a>blank node identifiers</a>. It
maintains a
<a href="#bn-issuer-state">blank node identifier issuer state</a>.</dd>
<dt><dfn>hash</dfn></dt>
Expand Down Expand Up @@ -476,7 +514,7 @@ <h2>Canonicalization State</h2>
<div class="ednote">
Mapping all <a>blank nodes</a> to use this
identifier spec means that an <a>RDF dataset</a> composed of two
different <a>RDF graphs</a> will use different
different <a>RDF graphs</a> will issue different
identifiers than that for the graphs taken independently. This may
happen anyway, due to <a
href="https://en.wikipedia.org/wiki/Automorphism">automorphisms</a>,
Expand All @@ -492,11 +530,10 @@ <h2>Canonicalization State</h2>
<section id="bn-issuer-state">
<h2>Blank Node Identifier Issuer State</h2>

<p>During the canonicalization algorithm, it is sometimes necessary to
issue new identifiers to <a>blank nodes</a>. The
<a href="#issue-identifier">Issue Identifier algorithm</a> uses an
<a>identifier issuer</a> to accomplish this task. The information
an <a>identifier issuer</a> needs to keep track of is described
<p>The canonicalization algorithm issues identifiers to <a>blank nodes</a>.
The <a href="#issue-identifier">Issue Identifier algorithm</a> uses an
<a>identifier issuer</a> to accomplish this task.
The information an <a>identifier issuer</a> needs to keep track of is described
below.</p>

<dl>
Expand All @@ -514,10 +551,10 @@ <h2>Blank Node Identifier Issuer State</h2>
create an <a>blank node identifier</a>. It is initialized to
<code>0</code>.</dd>
<dt><dfn>issued identifiers map</dfn></dt>
<dd>An <a>ordered map</a> that relates existing identifiers to issued identifiers,
<dd>An <a>ordered map</a> that relates <a>blank node identifiers</a> to issued identifiers,
to prevent issuance of more than one new identifier per existing identifier,
and to allow <a>blank nodes</a> to
be reassigned identifiers some time after issuance.</dd>
be assigned identifiers some time after issuance.</dd>
</dl>
</section>

Expand Down Expand Up @@ -561,7 +598,12 @@ <h3>Overview</h3>
<ol>
<li id="ca-hl.1"><strong>Initialization</strong>.
Initialize the state needed for the rest of the algorithm
using <a href="#canon-state" class="sectionRef"></a>.</li>
using <a href="#canon-state" class="sectionRef"></a>.
Also initialize the <a>normalized dataset</a> using the <a>input dataset</a>
(which remains immutable)
the <a>input blank node identifier map</a>
(retaining blank node identifiers from the input if possible, otherwise assigning them arbitrarily);
the <a>canonical issuer</a> is added upon completion of the algorithm.</li>
<li id="ca-hl.2"><strong>Compute first degree hashes</strong>.
Compute the first degree hash for each blank node in the dataset using <a href="#hash-1d-quads" class="sectionRef"></a>.</li>
<li id="ca-hl.3"><strong>Canonically label unique nodes</strong>.
Expand All @@ -578,7 +620,9 @@ <h3>Overview</h3>
If more than one node produces the same N-degree hash,
the order in which these nodes receive a canonical identifier does not matter.</li>
<li id="ca-hl.6"><strong>Finish</strong>.
Return the <a>serialized canonical form</a> of the <a>normalized dataset</a>.</li>
Return the <a>serialized canonical form</a> of the <a>normalized dataset</a>.
Alternatively, return the <a>normalized dataset</a> containing
the <a>input blank node identifier map</a> and <a>canonical issuer</a>.</li>
</ol>
</section>

Expand Down Expand Up @@ -690,8 +734,8 @@ <h3>Examples</h3>
as there are no remaining blank nodes without canonical identifiers.</p>

<p><a href="#ca.6">Step 6</a> generates
the normalized dataset by replacing blank node identifiers in the original
input with their canonical identifiers:</p>
the normalized dataset by mapping blank node identifiers in the input dataset
with canonical identifiers:</p>

<pre id="ex-ca-unique-normalized-dataset" data-transform="updateExample">
<!--
Expand Down Expand Up @@ -901,9 +945,12 @@ <h3>Examples</h3>
</tbody>
</table>

<p><a href="#ca.6">Step 6</a> generates
the normalized dataset by replacing blank node identifiers in the original
input with their canonical identifiers:</p>
<p><a href="#ca.6">Step 6</a> updates
the <a>normalized dataset</a>
with the <a>canonical issuer</a>,
containing an <a>issued identifiers map</a>
mapping blank node identifers from the input dataset
to their canonical identifiers:</p>

<pre id="ex-ca-normalized-shared-dataset" data-transform="updateExample">
<!--
Expand All @@ -922,12 +969,24 @@ <h3>Algorithm</h3>

<ol id="ca">
<li id="ca.1">Create the <a>canonicalization state</a>.
If the <a>input dataset</a> is an N-Quads document,
parse that document into an dataset in the <a>normalized dataset</a>,
retaining any blank node identifiers used within that document
in the <a>input blank node identifier map</a>;
otherwise arbitrary identifiers are assigned for each
blank node.
<details>
<summary>Explanation</summary>
<p>This has the effect of initializing the
<a>blank node to quads map</a>,
and the <a>hash to blank nodes map</a>,
as well as instantiating a new <a>canonical issuer</a>.</p>
<p>After this algorithm completes,
the <a>input blank node identifier map</a> state
and <a>canonical issuer</a> may be used to
correlate blank nodes used in the
<a>input dataset</a> with both their original identifiers,
and associated canonical identifiers.</p>
</details>
</li>
<li id="ca.2">For every <a>quad</a> <var>Q</var> in <a>input dataset</a>:
Expand All @@ -937,12 +996,15 @@ <h3>Algorithm</h3>
[= map/entry | map entry =] for the
<a>blank node identifier</a> <var>identifier</var>
in the <a>blank node to quads map</a>,
creating a new entry if necessary.
creating a new entry if necessary,
using the identifier for the blank node found in the
<a>input blank node identifier map</a>.
<details>
<summary>Explanation</summary>
<p>This establishes the <a>blank node to quads map</a>,
relating each <a>blank node</a> with the set of <a>quads</a>
of which it is a component.</p>
of which it is a component,
via the map for each blank node in the input dataset to its assigned identifier.</p>
<p class="note">
<a data-cite="RDF11-CONCEPTS#dfn-literal">Literal</a> components of
<a>quads</a> are not subject to any normalization.
Expand Down Expand Up @@ -1224,21 +1286,15 @@ <h3>Algorithm</h3>
</li>
</ol>
</li>
<li id="ca.6">For each <a>quad</a>, <var>q</var>, in <a>input dataset</a>:
<li id="ca.6">Add the <a>canonical issuer</a> to the
<a>normalized dataset</a>.
<details>
<summary>Explanation</summary>
<p>This step populates the <a>normalized dataset</a> with quads
substituting the original blank node identifiers,
with the newly established canonical blank node identifiers.</p>
<p>This step adds the <a>canonical issuer</a> to the
<a>normalized dataset</a>, the [= map/key | keys =] in the
<a>canonical issuer</a> with the [= map/entry | map entries =] of the
<a>input blank node identifier map</a>.</p>
</details>
<ol>
<li id="ca.6.1">Create a copy, <var>quad copy</var>, of <var>q</var> and replace any
existing <a>blank node identifier</a> <var>n</var> using the
canonical identifiers previously issued
by <a>canonical issuer</a>.</li>
<li id="ca.6.2">Add <var>quad copy</var> to the
<a>normalized dataset</a>.</li>
</ol>
<details>
<summary>Logging</summary>
<p>Log the state of the <a>canonical issuer</a> at the completion of the algorithm.</p>
Expand All @@ -1255,7 +1311,33 @@ <h3>Algorithm</h3>
</details>
</li>
<li id="ca.7">Return the <a>serialized canonical form</a>
of the <a>normalized dataset</a>.</li>
of the <a>normalized dataset</a>.
Upon request, alternatively (or additionally) return the
<a>normalized dataset</a> itself, which includes the
<a>input blank node identifier map</a>, and
<a>canonical issuer</a>.
<p class="note">Technically speaking, one implementation
might return a <a>normalized dataset</a> that maps
particular blank nodes to different identifiers than another
implementation, however, this only occurs when there are
isomorphisms in the dataset such that a canonically serialized
expression of the dataset would appear the same from either
implementation.</p>
<details>
<summary>Explanation</summary>
<p>The <a>serialized canonical form</a> is an N-Quads
document where the blank node identifiers are taken
from the canonical identifiers associated with each blank node.</p>
<p>The <a>normalized dataset</a> is composed of the original
<a>input dataset</a>, the <a>input blank node identifier map</a>,
containing identifiers for each blank node in the <a>input dataset</a>,
and the <a>canonical issuer</a>,
containing an <a>issued identifiers map</a>
mapping the identifiers in the <a>input blank node identifier map</a>
to their canonical identifiers.
</p>
</details>
</li>
</ol>
</section>
</section>
Expand Down Expand Up @@ -2407,7 +2489,7 @@ <h3>Algorithm</h3>
</li>
<li id="hndq.5.4.4.2.2">Use the
<a href="#issue-identifier">Issue Identifier algorithm</a>,
passing <var>issuer copy</var> and <var>related</var>, and
passing <var>issuer copy</var> and the <var>related</var>, and
append the string <code>_:</code>, followed by the result, to <var>path</var>.</li>
</ol>
</li>
Expand Down

0 comments on commit 5903d52

Please sign in to comment.