Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds the "input blank node identifier map" #100

Merged
merged 12 commits into from
May 24, 2023
150 changes: 112 additions & 38 deletions spec/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,25 @@ <h2>Uses of Dataset Canonicalization</h2>
RDF store or serialization is arbitrary,
and typically not relatable to the context within which it is used.</p>

<p>This specification defines an algorithm for creating stable <a>blank node identifiers</a> repeatably for different serializations possibly using individualized <a>blank node identifiers</a> of the same RDF graph (dataset) by grounding each <a>blank node</a> through the nodes to which it is connected, essentially creating <em>Skolem <a>blank node identifiers</a></em>. As a result, a graph signature can be obtained by hashing a canonical serialization of the resulting <a>normalized dataset</a>, allowing for the isomorphism and digital signing use cases. As blank node identifiers can be stable even with other changes to a graph (dataset), in some cases it is possible to compute the difference between two graphs (datasets), for example if changes are made only to ground triples, or if new blank nodes are introduced which do not create an automorphic confusion with other existing blank nodes. If any information which would change the generated blank node identifier, a resulting diff might indicate a greater set of changes than actually exists.</p>
<p>This specification defines an algorithm for creating stable
<a>blank node identifiers</a> repeatably for different serializations
possibly using individualized <a>blank node identifiers</a>
of the same RDF graph (dataset) by grounding each <a>blank node</a>
through the nodes to which it is connected.
As a result, a graph signature can be obtained by hashing a canonical serialization
of the resulting <a>normalized dataset</a>,
allowing for the isomorphism and digital signing use cases.
As blank node identifiers can be stable even with other changes to a graph (dataset),
in some cases it is possible to compute the difference between two graphs (datasets),
for example if changes are made only to ground triples,
or if new blank nodes are introduced which do not create an automorphic confusion
with other existing blank nodes.
If any information which would change the generated blank node identifier,
a resulting diff might indicate a greater set of changes than actually exists.
Additionally, if the starting dataset is an N-Quads document,
it may be possible to correlate the original blank node identifiers
used within that N-Quads document with those issued in the
<a>normalized dataset</a>.</p>

<div class="ednote">
<p>Add descriptions for relevant historical discussions and prior art:</p>
Expand Down Expand Up @@ -295,15 +313,35 @@ <h3>Terms defined by this specification</h3>
<dt><dfn data-lt="input dataset|input datasets">input dataset</dfn></dt>
<dd>The abstract <a>RDF dataset</a> that is provided as input to
the algorithm.</dd>
<dt><dfn>input blank node identifier map</dfn></dt>
<dd>Records any blank node identifiers already assigned to the
<a>input dataset</a>.
If the <a>input dataset</a> is provided as an N-Quads document,
the <a>map</a> relates blank nodes in the abstract <a>input dataset</a>
to the blank node identifiers used within the N-Quads document,
otherwise, identifiers are assigned arbitrarily for
each blank node in the input dataset not previously identified.
<div class="note">Implementations or environments might deal with blank
node identifiers more directly; for example, some implementations might
retain blank node identifiers in the parsed or abstract dataset. Implementations
are expected to reuse these to enable usable mappings between input blank node
identifiers and output blank node identifiers outside of the algorithm.</div>
</dd>
<dt><dfn>normalized dataset</dfn></dt>
<dd>A <a>normalized dataset</a> is the combination of an <a>RDF dataset</a>
and a <a>map</a> where [=map/keys=]
are <a>blank nodes</a> from the dataset
and [=map/values=] are the associated canonical <a>blank node identifiers</a>.
<dd>A <a>normalized dataset</a> is the combination of the following:
<ul>
<li>an <a>RDF dataset</a> —
the <a>input dataset</a>,</li>
<li>the <a>input blank node identifier map</a> —
mapping <a>blank nodes</a> in the input dataset to <a>blank node identifiers</a>, and</li>
<li>the <a>canonical issuer</a> —
mapping identifiers in the input dataset to canonical identifiers</li>
</ul>
A concrete serialization of a <a>normalized dataset</a> MUST label
all <a>blank nodes</a> using these stable <a>blank node identifiers</a>.</dd>
all <a>blank nodes</a> using the canonical <a>blank node identifiers</a>.
</dd>
<dt><dfn>identifier issuer</dfn></dt>
<dd>An identifier issuer is used to issue new <a>blank node identifier</a>. It
<dd>An identifier issuer is used to issue new <a>blank node identifiers</a>. It
maintains a
<a href="#bn-issuer-state">blank node identifier issuer state</a>.</dd>
<dt><dfn>hash</dfn></dt>
Expand Down Expand Up @@ -476,7 +514,7 @@ <h2>Canonicalization State</h2>
<div class="ednote">
Mapping all <a>blank nodes</a> to use this
identifier spec means that an <a>RDF dataset</a> composed of two
different <a>RDF graphs</a> will use different
different <a>RDF graphs</a> will issue different
identifiers than that for the graphs taken independently. This may
happen anyway, due to <a
href="https://en.wikipedia.org/wiki/Automorphism">automorphisms</a>,
Expand All @@ -492,11 +530,10 @@ <h2>Canonicalization State</h2>
<section id="bn-issuer-state">
<h2>Blank Node Identifier Issuer State</h2>

<p>During the canonicalization algorithm, it is sometimes necessary to
issue new identifiers to <a>blank nodes</a>. The
<a href="#issue-identifier">Issue Identifier algorithm</a> uses an
<a>identifier issuer</a> to accomplish this task. The information
an <a>identifier issuer</a> needs to keep track of is described
<p>The canonicalization algorithm issues identifiers to <a>blank nodes</a>.
The <a href="#issue-identifier">Issue Identifier algorithm</a> uses an
<a>identifier issuer</a> to accomplish this task.
The information an <a>identifier issuer</a> needs to keep track of is described
below.</p>

<dl>
Expand All @@ -514,10 +551,10 @@ <h2>Blank Node Identifier Issuer State</h2>
create an <a>blank node identifier</a>. It is initialized to
<code>0</code>.</dd>
<dt><dfn>issued identifiers map</dfn></dt>
<dd>An <a>ordered map</a> that relates existing identifiers to issued identifiers,
<dd>An <a>ordered map</a> that relates <a>blank node identifiers</a> to issued identifiers,
to prevent issuance of more than one new identifier per existing identifier,
and to allow <a>blank nodes</a> to
be reassigned identifiers some time after issuance.</dd>
be assigned identifiers some time after issuance.</dd>
</dl>
</section>

Expand Down Expand Up @@ -561,7 +598,12 @@ <h3>Overview</h3>
<ol>
<li id="ca-hl.1"><strong>Initialization</strong>.
Initialize the state needed for the rest of the algorithm
using <a href="#canon-state" class="sectionRef"></a>.</li>
using <a href="#canon-state" class="sectionRef"></a>.
Also initialize the <a>normalized dataset</a> using the <a>input dataset</a>
(which remains immutable)
the <a>input blank node identifier map</a>
(retaining blank node identifiers from the input, or otherwise assigning them arbitrarily);
gkellogg marked this conversation as resolved.
Show resolved Hide resolved
the <a>canonical issuer</a> is added upon completion of the algorithm.</li>
<li id="ca-hl.2"><strong>Compute first degree hashes</strong>.
Compute the first degree hash for each blank node in the dataset using <a href="#hash-1d-quads" class="sectionRef"></a>.</li>
<li id="ca-hl.3"><strong>Canonically label unique nodes</strong>.
Expand All @@ -578,7 +620,9 @@ <h3>Overview</h3>
If more than one node produces the same N-degree hash,
the order in which these nodes receive a canonical identifier does not matter.</li>
<li id="ca-hl.6"><strong>Finish</strong>.
Return the <a>serialized canonical form</a> of the <a>normalized dataset</a>.</li>
Return the <a>serialized canonical form</a> of the <a>normalized dataset</a>.
Alternatively, return the <a>normalized dataset</a> containing
the <a>input blank node identifier map</a> and <a>canonical issuer</a>.</li>
</ol>
</section>

Expand Down Expand Up @@ -690,8 +734,8 @@ <h3>Examples</h3>
as there are no remaining blank nodes without canonical identifiers.</p>

<p><a href="#ca.6">Step 6</a> generates
the normalized dataset by replacing blank node identifiers in the original
input with their canonical identifiers:</p>
the normalized dataset by mapping blank node identifiers in the input dataset
with canonical identifiers:</p>

<pre id="ex-ca-unique-normalized-dataset" data-transform="updateExample">
<!--
Expand Down Expand Up @@ -901,9 +945,12 @@ <h3>Examples</h3>
</tbody>
</table>

<p><a href="#ca.6">Step 6</a> generates
the normalized dataset by replacing blank node identifiers in the original
input with their canonical identifiers:</p>
<p><a href="#ca.6">Step 6</a> updates
the <a>normalized dataset</a>
with the <a>canonical issuer</a>,
containing an <a>issued identifiers map</a>
mapping blank node identifers from the input dataset
to their canonical identifiers:</p>

<pre id="ex-ca-normalized-shared-dataset" data-transform="updateExample">
<!--
Expand All @@ -922,12 +969,24 @@ <h3>Algorithm</h3>

<ol id="ca">
<li id="ca.1">Create the <a>canonicalization state</a>.
If the <a>input dataset</a> is an N-Quads document,
parse that document into an dataset in the <a>normalized dataset</a>,
retaining any blank node identifiers used within that document
in the <a>input blank node identifier map</a>;
otherwise arbitrary identifiers are assigned for each
blank node.
<details>
<summary>Explanation</summary>
<p>This has the effect of initializing the
<a>blank node to quads map</a>,
and the <a>hash to blank nodes map</a>,
as well as instantiating a new <a>canonical issuer</a>.</p>
<p>After this algorithm completes,
the <a>input blank node identifier map</a> state
and <a>canonical issuer</a> may be used to
correlate blank nodes used in the
<a>input dataset</a> with both their original identifiers,
and associated canonical identifiers.</p>
</details>
</li>
<li id="ca.2">For every <a>quad</a> <var>Q</var> in <a>input dataset</a>:
Expand All @@ -937,12 +996,15 @@ <h3>Algorithm</h3>
[= map/entry | map entry =] for the
<a>blank node identifier</a> <var>identifier</var>
in the <a>blank node to quads map</a>,
creating a new entry if necessary.
creating a new entry if necessary,
using the identifier for the blank node found in the
<a>input blank node identifier map</a>.
<details>
<summary>Explanation</summary>
<p>This establishes the <a>blank node to quads map</a>,
relating each <a>blank node</a> with the set of <a>quads</a>
of which it is a component.</p>
of which it is a component,
via the map for each blank node in the input dataset to its assigned identifier.</p>
<p class="note">
<a data-cite="RDF11-CONCEPTS#dfn-literal">Literal</a> components of
<a>quads</a> are not subject to any normalization.
Expand Down Expand Up @@ -1224,21 +1286,15 @@ <h3>Algorithm</h3>
</li>
</ol>
</li>
<li id="ca.6">For each <a>quad</a>, <var>q</var>, in <a>input dataset</a>:
<li id="ca.6">Add the <a>canonical issuer</a> to the
<a>normalized dataset</a>.
<details>
<summary>Explanation</summary>
<p>This step populates the <a>normalized dataset</a> with quads
substituting the original blank node identifiers,
with the newly established canonical blank node identifiers.</p>
<p>This step adds the <a>canonical issuer</a> to the
<a>normalized dataset</a>, the [= map/key | keys =] in the
<a>canonical issuer</a> with be [= map/entry | map entries =] of the
gkellogg marked this conversation as resolved.
Show resolved Hide resolved
<a>input blank node identifier map</a>.</p>
</details>
<ol>
<li id="ca.6.1">Create a copy, <var>quad copy</var>, of <var>q</var> and replace any
existing <a>blank node identifier</a> <var>n</var> using the
canonical identifiers previously issued
by <a>canonical issuer</a>.</li>
<li id="ca.6.2">Add <var>quad copy</var> to the
<a>normalized dataset</a>.</li>
</ol>
<details>
<summary>Logging</summary>
<p>Log the state of the <a>canonical issuer</a> at the completion of the algorithm.</p>
Expand All @@ -1255,7 +1311,25 @@ <h3>Algorithm</h3>
</details>
</li>
<li id="ca.7">Return the <a>serialized canonical form</a>
of the <a>normalized dataset</a>.</li>
of the <a>normalized dataset</a>.
Alternatively, return the <a>normalized dataset</a> itself,
which includes the <a>input blank node identifier map</a>,
and <a>canonical issuer</a>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Alternatively, return the <a>normalized dataset</a> itself,
which includes the <a>input blank node identifier map</a>,
and <a>canonical issuer</a>.
Optionally, the algorithm may also return the <a>normalized dataset</a> as an auxiliary output,
which includes the <a>input blank node identifier map</a>,
and <a>canonical issuer</a>.

How about saying that the output of the c14n (single and deterministic) is the serialized form, whereas the normalized dataset (possibly non-deterministic) can be obtained as an auxiliary output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, would change "Optionally" to "As an auxiliary output" satisfy this? I don't really follow how this is non-deterministic, as it would seem that with a given input, the same normalized dataset would be produced.

Copy link
Contributor

@dlongley dlongley May 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not be so prescriptive in saying that you have to return both -- I'd be happy with a both / either. We don't want to force implementations to do extra work they don't need to.

I also don't think we should say what implementations can do as a proxy for indicating that two different implementations might technically map one blank node to ID A and another implementation might map it to ID B. This is @yamdan's point, I believe -- but this only happens when there are isomorphisms that make this difference irrelevant. A serialized version of the dataset would look the same. We should just say this, not impose restrictions on implementations.

Perhaps that's what we say in a note: "Technically speaking, one implementation might map particular blank nodes to different identifiers than another implementation, however, this only occurs when there are isomorphisms in the dataset such that a serialized expression of the dataset would appear the same from either implementation."

And then we can say that algorithms may return both the canonically serialized dataset and the normalized dataset or either of these as requested by the invoker of the algorithm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be some time before I can do this update, but it seems simple enough. I'm traveling for the next week, and internet access is spotty. Feel free to update and commit, as this is really just informative, now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a suggestion below.

Copy link
Contributor

@yamdan yamdan May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkellogg,

I don't really follow how this is non-deterministic, as it would seem that with a given input, the same normalized dataset would be produced.

Non-deterministic choice may occur in step 5.3 of the 4.4.3 Algorithm, where it's possible to have ties of result in the hash path list so that it's non-deterministic which result is firstly chosen from the list. (see the debug log from my implementation)
Even the same implementation can output different canonical issuers depending on the runtime environment or the input blank node identifiers.

The only thing I would like to eliminate here is the possibility of creating a misuse of the normalized dataset, believing that it is a deterministic and single canonical result and connecting it to the hash or signature input.
I think we can prevent this by clearly stating that the serialized form is the output of the canonicalization and the normalized dataset is an auxiliary output.

As @dlongley mentioned, I think this only happens when there are automorphisms in the input dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will continue this "non-deterministic" topic in a new separate PR.

gkellogg marked this conversation as resolved.
Show resolved Hide resolved
<details>
<summary>Explanation</summary>
<p>The <a>serialized canonical form</a> is an N-Quads
document where the blank node identifiers are taken
from the canonical identifiers associated with each blank node.</p>
<p>The <a>normalized dataset</a> is composed of the original
<a>input dataset</a>, the <a>input blank node identifier map</a>,
containing identifiers for each blank node in the <a>input dataset</a>,
and the <a>canonical issuer</a>,
containing an <a>issued identifiers map</a>
mapping the identifiers in the <a>input blank node identifier map</a>
to their canonical identifiers.
</p>
</details>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to note that the normalized dataset should not be interpreted as a single canonical representation because the algorithm can output different canonical issuers depending on the implementation or runtime environment. (#89 (comment))

For example, an input dataset

_:e0 <http://example.org/vocab#next> _:e1 .
_:e1 <http://example.org/vocab#next> _:e0 .

can be transformed into the normalized dataset with either one of the following canonical issuers, depending on the implementation:

  1. { "e0": "c14n0", "e1": "c14n1" }
  2. { "e0": "c14n1", "e1": "c14n0" }

Both canonical issuers result in the same single serialized form:

_:c14n0 <http://example.org/vocab#next> _:c14n1 .
_:c14n1 <http://example.org/vocab#next> _:c14n0 .

So, we can only say that serialized form is a single canonical representation, but the normalized dataset is possibly not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't say that the normalized dataset is a single canonical representation; as you point out, the association of blank nodes to input identifiers could be different for two otherwise isomorphic datasets, and therefor the map from input identifier to canonical identifier would differ. Note that this is in a non-normative explanation detail. Is there some specific text you'd like to add or change?

</li>
</ol>
</section>
</section>
Expand Down Expand Up @@ -2407,7 +2481,7 @@ <h3>Algorithm</h3>
</li>
<li id="hndq.5.4.4.2.2">Use the
<a href="#issue-identifier">Issue Identifier algorithm</a>,
passing <var>issuer copy</var> and <var>related</var>, and
passing <var>issuer copy</var> and the <var>related</var>, and
append the string <code>_:</code>, followed by the result, to <var>path</var>.</li>
</ol>
</li>
Expand Down