Adds the "input blank node identifier map" #100

gkellogg · 2023-05-15T23:07:46Z

and a way to initialize it from an input N-Quads document, otherwise uses arbitrary identifiers for each blank node in the input dataset.

Clears up some ambiguity about original blank node identifiers.

Fixes #89.

Preview | Diff

… from an input N-Quads document, otherwise uses arbitrary identifiers for each blank node in the input dataset. Clears up some ambiguity about original blank node identifiers.

gkellogg · 2023-05-15T23:11:08Z

We've been pretty fuzzy about identifiers in the input dataset, this is a WIP, but should make it more concrete. If not previously assigned (i.e., by parsing an N-Quads document), it assigned arbitrary identifiers to the input. This does not create any interoperability issues, so there is no need to use any particular algorithm for assigning input blank node identifiers. The input blank node identifier map can be extracted after running the algorithm and used along with the normalized dataset to correlate input identifiers with the resulting canonical identifiers.

iherman

(Github does not allow me to put this comment to an old, unedited area)

Shouldn't §4.4.3/7 make it explicit that the algorithm also return (or return optionally?) The canonical issuer, that will contain the mapping of the input bnode identifiers to the canonical ones?

spec/index.html

yamdan · 2023-05-17T08:11:34Z

Thanks @gkellogg , I think the input blank node identifier map introduced here is helpful in practice.
Some N-Quads parsers intentionally randomize blank node identifiers after parsing. For example, in cases where parsers are used when loading N-Quads into a datastore (e.g., quadstore and oxigraph), such randomizations can be taken to avoid blank node identifier collisions. While canonicalization on loading may be rare, my implementation currently uses such a parser, so I have to implement this map later...

dlongley · 2023-05-17T14:14:19Z

@iherman,

Shouldn't §4.4.3/7 make it explicit that the algorithm also return (or return optionally?) The canonical issuer, that will contain the mapping of the input bnode identifiers to the canonical ones?

Yes, I think it should be explicit that it may optionally return it. However we can do that without modifying the step numbering as @gkellogg mentioned I'd be happy with.

Co-authored-by: Ivan Herman <ivan@ivan-herman.net> Co-authored-by: Dave Longley <dlongley@digitalbazaar.com> Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>

…ier map to be used instead of the seralized N-Quads result.

gkellogg · 2023-05-17T22:47:09Z

Another thing to consider:

Update the definition of a normalized dataset to be the input dataset, the input blank node identifier map, and the canonical issuer. This is useful, as the input blank node identifier map maps blank nodes in the input dataset to associated blank node identifiers, and the canonical issuer maps those identifiers to canonical identifiers. This allows an implementation to either take the N-Quads result, or the normalized dataset, without needing to keep this as two separate things.

spec/index.html

TallTed · 2023-05-18T01:59:26Z

spec/index.html

          in the <a>input blank node identifier map</a>,
-          otherise arbitrary identifiers are assigned for each
+          otherwise arbitrary identifiers are assigned for each


Suggested change

in the <a>input blank node identifier map</a>,

otherise arbitrary identifiers are assigned for each

otherwise arbitrary identifiers are assigned for each

in the <a>input blank node identifier map</a>;

otherwise, arbitrary identifiers are assigned for each

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

spec/index.html

Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>

TallTed · 2023-05-18T22:39:43Z

spec/index.html

@@ -969,8 +969,9 @@ <h3>Algorithm</h3>
              as well as instantiating a new <a>canonical issuer</a>.</p>
            <p>After this algorithm completes,
              the <a>input blank node identifier map</a> state
-              may be used to correlate blank node identifiers
-              used in the <a>input dataset</a> to those used
+              and / or <a>canonical issuer</a> may be used to


Suggested change

and / or <a>canonical issuer</a> may be used to

and/or <a>canonical issuer</a> may be used to

That text was changed, as it now includes both these things.

…lized dataset. This makes the RDF dataset portion of this immutable, initialized from the input dataset. Updates some steps to clarify that updates are for map entries, and not adding quads, which simplifies step 6.

gkellogg · 2023-05-18T23:02:53Z

Last commit adds the input blank node identifier map and canonical issuer to the normalized dataset. This makes the RDF dataset portion of this immutable, initialized from the input dataset.

Updates some steps to clarify that updates are for map entries, and not adding quads, which simplifies step 6.

spec/index.html

yamdan · 2023-05-19T08:42:22Z

spec/index.html

+              mapping the identifiers in the <a>input blank node identifier map</a>
+              to their canonical identifiers.
+            </p>
+          </details>


I think it would be better to note that the normalized dataset should not be interpreted as a single canonical representation because the algorithm can output different canonical issuers depending on the implementation or runtime environment. (#89 (comment))

For example, an input dataset

_:e0 <http://example.org/vocab#next> _:e1 . _:e1 <http://example.org/vocab#next> _:e0 .

can be transformed into the normalized dataset with either one of the following canonical issuers, depending on the implementation:

{ "e0": "c14n0", "e1": "c14n1" }

{ "e0": "c14n1", "e1": "c14n0" }

Both canonical issuers result in the same single serialized form:

_:c14n0 <http://example.org/vocab#next> _:c14n1 . _:c14n1 <http://example.org/vocab#next> _:c14n0 .

So, we can only say that serialized form is a single canonical representation, but the normalized dataset is possibly not.

It doesn't say that the normalized dataset is a single canonical representation; as you point out, the association of blank nodes to input identifiers could be different for two otherwise isomorphic datasets, and therefor the map from input identifier to canonical identifier would differ. Note that this is in a non-normative explanation detail. Is there some specific text you'd like to add or change?

Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>

dlongley

Great, thanks! Approving now -- and I have no issue if there's some informative note added around the isomorphic datasets discussion between @yamdan and @gkellogg.

spec/index.html

Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>

yamdan · 2023-05-20T02:20:42Z

spec/index.html

+          Alternatively, return the <a>normalized dataset</a> itself,
+          which includes the <a>input blank node identifier map</a>,
+          and <a>canonical issuer</a>.


Suggested change

Alternatively, return the <a>normalized dataset</a> itself,

which includes the <a>input blank node identifier map</a>,

and <a>canonical issuer</a>.

Optionally, the algorithm may also return the <a>normalized dataset</a> as an auxiliary output,

which includes the <a>input blank node identifier map</a>,

and <a>canonical issuer</a>.

How about saying that the output of the c14n (single and deterministic) is the serialized form, whereas the normalized dataset (possibly non-deterministic) can be obtained as an auxiliary output?

So, would change "Optionally" to "As an auxiliary output" satisfy this? I don't really follow how this is non-deterministic, as it would seem that with a given input, the same normalized dataset would be produced.

I'd rather not be so prescriptive in saying that you have to return both -- I'd be happy with a both / either. We don't want to force implementations to do extra work they don't need to.

I also don't think we should say what implementations can do as a proxy for indicating that two different implementations might technically map one blank node to ID A and another implementation might map it to ID B. This is @yamdan's point, I believe -- but this only happens when there are isomorphisms that make this difference irrelevant. A serialized version of the dataset would look the same. We should just say this, not impose restrictions on implementations.

Perhaps that's what we say in a note: "Technically speaking, one implementation might map particular blank nodes to different identifiers than another implementation, however, this only occurs when there are isomorphisms in the dataset such that a serialized expression of the dataset would appear the same from either implementation."

And then we can say that algorithms may return both the canonically serialized dataset and the normalized dataset or either of these as requested by the invoker of the algorithm.

It might be some time before I can do this update, but it seems simple enough. I'm traveling for the next week, and internet access is spotty. Feel free to update and commit, as this is really just informative, now.

I've added a suggestion below.

@gkellogg,

I don't really follow how this is non-deterministic, as it would seem that with a given input, the same normalized dataset would be produced.

Non-deterministic choice may occur in step 5.3 of the 4.4.3 Algorithm, where it's possible to have ties of result in the hash path list so that it's non-deterministic which result is firstly chosen from the list. (see the debug log from my implementation)
Even the same implementation can output different canonical issuers depending on the runtime environment or the input blank node identifiers.

The only thing I would like to eliminate here is the possibility of creating a misuse of the normalized dataset, believing that it is a deterministic and single canonical result and connecting it to the hash or signature input.
I think we can prevent this by clearly stating that the serialized form is the output of the canonicalization and the normalized dataset is an auxiliary output.

As @dlongley mentioned, I think this only happens when there are automorphisms in the input dataset.

I will continue this "non-deterministic" topic in a new separate PR.

spec/index.html

Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>

spec/index.html

Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>

yamdan · 2023-05-24T14:22:57Z

We decided to merge this PR on 2023-05-24 WG Call. I will make a new PR related to "non-deterministic" canonical issuer topic.

Adds the "input blank node identifier map" and a way to initialize it…

4d65b09

… from an input N-Quads document, otherwise uses arbitrary identifiers for each blank node in the input dataset. Clears up some ambiguity about original blank node identifiers.

gkellogg requested a review from iherman May 15, 2023 23:11

Broken terms.

b76ebdb

iherman reviewed May 16, 2023

View reviewed changes

spec/index.html Outdated Show resolved Hide resolved

spec/index.html Outdated Show resolved Hide resolved

spec/index.html Outdated Show resolved Hide resolved

spec/index.html Outdated Show resolved Hide resolved

yamdan reviewed May 17, 2023

View reviewed changes

spec/index.html Outdated Show resolved Hide resolved

gkellogg and others added 2 commits May 17, 2023 15:36

Apply suggestions from code review

a9fa5e3

Co-authored-by: Ivan Herman <ivan@ivan-herman.net> Co-authored-by: Dave Longley <dlongley@digitalbazaar.com> Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>

In step 7, allow the normalized data set and input blank node identif…

8b7a3d3

…ier map to be used instead of the seralized N-Quads result.

TallTed reviewed May 18, 2023

View reviewed changes

spec/index.html Outdated Show resolved Hide resolved

TallTed reviewed May 18, 2023

View reviewed changes

Apply suggestions from code review

0d96146

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

dlongley reviewed May 18, 2023

View reviewed changes

spec/index.html Outdated Show resolved Hide resolved

spec/index.html Outdated Show resolved Hide resolved

gkellogg and others added 2 commits May 18, 2023 09:23

Apply suggestions from code review

10563be

Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>

Grammar suggstion from Ted.

f957b13

TallTed reviewed May 18, 2023

View reviewed changes

gkellogg marked this pull request as ready for review May 18, 2023 23:02

yamdan reviewed May 19, 2023

View reviewed changes

spec/index.html Outdated Show resolved Hide resolved

yamdan reviewed May 19, 2023

View reviewed changes

Apply suggestions from code review

10f6c40

Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>

dlongley approved these changes May 19, 2023

View reviewed changes

spec/index.html Outdated Show resolved Hide resolved

Update spec/index.html

7675b1f

Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>

yamdan reviewed May 20, 2023

View reviewed changes

dlongley reviewed May 20, 2023

View reviewed changes

spec/index.html Outdated Show resolved Hide resolved

Update spec/index.html

61eecaf

Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>

gkellogg requested a review from yamdan May 20, 2023 23:46

dlongley reviewed May 23, 2023

View reviewed changes

spec/index.html Outdated Show resolved Hide resolved

Update spec/index.html

a8e7f01

Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>

yamdan merged commit 5903d52 into main May 24, 2023

yamdan deleted the ordered-input branch May 24, 2023 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds the "input blank node identifier map" #100

Adds the "input blank node identifier map" #100

gkellogg commented May 15, 2023 •

edited by pr-preview bot

Loading

gkellogg commented May 15, 2023

iherman left a comment

yamdan commented May 17, 2023

dlongley commented May 17, 2023

gkellogg commented May 17, 2023

TallTed May 18, 2023

TallTed May 18, 2023

gkellogg May 18, 2023

gkellogg commented May 18, 2023

yamdan May 19, 2023

gkellogg May 19, 2023

dlongley left a comment

yamdan May 20, 2023

gkellogg May 20, 2023

dlongley May 20, 2023 •

edited

Loading

gkellogg May 20, 2023

dlongley May 20, 2023

yamdan May 22, 2023 •

edited

Loading

yamdan May 24, 2023

yamdan commented May 24, 2023 •

edited

Loading

	and / or <a>canonical issuer</a> may be used to
	and/or <a>canonical issuer</a> may be used to

Adds the "input blank node identifier map" #100

Adds the "input blank node identifier map" #100

Conversation

gkellogg commented May 15, 2023 • edited by pr-preview bot Loading

gkellogg commented May 15, 2023

iherman left a comment

Choose a reason for hiding this comment

yamdan commented May 17, 2023

dlongley commented May 17, 2023

gkellogg commented May 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gkellogg commented May 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlongley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlongley May 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yamdan May 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yamdan commented May 24, 2023 • edited Loading

gkellogg commented May 15, 2023 •

edited by pr-preview bot

Loading

dlongley May 20, 2023 •

edited

Loading

yamdan May 22, 2023 •

edited

Loading

yamdan commented May 24, 2023 •

edited

Loading