Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds the "input blank node identifier map" #100

Merged
merged 12 commits into from
May 24, 2023
Merged

Adds the "input blank node identifier map" #100

merged 12 commits into from
May 24, 2023

Conversation

gkellogg
Copy link
Member

@gkellogg gkellogg commented May 15, 2023

and a way to initialize it from an input N-Quads document, otherwise uses arbitrary identifiers for each blank node in the input dataset.

Clears up some ambiguity about original blank node identifiers.

Fixes #89.


Preview | Diff

… from an input N-Quads document, otherwise uses arbitrary identifiers for each blank node in the input dataset.

Clears up some ambiguity about original blank node identifiers.
@gkellogg
Copy link
Member Author

We've been pretty fuzzy about identifiers in the input dataset, this is a WIP, but should make it more concrete. If not previously assigned (i.e., by parsing an N-Quads document), it assigned arbitrary identifiers to the input. This does not create any interoperability issues, so there is no need to use any particular algorithm for assigning input blank node identifiers. The input blank node identifier map can be extracted after running the algorithm and used along with the normalized dataset to correlate input identifiers with the resulting canonical identifiers.

@gkellogg gkellogg requested a review from iherman May 15, 2023 23:11
Copy link
Member

@iherman iherman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Github does not allow me to put this comment to an old, unedited area)

Shouldn't §4.4.3/7 make it explicit that the algorithm also return (or return optionally?) The canonical issuer, that will contain the mapping of the input bnode identifiers to the canonical ones?

spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
@yamdan
Copy link
Contributor

yamdan commented May 17, 2023

Thanks @gkellogg , I think the input blank node identifier map introduced here is helpful in practice.
Some N-Quads parsers intentionally randomize blank node identifiers after parsing. For example, in cases where parsers are used when loading N-Quads into a datastore (e.g., quadstore and oxigraph), such randomizations can be taken to avoid blank node identifier collisions. While canonicalization on loading may be rare, my implementation currently uses such a parser, so I have to implement this map later...

@dlongley
Copy link
Contributor

@iherman,

Shouldn't §4.4.3/7 make it explicit that the algorithm also return (or return optionally?) The canonical issuer, that will contain the mapping of the input bnode identifiers to the canonical ones?

Yes, I think it should be explicit that it may optionally return it. However we can do that without modifying the step numbering as @gkellogg mentioned I'd be happy with.

gkellogg and others added 2 commits May 17, 2023 15:36
Co-authored-by: Ivan Herman <ivan@ivan-herman.net>
Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>
Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>
…ier map to be used instead of the seralized N-Quads result.
@gkellogg
Copy link
Member Author

Another thing to consider:

Update the definition of a normalized dataset to be the input dataset, the input blank node identifier map, and the canonical issuer. This is useful, as the input blank node identifier map maps blank nodes in the input dataset to associated blank node identifiers, and the canonical issuer maps those identifiers to canonical identifiers. This allows an implementation to either take the N-Quads result, or the normalized dataset, without needing to keep this as two separate things.

spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated
Comment on lines 961 to 962
in the <a>input blank node identifier map</a>,
otherise arbitrary identifiers are assigned for each
otherwise arbitrary identifiers are assigned for each
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in the <a>input blank node identifier map</a>,
otherise arbitrary identifiers are assigned for each
otherwise arbitrary identifiers are assigned for each
in the <a>input blank node identifier map</a>;
otherwise, arbitrary identifiers are assigned for each

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
gkellogg and others added 2 commits May 18, 2023 09:23
Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>
spec/index.html Outdated
@@ -969,8 +969,9 @@ <h3>Algorithm</h3>
as well as instantiating a new <a>canonical issuer</a>.</p>
<p>After this algorithm completes,
the <a>input blank node identifier map</a> state
may be used to correlate blank node identifiers
used in the <a>input dataset</a> to those used
and / or <a>canonical issuer</a> may be used to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and / or <a>canonical issuer</a> may be used to
and/or <a>canonical issuer</a> may be used to

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That text was changed, as it now includes both these things.

…lized dataset. This makes the RDF dataset portion of this immutable, initialized from the input dataset.

Updates some steps to clarify that updates are for map entries, and not adding quads, which simplifies step 6.
@gkellogg
Copy link
Member Author

Last commit adds the input blank node identifier map and canonical issuer to the normalized dataset. This makes the RDF dataset portion of this immutable, initialized from the input dataset.

Updates some steps to clarify that updates are for map entries, and not adding quads, which simplifies step 6.

@gkellogg gkellogg marked this pull request as ready for review May 18, 2023 23:02
spec/index.html Outdated Show resolved Hide resolved
mapping the identifiers in the <a>input blank node identifier map</a>
to their canonical identifiers.
</p>
</details>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to note that the normalized dataset should not be interpreted as a single canonical representation because the algorithm can output different canonical issuers depending on the implementation or runtime environment. (#89 (comment))

For example, an input dataset

_:e0 <http://example.org/vocab#next> _:e1 .
_:e1 <http://example.org/vocab#next> _:e0 .

can be transformed into the normalized dataset with either one of the following canonical issuers, depending on the implementation:

  1. { "e0": "c14n0", "e1": "c14n1" }
  2. { "e0": "c14n1", "e1": "c14n0" }

Both canonical issuers result in the same single serialized form:

_:c14n0 <http://example.org/vocab#next> _:c14n1 .
_:c14n1 <http://example.org/vocab#next> _:c14n0 .

So, we can only say that serialized form is a single canonical representation, but the normalized dataset is possibly not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't say that the normalized dataset is a single canonical representation; as you point out, the association of blank nodes to input identifiers could be different for two otherwise isomorphic datasets, and therefor the map from input identifier to canonical identifier would differ. Note that this is in a non-normative explanation detail. Is there some specific text you'd like to add or change?

Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>
Copy link
Contributor

@dlongley dlongley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! Approving now -- and I have no issue if there's some informative note added around the isomorphic datasets discussion between @yamdan and @gkellogg.

spec/index.html Outdated Show resolved Hide resolved
Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>
spec/index.html Outdated
Comment on lines 1315 to 1317
Alternatively, return the <a>normalized dataset</a> itself,
which includes the <a>input blank node identifier map</a>,
and <a>canonical issuer</a>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Alternatively, return the <a>normalized dataset</a> itself,
which includes the <a>input blank node identifier map</a>,
and <a>canonical issuer</a>.
Optionally, the algorithm may also return the <a>normalized dataset</a> as an auxiliary output,
which includes the <a>input blank node identifier map</a>,
and <a>canonical issuer</a>.

How about saying that the output of the c14n (single and deterministic) is the serialized form, whereas the normalized dataset (possibly non-deterministic) can be obtained as an auxiliary output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, would change "Optionally" to "As an auxiliary output" satisfy this? I don't really follow how this is non-deterministic, as it would seem that with a given input, the same normalized dataset would be produced.

Copy link
Contributor

@dlongley dlongley May 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not be so prescriptive in saying that you have to return both -- I'd be happy with a both / either. We don't want to force implementations to do extra work they don't need to.

I also don't think we should say what implementations can do as a proxy for indicating that two different implementations might technically map one blank node to ID A and another implementation might map it to ID B. This is @yamdan's point, I believe -- but this only happens when there are isomorphisms that make this difference irrelevant. A serialized version of the dataset would look the same. We should just say this, not impose restrictions on implementations.

Perhaps that's what we say in a note: "Technically speaking, one implementation might map particular blank nodes to different identifiers than another implementation, however, this only occurs when there are isomorphisms in the dataset such that a serialized expression of the dataset would appear the same from either implementation."

And then we can say that algorithms may return both the canonically serialized dataset and the normalized dataset or either of these as requested by the invoker of the algorithm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be some time before I can do this update, but it seems simple enough. I'm traveling for the next week, and internet access is spotty. Feel free to update and commit, as this is really just informative, now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a suggestion below.

Copy link
Contributor

@yamdan yamdan May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkellogg,

I don't really follow how this is non-deterministic, as it would seem that with a given input, the same normalized dataset would be produced.

Non-deterministic choice may occur in step 5.3 of the 4.4.3 Algorithm, where it's possible to have ties of result in the hash path list so that it's non-deterministic which result is firstly chosen from the list. (see the debug log from my implementation)
Even the same implementation can output different canonical issuers depending on the runtime environment or the input blank node identifiers.

The only thing I would like to eliminate here is the possibility of creating a misuse of the normalized dataset, believing that it is a deterministic and single canonical result and connecting it to the hash or signature input.
I think we can prevent this by clearly stating that the serialized form is the output of the canonicalization and the normalized dataset is an auxiliary output.

As @dlongley mentioned, I think this only happens when there are automorphisms in the input dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will continue this "non-deterministic" topic in a new separate PR.

spec/index.html Outdated Show resolved Hide resolved
Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>
@gkellogg gkellogg requested a review from yamdan May 20, 2023 23:46
spec/index.html Outdated Show resolved Hide resolved
Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>
@yamdan
Copy link
Contributor

yamdan commented May 24, 2023

We decided to merge this PR on 2023-05-24 WG Call. I will make a new PR related to "non-deterministic" canonical issuer topic.

@yamdan yamdan merged commit 5903d52 into main May 24, 2023
@yamdan yamdan deleted the ordered-input branch May 24, 2023 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support ordered input dataset or "list of quads" and optional mapping from input indices to output indices
5 participants