-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds the "input blank node identifier map" #100
Conversation
… from an input N-Quads document, otherwise uses arbitrary identifiers for each blank node in the input dataset. Clears up some ambiguity about original blank node identifiers.
We've been pretty fuzzy about identifiers in the input dataset, this is a WIP, but should make it more concrete. If not previously assigned (i.e., by parsing an N-Quads document), it assigned arbitrary identifiers to the input. This does not create any interoperability issues, so there is no need to use any particular algorithm for assigning input blank node identifiers. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Github does not allow me to put this comment to an old, unedited area)
Shouldn't §4.4.3/7 make it explicit that the algorithm also return (or return optionally?) The canonical issuer, that will contain the mapping of the input bnode identifiers to the canonical ones?
Thanks @gkellogg , I think the input blank node identifier map introduced here is helpful in practice. |
Yes, I think it should be explicit that it may optionally return it. However we can do that without modifying the step numbering as @gkellogg mentioned I'd be happy with. |
Co-authored-by: Ivan Herman <ivan@ivan-herman.net> Co-authored-by: Dave Longley <dlongley@digitalbazaar.com> Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>
…ier map to be used instead of the seralized N-Quads result.
Another thing to consider: Update the definition of a normalized dataset to be the input dataset, the input blank node identifier map, and the canonical issuer. This is useful, as the input blank node identifier map maps blank nodes in the input dataset to associated blank node identifiers, and the canonical issuer maps those identifiers to canonical identifiers. This allows an implementation to either take the N-Quads result, or the normalized dataset, without needing to keep this as two separate things. |
spec/index.html
Outdated
in the <a>input blank node identifier map</a>, | ||
otherise arbitrary identifiers are assigned for each | ||
otherwise arbitrary identifiers are assigned for each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the <a>input blank node identifier map</a>, | |
otherise arbitrary identifiers are assigned for each | |
otherwise arbitrary identifiers are assigned for each | |
in the <a>input blank node identifier map</a>; | |
otherwise, arbitrary identifiers are assigned for each |
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>
spec/index.html
Outdated
@@ -969,8 +969,9 @@ <h3>Algorithm</h3> | |||
as well as instantiating a new <a>canonical issuer</a>.</p> | |||
<p>After this algorithm completes, | |||
the <a>input blank node identifier map</a> state | |||
may be used to correlate blank node identifiers | |||
used in the <a>input dataset</a> to those used | |||
and / or <a>canonical issuer</a> may be used to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and / or <a>canonical issuer</a> may be used to | |
and/or <a>canonical issuer</a> may be used to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That text was changed, as it now includes both these things.
…lized dataset. This makes the RDF dataset portion of this immutable, initialized from the input dataset. Updates some steps to clarify that updates are for map entries, and not adding quads, which simplifies step 6.
Last commit adds the input blank node identifier map and canonical issuer to the normalized dataset. This makes the RDF dataset portion of this immutable, initialized from the input dataset. Updates some steps to clarify that updates are for map entries, and not adding quads, which simplifies step 6. |
mapping the identifiers in the <a>input blank node identifier map</a> | ||
to their canonical identifiers. | ||
</p> | ||
</details> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to note that the normalized dataset should not be interpreted as a single canonical representation because the algorithm can output different canonical issuers depending on the implementation or runtime environment. (#89 (comment))
For example, an input dataset
_:e0 <http://example.org/vocab#next> _:e1 .
_:e1 <http://example.org/vocab#next> _:e0 .
can be transformed into the normalized dataset with either one of the following canonical issuers, depending on the implementation:
{ "e0": "c14n0", "e1": "c14n1" }
{ "e0": "c14n1", "e1": "c14n0" }
Both canonical issuers result in the same single serialized form:
_:c14n0 <http://example.org/vocab#next> _:c14n1 .
_:c14n1 <http://example.org/vocab#next> _:c14n0 .
So, we can only say that serialized form is a single canonical representation, but the normalized dataset is possibly not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't say that the normalized dataset is a single canonical representation; as you point out, the association of blank nodes to input identifiers could be different for two otherwise isomorphic datasets, and therefor the map from input identifier to canonical identifier would differ. Note that this is in a non-normative explanation detail. Is there some specific text you'd like to add or change?
Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>
spec/index.html
Outdated
Alternatively, return the <a>normalized dataset</a> itself, | ||
which includes the <a>input blank node identifier map</a>, | ||
and <a>canonical issuer</a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, return the <a>normalized dataset</a> itself, | |
which includes the <a>input blank node identifier map</a>, | |
and <a>canonical issuer</a>. | |
Optionally, the algorithm may also return the <a>normalized dataset</a> as an auxiliary output, | |
which includes the <a>input blank node identifier map</a>, | |
and <a>canonical issuer</a>. |
How about saying that the output of the c14n (single and deterministic) is the serialized form, whereas the normalized dataset (possibly non-deterministic) can be obtained as an auxiliary output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, would change "Optionally" to "As an auxiliary output" satisfy this? I don't really follow how this is non-deterministic, as it would seem that with a given input, the same normalized dataset would be produced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not be so prescriptive in saying that you have to return both -- I'd be happy with a both / either. We don't want to force implementations to do extra work they don't need to.
I also don't think we should say what implementations can do as a proxy for indicating that two different implementations might technically map one blank node to ID A and another implementation might map it to ID B. This is @yamdan's point, I believe -- but this only happens when there are isomorphisms that make this difference irrelevant. A serialized version of the dataset would look the same. We should just say this, not impose restrictions on implementations.
Perhaps that's what we say in a note: "Technically speaking, one implementation might map particular blank nodes to different identifiers than another implementation, however, this only occurs when there are isomorphisms in the dataset such that a serialized expression of the dataset would appear the same from either implementation."
And then we can say that algorithms may return both the canonically serialized dataset and the normalized dataset or either of these as requested by the invoker of the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be some time before I can do this update, but it seems simple enough. I'm traveling for the next week, and internet access is spotty. Feel free to update and commit, as this is really just informative, now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a suggestion below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really follow how this is non-deterministic, as it would seem that with a given input, the same normalized dataset would be produced.
Non-deterministic choice may occur in step 5.3 of the 4.4.3 Algorithm, where it's possible to have ties of result in the hash path list so that it's non-deterministic which result is firstly chosen from the list. (see the debug log from my implementation)
Even the same implementation can output different canonical issuers depending on the runtime environment or the input blank node identifiers.
The only thing I would like to eliminate here is the possibility of creating a misuse of the normalized dataset, believing that it is a deterministic and single canonical result and connecting it to the hash or signature input.
I think we can prevent this by clearly stating that the serialized form is the output of the canonicalization and the normalized dataset is an auxiliary output.
As @dlongley mentioned, I think this only happens when there are automorphisms in the input dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will continue this "non-deterministic" topic in a new separate PR.
Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>
Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>
We decided to merge this PR on 2023-05-24 WG Call. I will make a new PR related to "non-deterministic" canonical issuer topic. |
and a way to initialize it from an input N-Quads document, otherwise uses arbitrary identifiers for each blank node in the input dataset.
Clears up some ambiguity about original blank node identifiers.
Fixes #89.
Preview | Diff