-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support ordered input dataset or "list of quads" and optional mapping from input indices to output indices #89
Comments
To elaborate on the problem a bit more here: The goal is to support selective disclosure use cases with verifiable credentials. The main scenario is this: A holder of a verifiable credential must be able to reveal a subset of the quads from the verifiable credential to a verifier. The verifier must be able to reproduce the same bnode labels that were used when all of the quads from the verifiable credential were canonized and cryptographically signed by the issuer. The problem here lies in the fact that canonizing a subset of a dataset can produce different bnode labels (for the same blank nodes) from canonizing the entire dataset. The process the holder goes through to prepare to selectively disclose a verifiable credential to a verifier looks something like this:
Step 5 is the step we need to enable for the holder. Once we've enabled step 5 for the holder, then the verifier can do this:
|
@dlongley While one potential approach is (1) mapping from input indices to output indices, as you suggested, I believe a less intrusive alternative is (2) mapping from input blank node identifiers to output blank node identifiers. This mapping is already represented by the canonical issuer in the current specification draft. By using the canonical issuer for this mapping, we can avoid the need for ordered versions of both the input and normalized datasets. The only addition required to the current specification is to include the canonical issuer as an additional intermediate output of the canonicalization algorithm. As we addressed in #4 (comment), canonical issuers might not be uniquely determined in certain cases. Note that this issue also arises in the case of indices mapping (see the example below). Despite this, it does not pose a problem for selective disclosure use case. We do not require deterministic and unique canonical issuers (or indices mapping) for the selective disclosure you mentioned, as all possible issuers yield the same serialized canonical form. As a result, we must explicitly indicate that the canonical issuer serves as an intermediate output for selective disclosure (and other use cases, if any) and should not be considered canonical output due to its non-deterministic nature. ExampleAssume that we have the following input dataset:
(This N-Quads representation should be interpreted as an ordered quads in the case of (1). We can interpret it as unordered RDF dataset for the case of (2).) Then the output of canonicalization algorithm (serialized canonical form of the normalized dataset) should like this:
As for (1), we have two possible indices mappings: As for (2), we also have two possible canonical issuers: We can ensure that all the above mappings result in the same serialized output (otherwise, there has a flaw in the existing analysis result). Therefore the choice of these mappings does not matter in the selective disclosure usage mentioned above (#89 (comment)). |
The mapping of blank nodes to stable identifiers is now part of the definition of a normalized dataset, and this is effectively the same as the map maintained by the canonical issuer. One thing we could provision for is, if the input is a normalized dataset, initialize the canonical issuer from the map component of the dataset. |
The lower level steps here are:
Note: A reversal of the mapping from step 4 will be sent along with the selectively disclosed dataset (the latter of which may have it bnode labels changed at any point but can then have them transformed back via recanonizing and applying the map). The above step 4 is what we must enable. We must allow another spec to describe the above process where it references our spec here in steps 2 and 4. |
On the 10 May 2023 WG call, @dlongley, @gkellogg and @yamdan expressed their interest to set up a dedicated meeting to resolve this issue, perhaps via Doodle on the WG mailing list, so that other interested members can also participate. (UPDATE: see message here: https://lists.w3.org/Archives/Public/public-rch-wg/2023May/0011.html) |
Hmm, this is where my confusion sets in. What does "initialize the canonical issuer" mean, precisely? We don't want to use the values from the map component of the normalized dataset, otherwise we will not produce the new blank node labels (as just the original ones would be output again). We need the new canonical labels as well as the original ones. While it's true that the holder (the party generating the selectively disclosed dataset) could run the algorithm as you suggest to get output with the original labels, this information would not produce the needed mapping to hand to the verifier (the party receiving the selectively disclosed dataset). What's key is that the verifier will not have access to the normalized dataset and cannot run the algorithm in this way. The holder also can't "cull" from the normalized dataset and then send it along because it is abstract and a concrete serialization is required for transport (where the bnode labels could change and invalidate the abstract mapping). So we have a situation with asymmetrical knowledge, where the party that knows the total dataset must produce a transportable mapping that can be applied to the selectively disclosed dataset, post-canonicalization. The holder thus needs to produce both the original canonical labels and the new labels that would be produced from canonizing just the selectively disclosed quads. So those new labels must be known -- as well as a way to map them back. So, we can't just output the same original labels by passing in the normalized dataset again on the verifier side -- because that's not a thing the verifier has (nor can have). |
Okay, I've obviously remained confused about the steps involved, but as I understand it, the desire is to be able to take as an additional input, a map of blank nodes in the input dataset to canonical identifiers previously established (which could be a normalized dataset, either retrieved from a previous run, or constructed). The problem is that, both the map in the normalized dataset, and the canonical issuer used within the algorithm, take specific, not abstract, blank nodes from the dataset. If an input were constructed as a normalized dataset which includes that original mapping from blank nodes to canonical identifiers, but where the quads in the dataset represented a subset of the original input dataset, then we could maintain the mapping created when run against the original dataset, and correlate it with a mapping run against just the subset of quads and I think you would get what you want, but it is important the the blank node objects, used as keys in both maps, are the same thing. One way this might be done would be as follows:
Note that the key is that the objects representing the blank nodes remain the same across different runs of the algorithm. Note that the normalized dataset can be considered to be a combination of the original dataset and a map from blank nodes in that dataset to calculated canonical identifiers. |
Ok, my read is that you're proposing this: Holder does:
The above requires the same abstract blank nodes to be used throughout. Verifier does:
|
If the above is what you meant, it has the unfortunate requirement that the same abstract blank nodes must be used throughout the process for the holder. This adds complexity to the quad selection mechanism. Today, quads can be selected by:
An additional step would be required, I think, which would involve comparing the selected N-Quads from step 3 with the original dataset -- and producing a set of matches to re-run through the canonicalization algorithm. I wonder if we can avoid that by exposing the canonical identifier issuer instead. |
Suppose that we expose the canonical issuer identifier for external use. This seems like it would practically solve the problem but might run afoul of some abstract RDF rules: Holder does:
Verifier does:
|
Yes, it struck me that the map of blank node to canonical identifiers is effectively just the canonical issuer; the definition of As we can't rely on JSON-LD Framing as being the sole mechanism for removing quads, and we can't rely on systems preserving the blank node identifiers during transforms along the way, it seems that skolemization is the appropriate route.
This is somewhat different than re-canonicalizing Example: Input dataset # original dataset
_:b0 <http://schema.org/address> _:b1 .
_:b0 <http://schema.org/familyName> "Jarrett" .
_:b0 <http://schema.org/gender> "Female" . # gender === Female
_:b0 <http://schema.org/givenName> "Ali" .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:b1 <http://schema.org/addressCountry> "United States" .
_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> . Canonicalized result # normalized dataset
_:c14n0 <http://schema.org/addressCountry> "United States" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
_:c14n1 <http://schema.org/address> _:c14n0 .
_:c14n1 <http://schema.org/familyName> "Jarrett" .
_:c14n1 <http://schema.org/gender> "Female" . # gender === Female
_:c14n1 <http://schema.org/givenName> "Ali" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> . Skolemized result <https://w3c.org/ns/rch/skolem#c14n0> <http://schema.org/addressCountry> "United States" .
<https://w3c.org/ns/rch/skolem#c14n0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
<https://w3c.org/ns/rch/skolem#c14n1> <http://schema.org/address> <https://w3c.org/ns/rch/skolem#c14n0> .
<https://w3c.org/ns/rch/skolem#c14n1> <http://schema.org/familyName> "Jarrett" .
<https://w3c.org/ns/rch/skolem#c14n1> <http://schema.org/gender> "Female" . # gender === Female
<https://w3c.org/ns/rch/skolem#c14n1> <http://schema.org/givenName> "Ali" .
<https://w3c.org/ns/rch/skolem#c14n1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> . Subset dataset <https://w3c.org/ns/rch/skolem#c14n1> <http://schema.org/familyName> "Jarrett" .
<https://w3c.org/ns/rch/skolem#c14n1> <http://schema.org/gender> "Female" . # gender === Female
<https://w3c.org/ns/rch/skolem#c14n1> <http://schema.org/givenName> "Ali" .
<https://w3c.org/ns/rch/skolem#c14n1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> . Regenerate skolem identifiers and resulting map ( _:c14n0 <http://schema.org/familyName> "Jarrett" .
_:c14n0 <http://schema.org/gender> "Female" . # gender === Female
_:c14n0 <http://schema.org/givenName> "Ali" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> . Map of identifiers in
|
For step 4:
The first sentence here tripped me up. Can't this just be broken into these steps:
If so, it looks like it matches the outcome of what I said in step 3 in my comment above. All that would remain after this would be to build a reverse mapping from the canonical issuer identifier such that the canonical identifiers in C2 could be later mapped back to C1 by the verifier. If you agree, then I think we're on the same page and we just need to see if it passes muster with others, and if so, add some spec text that makes it clear that implementations may expose internal state, such as the canonical issuer identifier, to enable other specs to reference / make use of it in their own custom algorithms. |
The problem is that, formally, blank nodes in datasets do not have stable identifiers, even though many/most implementations may retain those from the serialization. This is why the step uses skolem IRIs, which are stable. |
Thank you @gkellogg and @dlongley; according to your discussion, I can now correct my idea of using canonical issuer. I describe it below, which is somewhat redundant but might be helpful to catch the picture (at least for me). The following example contains several elements that are not in the scope of this WG. What I think we have to define here additionally is an extended canonicalization algorithm (= "new algorithm," as @gkellogg said) like Holder
(this can also be represented as an array
Verifier
(here verifier resumes the (original) first and third quads; and get to know that the second quad have been undisclosed by the holder)
|
I think we should find a way to address that without creating a new algorithm in our spec here. It seems that, for the implementations you describe, the process can work like this:
Then, the verifier can do this:
So it seems that exposing the canonical identifier issuer and the normalized dataset with the bnode => label mapping can enable either style of implementation to perform the bnode mapping task. It's just that implementations that do not keep blank nodes stable may have an extra step. |
I just saw your comment now -- I'll take a look when I can to see if it matches what I just put in my comment, it looks similar at a high level. |
@yamdan said:
The problem with this step is that, after deskolemizing C2, you loose any association with the quads from the original dataset and the subset, which is why I kept the skolemized versions until the very end. This might require updating the notion of the issuer to map nodes (either IRIs or Blank Nodes) to canonical labels, and simply invoke it in the primary algorithms on bank nodes, and in the de-skolemizing version to IRIs matching the skolem pattern. In hindsight, one way to have cast the algorithm as immediately transforming blank nodes to skolem IRIs and using that to create new skolem IRIs with canonical labels, and turn those back into blank nodes with their canonical labels when serializing to N-Quads. I'll respond to @dlongley later, as I'm tied up for most of the rest of the day. |
IIRC, we introduced Skolemization to fix the blank node identifiers so that they do not change during the selective disclosure process, outside the canonicalization. |
Summary of todays discussionObjective is to allow a canonicalized dataset to be subsetted and re-canonized such that the canonical identifiers from the original dataset can be correlated with the canonical identifiers from the subsetted dataset. The problem to overcome is that, presently, the input to the C14N algorithm is a dataset, which abstractly contains no blank node identifiers, although many/most implementations to retain such identifiers. The solution that emerged from today's discussion was to allow as an input a concrete N-Quads serialization of a dataset, and use this to seed the issued identifiers map in the canonical issuer. Language needs updating, but this effectively maps blank nodes in the input dataset (via their identifiers) to canonical identifiers. Doing so requires ensuring that each blank node in the input dataset has an identifier. This can be created when parsing an N-Quads document as an input, and could be maintained in something like the normalized dataset. If the input was a dataset without identifiers (or where identifiers are only partially assigned), the algorithm would insure that each blank node had a unique identifier. It is probably worth renaming the normalized dataset to something like stabilized dataset, and replace the map component with an identifier issuer, which effectively records the same thing. A system trying to use the C14N algorithm for something such as selective disclosure would need to ensure that the blank node identifiers resulting from the first canonicalization, and used when called with a subset dataset, possibly by skolemizing the blank nodes making use of the canonical identifiers, so that they can be re-created when de-skolemizing the subset dataset. |
Great, thanks for the summary! Is this ready for PR, or still needs some more discussion? |
No, ready for PR, I believe. Still some things to be worked out in the process, but the goal seems clear. |
Starting with my comment here: #86 (comment)
Some discussion spawned around the need, as an optional output, a mapping of quad input indices to quad output indices for selective disclosure use cases.
@gkellogg made this comment:
Which implies that we might want to also take an ordered list of quads as an optional alternative input to the algorithm. Or perhaps we can describe the RDF abstract dataset as being optionally represented as such -- for the case where this mapping output is desirable. Notably, the presence (or lack thereof) of input blank node labels in this case is not relevant.
The text was updated successfully, but these errors were encountered: