-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the output of the c14n algorithm? #4
Comments
Looking at this document: https://w3c-ccg.github.io/rdf-dataset-canonicalization/spec/#canonicalization-algorithm-terms
If the c14n algorithm itself doesn't itself produce the serialization, then that is a separate step that needs to be specified elsewhere. |
Absolutely. For me this would be part of the 'hashing' specification. |
To clarify my initial point: Consider the following turtle: @prefix s: <http://schema.org/>.
_:b a s:Person. It does not represent a specific graph. It represents any graph containing exactly one triple, with predicate http://www.w3.org/1999/02/22-rdf-syntax-ns#type, with object http://schema.org/Person, and with some blank node as its subject. All these graphs are isomorphic to each other, so it does not really matter which one we "get" when we parse that Turtle. Changing So strictly speaking, the c14n algorithm is not producing "the" canonical abstract dataset, but a deterministic labeling of the blank nodes that is that same for all isomorphic graphs. This labeling can in turn be used to produce a canonical serialization of the dataset. But I consider that the labeling is more primitive that the serialization (one could devise several canonical serializations with the same labeling, e.g. using the expanded form of JSON-LD rather than n-quads). That's why I advocate for the output of the c14n algorithm specified in the RDC deliverable. Writing this, I even realize that the labeling is based on an ordering of the blank nodes in the dataset, and that ordering may also be considered the most primitive output. (The |
As I discussed on the call today, I think returning a mapping is problematic, as that mapping only makes sense in a context where the original blank nodes can be referenced, which would be in some internal representation. Once taken away from the context where a parsed RDF serialization has specific blank nodes allocated (i.e., when re-serialized), the mapping is not useful. As an alternative to serializing the result as sorted N-Quads, using the c14n labels for blank nodes, we could serialize an ordered array of such statements. For example, in the CG test suite, https://w3c-ccg.github.io/rdf-dataset-canonicalization/tests/test020-in.nq has the following form:
After canonicalization, the expected N-Quads might be:
A hypothetical JSON (not JSON-LD) serialization could be the following:
This can be easily turned back into the canonical N-Quads using an operator such as |
Agree that a mapping is problematic - the input may not have labels! - and the mapping only works for one input not. e.g.N-triples vs RDF/XML. Outputting N-Quads (with clear syntactic canonical form - single spaces etc) is sensible. Would splitting canonical N-quads on "\n" to produce the array for JSON be problematic? The only issue I can think if is integrity. A slice of missing lines is not detectable in either version. |
agreed, and that was my implicit assumption. We could define the C14N algorithm as working purely on the abstract syntax -- meaning that each implementation would be working on its internal representation of the abstract syntax. Granted, writing a test-suite for such an abstract algorithm would be challenging, compare to having serialized inputs and outputs. But if some uses-cases require the ability to link back to the original internal blank-node label (as suggested by @dlongley during the call), then it might be worth the trouble.
I think it would work ok. But isn't it cleaner to produce a list of strings, and leave it to the consumer to concatenate them if they need to? |
It is a new format in JSON syntax with JSON escaping and quoting rules. It's really a new RDF syntax (c.f. N-triples originating in RDF Core WG). N-Quads (with it's quoting and escape rules) can be fed into existing RDF systems and has a MIME type. Many JSON parsers prefer not to stream because they guarantee valid JSON i.e. checking seeing a valid end of the list before returning results. What is being hashed? #92 talks about hashing the canonicalized RDF document (top hashing path.) Even if hashing is on the abstract data model, then the input is parsed and RDF is the common format for that. |
In the simple case (blue path in the figure at #92) , hashing is definitely on some normalized N-Quads serialization (IIUC). But other paths of the figure (and possibly other use-cases) need more granular data. Also to be clear: I don't suggest that the C14N produces a serialization of that granular output, but something more at the data level (potentially based on [INFRA]). |
The WG test suite will need a document format. If you want structure - use the SPARQL JSON results format :-) |
So the use case I mentioned on the call is for selective disclosure of statements (i.e., quad) in a dataset. I'll talk about that a bit here in an effort to help tease out the requirements for the inputs / outputs of the canonicalization algorithm. Selective disclosure is the ability for someone to share only some of the statements from a signed dataset, without harming the ability of the recipient to verify the authenticity of those selected statements. Most cryptographic schemes that enable selective disclosure will need to apply a hash and signature to each individual component that can be selectively disclosed. This means, at least, signing each statement individually (note: I've seen at least one paper that seems to go further in signing individual elements of the statement as well, but I'm not familiar enough with it to say more at this time). Selective disclosure, ideally, only reveals the information disclosed (no other information leaks). This is not always possible, but should be minimized to the greatest extent it can be. For example, if you are revealing information from a credential about one of your children, ideally the recipient does not also necessarily learn how many children you have. Selective disclosure also ideally does not introduce any other units of correlation beyond what information is revealed. Again, this isn't always possible, but should be minimized. For example, if you reveal that you're a citizen of country X in one interaction, revealing this information again should not uniquely identify you, but rather, could indicate that you are any person in a group of 1000+ individuals. This property is called "unlinkability", because one cannot easily link the same individual across multiple interactions. Now, a canonicalization algorithm produces canonical blank node labels using all data in a dataset -- to ensure that any differences in the data are accounted for. This means that for In light of this, we should endeavor to make it easy for people to make modifications to the blank node labels output by the canonicalization algorithm. The goal for selective disclosure here is to minimize the possible number of blank node labelings (whilst also not introducing some other significant burdens). By reducing the number of labelings, the number of datasets (i.e., the number of individuals here) that have the same labeling increases, improving herd privacy and reducing information leakage. Note that introducing a secret randomizing factor (e.g., a salt or HMAC) that is applied to the canonized blank node labels is insufficient for solving this problem. It will prevent recipients from being able to "guess" what some non-disclosed statements contain because it decouples the labels chosen from the data, however, it will not help increase unlinkability. In fact, it may make it even easier to uniquely identify an individual, depending on the randomized output size vs. the number of possible blank node labelings. So far, at least two mechanisms have been considered to minimize possible blank node labelings. The first imposes a "template" that must be shared between the author of the dataset, the presenter of selectively disclosed statements, and the recipient. This template, with dummy values, is what is fed into the canonicalization algorithm so that every individual will have the same blank node labels. There are a number of issues with this approach, including the overhead, storage, and retrieval of such a template as well as the leakage of information (e.g., selectively disclosing information about your 3rd child using such a template will always reveal you have at least 3 children). The second is the one I mentioned on the call. It seeks to eliminate the overhead of a template and reduce (at least in some cases) the information leakage problem. It looks something like this:
Given the above, I don't know that we're looking for a direct mapping from the original labels to the canonized labels, but we are looking to preserve some abstract structure and references / mappings from canonized labels to blank nodes (and / or vice versa) to avoid having to perform extra parsing steps. Of course, any implementation could surface the features needed to do the things above, but what is needed is the ability to reference and reuse the canonicalization algorithm in spec text for the selective disclosure schemes. Those specifications need to be able to say something along the lines of: "Use canonicalization algorithm X [spec Y] to do step 2 and pass the output to step 3". |
The timing won't be right, but the JSON-LD CG is starting work on NDJSON-LD, or Newline Delimited JSON-LD, mostly to have a stream format for YAML-LD, where YAML supports streams including zero or more YAML documents. This may end up being based on JSON Text Sequences – RFC7464, or NDJSON where the difference largely comes down to what the record separator between documents is. Conceptually, this is pretty simple, and having a format that allows easy extraction/introspection of elements of each quad could be useful. That said, I think that the Hash itself, either for the complete document, or selective disclosure, is based on the sorted canonical N-Quads representation, but as an intermediate form, something like NDJSON-LD to represent each quad with canonicalized blank node identifiers, would be useful. SPARQL JSON Results format could certainly be used, and could use something like the SPARQL-star results representation to represent triples, but would need a way to fit graph names in, which isn't otherwise specified as a Quads format, IIRC. |
Just want to be sure I understand this, working through an example using test002 from the test suite (dataset X):
Using the URDNA2015 algorithm to canonicalize the input generates the following (CX):
(By "value", I presume you mean Literal, including potential language and datatype information). (I'm unclear on when/how additional dummy statements might be injected). Using the same statement ordering, the "herd-privacy" dataset Y would be:
IIRC, this would result in CY:
(edited, the hash of the resulting N-Quads would be " This generates the following hashes for each statement, in order:
|
The only thing that should be different in what you said above would be that CY uses the original values (not the dummy ones):
You can see how it changed from CX:
And, presumably, if you were to change "Foo" to a variety of other values, you'd also get a variety of different blank node labelings in CX -- that would each be reduced in CY, thereby increasing herd privacy. |
So, what does the phrase 'and every value with the same "dummy" value' mean, then? Is there some case where values are replaced with dummy values? |
Also, given that the intermediate step replaces all blank node labels with |
Ok, in order to (hopefully) answer your questions I thought it would be easiest if I also wrote down all of the steps below as I intended to communicate earlier (but my description wasn't clear enough). Note that I didn't actually run any canonize algorithm here, so the canonize outputs are just hypothetical. The tricky part to all this is ensuring that every implementation will swap in and out the Let's say we start with this input:
Which hypothetically canonizes to this:
We then create a new set of statements with every blank node replaced with
We remember the order of these statements (so that we can break ties later) and we remember the actual uniqueness of each blank node (even though their labels are the same here for sorting purposes). In other words:
We then sort the statements, breaking ties using the original sort order:
We then assign blank nodes in the order that the blank nodes appear, using their original uniqueness:
Producing:
Finally this is sorted to produce the final output:
Now, the more ties that have to be broken, the less herd privacy there will be. But consider other data sets such as this optimal (for the purposes here) one:
And suppose it hypothetically canonizes to:
When the above algorithm is applied, we'd be sorting a data set that looks like this:
This sort order is always the same here (no ties) and would always result in the same blank node labels being used in the final output, regardless of the literal values in the input. So the purpose of this algorithm is to only make the final labels be influenced by literal values in the case that a tie needs to be broken. |
@dlongley, Thank you for your example that helps me understand your privacy-enhancing blank node labeling. If I am not mistaken, the example includes two statements
that should have been
|
@DLongLe, regardless of the details of your extra step described above, let us come back to the original question also discussed in the wg call. The question at the call and of this issue is "what is the output of the c14n algorithm?". The answer may be that it is an RDF graph in the abstract sense with a new set of bnode labels unless that correspondence between the bnode labels should be maintained (the answer may also be that the output of the algorithm is a sorted set of triples in n-quads, but that does not change the arguments). What you describe above does not contradict, as far as I can see, that possible answer. The extra algorithm is a post-processing step on the output of c14n which, temporarily, has to keep some correspondence between bnode labels but not to the original bnode labels, and the final output is "merely" a modified version of the c14n abstract graph (or a serialization thereof). Do I misunderstand something? |
Yes, thank you for the correction! I edited the original post. |
No, nothing was misunderstood there. After writing out the use case above (instead of me just muddling through it on the call), I said that the labels for the blank nodes passed in as original input would likely not be needed for it. I mentioned that here, but given that I wrote a wall of text and it was near the end, it was probably easily missed -- and I wasn't as clear as I could be:
Having further written out some example steps above for the use case, what would be preferred as output would be access to the statements (and their components) in the canonized output, in the canonized statement order, ideally without having to reparse anything. |
Yes, this expression of the steps makes sense. However, I think it adds more complexity for the RDF-star case, where it's more difficult to enumerate the order of blank nodes within a statement. Before this, I think the current algorithm's "Hash First Degree Quads" and "Hash N-Degree Quads" algorithm are fairly easily adapted to the RDF-star case. I'm sure there is a way to consider this step of referring from statements in Y to CX keeping track of blank node assignments can be adapted to the potentially greater number of blank nodes contained within any given statement, as a traversal ordering can be defined. |
@iherman said:
It would be worth considering the description of the normalized dataset in the Draft CG report, which attempts to describe such an abstract dataset where blank node identifiers are stable. Being able to consider this as the output of the C14N algorithm as input to the Hashing algorithm(s), aside from how it might be serialized, would be useful, if it doesn't raise too many semantic red-flags. |
There is an order to RDF terms (inc blank nodes) in an RDF-star quoted triple term because the terms are trees with each branch node having an order S-P-O - 1-2-3. There is a path given by the positions used to get to any point. There is only one path because it's a tree. There is an enumeration by depth first traverse. There are other ways to get a deterministic enumeration of just the blank nodes. |
Intuitively and based on what I have read above (without fully grasping the details of the herd privacy part), returning an abstract graph with mapped labels seems the proper solution to me here. But in case we decide for the first solution ("return one particular serialization") for practical reasons, is the understanding here that this serialization would also be the direct input of the hashing step? Or can the hashing step choose to preprocess that serialization into a different form (possibly another of the existing RDF serializations) before applying the hash function? |
This is still an open issue. Presently, the test suite tests that the canonical N-Quads serialization of the normalized dataset is used as the expected result. But, part of the motivation of having an intermediate normalized dataset is to allow for other uses, for instance a hash of each individual quad in the normalized dataset. Thus far, canonical N-Quads seems like the only real alternative for serializing these quads to make use of them, but in theory, they could be serialized to another form (e.g., Turtle/TriG using a reduced syntax, is used within the examples in the spec). For testing purposes, at least, some serialization of the normalized dataset is required, but the abstract normalized dataset is certainly useful for other algorithms. |
It seems that there is an agreement for returning an ordered list of statements, using the canonical bnode labels -- against my proposal of returning a hard-to-define-and-to-serialized mapping ;-) So if that ship has sailed, canonical N-Quads is probably a good way of conveying that thing. Retrieving each individual triple is trivially done by splitting on |
Yeah well... for an application that does not want to do signing but, for example, compare graphs, returning canonical n-quads is not the best idea imho. I would opt to return a dataset (in any format that the given environment holds datasets in) with all bnodes labeled canonically. Converting this into nquads should be considered as a separate step, conceptually. |
N-triples section 2.4 and Turtle 2.6 RDF Blank Nodes use "blank node label" in normative sections. It is the string after the |
which is great. I think using the term "label" is then consistent for us, too! Thx |
As a name for the part after As output of c14n, no, not on it's own. You don't want the baggage that comes with the term such as scope, renaming possibilities and parsing. For (1), datasets do not have "blank node labels" or "blank node identifiers". Datasts are abstract so the output is too - it is a mapping to unique values - a 1-1 function - from blank node to strings. These strings may be used as labels in syntax - but they aren't guaranteed to be preserved if parsed or combined with other data. |
Let the WG vote. I won't lie down the road on any decision on this; I guess we'll have to agree to disagree on this. |
I agree that it is worth taking the abstract dataset with labeling as an output of rdf-canon, but we seem to need to patch step 5.3 in the main algorithm to handle the "automorphism" issue to do so. More specifically, if the input dataset has nontrivial automorphism, e.g., in the duplicated paths and double circle examples, multiple blank node labelings can yield the same canonical n-quads. For example, in the duplicated paths example, in addition to the I have no theoretical analysis yet, but in such a case, there appear to be multiple result with the same hash values in hash path list in step 5.3 of the main algorithm (see the debug log from my implementation). In those cases, we should output an unordered set of all the legitimate labelings since it seems impossible to deterministically select only one of them. |
@yamdan, would it be better to raise that as a separate issue? I am not sure whether it influences the current issue's outcome... |
That's a very good point, and that's one additional practical issue raised by the "returning a mapping" option! Although I was initially leaning towards this option, I am now convinced by the "returning a serialization" option, as the only practically viable thing to standardize. We could add a note explaining that implementations may return some "intermediate" results, such as
but that this is out of scope of the specification. |
Ah. Good point indeed. |
In the "abstract dataset option w/canonical blank node labels", the aim wasn't to return "a mapping" that specified, for example, input blank node 1 => output blank node 2. I agree that multiple options could be generated that way. So, since such a "mapping" wasn't part of the output there, I don't think it matters that there are multiple different ways to map input blank nodes to output labels, as the end result (abstract dataset with stable blank node labels) will look the same, as will a concrete serialization. |
This was discussed during today's call: |
Thank you @dlongley for pointing out my misunderstanding. I now understand that we do not have to consider the additional sub-steps to get multiple mappings described in (#4 (comment)); instead just focus on the formal definition of normalized dataset based on RDF/JS data model for example. |
Personally, I'd like to avoid using WebIDL definitions, and stick with Infra, which is already used to some degree in the spec. JSON-LD uses Infra to define the Internal Representation. |
One more argument for sticking with Infra: the usage of IDL is often (mis?)represented as an "obligation" of implementing those data types in a browser, too. (I ran into this type of objection in another spec when using IDL to define some datatypes.) Using infra avoids this problem. |
This issue was discussed (again) on 2023-03-01, see https://www.w3.org/2023/03/01-rch-minutes.html#t02. My understanding of that discussion is that the output can be defined in several steps. There is the abstract dataset with labelled bnodes. For some people that's enough (they may want to change those labels, BBS for example). Then we can say that it can be serialized as a set of n-quads, perhaps as an array. Those quads can be hashed individually or the array can be joined and the whole set hashed as one. IMO that's clear and sounds like the end section(s) of the c14n spec, not a separate document. We also talked about the canonicalization of n-quads. An open PR in the RDF-N-Quads spec is very relevant here. See @gkellogg's email to the RCH WG. |
+1 I think there is a new section on representation. One representation would be the dataset represented in N-Quads canonical form with lines in code point order. A second point would use this to create a hash, using the description for hashing values already in the spec. Another representation would be as an ordered Infra array, where each entry is a tuple composed of the serialized N-Quad and the hash of that N-Quad where the array is ordered by the code points of the N-Quad. Alternatively, this could be in a map representation. Is there a need to represent more than just the quad and its hash? We might consider an IANA section to define the textual representations of these, one extending |
The group discussed this today and came to the following conclusions:
A PR should be raised that accomplishes all of the above (where some of the above might already be defined in the specification). |
I think the spec already covers all of the four points except maybe point 2. See this PR that was merged a few weeks ago: #90 |
The Serialization section added in #90 says the following:
So the N-Quads are ordered. |
… ordered map mapping canonical labels to blank nodes within that dataset. Update the result of the canonicalization algorithm to be the serialized canonical form of that normalized dataset. Fixes #4.
* Update definition of "normalized dataset" to be an RDF dataset and an unordered map relates blank nodes to their canonical identifiers. * Update the result of the canonicalization algorithm to be the serialized canonical form of that normalized dataset. Fixes #4. Fixes #92. --------- Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Following the discussion that just happened at the TPAC joint meeting with the VC.
An option is to return one particular serialization.
An another option is to return a mapping from bnodes to labels.
I prefer the second solution as it is more generic. It does not preclude which serialization to use downstream. It actually does not impose that you serialize the dataset (imagine storing a dataset in a triple store with canonical blank node labels).
The text was updated successfully, but these errors were encountered: