Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the output of the c14n algorithm? #4

Closed
pchampin opened this issue Sep 16, 2022 · 65 comments · Fixed by #97
Closed

What is the output of the c14n algorithm? #4

pchampin opened this issue Sep 16, 2022 · 65 comments · Fixed by #97
Labels
pending-close ready for pr This issue is ready for a PR to be created.

Comments

@pchampin
Copy link
Contributor

Following the discussion that just happened at the TPAC joint meeting with the VC.

An option is to return one particular serialization.
An another option is to return a mapping from bnodes to labels.

I prefer the second solution as it is more generic. It does not preclude which serialization to use downstream. It actually does not impose that you serialize the dataset (imagine storing a dataset in a triple store with canonical blank node labels).

@peacekeeper
Copy link
Contributor

Looking at this document: https://w3c-ccg.github.io/rdf-dataset-canonicalization/spec/#canonicalization-algorithm-terms

normalized dataset = The immutable, abstract RDF dataset and set of normalized blank node identifiers that are produced as output by the algorithm.

If the c14n algorithm itself doesn't itself produce the serialization, then that is a separate step that needs to be specified elsewhere.

@pchampin
Copy link
Contributor Author

If the c14n algorithm itself doesn't itself produce the serialization, then that is a separate step that needs to be specified elsewhere.

Absolutely. For me this would be part of the 'hashing' specification.

@pchampin
Copy link
Contributor Author

pchampin commented Sep 27, 2022

To clarify my initial point:
blank node labels are a convenience introduced by concrete syntaxes (i.e. serialization formats) such as Turtle or JSON-LD, but they have no counterpart on the abstract syntax (i.e. the underlying data mode). In the abstract syntax, blank node have only an abstract, unspeakable, identity --unlike IRIs, that we can spell explicitly.

Consider the following turtle:

@prefix s: <http://schema.org/>.
_:b a s:Person.

It does not represent a specific graph. It represents any graph containing exactly one triple, with predicate http://www.w3.org/1999/02/22-rdf-syntax-ns#type, with object http://schema.org/Person, and with some blank node as its subject. All these graphs are isomorphic to each other, so it does not really matter which one we "get" when we parse that Turtle.

Changing _:b to _:c14n0 in the Turtle snippet above does not make it any different. We still have a serialization that represents any of those isomorphic graphs. If we were to define, in all these abstract graphs, a canonical one, there would be no way to specifically describe that one rather than another one.

So strictly speaking, the c14n algorithm is not producing "the" canonical abstract dataset, but a deterministic labeling of the blank nodes that is that same for all isomorphic graphs. This labeling can in turn be used to produce a canonical serialization of the dataset. But I consider that the labeling is more primitive that the serialization (one could devise several canonical serializations with the same labeling, e.g. using the expanded form of JSON-LD rather than n-quads). That's why I advocate for the output of the c14n algorithm specified in the RDC deliverable.

Writing this, I even realize that the labeling is based on an ordering of the blank nodes in the dataset, and that ordering may also be considered the most primitive output. (The _:c14n prefix in the canonical labels is arbitrary, as is the use of base 10 numbering...)

@gkellogg
Copy link
Member

An option is to return one particular serialization. An another option is to return a mapping from bnodes to labels.

I prefer the second solution as it is more generic. It does not preclude which serialization to use downstream. It actually does not impose that you serialize the dataset (imagine storing a dataset in a triple store with canonical blank node labels).

As I discussed on the call today, I think returning a mapping is problematic, as that mapping only makes sense in a context where the original blank nodes can be referenced, which would be in some internal representation. Once taken away from the context where a parsed RDF serialization has specific blank nodes allocated (i.e., when re-serialized), the mapping is not useful.

As an alternative to serializing the result as sorted N-Quads, using the c14n labels for blank nodes, we could serialize an ordered array of such statements.

For example, in the CG test suite, https://w3c-ccg.github.io/rdf-dataset-canonicalization/tests/test020-in.nq has the following form:

<http://example.org/vocab#test> <http://example.org/vocab#A> _:b0 .
<http://example.org/vocab#test> <http://example.org/vocab#B> _:b1 .
_:b0 <http://example.org/vocab#next> _:b2 .
_:b1 <http://example.org/vocab#next> _:b2 .

After canonicalization, the expected N-Quads might be:

<http://example.org/vocab#test> <http://example.org/vocab#A> _:c14n2 .
<http://example.org/vocab#test> <http://example.org/vocab#B> _:c14n0 .
_:c14n0 <http://example.org/vocab#next> _:c14n1 .
_:c14n2 <http://example.org/vocab#next> _:c14n1 .

A hypothetical JSON (not JSON-LD) serialization could be the following:

[
  "<http://example.org/vocab#test> <http://example.org/vocab#A> _:c14n2 .",
  "<http://example.org/vocab#test> <http://example.org/vocab#B> _:c14n0 .",
  "_:c14n0 <http://example.org/vocab#next> _:c14n1 .",
  "_:c14n2 <http://example.org/vocab#next> _:c14n1 ."
]

This can be easily turned back into the canonical N-Quads using an operator such as join("\n"), but allows each statement to be considered on it's own without splitting the N-Quads into separate statements.

@afs
Copy link

afs commented Sep 28, 2022

Agree that a mapping is problematic - the input may not have labels! - and the mapping only works for one input not. e.g.N-triples vs RDF/XML.

Outputting N-Quads (with clear syntactic canonical form - single spaces etc) is sensible.
N-Quads is simply fed into existing RDF systems.

Would splitting canonical N-quads on "\n" to produce the array for JSON be problematic?

The only issue I can think if is integrity. A slice of missing lines is not detectable in either version.
But that's what hashing is for :-)

@pchampin
Copy link
Contributor Author

that mapping only makes sense in a context where the original blank nodes can be referenced, which would be in some internal representation

agreed, and that was my implicit assumption. We could define the C14N algorithm as working purely on the abstract syntax -- meaning that each implementation would be working on its internal representation of the abstract syntax.

Granted, writing a test-suite for such an abstract algorithm would be challenging, compare to having serialized inputs and outputs. But if some uses-cases require the ability to link back to the original internal blank-node label (as suggested by @dlongley during the call), then it might be worth the trouble.

Would splitting canonical N-quads on "\n" to produce the array for JSON be problematic?

I think it would work ok. But isn't it cleaner to produce a list of strings, and leave it to the consumer to concatenate them if they need to?

@afs
Copy link

afs commented Sep 29, 2022

But isn't it cleaner to produce a list of strings,

It is a new format in JSON syntax with JSON escaping and quoting rules. It's really a new RDF syntax (c.f. N-triples originating in RDF Core WG). N-Quads (with it's quoting and escape rules) can be fed into existing RDF systems and has a MIME type.

Many JSON parsers prefer not to stream because they guarantee valid JSON i.e. checking seeing a valid end of the list before returning results.

What is being hashed? #92 talks about hashing the canonicalized RDF document (top hashing path.)

Even if hashing is on the abstract data model, then the input is parsed and RDF is the common format for that.

@pchampin
Copy link
Contributor Author

In the simple case (blue path in the figure at #92) , hashing is definitely on some normalized N-Quads serialization (IIUC). But other paths of the figure (and possibly other use-cases) need more granular data.
That's why I would rather have the C14N algorithm produce something granular from which all required structures can be generated (from the monolythic canonical N-Quads serialization to the array of arrays of terms.

Also to be clear: I don't suggest that the C14N produces a serialization of that granular output, but something more at the data level (potentially based on [INFRA]).

@afs
Copy link

afs commented Sep 29, 2022

The WG test suite will need a document format.

If you want structure - use the SPARQL JSON results format :-)

@dlongley
Copy link
Contributor

So the use case I mentioned on the call is for selective disclosure of statements (i.e., quad) in a dataset. I'll talk about that a bit here in an effort to help tease out the requirements for the inputs / outputs of the canonicalization algorithm.

Selective disclosure is the ability for someone to share only some of the statements from a signed dataset, without harming the ability of the recipient to verify the authenticity of those selected statements.

Most cryptographic schemes that enable selective disclosure will need to apply a hash and signature to each individual component that can be selectively disclosed. This means, at least, signing each statement individually (note: I've seen at least one paper that seems to go further in signing individual elements of the statement as well, but I'm not familiar enough with it to say more at this time).

Selective disclosure, ideally, only reveals the information disclosed (no other information leaks). This is not always possible, but should be minimized to the greatest extent it can be. For example, if you are revealing information from a credential about one of your children, ideally the recipient does not also necessarily learn how many children you have.

Selective disclosure also ideally does not introduce any other units of correlation beyond what information is revealed. Again, this isn't always possible, but should be minimized. For example, if you reveal that you're a citizen of country X in one interaction, revealing this information again should not uniquely identify you, but rather, could indicate that you are any person in a group of 1000+ individuals. This property is called "unlinkability", because one cannot easily link the same individual across multiple interactions.

Now, a canonicalization algorithm produces canonical blank node labels using all data in a dataset -- to ensure that any differences in the data are accounted for. This means that for N blank nodes, there are up to N! possible labels that depend on the rest of the data. The canonicalization algorithm doesn't care which labeling is chosen, only that the same one is always chosen for the same data. In other words, as the number of blank nodes in a dataset increases, a particular labeling is more likely to uniquely identify an individual. This needn't necessarily be the case, however, as in many cases, some other labeling (or at least a partial labeling) that is common to more individuals could have been chosen instead.

In light of this, we should endeavor to make it easy for people to make modifications to the blank node labels output by the canonicalization algorithm.

The goal for selective disclosure here is to minimize the possible number of blank node labelings (whilst also not introducing some other significant burdens). By reducing the number of labelings, the number of datasets (i.e., the number of individuals here) that have the same labeling increases, improving herd privacy and reducing information leakage.

Note that introducing a secret randomizing factor (e.g., a salt or HMAC) that is applied to the canonized blank node labels is insufficient for solving this problem. It will prevent recipients from being able to "guess" what some non-disclosed statements contain because it decouples the labels chosen from the data, however, it will not help increase unlinkability. In fact, it may make it even easier to uniquely identify an individual, depending on the randomized output size vs. the number of possible blank node labelings.

So far, at least two mechanisms have been considered to minimize possible blank node labelings.

The first imposes a "template" that must be shared between the author of the dataset, the presenter of selectively disclosed statements, and the recipient. This template, with dummy values, is what is fed into the canonicalization algorithm so that every individual will have the same blank node labels. There are a number of issues with this approach, including the overhead, storage, and retrieval of such a template as well as the leakage of information (e.g., selectively disclosing information about your 3rd child using such a template will always reveal you have at least 3 children).

The second is the one I mentioned on the call. It seeks to eliminate the overhead of a template and reduce (at least in some cases) the information leakage problem. It looks something like this:

  1. Running the canonize algorithm on dataset X to obtain canonized output, CX (a set of ordered statements with canonical blank node labels).
  2. Create a new "herd-privacy" set of statements Y, where, for each statement in CX, we replace every blank node with the same label (e.g., _:a) and every value with the same "dummy" value, producing an "on-the-fly" template (note: This process may also inject additional dummy statements to stabilize the number of blank nodes across the herd). We keep a mapping of each statement in Y to the statement in CX.
  3. Sort Y, breaking ties using the CX statements they map to.
  4. Assign new blank node labels in the order that blank nodes appear in Y, producing CY.
  5. Hash and sign the individual statements in CY.

Given the above, I don't know that we're looking for a direct mapping from the original labels to the canonized labels, but we are looking to preserve some abstract structure and references / mappings from canonized labels to blank nodes (and / or vice versa) to avoid having to perform extra parsing steps.

Of course, any implementation could surface the features needed to do the things above, but what is needed is the ability to reference and reuse the canonicalization algorithm in spec text for the selective disclosure schemes. Those specifications need to be able to say something along the lines of: "Use canonicalization algorithm X [spec Y] to do step 2 and pass the output to step 3".

@gkellogg
Copy link
Member

The timing won't be right, but the JSON-LD CG is starting work on NDJSON-LD, or Newline Delimited JSON-LD, mostly to have a stream format for YAML-LD, where YAML supports streams including zero or more YAML documents.

This may end up being based on JSON Text Sequences – RFC7464, or NDJSON where the difference largely comes down to what the record separator between documents is. Conceptually, this is pretty simple, and having a format that allows easy extraction/introspection of elements of each quad could be useful.

That said, I think that the Hash itself, either for the complete document, or selective disclosure, is based on the sorted canonical N-Quads representation, but as an intermediate form, something like NDJSON-LD to represent each quad with canonicalized blank node identifiers, would be useful.

SPARQL JSON Results format could certainly be used, and could use something like the SPARQL-star results representation to represent triples, but would need a way to fit graph names in, which isn't otherwise specified as a Quads format, IIRC.

@gkellogg
Copy link
Member

gkellogg commented Oct 10, 2022

Just want to be sure I understand this, working through an example using test002 from the test suite (dataset X):

_:b0 <http://example.org/vocab#p1> _:b1 .
_:b1 <http://example.org/vocab#p2> "Foo" .
_:b2 <http://example.org/vocab#p1> _:b3 .
_:b3 <http://example.org/vocab#p2> "Foo" .
  • Running the canonize algorithm on dataset X to obtain canonized output, CX (a set of ordered statements with canonical blank node labels).

Using the URDNA2015 algorithm to canonicalize the input generates the following (CX):

_:c14n0 <http://example.org/vocab#p1> _:c14n1 .
_:c14n1 <http://example.org/vocab#p2> "Foo" .
_:c14n2 <http://example.org/vocab#p1> _:c14n3 .
_:c14n3 <http://example.org/vocab#p2> "Foo" .
  • Create a new "herd-privacy" set of statements Y, where, for each statement in CX, we replace every blank node with the same label (e.g., _:a) and every value with the same "dummy" value, producing an "on-the-fly" template (note: This process may also inject additional dummy statements to stabilize the number of blank nodes across the herd). We keep a mapping of each statement in Y to the statement in CX.

(By "value", I presume you mean Literal, including potential language and datatype information).

(I'm unclear on when/how additional dummy statements might be injected).

Using the same statement ordering, the "herd-privacy" dataset Y would be:

_:a <http://example.org/vocab#p1> _:a .
_:a <http://example.org/vocab#p2> "dummy" .
_:a <http://example.org/vocab#p1> _:a .
_:a <http://example.org/vocab#p2> "dummy" .
  • Sort Y, breaking ties using the CX statements they map to.
  • Assign new blank node labels in the order that blank nodes appear in Y, producing CY.

IIRC, this would result in CY:

_:c14n0 <http://example.org/vocab#p1> _:c14n1 .
_:c14n2 <http://example.org/vocab#p2> "dummy" .
_:c14n3 <http://example.org/vocab#p1> _:c14n4 .
_:c14n5 <http://example.org/vocab#p2> "dummy" .
  • Hash and sign the individual statements in CY.

(edited, the hash of the resulting N-Quads would be "e25fe023701228da9afd37deb415b6d55c07a048f05aef81a6254c98b9cb043b").

This generates the following hashes for each statement, in order:

  • "af38a5a242a3fda01e601a3c06bfc076acf776c0ccc442b3ae36fc960d6676bd"
  • "cdd3336a47c2c61795ad7222bf495be23b6fbb09b104fc8b386e758642ae5b5a"
  • "c0a64c2ff2224bab50bdf88ac34bb84407c632283d0e6855b6e7361e5360482a"
  • "20b3eb6a0a503ccde8e11fec4a8970c82762f3eac3c6b86b473d7fc08c0b90bb"

@dlongley
Copy link
Contributor

dlongley commented Oct 10, 2022

@gkellogg,

The only thing that should be different in what you said above would be that CY uses the original values (not the dummy ones):

_:c14n0 <http://example.org/vocab#p1> _:c14n1 .
_:c14n2 <http://example.org/vocab#p2> "Foo" .
_:c14n3 <http://example.org/vocab#p1> _:c14n4 .
_:c14n5 <http://example.org/vocab#p2> "Foo" .

You can see how it changed from CX:

_:c14n0 <http://example.org/vocab#p1> _:c14n1 .
_:c14n1 <http://example.org/vocab#p2> "Foo" .
_:c14n2 <http://example.org/vocab#p1> _:c14n3 .
_:c14n3 <http://example.org/vocab#p2> "Foo" .

And, presumably, if you were to change "Foo" to a variety of other values, you'd also get a variety of different blank node labelings in CX -- that would each be reduced in CY, thereby increasing herd privacy.

@gkellogg
Copy link
Member

So, what does the phrase 'and every value with the same "dummy" value' mean, then? Is there some case where values are replaced with dummy values?

@gkellogg
Copy link
Member

Also, given that the intermediate step replaces all blank node labels with _:a, it would seem that step 4 would relabel all blank nodes in Y with unique labels, so there would be no shared labels in CY.

@dlongley
Copy link
Contributor

dlongley commented Oct 10, 2022

@gkellogg,

Ok, in order to (hopefully) answer your questions I thought it would be easiest if I also wrote down all of the steps below as I intended to communicate earlier (but my description wasn't clear enough). Note that I didn't actually run any canonize algorithm here, so the canonize outputs are just hypothetical. The tricky part to all this is ensuring that every implementation will swap in and out the dummy values in the same order. This can be guaranteed by canonizing the data (like usual, without the dummy values) to produce the same order for every implementation -- and then using that order to map everything later.

Let's say we start with this input:

_:b0 <http://example.org/vocab#p1> _:b1 .
_:b1 <http://example.org/vocab#p2> "Foo" .
_:b2 <http://example.org/vocab#p1> _:b3 .
_:b3 <http://example.org/vocab#p2> "Bar" .

Which hypothetically canonizes to this:

_:c14n0 <http://example.org/vocab#p2> "Foo" .
_:c14n1 <http://example.org/vocab#p1> _:c14n0 .
_:c14n2 <http://example.org/vocab#p2> "Bar" .
_:c14n3 <http://example.org/vocab#p1> _:c14n2 .

We then create a new set of statements with every blank node replaced with _:a and every value (literal) replaced with dummy:

_:a <http://example.org/vocab#p2> "dummy" .
_:a <http://example.org/vocab#p1> _:a .
_:a <http://example.org/vocab#p2> "dummy" .
_:a <http://example.org/vocab#p1> _:a .

We remember the order of these statements (so that we can break ties later) and we remember the actual uniqueness of each blank node (even though their labels are the same here for sorting purposes).

In other words:

Statement 1, 1st _:a => _:c14n0
Statement 2, 1st _:a => _:c14n1
Statement 2, 2nd _:a => _:c14n0
Statement 3, 1st _:a => _:c14n2
Statement 4, 1st _:a => _:c14n3
Statement 4, 2nd _:a => _:c14n2

We then sort the statements, breaking ties using the original sort order:

_:a <http://example.org/vocab#p1> _:a . # Statement 2
_:a <http://example.org/vocab#p1> _:a . # Statement 4
_:a <http://example.org/vocab#p2> "dummy" . # Statement 1
_:a <http://example.org/vocab#p2> "dummy" . # Statement 3

We then assign blank nodes in the order that the blank nodes appear, using their original uniqueness:

Statement 2, 1st _:a => _:c14n1 => _:c14n0 (new)
Statement 2, 2nd _:a => _:c14n0 => _:c14n1 (new)
Statement 4, 1st _:a => _:c14n3 => _:c14n2 (new)
Statement 4, 2nd _:a => _:c14n2 => _:c14n3 (new)
Statement 1, 1st _:a => _:c14n0 => _:c14n1 (new)
Statement 3, 1st _:a => _:c14n2 => _:c14n3 (new)

Producing:

_:c14n0 <http://example.org/vocab#p1> _:c14n1 .
_:c14n2 <http://example.org/vocab#p1> _:c14n3 .
_:c14n1 <http://example.org/vocab#p2> "Foo" .
_:c14n3 <http://example.org/vocab#p2> "Bar" .

Finally this is sorted to produce the final output:

_:c14n0 <http://example.org/vocab#p1> _:c14n1 .
_:c14n1 <http://example.org/vocab#p2> "Foo" .
_:c14n2 <http://example.org/vocab#p1> _:c14n3 .
_:c14n3 <http://example.org/vocab#p2> "Bar" .

Now, the more ties that have to be broken, the less herd privacy there will be. But consider other data sets such as this optimal (for the purposes here) one:

_:b0 <http://example.org/vocab#p1> "A" .
_:b1 <http://example.org/vocab#p2> "B" .
_:b2 <http://example.org/vocab#p3> "C" .
_:b3 <http://example.org/vocab#p4> "D" .

And suppose it hypothetically canonizes to:

_:c14n0 <http://example.org/vocab#p4> "D" .
_:c14n1 <http://example.org/vocab#p1> "A" .
_:c14n2 <http://example.org/vocab#p2> "B" .
_:c14n3 <http://example.org/vocab#p3> "C" .

When the above algorithm is applied, we'd be sorting a data set that looks like this:

_:a <http://example.org/vocab#p1> "dummy" .
_:a <http://example.org/vocab#p2> "dummy" .
_:a <http://example.org/vocab#p3> "dummy" .
_:a <http://example.org/vocab#p4> "dummy" .

This sort order is always the same here (no ties) and would always result in the same blank node labels being used in the final output, regardless of the literal values in the input. So the purpose of this algorithm is to only make the final labels be influenced by literal values in the case that a tie needs to be broken.

@yamdan
Copy link
Contributor

yamdan commented Oct 11, 2022

@dlongley, Thank you for your example that helps me understand your privacy-enhancing blank node labeling. If I am not mistaken, the example includes two statements

_:c14n2 <http://example.org/vocab#p1> _:c14n2 .

that should have been

_:c14n2 <http://example.org/vocab#p1> _:c14n3 . 

@iherman
Copy link
Member

iherman commented Oct 11, 2022

@DLongLe, regardless of the details of your extra step described above, let us come back to the original question also discussed in the wg call. The question at the call and of this issue is "what is the output of the c14n algorithm?". The answer may be that it is an RDF graph in the abstract sense with a new set of bnode labels unless that correspondence between the bnode labels should be maintained (the answer may also be that the output of the algorithm is a sorted set of triples in n-quads, but that does not change the arguments).

What you describe above does not contradict, as far as I can see, that possible answer. The extra algorithm is a post-processing step on the output of c14n which, temporarily, has to keep some correspondence between bnode labels but not to the original bnode labels, and the final output is "merely" a modified version of the c14n abstract graph (or a serialization thereof). Do I misunderstand something?

@dlongley
Copy link
Contributor

@yamdan,

Yes, thank you for the correction! I edited the original post.

@dlongley
Copy link
Contributor

@iherman,

...regardless of the details of your extra step described above, let us come back to the original question also discussed in the wg call. The question at the call and of this issue is "what is the output of the c14n algorithm?". The answer may be that it is an RDF graph in the abstract sense with a new set of bnode labels unless that correspondence between the bnode labels should be maintained (the answer may also be that the output of the algorithm is a sorted set of triples in n-quads, but that does not change the arguments).

What you describe above does not contradict, as far as I can see, that possible answer. The extra algorithm is a post-processing step on the output of c14n which, temporarily, has to keep some correspondence between bnode labels but not to the original bnode labels, and the final output is "merely" a modified version of the c14n abstract graph (or a serialization thereof). Do I misunderstand something?

No, nothing was misunderstood there. After writing out the use case above (instead of me just muddling through it on the call), I said that the labels for the blank nodes passed in as original input would likely not be needed for it. I mentioned that here, but given that I wrote a wall of text and it was near the end, it was probably easily missed -- and I wasn't as clear as I could be:

Given the above, I don't know that we're looking for a direct mapping from the original labels to the canonized labels, but we are looking to preserve some abstract structure and references / mappings from canonized labels to blank nodes (and / or vice versa) to avoid having to perform extra parsing steps.

Having further written out some example steps above for the use case, what would be preferred as output would be access to the statements (and their components) in the canonized output, in the canonized statement order, ideally without having to reparse anything.

@gkellogg
Copy link
Member

We remember the order of these statements (so that we can break ties later) and we remember the actual uniqueness of each blank node (even though their labels are the same here for sorting purposes).

In other words:

Statement 1, 1st _:a => _:c14n0
Statement 2, 1st _:a => _:c14n1
Statement 2, 2nd _:a => _:c14n0
Statement 3, 1st _:a => _:c14n2
Statement 4, 1st _:a => _:c14n3
Statement 4, 2nd _:a => _:c14n2

Yes, this expression of the steps makes sense.

However, I think it adds more complexity for the RDF-star case, where it's more difficult to enumerate the order of blank nodes within a statement. Before this, I think the current algorithm's "Hash First Degree Quads" and "Hash N-Degree Quads" algorithm are fairly easily adapted to the RDF-star case. I'm sure there is a way to consider this step of referring from statements in Y to CX keeping track of blank node assignments can be adapted to the potentially greater number of blank nodes contained within any given statement, as a traversal ordering can be defined.

@gkellogg
Copy link
Member

@iherman said:

... The question at the call and of this issue is "what is the output of the c14n algorithm?". The answer may be that it is an RDF graph in the abstract sense with a new set of bnode labels unless that correspondence between the bnode labels should be maintained (the answer may also be that the output of the algorithm is a sorted set of triples in n-quads, but that does not change the arguments).

It would be worth considering the description of the normalized dataset in the Draft CG report, which attempts to describe such an abstract dataset where blank node identifiers are stable. Being able to consider this as the output of the C14N algorithm as input to the Hashing algorithm(s), aside from how it might be serialized, would be useful, if it doesn't raise too many semantic red-flags.

@afs
Copy link

afs commented Oct 12, 2022

However, I think it adds more complexity for the RDF-star case, where it's more difficult to enumerate the order of blank nodes within a statement.

There is an order to RDF terms (inc blank nodes) in an RDF-star quoted triple term because the terms are trees with each branch node having an order S-P-O - 1-2-3. There is a path given by the positions used to get to any point. There is only one path because it's a tree. There is an enumeration by depth first traverse.

There are other ways to get a deterministic enumeration of just the blank nodes.

@tkuhn
Copy link

tkuhn commented Dec 9, 2022

Intuitively and based on what I have read above (without fully grasping the details of the herd privacy part), returning an abstract graph with mapped labels seems the proper solution to me here.

But in case we decide for the first solution ("return one particular serialization") for practical reasons, is the understanding here that this serialization would also be the direct input of the hashing step? Or can the hashing step choose to preprocess that serialization into a different form (possibly another of the existing RDF serializations) before applying the hash function?

@gkellogg
Copy link
Member

gkellogg commented Dec 9, 2022

But in case we decide for the first solution ("return one particular serialization") for practical reasons, is the understanding here that this serialization would also be the direct input of the hashing step? Or can the hashing step choose to preprocess that serialization into a different form (possibly another of the existing RDF serializations) before applying the hash function?

This is still an open issue. Presently, the test suite tests that the canonical N-Quads serialization of the normalized dataset is used as the expected result. But, part of the motivation of having an intermediate normalized dataset is to allow for other uses, for instance a hash of each individual quad in the normalized dataset.

Thus far, canonical N-Quads seems like the only real alternative for serializing these quads to make use of them, but in theory, they could be serialized to another form (e.g., Turtle/TriG using a reduced syntax, is used within the examples in the spec).

For testing purposes, at least, some serialization of the normalized dataset is required, but the abstract normalized dataset is certainly useful for other algorithms.

@pchampin
Copy link
Contributor Author

It seems that there is an agreement for returning an ordered list of statements, using the canonical bnode labels -- against my proposal of returning a hard-to-define-and-to-serialized mapping ;-)

So if that ship has sailed, canonical N-Quads is probably a good way of conveying that thing. Retrieving each individual triple is trivially done by splitting on \n as suggested by @afs. Retrieving each individual term is a bit more challenging (literals can contain spaces), but not much, I guess.

@iherman
Copy link
Member

iherman commented Dec 16, 2022

Yeah well... for an application that does not want to do signing but, for example, compare graphs, returning canonical n-quads is not the best idea imho. I would opt to return a dataset (in any format that the given environment holds datasets in) with all bnodes labeled canonically. Converting this into nquads should be considered as a separate step, conceptually.

@afs
Copy link

afs commented Jan 3, 2023

N-triples section 2.4 and Turtle 2.6 RDF Blank Nodes use "blank node label" in normative sections. It is the string after the _:.

@iherman
Copy link
Member

iherman commented Jan 4, 2023

N-triples section 2.4 and Turtle 2.6 RDF Blank Nodes use "blank node label" in normative sections. It is the string after the _:.

which is great. I think using the term "label" is then consistent for us, too!

Thx

@afs
Copy link

afs commented Jan 4, 2023

As a name for the part after _:, yes.

As output of c14n, no, not on it's own. You don't want the baggage that comes with the term such as scope, renaming possibilities and parsing.

For (1), datasets do not have "blank node labels" or "blank node identifiers". Datasts are abstract so the output is too - it is a mapping to unique values - a 1-1 function - from blank node to strings. These strings may be used as labels in syntax - but they aren't guaranteed to be preserved if parsed or combined with other data.

@iherman
Copy link
Member

iherman commented Jan 6, 2023

As output of c14n, no, not on it's own. You don't want the baggage that comes with the term such as scope, renaming possibilities and parsing.

Let the WG vote. I won't lie down the road on any decision on this; I guess we'll have to agree to disagree on this.

@yamdan
Copy link
Contributor

yamdan commented Jan 16, 2023

I agree that it is worth taking the abstract dataset with labeling as an output of rdf-canon, but we seem to need to patch step 5.3 in the main algorithm to handle the "automorphism" issue to do so.

More specifically, if the input dataset has nontrivial automorphism, e.g., in the duplicated paths and double circle examples, multiple blank node labelings can yield the same canonical n-quads.

For example, in the duplicated paths example, in addition to the {e0: c14n0, e1: c14n1, e2: c14n2, e3: c14n3} shown, another labeling {e0: c14n2, e1: c14n3, e2: c14n0, e3: c14n1} also results in the same canonical n-quads (after sorted).
Also, the double circle example gives the same sorted canonicalized n-quads for both {e0: c14n0, e1: c14n1} and {e0: c14n1, e1: c14n0}.

I have no theoretical analysis yet, but in such a case, there appear to be multiple result with the same hash values in hash path list in step 5.3 of the main algorithm (see the debug log from my implementation).

In those cases, we should output an unordered set of all the legitimate labelings since it seems impossible to deterministically select only one of them.
Concretely, step 5.3 should include additional sub-steps to respect all the temporary issuers with the same hash values, where I still need more time to formalize it as PR.

@iherman
Copy link
Member

iherman commented Jan 17, 2023

@yamdan, would it be better to raise that as a separate issue? I am not sure whether it influences the current issue's outcome...

@pchampin
Copy link
Contributor Author

More specifically, if the input dataset has nontrivial automorphism, e.g., in the duplicated paths and double circle examples, multiple blank node labellings can yield the same canonical n-quads.

That's a very good point, and that's one additional practical issue raised by the "returning a mapping" option!

Although I was initially leaning towards this option, I am now convinced by the "returning a serialization" option, as the only practically viable thing to standardize. We could add a note explaining that implementations may return some "intermediate" results, such as

  • a list of strings, each representing one quad in N-Quads
  • a list of lists of strings, each sublist representing a quad, and each string representing a term in N-Quads
    (could be useful, as splitting N-Quads lines into terms is not trivial, as literals can contain spaces)
  • an internal representation of the Dataset, where blank nodes have the canonical label
    (this is actually what my implementation produces)

but that this is out of scope of the specification.

@iherman
Copy link
Member

iherman commented Jan 18, 2023

More specifically, if the input dataset has nontrivial automorphism, e.g., in the duplicated paths and double circle examples, multiple blank node labellings can yield the same canonical n-quads.

That's a very good point, and that's one additional practical issue raised by the "returning a mapping" option!

Although I was initially leaning towards this option, I am now convinced by the "returning a serialization" option, as the only practically viable thing to standardize.

Ah. Good point indeed.

@dlongley
Copy link
Contributor

dlongley commented Jan 18, 2023

@yamdan,

I agree that it is worth taking the abstract dataset with labeling as an output of rdf-canon, but we seem to need to patch step 5.3 in the main algorithm to handle the "automorphism" issue to do so.

More specifically, if the input dataset has nontrivial automorphism, e.g., in the duplicated paths and double circle examples, multiple blank node labelings can yield the same canonical n-quads.

In the "abstract dataset option w/canonical blank node labels", the aim wasn't to return "a mapping" that specified, for example, input blank node 1 => output blank node 2. I agree that multiple options could be generated that way.

So, since such a "mapping" wasn't part of the output there, I don't think it matters that there are multiple different ways to map input blank nodes to output labels, as the end result (abstract dataset with stable blank node labels) will look the same, as will a concrete serialization.

@pchampin
Copy link
Contributor Author

This was discussed during today's call:
https://www.w3.org/2023/01/18-rch-minutes.html#t03

@yamdan
Copy link
Contributor

yamdan commented Jan 18, 2023

Thank you @dlongley for pointing out my misunderstanding. I now understand that we do not have to consider the additional sub-steps to get multiple mappings described in (#4 (comment)); instead just focus on the formal definition of normalized dataset based on RDF/JS data model for example.

@gkellogg
Copy link
Member

Personally, I'd like to avoid using WebIDL definitions, and stick with Infra, which is already used to some degree in the spec. JSON-LD uses Infra to define the Internal Representation.

@iherman
Copy link
Member

iherman commented Jan 19, 2023

Personally, I'd like to avoid using WebIDL definitions, and stick with Infra, which is already used to some degree in the spec. JSON-LD uses Infra to define the Internal Representation.

One more argument for sticking with Infra: the usage of IDL is often (mis?)represented as an "obligation" of implementing those data types in a browser, too. (I ran into this type of objection in another spec when using IDL to define some datatypes.) Using infra avoids this problem.

@philarcher
Copy link
Contributor

This issue was discussed (again) on 2023-03-01, see https://www.w3.org/2023/03/01-rch-minutes.html#t02. My understanding of that discussion is that the output can be defined in several steps. There is the abstract dataset with labelled bnodes. For some people that's enough (they may want to change those labels, BBS for example). Then we can say that it can be serialized as a set of n-quads, perhaps as an array. Those quads can be hashed individually or the array can be joined and the whole set hashed as one.

IMO that's clear and sounds like the end section(s) of the c14n spec, not a separate document.

We also talked about the canonicalization of n-quads. An open PR in the RDF-N-Quads spec is very relevant here. See @gkellogg's email to the RCH WG.

@gkellogg
Copy link
Member

gkellogg commented Mar 1, 2023

+1 I think there is a new section on representation. One representation would be the dataset represented in N-Quads canonical form with lines in code point order. A second point would use this to create a hash, using the description for hashing values already in the spec.

Another representation would be as an ordered Infra array, where each entry is a tuple composed of the serialized N-Quad and the hash of that N-Quad where the array is ordered by the code points of the N-Quad. Alternatively, this could be in a map representation. Is there a need to represent more than just the quad and its hash?

We might consider an IANA section to define the textual representations of these, one extending application/n-quads with a profile IRI, similar to how JSON-LD defined the profile parameter. The other would define an encoding for the array representation. But, these may be overkill. Perhaps we need a separate issue on media types.

@msporny
Copy link
Member

msporny commented Apr 26, 2023

The group discussed this today and came to the following conclusions:

  1. The output is a normalized dataset (the normalized dataset has stable blank node identifiers).
  2. Implementations MAY change stable blank node identifiers given a normalized data set before serialization.
  3. The N-Quads serialization of the normalized dataset is sorted (using canonical N-Quads).
  4. Implementations MUST use the canonical N-Quads serialization (to address code point issues).

A PR should be raised that accomplishes all of the above (where some of the above might already be defined in the specification).

@msporny msporny added the ready for pr This issue is ready for a PR to be created. label Apr 26, 2023
@peacekeeper
Copy link
Contributor

I think the spec already covers all of the four points except maybe point 2.

See this PR that was merged a few weeks ago: #90

@gkellogg
Copy link
Member

The Serialization section added in #90 says the following:

The serialized canonical form of a normalized dataset is an N-Quads document [N-QUADS] created by representing each quad from the normalized dataset in canonical n-quads form, sorting them into code point order,

So the N-Quads are ordered.

gkellogg added a commit that referenced this issue May 2, 2023
… ordered map mapping canonical labels to blank nodes within that dataset.

Update the result of the canonicalization algorithm to be the serialized canonical form of that normalized dataset.

Fixes #4.
gkellogg added a commit that referenced this issue May 5, 2023
* Update definition of "normalized dataset" to be an RDF dataset and an unordered map relates blank nodes to their canonical identifiers.
* Update the result of the canonicalization algorithm to be the serialized canonical form of that normalized dataset.

Fixes #4. Fixes #92.

---------

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending-close ready for pr This issue is ready for a PR to be created.
Projects
None yet
Development

Successfully merging a pull request may close this issue.