Re-consider use of escapes in canonical N-Quads #16

gkellogg · 2023-02-14T22:48:08Z

Re-consider the use of UCHAR and ECHAR escapes in N-Triples/N-Quads canonicalization. The 1.1-based recommendation prohibits the use of UCHAR (U+XXXX) and allows ECHAR only for U+0022 (quote \"), U+005C (backslash \\), U+000A (LF \n), and U+000D (CR \r). However, the use of control characters can obfuscate text when presented, creating a potential security concern.

A future version may consider requiring all characters between U+0000 and U+001F (other than U+000A (LF) and U+000D (CR)) and U+007F (_DEL) to be represented using UCHAR. Characters that can be represented using ECHAR MUST use that representation. All other code points MUST NOT be represented by UCHAR.

The text was updated successfully, but these errors were encountered:

gkellogg · 2023-02-14T22:48:36Z

See #2 (comment).

…y Considerations stub. Reference issue #16 as a future direction for canonicalization.

gkellogg · 2023-02-16T22:33:31Z

Probably also need U+007F _DEL to be UCHAR escaped.

gkellogg · 2023-02-16T22:43:16Z

From @afs in https://github.com/w3c/rdf-n-quads/pull/17/files#r1108143303:

Unfortunately, the IRI rule allows more than the characters of a legal IRI so if the goal is a canonical form that corresponds to the grammar (I think it should), it has to cope with this situation.

Raw backspace, tabs, null and form-feed anywhere in the canonical form leads to obfuscation attacks. If this canonical form is used for RCH, these will need to be addresses or a positive justification made for them. Otherwise, the security considerations sections will be advising differently to the canonical representation.

It would be good if there were a simple check to see whether a claimed canonical form actually hiding an obfuscation. One way is the presences of any the most dubious codepoints.

gkellogg · 2023-02-16T22:54:58Z

Unfortunately, the IRI rule allows more than the characters of a legal IRI so if the goal is a canonical form that corresponds to the grammar (I think it should), it has to cope with this situation.

This treads on erratum 29 (which was actually more about terminology) where the grammar for IRI is from a spec that never made it to standard. We may need to define the parts of RFC3987 in RDF Concepts to be able to fall back on it.

Given that, I'd say that saying the the value matched by the IRIREF terminal MUST be a valid IRI based on that ABNF grammar. (We also need to comb through the use of IRI terminology, as @pchampin actually suggested).

afs · 2023-02-17T10:10:24Z

re: RFC3987.

Yes, good point. It would be good to do something about that.
The simplest would be to adopt the text from the proposed standard (sections 2 & 5). We then don't have any conversion text nearby

Issue created: w3c/rdf-concepts#15, tracked by w3c/sparql-query#13.

afs · 2023-02-17T10:24:50Z

IRIREF terminal MUST be a valid IRI

We have the liberal grammar to remove implementation burden. Provided a dubious IRI conforms to IRIREF everything should work (and there are often dubious IRIs in some public datasets).

There are parts of RFC 3986 which are "should". One example is "no empty port" (because they weren't banned originally?).

Valid IRIs come in layers - URI schemes define addition syntax constraints which a general processor can't always know about.

Other specs we aren't touching also have the same grammar rule.

I think SHOULD is better.

gkellogg · 2023-02-18T22:30:02Z

Just a datapoint for RDF Canonicalization. If we change to escape \f and ' it will break expected results for test060. All other tests pass.

input:

<urn:ex:s> <urn:ex:000:empty> "" .
<urn:ex:s> <urn:ex:001:simple> "simple" .
<urn:ex:s> <urn:ex:002:quote> "\"" .
<urn:ex:s> <urn:ex:003:backslash> "\\" .
<urn:ex:s> <urn:ex:004:nl> "\n" .
<urn:ex:s> <urn:ex:005:cr> "\r" .
<urn:ex:s> <urn:ex:006:all> "\"\\\n\r" .
<urn:ex:s> <urn:ex:007:uchar> "\u0022\u005c" .
<urn:ex:s> <urn:ex:008:echar> "\t\b\n\r\f\"\'\\" .
<urn:ex:s> <urn:ex:009> "\\u0039" .
<urn:ex:s> <urn:ex:010> "\\n" .
<urn:ex:s> <urn:ex:011> "\\\\" .
<urn:ex:s> <urn:ex:012> "\"\"" .
<urn:ex:s> <urn:ex:013> "\\\\\\" .
<urn:ex:s> <urn:ex:014> "\"\"\"" .
<urn:ex:s> <urn:ex:015> "\u221e" .
<urn:ex:s> <urn:ex:016> "∞" .

expected results:

<urn:ex:s> <urn:ex:000:empty> "" .
<urn:ex:s> <urn:ex:001:simple> "simple" .
<urn:ex:s> <urn:ex:002:quote> "\"" .
<urn:ex:s> <urn:ex:003:backslash> "\\" .
<urn:ex:s> <urn:ex:004:nl> "\n" .
<urn:ex:s> <urn:ex:005:cr> "\r" .
<urn:ex:s> <urn:ex:006:all> "\"\\\n\r" .
<urn:ex:s> <urn:ex:007:uchar> "\"\\" .
<urn:ex:s> <urn:ex:008:echar> "\n\r
                                   \"'\\" .
<urn:ex:s> <urn:ex:009> "\\u0039" .
<urn:ex:s> <urn:ex:010> "\\n" .
<urn:ex:s> <urn:ex:011> "\\\\" .
<urn:ex:s> <urn:ex:012> "\"\"" .
<urn:ex:s> <urn:ex:013> "\\\\\\" .
<urn:ex:s> <urn:ex:014> "\"\"\"" .
<urn:ex:s> <urn:ex:015> "∞" .
<urn:ex:s> <urn:ex:016> "∞" .

cc/ @dlongley

philarcher · 2023-03-15T15:03:43Z

Just to note there that the RDF Canonicalization & Hash WG resolved to support the approach to escaping proposed in this issue, see https://www.w3.org/2023/03/15-rch-minutes.html#r01

yamdan · 2023-03-15T16:33:03Z

Please allow me to confirm this for better understanding.
If we adopt the proposed escape method, I understand that the expected result for test060 will change to the following. Is this correct?

expected results (after the new escaping):

<urn:ex:s> <urn:ex:000:empty> "" .
<urn:ex:s> <urn:ex:001:simple> "simple" .
<urn:ex:s> <urn:ex:002:quote> "\"" .
<urn:ex:s> <urn:ex:003:backslash> "\\" .
<urn:ex:s> <urn:ex:004:nl> "\n" .
<urn:ex:s> <urn:ex:005:cr> "\r" .
<urn:ex:s> <urn:ex:006:all> "\"\\\n\r" .
<urn:ex:s> <urn:ex:007:uchar> "\"\\" .
<urn:ex:s> <urn:ex:008:echar> "\t\b\n\r\f\"\'\\" .
<urn:ex:s> <urn:ex:009> "\\u0039" .
<urn:ex:s> <urn:ex:010> "\\n" .
<urn:ex:s> <urn:ex:011> "\\\\" .
<urn:ex:s> <urn:ex:012> "\"\"" .
<urn:ex:s> <urn:ex:013> "\\\\\\" .
<urn:ex:s> <urn:ex:014> "\"\"\"" .
<urn:ex:s> <urn:ex:015> "∞" .
<urn:ex:s> <urn:ex:016> "∞" .

gkellogg · 2023-03-15T22:12:29Z

Yes, that's what my implementation produces.

…y Considerations stub. Reference issue #16 as a future direction for canonicalization.

* Improve canonicalization section. * Reference issue #16 as a future direction for canonicalization. * Add prohibition on using a datatype IRI if the datatype is xsd:string when canonicalizing. * Update change note on PN_CHARS_U to describe the change in blank node representation. * White space updates. * Change note motivating the use of canonical N-Quads. * Sync recent changes to w3c/rdf-concepts#16. * Fix IRI term references. * Update note motivating canonical N-Quads. --------- Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com> Co-authored-by: Andy Seaborne <andy@apache.org> Co-authored-by: Dan Yamamoto <yamdan@gmail.com>

Fixes #16.

* Update the use of ECHAR and UCHAR in canonical N-Quads. Fixes #16. * Add paragraph saying that Canonical N-Quads extends Canonical N-Triples. --------- Co-authored-by: Andy Seaborne <andy@apache.org> Co-authored-by: Dan Yamamoto <yamdan@gmail.com> Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

gkellogg mentioned this issue Feb 14, 2023

zero current implementations of Turtle/TriG/N-Triples/N-Quads functionality are spec compliant, because none can "ensure that malignant strings may not be used to mislead the reader" — there's just no way to do so! w3c/rdf-concepts#11

Closed

gkellogg added a commit that referenced this issue Feb 14, 2023

Separate out Security Considerations from media type and add a Privac…

74e3f39

…y Considerations stub. Reference issue #16 as a future direction for canonicalization.

gkellogg mentioned this issue Feb 14, 2023

Canonicalization #17

Merged

gkellogg added the spec:substantive Change in the spec affecting its normative content (class 3) –see also spec:bug, spec:new-feature label Mar 22, 2023

gkellogg added a commit that referenced this issue Mar 30, 2023

Separate out Security Considerations from media type and add a Privac…

62ed776

…y Considerations stub. Reference issue #16 as a future direction for canonicalization.

gkellogg added a commit that referenced this issue Apr 5, 2023

Update the use of ECHAR and UCHAR in canonical N-Quads.

0498947

Fixes #16.

gkellogg mentioned this issue Apr 5, 2023

Update the use of ECHAR and UCHAR in canonical N-Quads. #27

Merged

gkellogg closed this as completed in #27 Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-consider use of escapes in canonical N-Quads #16

Re-consider use of escapes in canonical N-Quads #16

gkellogg commented Feb 14, 2023 •

edited

Loading

gkellogg commented Feb 14, 2023

gkellogg commented Feb 16, 2023

gkellogg commented Feb 16, 2023

gkellogg commented Feb 16, 2023

afs commented Feb 17, 2023

afs commented Feb 17, 2023

gkellogg commented Feb 18, 2023

philarcher commented Mar 15, 2023

yamdan commented Mar 15, 2023

gkellogg commented Mar 15, 2023

Re-consider use of escapes in canonical N-Quads #16

Re-consider use of escapes in canonical N-Quads #16

Comments

gkellogg commented Feb 14, 2023 • edited Loading

gkellogg commented Feb 14, 2023

gkellogg commented Feb 16, 2023

gkellogg commented Feb 16, 2023

gkellogg commented Feb 16, 2023

afs commented Feb 17, 2023

afs commented Feb 17, 2023

gkellogg commented Feb 18, 2023

philarcher commented Mar 15, 2023

yamdan commented Mar 15, 2023

gkellogg commented Mar 15, 2023

gkellogg commented Feb 14, 2023 •

edited

Loading