Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-consider use of escapes in canonical N-Quads #16

Closed
gkellogg opened this issue Feb 14, 2023 · 10 comments · Fixed by #27
Closed

Re-consider use of escapes in canonical N-Quads #16

gkellogg opened this issue Feb 14, 2023 · 10 comments · Fixed by #27
Labels
spec:substantive Change in the spec affecting its normative content (class 3) –see also spec:bug, spec:new-feature

Comments

@gkellogg
Copy link
Member

gkellogg commented Feb 14, 2023

Re-consider the use of UCHAR and ECHAR escapes in N-Triples/N-Quads canonicalization. The 1.1-based recommendation prohibits the use of UCHAR (U+XXXX) and allows ECHAR only for U+0022 (quote \"), U+005C (backslash \\), U+000A (LF \n), and U+000D (CR \r). However, the use of control characters can obfuscate text when presented, creating a potential security concern.

A future version may consider requiring all characters between U+0000 and U+001F (other than U+000A (LF) and U+000D (CR)) and U+007F (DEL) to be represented using UCHAR. Characters that can be represented using ECHAR MUST use that representation. All other code points MUST NOT be represented by UCHAR.

@gkellogg
Copy link
Member Author

See #2 (comment).

@gkellogg
Copy link
Member Author

Probably also need U+007F DEL to be UCHAR escaped.

@gkellogg
Copy link
Member Author

From @afs in https://github.com/w3c/rdf-n-quads/pull/17/files#r1108143303:

Unfortunately, the IRI rule allows more than the characters of a legal IRI so if the goal is a canonical form that corresponds to the grammar (I think it should), it has to cope with this situation.

Raw backspace, tabs, null and form-feed anywhere in the canonical form leads to obfuscation attacks. If this canonical form is used for RCH, these will need to be addresses or a positive justification made for them. Otherwise, the security considerations sections will be advising differently to the canonical representation.

It would be good if there were a simple check to see whether a claimed canonical form actually hiding an obfuscation. One way is the presences of any the most dubious codepoints.

@gkellogg
Copy link
Member Author

Unfortunately, the IRI rule allows more than the characters of a legal IRI so if the goal is a canonical form that corresponds to the grammar (I think it should), it has to cope with this situation.

This treads on erratum 29 (which was actually more about terminology) where the grammar for IRI is from a spec that never made it to standard. We may need to define the parts of RFC3987 in RDF Concepts to be able to fall back on it.

Given that, I'd say that saying the the value matched by the IRIREF terminal MUST be a valid IRI based on that ABNF grammar. (We also need to comb through the use of IRI terminology, as @pchampin actually suggested).

@afs
Copy link
Contributor

afs commented Feb 17, 2023

re: RFC3987.

Yes, good point. It would be good to do something about that.
The simplest would be to adopt the text from the proposed standard (sections 2 & 5). We then don't have any conversion text nearby

Issue created: w3c/rdf-concepts#15, tracked by w3c/sparql-query#13.

@afs
Copy link
Contributor

afs commented Feb 17, 2023

IRIREF terminal MUST be a valid IRI

We have the liberal grammar to remove implementation burden. Provided a dubious IRI conforms to IRIREF everything should work (and there are often dubious IRIs in some public datasets).

There are parts of RFC 3986 which are "should". One example is "no empty port" (because they weren't banned originally?).

Valid IRIs come in layers - URI schemes define addition syntax constraints which a general processor can't always know about.

Other specs we aren't touching also have the same grammar rule.

I think SHOULD is better.

@gkellogg
Copy link
Member Author

Just a datapoint for RDF Canonicalization. If we change to escape \f and ' it will break expected results for test060. All other tests pass.

input:

<urn:ex:s> <urn:ex:000:empty> "" .
<urn:ex:s> <urn:ex:001:simple> "simple" .
<urn:ex:s> <urn:ex:002:quote> "\"" .
<urn:ex:s> <urn:ex:003:backslash> "\\" .
<urn:ex:s> <urn:ex:004:nl> "\n" .
<urn:ex:s> <urn:ex:005:cr> "\r" .
<urn:ex:s> <urn:ex:006:all> "\"\\\n\r" .
<urn:ex:s> <urn:ex:007:uchar> "\u0022\u005c" .
<urn:ex:s> <urn:ex:008:echar> "\t\b\n\r\f\"\'\\" .
<urn:ex:s> <urn:ex:009> "\\u0039" .
<urn:ex:s> <urn:ex:010> "\\n" .
<urn:ex:s> <urn:ex:011> "\\\\" .
<urn:ex:s> <urn:ex:012> "\"\"" .
<urn:ex:s> <urn:ex:013> "\\\\\\" .
<urn:ex:s> <urn:ex:014> "\"\"\"" .
<urn:ex:s> <urn:ex:015> "\u221e" .
<urn:ex:s> <urn:ex:016> "" .

expected results:

<urn:ex:s> <urn:ex:000:empty> "" .
<urn:ex:s> <urn:ex:001:simple> "simple" .
<urn:ex:s> <urn:ex:002:quote> "\"" .
<urn:ex:s> <urn:ex:003:backslash> "\\" .
<urn:ex:s> <urn:ex:004:nl> "\n" .
<urn:ex:s> <urn:ex:005:cr> "\r" .
<urn:ex:s> <urn:ex:006:all> "\"\\\n\r" .
<urn:ex:s> <urn:ex:007:uchar> "\"\\" .
<urn:ex:s> <urn:ex:008:echar> "\n\r
                                   \"'\\" .
<urn:ex:s> <urn:ex:009> "\\u0039" .
<urn:ex:s> <urn:ex:010> "\\n" .
<urn:ex:s> <urn:ex:011> "\\\\" .
<urn:ex:s> <urn:ex:012> "\"\"" .
<urn:ex:s> <urn:ex:013> "\\\\\\" .
<urn:ex:s> <urn:ex:014> "\"\"\"" .
<urn:ex:s> <urn:ex:015> "" .
<urn:ex:s> <urn:ex:016> "" .

cc/ @dlongley

@philarcher
Copy link

Just to note there that the RDF Canonicalization & Hash WG resolved to support the approach to escaping proposed in this issue, see https://www.w3.org/2023/03/15-rch-minutes.html#r01

@yamdan
Copy link
Contributor

yamdan commented Mar 15, 2023

Please allow me to confirm this for better understanding.
If we adopt the proposed escape method, I understand that the expected result for test060 will change to the following. Is this correct?

expected results (after the new escaping):

<urn:ex:s> <urn:ex:000:empty> "" .
<urn:ex:s> <urn:ex:001:simple> "simple" .
<urn:ex:s> <urn:ex:002:quote> "\"" .
<urn:ex:s> <urn:ex:003:backslash> "\\" .
<urn:ex:s> <urn:ex:004:nl> "\n" .
<urn:ex:s> <urn:ex:005:cr> "\r" .
<urn:ex:s> <urn:ex:006:all> "\"\\\n\r" .
<urn:ex:s> <urn:ex:007:uchar> "\"\\" .
<urn:ex:s> <urn:ex:008:echar> "\t\b\n\r\f\"\'\\" .
<urn:ex:s> <urn:ex:009> "\\u0039" .
<urn:ex:s> <urn:ex:010> "\\n" .
<urn:ex:s> <urn:ex:011> "\\\\" .
<urn:ex:s> <urn:ex:012> "\"\"" .
<urn:ex:s> <urn:ex:013> "\\\\\\" .
<urn:ex:s> <urn:ex:014> "\"\"\"" .
<urn:ex:s> <urn:ex:015> "" .
<urn:ex:s> <urn:ex:016> "" .

@gkellogg
Copy link
Member Author

Yes, that's what my implementation produces.

@gkellogg gkellogg added the spec:substantive Change in the spec affecting its normative content (class 3) –see also spec:bug, spec:new-feature label Mar 22, 2023
gkellogg added a commit that referenced this issue Mar 30, 2023
…y Considerations stub.

Reference issue #16 as a future direction for canonicalization.
gkellogg added a commit that referenced this issue Mar 30, 2023
* Improve canonicalization section.
  * Reference issue #16 as a future direction for canonicalization.
* Add prohibition on using a datatype IRI if the datatype is xsd:string when canonicalizing.
* Update change note on PN_CHARS_U to describe the change in blank node representation.
* White space updates.
* Change note motivating the use of canonical N-Quads.
* Sync recent changes to w3c/rdf-concepts#16.
* Fix IRI term references.
* Update note motivating canonical N-Quads.

---------

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Co-authored-by: Andy Seaborne <andy@apache.org>
Co-authored-by: Dan Yamamoto <yamdan@gmail.com>
gkellogg added a commit that referenced this issue Apr 20, 2023
* Update the use of ECHAR and UCHAR in canonical N-Quads. Fixes #16.
* Add paragraph saying that Canonical N-Quads extends Canonical N-Triples.

---------

Co-authored-by: Andy Seaborne <andy@apache.org>
Co-authored-by: Dan Yamamoto <yamdan@gmail.com>
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:substantive Change in the spec affecting its normative content (class 3) –see also spec:bug, spec:new-feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants