-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-consider use of escapes in canonical N-Quads #16
Comments
See #2 (comment). |
…y Considerations stub. Reference issue #16 as a future direction for canonicalization.
Probably also need |
From @afs in https://github.com/w3c/rdf-n-quads/pull/17/files#r1108143303:
|
This treads on erratum 29 (which was actually more about terminology) where the grammar for IRI is from a spec that never made it to standard. We may need to define the parts of RFC3987 in RDF Concepts to be able to fall back on it. Given that, I'd say that saying the the value matched by the IRIREF terminal MUST be a valid IRI based on that ABNF grammar. (We also need to comb through the use of IRI terminology, as @pchampin actually suggested). |
re: RFC3987. Yes, good point. It would be good to do something about that. Issue created: w3c/rdf-concepts#15, tracked by w3c/sparql-query#13. |
We have the liberal grammar to remove implementation burden. Provided a dubious IRI conforms to There are parts of RFC 3986 which are "should". One example is "no empty port" (because they weren't banned originally?). Valid IRIs come in layers - URI schemes define addition syntax constraints which a general processor can't always know about. Other specs we aren't touching also have the same grammar rule. I think SHOULD is better. |
Just a datapoint for RDF Canonicalization. If we change to escape input: <urn:ex:s> <urn:ex:000:empty> "" .
<urn:ex:s> <urn:ex:001:simple> "simple" .
<urn:ex:s> <urn:ex:002:quote> "\"" .
<urn:ex:s> <urn:ex:003:backslash> "\\" .
<urn:ex:s> <urn:ex:004:nl> "\n" .
<urn:ex:s> <urn:ex:005:cr> "\r" .
<urn:ex:s> <urn:ex:006:all> "\"\\\n\r" .
<urn:ex:s> <urn:ex:007:uchar> "\u0022\u005c" .
<urn:ex:s> <urn:ex:008:echar> "\t\b\n\r\f\"\'\\" .
<urn:ex:s> <urn:ex:009> "\\u0039" .
<urn:ex:s> <urn:ex:010> "\\n" .
<urn:ex:s> <urn:ex:011> "\\\\" .
<urn:ex:s> <urn:ex:012> "\"\"" .
<urn:ex:s> <urn:ex:013> "\\\\\\" .
<urn:ex:s> <urn:ex:014> "\"\"\"" .
<urn:ex:s> <urn:ex:015> "\u221e" .
<urn:ex:s> <urn:ex:016> "∞" . expected results: <urn:ex:s> <urn:ex:000:empty> "" .
<urn:ex:s> <urn:ex:001:simple> "simple" .
<urn:ex:s> <urn:ex:002:quote> "\"" .
<urn:ex:s> <urn:ex:003:backslash> "\\" .
<urn:ex:s> <urn:ex:004:nl> "\n" .
<urn:ex:s> <urn:ex:005:cr> "\r" .
<urn:ex:s> <urn:ex:006:all> "\"\\\n\r" .
<urn:ex:s> <urn:ex:007:uchar> "\"\\" .
<urn:ex:s> <urn:ex:008:echar> "\n\r
\"'\\" .
<urn:ex:s> <urn:ex:009> "\\u0039" .
<urn:ex:s> <urn:ex:010> "\\n" .
<urn:ex:s> <urn:ex:011> "\\\\" .
<urn:ex:s> <urn:ex:012> "\"\"" .
<urn:ex:s> <urn:ex:013> "\\\\\\" .
<urn:ex:s> <urn:ex:014> "\"\"\"" .
<urn:ex:s> <urn:ex:015> "∞" .
<urn:ex:s> <urn:ex:016> "∞" . cc/ @dlongley |
Just to note there that the RDF Canonicalization & Hash WG resolved to support the approach to escaping proposed in this issue, see https://www.w3.org/2023/03/15-rch-minutes.html#r01 |
Please allow me to confirm this for better understanding. expected results (after the new escaping): <urn:ex:s> <urn:ex:000:empty> "" .
<urn:ex:s> <urn:ex:001:simple> "simple" .
<urn:ex:s> <urn:ex:002:quote> "\"" .
<urn:ex:s> <urn:ex:003:backslash> "\\" .
<urn:ex:s> <urn:ex:004:nl> "\n" .
<urn:ex:s> <urn:ex:005:cr> "\r" .
<urn:ex:s> <urn:ex:006:all> "\"\\\n\r" .
<urn:ex:s> <urn:ex:007:uchar> "\"\\" .
<urn:ex:s> <urn:ex:008:echar> "\t\b\n\r\f\"\'\\" .
<urn:ex:s> <urn:ex:009> "\\u0039" .
<urn:ex:s> <urn:ex:010> "\\n" .
<urn:ex:s> <urn:ex:011> "\\\\" .
<urn:ex:s> <urn:ex:012> "\"\"" .
<urn:ex:s> <urn:ex:013> "\\\\\\" .
<urn:ex:s> <urn:ex:014> "\"\"\"" .
<urn:ex:s> <urn:ex:015> "∞" .
<urn:ex:s> <urn:ex:016> "∞" . |
Yes, that's what my implementation produces. |
…y Considerations stub. Reference issue #16 as a future direction for canonicalization.
* Improve canonicalization section. * Reference issue #16 as a future direction for canonicalization. * Add prohibition on using a datatype IRI if the datatype is xsd:string when canonicalizing. * Update change note on PN_CHARS_U to describe the change in blank node representation. * White space updates. * Change note motivating the use of canonical N-Quads. * Sync recent changes to w3c/rdf-concepts#16. * Fix IRI term references. * Update note motivating canonical N-Quads. --------- Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com> Co-authored-by: Andy Seaborne <andy@apache.org> Co-authored-by: Dan Yamamoto <yamdan@gmail.com>
* Update the use of ECHAR and UCHAR in canonical N-Quads. Fixes #16. * Add paragraph saying that Canonical N-Quads extends Canonical N-Triples. --------- Co-authored-by: Andy Seaborne <andy@apache.org> Co-authored-by: Dan Yamamoto <yamdan@gmail.com> Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Re-consider the use of
UCHAR
andECHAR
escapes in N-Triples/N-Quads canonicalization. The 1.1-based recommendation prohibits the use ofUCHAR
(U+XXXX
) and allowsECHAR
only forU+0022
(quote\"
),U+005C
(backslash\\
),U+000A
(LF\n
), andU+000D
(CR\r
). However, the use of control characters can obfuscate text when presented, creating a potential security concern.A future version may consider requiring all characters between
U+0000
andU+001F
(other thanU+000A
(LF) andU+000D
(CR)) andU+007F
(DEL) to be represented usingUCHAR
. Characters that can be represented usingECHAR
MUST use that representation. All other code points MUST NOT be represented byUCHAR
.The text was updated successfully, but these errors were encountered: