-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode ordering #18
Comments
It strikes me that string sorting requirements are the same as for SPARQL, particularly ORDER BY, which uses the Perhaps @afs has some recollection on how this issue was considered in SPARQL. |
Defn in SPARQL 1.1: sparql11-query/#modOrderBy which leads to XPath/XQuery fn:compare with no collation. Section 17.3 says The discussion seems to be happening on #17. |
Left some comments over here: #17 (comment) I think the discussion should move over here now. |
Unicode collation is not helpful. It is locale sensitive and you must know the locale. It is not total. And involves Unicode Normalization which equates multiple ways to write the same appearance.
For @dlongley points 1-4, the conclusion is one of codepoint sorting (logically convert to 21 bits values, sort) or a code unit for a specific code unit choice. SPARQL is (strictly) codepoint (the abstract character) ordering. No normalization. Section 17.3 says use collation |
I made an issue in our JS implementation of URDNA2015 and commented with some example code that might do Unicode code point sorting in JS: |
This would seem like a good reference to cite, then. If this doesn’t incorporate Locale, it would seem to satisfy our requirements. |
Resolution is to use "Unicode point order" with a reference to "total order". |
I think it would be better to cite https://www.w3.org/TR/xpath-functions-31/#codepoint-collation which normatively defines "The Unicode Codepoint Collation" which is identified by http://www.w3.org/2005/xpath-functions/collation/codepoint, rather than to cite http://www.w3.org/2005/xpath-functions/collation/codepoint which contains none of the standard W3 TR status notes and merely points back to https://www.w3.org/TR/xpath-functions-31/ in toto. I do not see any reference in either of these documents to "Unicode point order", nor to "total order" in regards to strings. |
The correct terminology is "Unicode code point order". I think this is what W3C I18N will recommend to you. The XPath function pointed to by @TallTed is one good reference. You might glance at our guidance in INTERNATIONAL-SPECS The one potential alternative is what HTML and the Infra spec define, which is "code unit order" of UTF-16 (e.g. Note: The I18N WG tends to shy away from the the phrase "lexicographical order" because it implies an alphabetical or dictionary type ordering and Unicode code point order is definitely not that overall. Code point order is a stable, deterministic, efficient order and is "lexicographical-enough" to be relatively unsurprising to users (as long as the limitations are called out).
We (I18N) always appreciate early review. |
… the normative reference in XPATH-FUNCTIONS. Fixes #18.
Thanks for the review, @aphillips. I've made an update, as you and @TallTed suggested, in PR #33. I'll raise an issue to be addressed in JSON-LD 1.1 API referencing this issue so we can eventually fix the wording there, as well. |
… the normative reference in XPATH-FUNCTIONS. Fixes #18.
URDNA2015 depends on a reliable sort order for lexicographically sorting/ordering values to guarantee that quads are traversed in a predictable order. As revealed in discussions on PR #17 (#17 (comment), #17 (comment), #17 (comment), #17 (comment), #17 (comment), #17 (comment), and #17 (comment)) this is a more complicated problem than it would first appear, and affects any algorithm that would be used for C14N.
Our problem is somewhat simplified by N-Quads requiring UTF-8, but even so different implementations may give different results.
The problem is generic enough that it must hit other specifications as well (certainly JSON-LD) and may call for an early TAG or I18N WG review.
The text was updated successfully, but these errors were encountered: