Unicode ordering #18

gkellogg · 2022-10-21T18:02:43Z

URDNA2015 depends on a reliable sort order for lexicographically sorting/ordering values to guarantee that quads are traversed in a predictable order. As revealed in discussions on PR #17 (#17 (comment), #17 (comment), #17 (comment), #17 (comment), #17 (comment), #17 (comment), and #17 (comment)) this is a more complicated problem than it would first appear, and affects any algorithm that would be used for C14N.

Our problem is somewhat simplified by N-Quads requiring UTF-8, but even so different implementations may give different results.

The problem is generic enough that it must hit other specifications as well (certainly JSON-LD) and may call for an early TAG or I18N WG review.

gkellogg · 2022-10-21T23:02:17Z

It strikes me that string sorting requirements are the same as for SPARQL, particularly ORDER BY, which uses the < operator, in our case over literals of type xsd:string, which is effectively how a Quad is serialized. So the term lexicographical order could be defined using op:numeric-equal(fn:compare(STR(A), STR(B)), -1).

Perhaps @afs has some recollection on how this issue was considered in SPARQL.

afs · 2022-10-22T19:52:35Z

Defn in SPARQL 1.1: sparql11-query/#modOrderBy

which leads to XPath/XQuery fn:compare with no collation.

Section 17.3 says http://www.w3.org/2005/xpath-functions/collation/codepoint (credit to Eric Prud'hommeaux).

The discussion seems to be happening on #17.

dlongley · 2022-10-22T20:12:02Z

Left some comments over here: #17 (comment)

I think the discussion should move over here now.

afs · 2022-10-22T20:22:21Z

Unicode collation is not helpful. It is locale sensitive and you must know the locale. It is not total. And involves Unicode Normalization which equates multiple ways to write the same appearance.

https://unicode.org/reports/tr10/#Common_Misperceptions

"Swedish and German share most of the same characters, for example, but have very different sorting orders."
"Collation is not a property of strings."
"Collation order is not preserved when comparing sort keys generated from different collation sequences."

For @dlongley points 1-4, the conclusion is one of codepoint sorting (logically convert to 21 bits values, sort) or a code unit for a specific code unit choice.

SPARQL is (strictly) codepoint (the abstract character) ordering. No normalization. Section 17.3 says use collation http://www.w3.org/2005/xpath-functions/collation/codepoint (credit to Eric Prud'hommeaux).

dlongley · 2022-10-22T20:52:33Z

I made an issue in our JS implementation of URDNA2015 and commented with some example code that might do Unicode code point sorting in JS:

digitalbazaar/rdf-canonize#52 (comment)

gkellogg · 2022-10-24T02:28:17Z

SPARQL is (strictly) codepoint (the abstract character) ordering. No normalization. Section 17.3 says use collation http://www.w3.org/2005/xpath-functions/collation/codepoint (credit to Eric Prud'hommeaux).

This would seem like a good reference to cite, then. If this doesn’t incorporate Locale, it would seem to satisfy our requirements.

gkellogg · 2022-10-31T16:01:36Z

Resolution is to use "Unicode point order" with a reference to "total order".

TallTed · 2022-11-10T20:54:10Z

I think it would be better to cite https://www.w3.org/TR/xpath-functions-31/#codepoint-collation which normatively defines "The Unicode Codepoint Collation" which is identified by http://www.w3.org/2005/xpath-functions/collation/codepoint, rather than to cite http://www.w3.org/2005/xpath-functions/collation/codepoint which contains none of the standard W3 TR status notes and merely points back to https://www.w3.org/TR/xpath-functions-31/ in toto.

I do not see any reference in either of these documents to "Unicode point order", nor to "total order" in regards to strings.

aphillips · 2022-11-11T18:21:39Z

The correct terminology is "Unicode code point order". I think this is what W3C I18N will recommend to you. The XPath function pointed to by @TallTed is one good reference. You might glance at our guidance in INTERNATIONAL-SPECS

The one potential alternative is what HTML and the Infra spec define, which is "code unit order" of UTF-16 (e.g. DOMString or JavaScript String.sort()) strings. This is defined here in the Infra spec. However, I think this would be inconvenient for you to adopt vs. code point order and the results (while in a slightly different order) aren't materially better.

Note: The I18N WG tends to shy away from the the phrase "lexicographical order" because it implies an alphabetical or dictionary type ordering and Unicode code point order is definitely not that overall. Code point order is a stable, deterministic, efficient order and is "lexicographical-enough" to be relatively unsurprising to users (as long as the limitations are called out).

The problem is generic enough that it must hit other specifications as well (certainly JSON-LD) and may call for an early TAG or I18N WG review.

We (I18N) always appreciate early review.

… the normative reference in XPATH-FUNCTIONS. Fixes #18.

gkellogg · 2022-11-11T20:57:33Z

Thanks for the review, @aphillips. I've made an update, as you and @TallTed suggested, in PR #33.

I'll raise an issue to be addressed in JSON-LD 1.1 API referencing this issue so we can eventually fix the wording there, as well.

… the normative reference in XPATH-FUNCTIONS. Fixes #18.

gkellogg added help wanted Extra attention is needed question Further information is requested labels Oct 21, 2022

gkellogg mentioned this issue Oct 21, 2022

Descriptive content #17

Merged

gkellogg linked a pull request Oct 21, 2022 that will close this issue

Descriptive content #17

Merged

gkellogg removed the help wanted Extra attention is needed label Oct 26, 2022

gkellogg closed this as completed Oct 31, 2022

gkellogg reopened this Nov 10, 2022

gkellogg added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. spec:enhancement and removed question Further information is requested labels Nov 10, 2022

gkellogg self-assigned this Nov 10, 2022

w3cbot mentioned this issue Nov 11, 2022

Unicode ordering w3c/i18n-activity#1613

Closed

This was referenced Nov 11, 2022

Provide guidance on sorting/collation w3c/charmod-norm#222

Open

Add guidance on Unicode code point order w3c/bp-i18n-specdev#83

Closed

gkellogg added a commit that referenced this issue Nov 11, 2022

Change "Unicode point order" to "Unicode code point order" and update…

5bde224

… the normative reference in XPATH-FUNCTIONS. Fixes #18.

gkellogg mentioned this issue Nov 11, 2022

Change "Unicode point order" to "Unicode code point order" #33

Merged

This was referenced Nov 11, 2022

Change "Lexicographical Order" (and related) to "Unicode code point order". w3c/json-ld-api#552

Closed

Change "Lexicographical Order" (and related) to "Unicode code point order". w3c/json-ld-framing#141

Open

gkellogg closed this as completed in #33 Nov 14, 2022

gkellogg added a commit that referenced this issue Nov 14, 2022

Change "Unicode point order" to "Unicode code point order" and update…

c83c4e8

… the normative reference in XPATH-FUNCTIONS. Fixes #18.

gkellogg mentioned this issue Aug 8, 2023

Change "Lexicographical Order" (and related) to "Unicode code point order". w3c/json-ld-syntax#416

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode ordering #18

Unicode ordering #18

gkellogg commented Oct 21, 2022

gkellogg commented Oct 21, 2022

afs commented Oct 22, 2022 •

edited by gkellogg

Loading

dlongley commented Oct 22, 2022

afs commented Oct 22, 2022 •

edited

Loading

dlongley commented Oct 22, 2022

gkellogg commented Oct 24, 2022

gkellogg commented Oct 31, 2022

TallTed commented Nov 10, 2022

aphillips commented Nov 11, 2022

gkellogg commented Nov 11, 2022 •

edited

Loading

Unicode ordering #18

Unicode ordering #18

Comments

gkellogg commented Oct 21, 2022

gkellogg commented Oct 21, 2022

afs commented Oct 22, 2022 • edited by gkellogg Loading

dlongley commented Oct 22, 2022

afs commented Oct 22, 2022 • edited Loading

dlongley commented Oct 22, 2022

gkellogg commented Oct 24, 2022

gkellogg commented Oct 31, 2022

TallTed commented Nov 10, 2022

aphillips commented Nov 11, 2022

gkellogg commented Nov 11, 2022 • edited Loading

afs commented Oct 22, 2022 •

edited by gkellogg

Loading

afs commented Oct 22, 2022 •

edited

Loading

gkellogg commented Nov 11, 2022 •

edited

Loading