Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode ordering #18

Closed
gkellogg opened this issue Oct 21, 2022 · 10 comments · Fixed by #17 or #33
Closed

Unicode ordering #18

gkellogg opened this issue Oct 21, 2022 · 10 comments · Fixed by #17 or #33
Assignees
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. spec:enhancement

Comments

@gkellogg
Copy link
Member

URDNA2015 depends on a reliable sort order for lexicographically sorting/ordering values to guarantee that quads are traversed in a predictable order. As revealed in discussions on PR #17 (#17 (comment), #17 (comment), #17 (comment), #17 (comment), #17 (comment), #17 (comment), and #17 (comment)) this is a more complicated problem than it would first appear, and affects any algorithm that would be used for C14N.

Our problem is somewhat simplified by N-Quads requiring UTF-8, but even so different implementations may give different results.

The problem is generic enough that it must hit other specifications as well (certainly JSON-LD) and may call for an early TAG or I18N WG review.

@gkellogg gkellogg added help wanted Extra attention is needed question Further information is requested labels Oct 21, 2022
@gkellogg gkellogg linked a pull request Oct 21, 2022 that will close this issue
@gkellogg
Copy link
Member Author

It strikes me that string sorting requirements are the same as for SPARQL, particularly ORDER BY, which uses the < operator, in our case over literals of type xsd:string, which is effectively how a Quad is serialized. So the term lexicographical order could be defined using op:numeric-equal(fn:compare(STR(A), STR(B)), -1).

Perhaps @afs has some recollection on how this issue was considered in SPARQL.

@afs
Copy link

afs commented Oct 22, 2022

Defn in SPARQL 1.1: sparql11-query/#modOrderBy

which leads to XPath/XQuery fn:compare with no collation.

Section 17.3 says http://www.w3.org/2005/xpath-functions/collation/codepoint (credit to Eric Prud'hommeaux).

The discussion seems to be happening on #17.

@dlongley
Copy link
Contributor

Left some comments over here: #17 (comment)

I think the discussion should move over here now.

@afs
Copy link

afs commented Oct 22, 2022

Unicode collation is not helpful. It is locale sensitive and you must know the locale. It is not total. And involves Unicode Normalization which equates multiple ways to write the same appearance.

https://unicode.org/reports/tr10/#Common_Misperceptions

"Swedish and German share most of the same characters, for example, but have very different sorting orders."
"Collation is not a property of strings."
"Collation order is not preserved when comparing sort keys generated from different collation sequences."

For @dlongley points 1-4, the conclusion is one of codepoint sorting (logically convert to 21 bits values, sort) or a code unit for a specific code unit choice.

SPARQL is (strictly) codepoint (the abstract character) ordering. No normalization. Section 17.3 says use collation http://www.w3.org/2005/xpath-functions/collation/codepoint (credit to Eric Prud'hommeaux).

@dlongley
Copy link
Contributor

I made an issue in our JS implementation of URDNA2015 and commented with some example code that might do Unicode code point sorting in JS:

digitalbazaar/rdf-canonize#52 (comment)

@gkellogg
Copy link
Member Author

SPARQL is (strictly) codepoint (the abstract character) ordering. No normalization. Section 17.3 says use collation http://www.w3.org/2005/xpath-functions/collation/codepoint (credit to Eric Prud'hommeaux).

This would seem like a good reference to cite, then. If this doesn’t incorporate Locale, it would seem to satisfy our requirements.

@gkellogg gkellogg removed the help wanted Extra attention is needed label Oct 26, 2022
@gkellogg
Copy link
Member Author

Resolution is to use "Unicode point order" with a reference to "total order".

@TallTed
Copy link
Member

TallTed commented Nov 10, 2022

I think it would be better to cite https://www.w3.org/TR/xpath-functions-31/#codepoint-collation which normatively defines "The Unicode Codepoint Collation" which is identified by http://www.w3.org/2005/xpath-functions/collation/codepoint, rather than to cite http://www.w3.org/2005/xpath-functions/collation/codepoint which contains none of the standard W3 TR status notes and merely points back to https://www.w3.org/TR/xpath-functions-31/ in toto.

I do not see any reference in either of these documents to "Unicode point order", nor to "total order" in regards to strings.

@gkellogg gkellogg reopened this Nov 10, 2022
@gkellogg gkellogg added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. spec:enhancement and removed question Further information is requested labels Nov 10, 2022
@gkellogg gkellogg self-assigned this Nov 10, 2022
@aphillips
Copy link

The correct terminology is "Unicode code point order". I think this is what W3C I18N will recommend to you. The XPath function pointed to by @TallTed is one good reference. You might glance at our guidance in INTERNATIONAL-SPECS

The one potential alternative is what HTML and the Infra spec define, which is "code unit order" of UTF-16 (e.g. DOMString or JavaScript String.sort()) strings. This is defined here in the Infra spec. However, I think this would be inconvenient for you to adopt vs. code point order and the results (while in a slightly different order) aren't materially better.

Note: The I18N WG tends to shy away from the the phrase "lexicographical order" because it implies an alphabetical or dictionary type ordering and Unicode code point order is definitely not that overall. Code point order is a stable, deterministic, efficient order and is "lexicographical-enough" to be relatively unsurprising to users (as long as the limitations are called out).

The problem is generic enough that it must hit other specifications as well (certainly JSON-LD) and may call for an early TAG or I18N WG review.

We (I18N) always appreciate early review.

@gkellogg
Copy link
Member Author

gkellogg commented Nov 11, 2022

Thanks for the review, @aphillips. I've made an update, as you and @TallTed suggested, in PR #33.

I'll raise an issue to be addressed in JSON-LD 1.1 API referencing this issue so we can eventually fix the wording there, as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. spec:enhancement
Projects
None yet
5 participants