Descriptive content #17

gkellogg · 2022-10-17T22:43:04Z

Update editors and authors, using w3cid. Adds @gkellogg, removes @msporny.
Normalize spec references.
Reference specific issues with some issues in the spec.
Move localBiblio to a common file.
Add typographical conventions.
Add explicit section identifiers, updating some references.
Minor intro updates.
Add some descriptive information from the Arnold/Longley paper.

…orny

gkellogg · 2022-10-17T22:44:14Z

@msporny, @dlongley, I'm pulling from the Arnold/Longley paper, as appropriate. I presume that this is reasonable use, but please let me know if this is okay.

dlongley · 2022-10-19T14:44:00Z

This is looking good to me -- my only concern is with properly crediting @msporny's contributions. We should figure out an appropriate way to do that -- whether that be with "former editors" or something along those lines.

dlongley · 2022-10-19T14:44:53Z

@gkellogg,

I'm pulling from the Arnold/Longley paper, as appropriate. I presume that this is reasonable use, but please let me know if this is okay.

Yes, that's totally fine. Thanks!

gkellogg · 2022-10-19T16:47:33Z

I’m fine with keeping @msporny as a former editor, although that may be intended for former editors of Recs, more formally. I also suggested elsewhere that emerging practice may also list chairs as having a role someplace in the document header.

@msporny

@msporny, I believe I got your w3cid right, but let me know.

gkellogg · 2022-10-19T18:05:36Z

@msporny, I used "41758" as your w3cid, let me know if this is not correct, or suggest a change.

TallTed

Mostly minor, except the i18n and L10n concerns

spec/common/typographical-conventions.html

spec/index.html

TallTed · 2022-10-20T16:40:26Z

spec/index.html

+          Compute the first degree hash for each blank node in the dataset using <a href="#hash-1d-quads" class="sectionRef"></a>.</li>
+        <li><strong>Canonically label unique nodes</strong>.
+          Assign canonical identifiers via <a href="#issue-identifier" class="sectionRef"></a>,
+          in lexicographical order, to each blank node whose first degree hash is unique.</li>


How is lexicographical order defined? If I understand this doc correctly as it stands, I think including this definition is important, especially because we're working with global data and users, so need to consider the complications inherent in i18n and L10n (i.e., internationalization and localization, or internationalisation and localisation, depending on your locale ... noting also that W3C uses US English).

((tangent ... it's too bad that Append ",spell" to a W3C URI to invoke W3C's spell checker can't be brought to bear on GitHub-hosted documents. I wonder whether W3C's tech magicians can provide the dictionary on which that checker must be based for use on GitHub and/or local docs and/or etc?))

The Unicode Collation Algorithm 15.0 defines a sorting algorithm, which can be pretty complex for encodings other than UTF-8.

Although I've always understood the meaning of "Lexicographical order" as clear (there is a Wikipedia entry on it, we can certainly define the term and reference Unicode ordering.

Note that we used the term "lexicographical order" and similar in JSON-LD API with no apparent misunderstanding.

[@gkellogg]
The Unicode Collation Algorithm 15.0 defines a sorting algorithm, which can be pretty complex for encodings other than UTF-8.

Given that we're already going to require a number of transformations, I think requiring other encodings to be converted to UTF-8 is not unreasonable. Having skimmed the Unicode sorting algorithm, I'm afraid it will be pretty complex even for UTF-8 encoded data. (I certainly don't fully grasp it yet!) Their discussion of Deterministic Sorting seems particularly important for implementers of our work.

[@gkellogg]
there is a Wikipedia entry on [Lexicographical order]

Indeed there is ... and its second paragraph starts with —

[Wikipedia]
There are several variants and generalizations of the lexicographical ordering.

— which is reinforced by the several See Also items —

[Wikipedia]

Collation

Kleene–Brouwer order

Lexicographic preferences

Lexicographic order topology on the unit square

Lexicographic ordering in tensor abstract index notation

Lexicographically minimal string rotation

Long line (topology)

Lyndon word

Star product, a different way of combining partial orders

Orders on the Cartesian product of totally ordered sets

[@gkellogg]
Note that we used the term "lexicographical order" and similar in JSON-LD API with no apparent misunderstanding.

This the sort of thing that tends not to get raised as an issue until two people with different interpretations of the meaning of a term trade data and discover their unexpected differences with varying severity of impact. I regret that I was not involved sufficiently with the JSON-LD API work to have caught this as an issue.

[@gkellogg]
The Unicode Collation Algorithm 15.0 defines a sorting algorithm, which can be pretty complex for encodings other than UTF-8.

Given that we're already going to require a number of transformations, I think requiring other encodings to be converted to UTF-8 is not unreasonable. Having skimmed the Unicode sorting algorithm, I'm afraid it will be pretty complex even for UTF-8 encoded data. (I certainly don't fully grasp it yet!) Their discussion of Deterministic Sorting seems particularly important for implementers of our work.

I expect String.sort() as implemented on most platforms to do the right thing on most platforms. It's beyond our scope to exhaustively explore problem areas that belong to the Unicode specs.

See if the definitions I added adequately satisfy your concerns. I really think that we could go overboard trying to specify what lexicographical ordering of Unicode strings means.

Also, note that sorting is done after conversion to N-Quads, which requires quads to be encoded in UTF-8, which simplifies the collation problem.

Well, it looks like you can specify ducet (Default Unicode Collation Element Table (DUCET)) as the locale to use the Unicode Collation Algorithm in JavaScript ... if it's supported:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Locale/collation

Unfortunately, it also looks like it's perhaps not supported in browsers (well, not in Firefox anyway, if MDN docs are accurate).

Also, for our particular use case, it's important that we don't sort based on platform / system locale, but rather, ensure that the sort order is the same regardless of platform / system locale. This points to using code points or code units to me.

IMO, our requirements are:

Consistent sort order no matter the platform / system locale.

Speed.

Simplicity to implement.

Matching (or getting as close as possible) to existing implementations.

Unicode collation is not helpful. It is locale sensitive and you must know the locale. It is not total.

https://unicode.org/reports/tr10/#Common_Misperceptions

"Swedish and German share most of the same characters, for example, but have very different sorting orders."
"Collation is not a property of strings."
"Collation order is not preserved when comparing sort keys generated from different collation sequences."

For @dlongley points 1-4, the conclusion is one of codepoint sorting (logically convert to 21 bits values, sort) or
a code unit for a specific code unit choice.

SPARQL is (strictly) codepoint (the abstract character) ordering.

Section 17.3 says use collation http://www.w3.org/2005/xpath-functions/collation/codepoint (credit to Eric Prud'hommeaux).

I also suspect we want to compare using code points. The only reason we'd choose code units would be if most languages already natively sort by UTF-16 code units, but I think this is unlikely.

For those languages that don't offer some fast, native support for code point sorting, but instead compare natively using UTF-16 code units, I imagine the input could be scanned just once to look for surrogate pairs. Only if any were detected would special comparison code need to be used.

This discussion is somewhat split between here and #18 -- I recommend we say in this PR that lexicographical order refers to Unicode code point order and then continue the discussion over there. My guess is that existing implementations already sort by Unicode code point order or are very close and just need a slight adjustment for some data because they compare using UTF-16 code units today.

Does DUCET canonicalize codepoints? (https://unicode.org/reports/tr10/#Contractions_DUCET)

So different ways of writing the same character (e.g. combining characters, Unicode Normalization) have the same weight but then two different xsd:strings are not totally ordered if one contains the character as one codepoint and the other used two (the second being a combining character).

Not simple any more!

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

…ach step.

gkellogg · 2022-10-20T23:20:44Z

I've incorporated most of the text I thought was straightforward from Arnold/Longley. Still need examples and images, but I think this is a good point to finish this particular PR.

spec/index.html

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

TallTed · 2022-10-21T17:44:41Z

we could go overboard trying to specify what lexicographical ordering of Unicode strings means

Either the ordering is important and necessary, or it isn't. If it isn't, then all reference to it should be removed. If it is, either we need to provide sufficient detail that the ordering is predictable and consistent, or we need to explain why (and how much of) the unpredictability and/or inconsistency is acceptable.

Presuming the ordering is important and necessary, I think @dlongley's suggestion of UTF-8 codepoint order is likely to play an important part. (It was in my own thinking yesterday, but not firm enough to write up, given the other questions still in the air.)

I'm pretty sure we'll need more discussion on this, but it probably does need its own issue(s). For now, I'm happy enough that it's on more people's radar than my own.

gkellogg · 2022-10-21T17:56:20Z

Ensuring proper ordering, whatever it may be, would be important regardless of what algorithm is chosen for actually navigating the quads to create hashes. I hadn't looked at the documents closely enough to be sensitive to the Localization issues, and wasn't aware of the JavaScript implementation issues @dlongley pointed out.

Given the fact that this is something which would face many different specifications that there is not a single solution useable across domains. This may be a case where we call for early feedback from either the Internationalization WG or the TAG.

I'll create an issue to track (#18).

TallTed

A couple of small tweaks.

spec/index.html

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

gkellogg · 2022-10-26T18:39:02Z

@msporny I added the issue-summary section, but it looks like it just picked up the issues explicitly mentioned in the spec. Couldn't see anything about configuring to pull all issues from the repo.

spec/index.html

Oxford comma. Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

afs

"Approved" but I think this is a good time to start to remove the terminology "lexicographic order". It is confusing because it may be taken to imply it works on characters, not codepoints (combining characters etc), and hence unicode normalization and possibly locale sensitive ordering is being specified when it's not.

I think we settled on needing a fixed total ordering for any strings (URIs or literals) by using codepoint ordering which has no semantic (string value, or collation) semantics.

For speed, at least an issue box saying we will remove the use of "lexicographic order" and replace it with "total order of RDF terms".

gkellogg · 2022-10-28T19:22:18Z

I can do a commit changing "lexicographical order" (and related) with "total order" (or "codepoint order") later this weekend.

@dlongley WDYT?

dlongley · 2022-10-28T22:40:50Z

@gkellogg,

Yes, I think it's a good idea to change the language. My vote is for "Unicode code point order" for clarity. I think "total order" would only work if that's a term itself that we define in the terminology section as Unicode code point order.

afs · 2022-10-29T17:37:00Z

No need to define it -- "total ordering" (wikipedia link) is the term for the requirement we need (and not some weird lattice comparison).

"Unicode code point order" seems the best choice for a web standard.

… reference to "total ordering" from Wikipedia.

dlongley

Approving with some suggestions to consider. Thanks!

spec/index.html

Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>

gkellogg added 8 commits October 17, 2022 12:02

Update editors and authors, using w3cid. Adds @gkellogg, removes @msp…

b76c752

…orny.

Normalize spec references.

26c6531

Reference specific issues with some issues in the spec.

73ff09d

Move localBiblio to a common file.

6256e6a

Add typographical conventions.

2668a34

Add explicit section identifiers, updating some references.

850f9c2

Minor intro updates.

d2b6bfe

Add some descriptive information from the Arnold/Longley paper.

a21bd11

gkellogg changed the title ~~Descriptive content~~ WIP: Descriptive content Oct 17, 2022

Add Manu as formerEditor (CG Report).

9d55dc7

@msporny, I believe I got your w3cid right, but let me know.

TallTed requested changes Oct 20, 2022

View reviewed changes

gkellogg and others added 6 commits October 20, 2022 12:54

Improvements to typographical-conventions.

6aea3cb

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

Apply suggestions from code review

1cfe656

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

Update spec/common/typographical-conventions.html

ba97d6d

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

Update algorithm structure, including adding unique identifiers for e…

bd8a736

…ach step.

Define "lexicographically ordered" and some other definitions.

f0c7060

Narrative updates.

46f537d

gkellogg changed the title ~~WIP: Descriptive content~~ Descriptive content Oct 20, 2022

gkellogg marked this pull request as ready for review October 20, 2022 23:18

TallTed reviewed Oct 20, 2022

View reviewed changes

spec/index.html Outdated Show resolved Hide resolved

Update spec/index.html

6521f41

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

gkellogg mentioned this pull request Oct 21, 2022

Unicode ordering #18

Closed

gkellogg requested review from dlongley, TallTed and afs and removed request for dlongley and TallTed October 26, 2022 16:36

TallTed requested changes Oct 26, 2022

View reviewed changes

spec/index.html Outdated Show resolved Hide resolved

spec/index.html Outdated Show resolved Hide resolved

dlongley approved these changes Oct 26, 2022

View reviewed changes

gkellogg and others added 3 commits October 26, 2022 14:23

Apply suggestions from code review

244d84e

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

"lexicographically-sorted" => "lexicographically sorted"

fb32054

Add issue-summary secction.

8b5b612

TallTed reviewed Oct 26, 2022

View reviewed changes

spec/index.html Outdated Show resolved Hide resolved

Update spec/index.html

9e9cbe3

Oxford comma. Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>

afs approved these changes Oct 28, 2022

View reviewed changes

Change "lexicographical ordering" to "Unicode point ordering", with a…

c733ed5

… reference to "total ordering" from Wikipedia.

gkellogg requested review from afs and dlongley and removed request for afs October 31, 2022 00:30

dlongley approved these changes Oct 31, 2022

View reviewed changes

spec/index.html Outdated Show resolved Hide resolved

spec/index.html Show resolved Hide resolved

spec/index.html Outdated Show resolved Hide resolved

spec/index.html Outdated Show resolved Hide resolved

gkellogg and others added 2 commits October 31, 2022 07:27

Apply suggestions from code review

bdfb347

Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>

"shortest1" => "shortest".

e4f5846

gkellogg merged commit 75279ca into main Oct 31, 2022

gkellogg deleted the descriptive-content branch October 31, 2022 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Descriptive content #17

Descriptive content #17

gkellogg commented Oct 17, 2022 •

edited by pr-preview bot

Loading

gkellogg commented Oct 17, 2022

dlongley commented Oct 19, 2022

dlongley commented Oct 19, 2022 •

edited

Loading

gkellogg commented Oct 19, 2022

gkellogg commented Oct 19, 2022

TallTed left a comment

TallTed Oct 20, 2022

gkellogg Oct 20, 2022

TallTed Oct 20, 2022

gkellogg Oct 20, 2022 •

edited

Loading

gkellogg Oct 20, 2022

dlongley Oct 21, 2022

dlongley Oct 21, 2022 •

edited

Loading

afs Oct 22, 2022

dlongley Oct 22, 2022

afs Oct 22, 2022

gkellogg commented Oct 20, 2022

TallTed commented Oct 21, 2022

gkellogg commented Oct 21, 2022 •

edited

Loading

TallTed left a comment

gkellogg commented Oct 26, 2022

afs left a comment

gkellogg commented Oct 28, 2022

dlongley commented Oct 28, 2022 •

edited

Loading

afs commented Oct 29, 2022

dlongley left a comment

Descriptive content #17

Descriptive content #17

Conversation

gkellogg commented Oct 17, 2022 • edited by pr-preview bot Loading

gkellogg commented Oct 17, 2022

dlongley commented Oct 19, 2022

dlongley commented Oct 19, 2022 • edited Loading

gkellogg commented Oct 19, 2022

gkellogg commented Oct 19, 2022

TallTed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gkellogg Oct 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlongley Oct 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gkellogg commented Oct 20, 2022

TallTed commented Oct 21, 2022

gkellogg commented Oct 21, 2022 • edited Loading

TallTed left a comment

Choose a reason for hiding this comment

gkellogg commented Oct 26, 2022

afs left a comment

Choose a reason for hiding this comment

gkellogg commented Oct 28, 2022

dlongley commented Oct 28, 2022 • edited Loading

afs commented Oct 29, 2022

dlongley left a comment

Choose a reason for hiding this comment

gkellogg commented Oct 17, 2022 •

edited by pr-preview bot

Loading

dlongley commented Oct 19, 2022 •

edited

Loading

gkellogg Oct 20, 2022 •

edited

Loading

dlongley Oct 21, 2022 •

edited

Loading

gkellogg commented Oct 21, 2022 •

edited

Loading

dlongley commented Oct 28, 2022 •

edited

Loading