Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Descriptive content #17

Merged
merged 25 commits into from
Oct 31, 2022
Merged

Descriptive content #17

merged 25 commits into from
Oct 31, 2022

Conversation

gkellogg
Copy link
Member

@gkellogg gkellogg commented Oct 17, 2022

  • Update editors and authors, using w3cid. Adds @gkellogg, removes @msporny.
  • Normalize spec references.
  • Reference specific issues with some issues in the spec.
  • Move localBiblio to a common file.
  • Add typographical conventions.
  • Add explicit section identifiers, updating some references.
  • Minor intro updates.
  • Add some descriptive information from the Arnold/Longley paper.

cc/ @dlongley


Preview | Diff

@gkellogg
Copy link
Member Author

@msporny, @dlongley, I'm pulling from the Arnold/Longley paper, as appropriate. I presume that this is reasonable use, but please let me know if this is okay.

@gkellogg gkellogg changed the title Descriptive content WIP: Descriptive content Oct 17, 2022
@dlongley
Copy link
Contributor

This is looking good to me -- my only concern is with properly crediting @msporny's contributions. We should figure out an appropriate way to do that -- whether that be with "former editors" or something along those lines.

@dlongley
Copy link
Contributor

dlongley commented Oct 19, 2022

@gkellogg,

I'm pulling from the Arnold/Longley paper, as appropriate. I presume that this is reasonable use, but please let me know if this is okay.

Yes, that's totally fine. Thanks!

@gkellogg
Copy link
Member Author

I’m fine with keeping @msporny as a former editor, although that may be intended for former editors of Recs, more formally. I also suggested elsewhere that emerging practice may also list chairs as having a role someplace in the document header.

@msporny, I believe I got your w3cid right, but let me know.
@gkellogg
Copy link
Member Author

@msporny, I used "41758" as your w3cid, let me know if this is not correct, or suggest a change.

Copy link
Member

@TallTed TallTed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor, except the i18n and L10n concerns

spec/common/typographical-conventions.html Outdated Show resolved Hide resolved
spec/common/typographical-conventions.html Outdated Show resolved Hide resolved
spec/common/typographical-conventions.html Outdated Show resolved Hide resolved
spec/common/typographical-conventions.html Outdated Show resolved Hide resolved
spec/common/typographical-conventions.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated
Compute the first degree hash for each blank node in the dataset using <a href="#hash-1d-quads" class="sectionRef"></a>.</li>
<li><strong>Canonically label unique nodes</strong>.
Assign canonical identifiers via <a href="#issue-identifier" class="sectionRef"></a>,
in lexicographical order, to each blank node whose first degree hash is unique.</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is lexicographical order defined? If I understand this doc correctly as it stands, I think including this definition is important, especially because we're working with global data and users, so need to consider the complications inherent in i18n and L10n (i.e., internationalization and localization, or internationalisation and localisation, depending on your locale ... noting also that W3C uses US English).

((tangent ... it's too bad that Append ",spell" to a W3C URI to invoke W3C's spell checker can't be brought to bear on GitHub-hosted documents. I wonder whether W3C's tech magicians can provide the dictionary on which that checker must be based for use on GitHub and/or local docs and/or etc?))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unicode Collation Algorithm 15.0 defines a sorting algorithm, which can be pretty complex for encodings other than UTF-8.

Although I've always understood the meaning of "Lexicographical order" as clear (there is a Wikipedia entry on it, we can certainly define the term and reference Unicode ordering.

Note that we used the term "lexicographical order" and similar in JSON-LD API with no apparent misunderstanding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[@gkellogg]
The Unicode Collation Algorithm 15.0 defines a sorting algorithm, which can be pretty complex for encodings other than UTF-8.

Given that we're already going to require a number of transformations, I think requiring other encodings to be converted to UTF-8 is not unreasonable. Having skimmed the Unicode sorting algorithm, I'm afraid it will be pretty complex even for UTF-8 encoded data. (I certainly don't fully grasp it yet!) Their discussion of Deterministic Sorting seems particularly important for implementers of our work.

[@gkellogg]
there is a Wikipedia entry on [Lexicographical order]

Indeed there is ... and its second paragraph starts with —

[Wikipedia]
There are several variants and generalizations of the lexicographical ordering.

— which is reinforced by the several See Also items

[Wikipedia]

[@gkellogg]
Note that we used the term "lexicographical order" and similar in JSON-LD API with no apparent misunderstanding.

This the sort of thing that tends not to get raised as an issue until two people with different interpretations of the meaning of a term trade data and discover their unexpected differences with varying severity of impact. I regret that I was not involved sufficiently with the JSON-LD API work to have caught this as an issue.

Copy link
Member Author

@gkellogg gkellogg Oct 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[@gkellogg]
The Unicode Collation Algorithm 15.0 defines a sorting algorithm, which can be pretty complex for encodings other than UTF-8.

Given that we're already going to require a number of transformations, I think requiring other encodings to be converted to UTF-8 is not unreasonable. Having skimmed the Unicode sorting algorithm, I'm afraid it will be pretty complex even for UTF-8 encoded data. (I certainly don't fully grasp it yet!) Their discussion of Deterministic Sorting seems particularly important for implementers of our work.

I expect String.sort() as implemented on most platforms to do the right thing on most platforms. It's beyond our scope to exhaustively explore problem areas that belong to the Unicode specs.

See if the definitions I added adequately satisfy your concerns. I really think that we could go overboard trying to specify what lexicographical ordering of Unicode strings means.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, note that sorting is done after conversion to N-Quads, which requires quads to be encoded in UTF-8, which simplifies the collation problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it looks like you can specify ducet (Default Unicode Collation Element Table (DUCET)) as the locale to use the Unicode Collation Algorithm in JavaScript ... if it's supported:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Locale/collation

Unfortunately, it also looks like it's perhaps not supported in browsers (well, not in Firefox anyway, if MDN docs are accurate).

Copy link
Contributor

@dlongley dlongley Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for our particular use case, it's important that we don't sort based on platform / system locale, but rather, ensure that the sort order is the same regardless of platform / system locale. This points to using code points or code units to me.

IMO, our requirements are:

  1. Consistent sort order no matter the platform / system locale.
  2. Speed.
  3. Simplicity to implement.
  4. Matching (or getting as close as possible) to existing implementations.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unicode collation is not helpful. It is locale sensitive and you must know the locale. It is not total.

https://unicode.org/reports/tr10/#Common_Misperceptions

"Swedish and German share most of the same characters, for example, but have very different sorting orders."
"Collation is not a property of strings."
"Collation order is not preserved when comparing sort keys generated from different collation sequences."

For @dlongley points 1-4, the conclusion is one of codepoint sorting (logically convert to 21 bits values, sort) or
a code unit for a specific code unit choice.

SPARQL is (strictly) codepoint (the abstract character) ordering.

Section 17.3 says use collation http://www.w3.org/2005/xpath-functions/collation/codepoint (credit to Eric Prud'hommeaux).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also suspect we want to compare using code points. The only reason we'd choose code units would be if most languages already natively sort by UTF-16 code units, but I think this is unlikely.

For those languages that don't offer some fast, native support for code point sorting, but instead compare natively using UTF-16 code units, I imagine the input could be scanned just once to look for surrogate pairs. Only if any were detected would special comparison code need to be used.

This discussion is somewhat split between here and #18 -- I recommend we say in this PR that lexicographical order refers to Unicode code point order and then continue the discussion over there. My guess is that existing implementations already sort by Unicode code point order or are very close and just need a slight adjustment for some data because they compare using UTF-16 code units today.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does DUCET canonicalize codepoints? (https://unicode.org/reports/tr10/#Contractions_DUCET)

So different ways of writing the same character (e.g. combining characters, Unicode Normalization) have the same weight but then two different xsd:strings are not totally ordered if one contains the character as one codepoint and the other used two (the second being a combining character).

Not simple any more!

gkellogg and others added 6 commits October 20, 2022 12:54
@gkellogg gkellogg changed the title WIP: Descriptive content Descriptive content Oct 20, 2022
@gkellogg gkellogg marked this pull request as ready for review October 20, 2022 23:18
@gkellogg
Copy link
Member Author

I've incorporated most of the text I thought was straightforward from Arnold/Longley. Still need examples and images, but I think this is a good point to finish this particular PR.

spec/index.html Outdated Show resolved Hide resolved
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
@TallTed
Copy link
Member

TallTed commented Oct 21, 2022

we could go overboard trying to specify what lexicographical ordering of Unicode strings means

Either the ordering is important and necessary, or it isn't. If it isn't, then all reference to it should be removed. If it is, either we need to provide sufficient detail that the ordering is predictable and consistent, or we need to explain why (and how much of) the unpredictability and/or inconsistency is acceptable.

Presuming the ordering is important and necessary, I think @dlongley's suggestion of UTF-8 codepoint order is likely to play an important part. (It was in my own thinking yesterday, but not firm enough to write up, given the other questions still in the air.)

I'm pretty sure we'll need more discussion on this, but it probably does need its own issue(s). For now, I'm happy enough that it's on more people's radar than my own.

@gkellogg
Copy link
Member Author

gkellogg commented Oct 21, 2022

Ensuring proper ordering, whatever it may be, would be important regardless of what algorithm is chosen for actually navigating the quads to create hashes. I hadn't looked at the documents closely enough to be sensitive to the Localization issues, and wasn't aware of the JavaScript implementation issues @dlongley pointed out.

Given the fact that this is something which would face many different specifications that there is not a single solution useable across domains. This may be a case where we call for early feedback from either the Internationalization WG or the TAG.

I'll create an issue to track (#18).

@gkellogg gkellogg mentioned this pull request Oct 21, 2022
@gkellogg gkellogg requested review from dlongley, TallTed and afs and removed request for dlongley and TallTed October 26, 2022 16:36
Copy link
Member

@TallTed TallTed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small tweaks.

spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
@gkellogg
Copy link
Member Author

@msporny I added the issue-summary section, but it looks like it just picked up the issues explicitly mentioned in the spec. Couldn't see anything about configuring to pull all issues from the repo.

spec/index.html Outdated Show resolved Hide resolved
Oxford comma.

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Copy link

@afs afs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Approved" but I think this is a good time to start to remove the terminology "lexicographic order". It is confusing because it may be taken to imply it works on characters, not codepoints (combining characters etc), and hence unicode normalization and possibly locale sensitive ordering is being specified when it's not.

I think we settled on needing a fixed total ordering for any strings (URIs or literals) by using codepoint ordering which has no semantic (string value, or collation) semantics.

For speed, at least an issue box saying we will remove the use of "lexicographic order" and replace it with "total order of RDF terms".

@gkellogg
Copy link
Member Author

I can do a commit changing "lexicographical order" (and related) with "total order" (or "codepoint order") later this weekend.

@dlongley WDYT?

@dlongley
Copy link
Contributor

dlongley commented Oct 28, 2022

@gkellogg,

Yes, I think it's a good idea to change the language. My vote is for "Unicode code point order" for clarity. I think "total order" would only work if that's a term itself that we define in the terminology section as Unicode code point order.

@afs
Copy link

afs commented Oct 29, 2022

No need to define it -- "total ordering" (wikipedia link) is the term for the requirement we need (and not some weird lattice comparison).

"Unicode code point order" seems the best choice for a web standard.

… reference to "total ordering" from Wikipedia.
@gkellogg gkellogg requested review from afs and dlongley and removed request for afs October 31, 2022 00:30
Copy link
Contributor

@dlongley dlongley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with some suggestions to consider. Thanks!

spec/index.html Outdated Show resolved Hide resolved
spec/index.html Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
gkellogg and others added 2 commits October 31, 2022 07:27
Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>
@gkellogg gkellogg merged commit 75279ca into main Oct 31, 2022
@gkellogg gkellogg deleted the descriptive-content branch October 31, 2022 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unicode ordering
5 participants